IndicNLG Suite

IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

IndicNLG suite is a collection of datasets for benchmarking Natural Language Generation (NLG) for 11 Indic languages spanning five diverse NLG tasks. The datasets were created using a combination of crawling websites, machine translation, n-gram count and regular expression based cleaning. Overall, the suite contains about 8.5M examples across all languages and tasks and is the largest multilingual NLG dataset to date as well as the first of its kind for Indic languages. You can use these datasets to benchmark your own NLG systems.

Supported languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, and Telugu.

Supported NLG tasks and datasets: Biography generation using Wikipedia infoboxes (WikiBio), news headline generation, sentence summarization, question generation, and paraphrase generation.

Datasets are available in JSON file and HuggingFace format.

You can read more about IndicNLG Suite in this paper. We have benchmarked our own monolingual and multilingual models based on IndicBART and found that our models perform at par with or are better than baseline models such as mT5.

Downloads

The datasets and models are available on HuggingFace:

Task	Dataset	Model
Biography Generation	IndicWikiBio	Coming Soon
Headline Generation	IndicHeadlineGeneration	Coming Soon
Sentence Summarization	IndicSentenceSummarization	Coming Soon
Paraphrase Generation	IndicParaphrase	Coming Soon
Question Generation	IndicQuestionGeneration	Coming Soon

IndicBART fine-tuning and decoding

Follow the setup instructions here.

We use the YANMTT toolkit for fine-tuning IndicBART.
Extract the input and target text from the JSONL format files or HuggingFace format files.
For question generation, concatenate the question and context into a single line.
Convert the scripts in the extracted files into Devanagari using the Indic Script Converter.

Here is a command for fine-tuning IndicBART for summarization. The correct input and output file paths should be provided. Use appropriate hyperparameters according to the paper. Decode the test set using the fine-tuned model after modifying this command. Map the output to the original script using the script converter.

Alternatively: IndicBART is uploaded to HuggingFace hub here. Modify the HuggingFace summarization script to use the IndicBART model. This script can use the JSON as well as HuggingFace format files. Ensure that script mapping is done before training and after decoding.

Contributors

Aman Kumar
Prachi Sahu
Himani Shrotriya
Raj Dabre
Ratish Puduppully
Anoop Kunchukuttan
Amogh Mishra
Mitesh M. Khapra
Pratyush Kumar

Citing

If you use IndicNLG Suite, please cite the following paper:


@misc{kumar2022indicnlg,
      title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages}, 
      author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
      year={2022},
      eprint={2203.05437},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

Datasets

IndicHeadlineGeneration, IndicSentenceSummarization and IndicParaphrase are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IndicWikiBio and IndicQuestionGeneration are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Models

All models are released under the MIT license.