IndicNLG suite is a collection of datasets for benchmarking Natural Language Generation (NLG) for 11 Indic languages spanning five diverse NLG tasks. The datasets were created using a combination of crawling websites, machine translation, n-gram count and regular expression based cleaning. Overall, the suite contains about 8.5M examples across all languages and tasks and is the largest multilingual NLG dataset to date as well as the first of its kind for Indic languages. You can use these datasets to benchmark your own NLG systems.
Supported languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, and Telugu.
Supported NLG tasks and datasets: Biography generation using Wikipedia infoboxes (WikiBio), news headline generation, sentence summarization, question generation, and paraphrase generation.
Datasets are available in JSON file and HuggingFace format.
You can read more about IndicNLG Suite in this paper. We have benchmarked our own monolingual and multilingual models based on IndicBART and found that our models perform at par with or are better than baseline models such as mT5.
The datasets and models are available on HuggingFace:
Task | Dataset | Model |
---|---|---|
Biography Generation | IndicWikiBio | Coming Soon |
Headline Generation | IndicHeadlineGeneration | Coming Soon |
Sentence Summarization | IndicSentenceSummarization | Coming Soon |
Paraphrase Generation | IndicParaphrase | Coming Soon |
Question Generation | IndicQuestionGeneration | Coming Soon |
Follow the setup instructions here.
Here is a command for fine-tuning IndicBART for summarization. The correct input and output file paths should be provided. Use appropriate hyperparameters according to the paper. Decode the test set using the fine-tuned model after modifying this command. Map the output to the original script using the script converter.
Alternatively: IndicBART is uploaded to HuggingFace hub here. Modify the HuggingFace summarization script to use the IndicBART model. This script can use the JSON as well as HuggingFace format files. Ensure that script mapping is done before training and after decoding.
If you use IndicNLG Suite, please cite the following paper:
@misc{kumar2022indicnlg,
title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages},
author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
year={2022},
eprint={2203.05437},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Datasets
Models