To evaluate language models on Indic languages, we need a robust human-annotated NLU benchmark consisting of 9 tasks across 18 Indic languages.

IndicXTREME benchmark includes 9 tasks that can be broadly grouped into sentence classification (5), structure prediction (2), question answering (1), and sentence retrieval (1).

The list of tasks are as follows:

– IndicCOPA – Dataset – We manually translate the COPA test set into 18 Indic languages to create IndicCOPA

– IndicQA – Dataset – A manually curated cloze-style reading comprehension dataset that can be used for evaluating question-answering models in 11 Indic languages

– IndicXParaphrase – Dataset – A new, multilingual, and n-way parallel dataset for paraphrase detection in 10 Indic languages

– IndicSentiment – Dataset – A new, multilingual, and n-way parallel dataset for sentiment analysis in 13 Indic languages

– IndicXNLI – Dataset – Automatically translated version of XNLI in 11 Indic languages. Created by Divyanshu et. al. in this paper

– Naamapadam – Dataset – NER dataset with manually curated testsets for 9 Indic languages. Created by Arnav et. al in this paper

– MASSIVE – Dataset – This in an intent classification and slot-filling dataset created using user queries collected by Amazon Alexa for 7 Indic languages. Created by FitzGerald et. al. in this paper

– FLORES – Dataset – To evaluate the retrieval capabilities of models, we include the Indic parts of the FLORES-101 dataset. Available in 18 Indic languages. Created by NLLB Team et. al. in this paper

For more information about the datasets and the models, please refer to the GitHub repository

Contributors

  • Sumanth Doddapaneni (AI4Bharat, IITM)
  • Rahul Aralikatte (McGill, MILA)
  • Gowtham Ramesh, (AI4Bharat, IITM)
  • Shreya Goyal, (AI4Bharat)
  • Mitesh Khapra,  (AI4Bharat, IITM)
  • Anoop Kunchukuttan, (Microsoft, AI4Bharat, IITM)
  • Pratyush Kumar,  (Microsoft, AI4Bharat, IITM)

Corresponding authors: Sumanth Doddapaneni

Citing

If you are using any of the resources, please cite the following article:

@article{Doddapaneni2022towards,
  title={Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages},
  author={Sumanth Doddapaneni and Rahul Aralikatte and Gowtham Ramesh and Shreyansh Goyal and Mitesh M. Khapra and Anoop Kunchukuttan and Pratyush Kumar},
  journal={ArXiv},
  year={2022},
  volume={abs/2212.05409}
}

License

IndicXTREME is released under this licensing scheme:

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to IndicXTREME.
  • This work is published from: India.