Paper | Huggingface | Benchmarking
Bhasha-Abhijnaanam is a language identification test set for native-script as well as romanized text which spans 22 Indic languages. Benchmarking result on Bhasha-Abhijnaanam test set using IndicLID model can be found here. More details regarding Bhasha-Abhijnaanam can be in the paper.
The language-wise statistics for Bhasha-Abhijnaanam is shown in the table with total number of sentences.
Subset | asm | ben | brx | guj | hin | kan | kas (Perso-Arabic) | kas (Devanagari) | kok | mai | mal | mni (Bengali) | mni (Meetei Mayek) | mar | nep | ori | pan | san | sid | tam | tel | urd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Native | 1012 | 5606 | 1500 | 5797 | 5617 | 5859 | 2511 | 1012 | 1500 | 2512 | 5628 | 1012 | 1500 | 5611 | 2512 | 1012 | 5776 | 2510 | 2512 | 5893 | 5779 | 5751 |
Romanized | 512 | 4595 | 433 | 4785 | 4606 | 4848 | 450 | 0 | 444 | 439 | 4617 | 0 | 442 | 4603 | 423 | 512 | 4765 | 448 | 0 | 4881 | 4767 | 4741 |
If you are using any of the resources, please cite the following article:
@misc{madhani2023bhashaabhijnaanam, title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages}, author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan}, year={2023}, eprint={2305.15814}, archivePrefix={arXiv}, primaryClass={cs.CL} }
This data is released under the following licensing scheme:
CC0 License Statement
We would like to thank the Ministry of Electronics and Information Technology of the Government of India for their generous grant through the Digital India Bhashini project. We also thank the Centre for Development of Advanced Computing for providing compute time on the Param Siddhi Supercomputer. We also thank Nilekani Philanthropies for their generous grant towards building datasets, models, tools and resources for Indic languages. We also thank Microsoft for their grant to support research on Indic languages. We would like to thank Jay Gala and Ishvinder Sethi for their help in coordinating the annotation work. Most importantly we would like to thank all the annotators who helped create the Bhasha-Abhijnaanam benchmark.