Github | Downloads | Paper
IndicLID, is a language identifier for all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. IndicLID is the first LID for romanized text in Indian languages. It is a two stage classifier that is ensemble of a fast linear classifier and a slower classifier finetuned from a pre-trained LM. It can predict 47 classes (24 native-script classes and 21 roman-script classes plus English and Others). All the classes are listed below.
Language | IndicLID Code |
---|---|
Assamese (Bengali script) | asm_Beng |
Assamese (Latin script) | asm_Latn |
Bangla (Bengali script) | ben_Beng |
Bangla (Latin script) | ben_Latn |
Bodo (Devanagari script) | brx_Deva |
Bodo (Latin script) | brx_Latn |
Dogri (Devanagari script) | doi_Deva |
Dogri (Latin script) | doi_Latn |
English (Latin script) | eng_Latn |
Gujarati (Gujarati script) | guj_Gujr |
Gujarati (Latin script) | guj_Latn |
Hindi (Devanagari script) | hin_Deva |
Hindi (Latin script) | hin_Latn |
Kannada (Kannada script) | kan_Knda |
Kannada (Latin script) | kan_Latn |
Kashmiri (Perso_Arabic script) | kas_Arab |
Kashmiri (Devanagari script) | kas_Deva |
Kashmiri (Latin script) | kas_Latn |
Konkani (Devanagari script) | kok_Deva |
Konkani (Latin script) | kok_Latn |
Maithili (Devanagari script) | mai_Deva |
Maithili (Latin script) | mai_Latn |
Malayalam (Malayalam script) | mal_Mlym |
Malayalam (Latin script) | mal_Latn |
Manipuri (Bengali script) | mni_Beng |
Manipuri (Meetei_Mayek script) | mni_Meti |
Manipuri (Latin script) | mni_Latn |
Marathi (Devanagari script) | mar_Deva |
Marathi (Latin script) | mar_Latn |
Nepali (Devanagari script) | nep_Deva |
Nepali (Latin script) | nep_Latn |
Oriya (Oriya script) | ori_Orya |
Oriya (Latin script) | ori_Latn |
Panjabi (Gurmukhi script) | pan_Guru |
Panjabi (Latin script) | pan_Latn |
Sanskrit (Devanagari script) | san_Deva |
Sanskrit (Latin script) | san_Latn |
Santali (Ol_Chiki script) | sat_Olch |
Sindhi (Perso_Arabic script) | snd_Arab |
Sindhi (Latin script) | snd_Latn |
Tamil (Tamil script) | tam_Tamil |
Tamil (Latin script) | tam_Latn |
Telugu (Telugu script) | tel_Telu |
Telugu (Latin script) | tel_Latn |
Urdu (Perso_Arabic script) | urd_Arab |
Urdu (Latin script) | urd_Latn |
Other | other |
Know more about IndicLID
You can visit the IndicLID page to know more about the models including:
- Downloading IndicLID
- Using the publicly available models
- IndicLID accuracy
- IndicLID evaluation script
Citing
If you are using any of the resources, please cite the following article:
@misc{madhani2023bhashaabhijnaanam,
title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages},
author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan},
year={2023},
eprint={2305.15814},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.
License
The IndicLID code (and models) are released under the MIT License.
Contributors
- Yash Madhani (AI4Bharat, IITM)
- Mitesh M. Khapra (AI4Bharat, IITM)
- Anoop Kunchukuttan (AI4Bharat, Microsoft)
Contact
- Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
- Mitesh Khapra (miteshk@cse.iitm.ac.in)
- Pratyush Kumar (pratyush@cse.iitm.ac.in)
Acknowledgements
We would like to thank the Ministry of Electronics and Information Technology of the Government of India for their generous grant through the Digital India Bhashini project. We also thank the Centre for Development of Advanced Computing for providing compute time on the Param Siddhi Supercomputer. We also thank Nilekani Philanthropies for their generous grant towards building datasets, models, tools and resources for Indic languages. We also thank Microsoft for their grant to support research on Indic languages. We would like to thank Jay Gala and Ishvinder Sethi for their help in coordinating the annotation work. Most importantly we would like to thank all the annotators who helped create the Bhasha-Abhijnaanam benchmark.