IndicLID – AI4BHĀRAT

Github | Downloads | Paper

IndicLID, is a language identifier for all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. IndicLID is the first LID for romanized text in Indian languages. It is a two stage classifier that is ensemble of a fast linear classifier and a slower classifier finetuned from a pre-trained LM. It can predict 47 classes (24 native-script classes and 21 roman-script classes plus English and Others). All the classes are listed below.

Language	IndicLID Code
Assamese (Bengali script)	asm_Beng
Assamese (Latin script)	asm_Latn
Bangla (Bengali script)	ben_Beng
Bangla (Latin script)	ben_Latn
Bodo (Devanagari script)	brx_Deva
Bodo (Latin script)	brx_Latn
Dogri (Devanagari script)	doi_Deva
Dogri (Latin script)	doi_Latn
English (Latin script)	eng_Latn
Gujarati (Gujarati script)	guj_Gujr
Gujarati (Latin script)	guj_Latn
Hindi (Devanagari script)	hin_Deva
Hindi (Latin script)	hin_Latn
Kannada (Kannada script)	kan_Knda
Kannada (Latin script)	kan_Latn
Kashmiri (Perso_Arabic script)	kas_Arab
Kashmiri (Devanagari script)	kas_Deva
Kashmiri (Latin script)	kas_Latn
Konkani (Devanagari script)	kok_Deva
Konkani (Latin script)	kok_Latn
Maithili (Devanagari script)	mai_Deva
Maithili (Latin script)	mai_Latn
Malayalam (Malayalam script)	mal_Mlym
Malayalam (Latin script)	mal_Latn
Manipuri (Bengali script)	mni_Beng
Manipuri (Meetei_Mayek script)	mni_Meti
Manipuri (Latin script)	mni_Latn
Marathi (Devanagari script)	mar_Deva
Marathi (Latin script)	mar_Latn
Nepali (Devanagari script)	nep_Deva
Nepali (Latin script)	nep_Latn
Oriya (Oriya script)	ori_Orya
Oriya (Latin script)	ori_Latn
Panjabi (Gurmukhi script)	pan_Guru
Panjabi (Latin script)	pan_Latn
Sanskrit (Devanagari script)	san_Deva
Sanskrit (Latin script)	san_Latn
Santali (Ol_Chiki script)	sat_Olch
Sindhi (Perso_Arabic script)	snd_Arab
Sindhi (Latin script)	snd_Latn
Tamil (Tamil script)	tam_Tamil
Tamil (Latin script)	tam_Latn
Telugu (Telugu script)	tel_Telu
Telugu (Latin script)	tel_Latn
Urdu (Perso_Arabic script)	urd_Arab
Urdu (Latin script)	urd_Latn
Other	other

Know more about IndicLID

You can visit the IndicLID page to know more about the models including:

Downloading IndicLID
Using the publicly available models
IndicLID accuracy
IndicLID evaluation script

Citing

If you are using any of the resources, please cite the following article:

@misc{madhani2023bhashaabhijnaanam,
      title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages}, 
      author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan},
      year={2023},
      eprint={2305.15814},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.

License

The IndicLID code (and models) are released under the MIT License.

Contributors

Yash Madhani (AI4Bharat, IITM)
Mitesh M. Khapra (AI4Bharat, IITM)
Anoop Kunchukuttan (AI4Bharat, Microsoft)

Contact

Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
Mitesh Khapra (miteshk@cse.iitm.ac.in)
Pratyush Kumar (pratyush@cse.iitm.ac.in)

Acknowledgements

We would like to thank the Ministry of Electronics and Information Technology of the Government of India for their generous grant through the Digital India Bhashini project. We also thank the Centre for Development of Advanced Computing for providing compute time on the Param Siddhi Supercomputer. We also thank Nilekani Philanthropies for their generous grant towards building datasets, models, tools and resources for Indic languages. We also thank Microsoft for their grant to support research on Indic languages. We would like to thank Jay Gala and Ishvinder Sethi for their help in coordinating the annotation work. Most importantly we would like to thank all the annotators who helped create the Bhasha-Abhijnaanam benchmark.