A4B Logo

Transliteration

To know more about our contributions over the years see the timeline below!

At AI4Bharat, we are advancing transliteration and language identification to embrace India's linguistic diversity across 22 constitutionally recognized languages. We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages, containing 26 million transliteration pairs across 21 languages and 12 scripts. This dataset is 21 times larger than existing resources and is the first to cover 7 languages and 1 language family. Alongside Aksharantar, we also introduce the Aksharantar testset, which includes 103,000 word pairs, enabling fine-grained evaluation of transliteration models. Using this dataset, we developed IndicXlit, a multilingual transliteration model that enhances accuracy by 15% on established benchmarks. In the realm of language identification, we created Bhasha-Abhijnaanam, a comprehensive LID test set for native-script and romanized text, spanning all 22 Indic languages. IndicLID, our language identifier, is designed for both native and romanized scripts, supporting 47 classes, including English and Others. IndicLID sets a new standard for language identification in romanized Indian text, overcoming challenges like limited training data and similar language structures. These resources are publicly available to drive further innovation in Indic language transliteration and identification.

Timeline

A4B Logo

© 2024 AI4Bharat. All rights reserved

TwitterTwitterYouTubeTwitterTwitter