MACHINE TRANSLITERATION

Datasets

Bhasha-Abhijnaanam Dataset

Bhasha-Abhijnaanam is a language identification test set for native-script as well as Romanized text spanning 22 Indic languages.

📜Aksharantar dataset

BPCC is a comprehensive and publicly available parallel corpus containing a mix of Human labelled data and automatically mined data; totaling to approximately 230 million bitext pairs.

Models

🚀IndicXlit model

A multilingual transformer based model for transliteration from romanized input to native language scripts supporting 21 languages. This model is trained using Aksharantar corpus and at the time of its release was the state of the art open source model as evaluated on Google's Dakshina benchmark and our Aksharantar benchmark.

IndicLID

A model for identifying the language of romanized content on social media. This model will be trained using the large number of romanized Indian language words in Aksharantar.

Tools

⌨️IndicXlit Keyboard

A transliteration interface (En-Indic online keyboard) that converts romanized text into Indic text as you type.

💻IndicXlit Converter

A transliteration interface that converts Indic text into romanized text and vice versa.

📱IndicSwipe

An on-going research for creating swipe-based keypad typing on Android devices for Indic languages

Indic Input tools

Enhance your typing experience in Chrome with AI4Bharat's Input Tools Chrome extension. This extension provides real-time transliteration suggestions for Indian languages, offering seamless integration into your typing workflow.

Our Partners

Pratham Books

We have partnered with Pratham Books to enable romanized keyboards in their translation interface for low resource languages such as Bodo, Kashmiri, Konkani, Maithili, Nepali and Urdu.