India has 22 constitutionally recognised languages with a collective speaker base of over 1 billion speakers. With increasing digital penetration and the preference for regional language content on the web, a good translation system for Indian languages is a necessity to provide equitable access to information and content. Despite this fundamental need, the accuracy of machine translation (MT) systems to and from Indic languages are poorer compared to those for several European languages. At AI4Bharat, our goal is to bridge this gap by (i) mining cheaper parallel data from the web (ii) manually collecting a small amount of seed data (iii) creating robust India-centric benchmarks and (iv) building efficient multilngual models which exploit the similarity between Indian languages.
Open-source datasets (Samanantar) and models (IndicTrans) for neural machine translation between English and 12 Indic languages.
BPCC is a comprehensive and publicly available parallel corpus containing a mix of Human labelled data and automatically mined data; totaling to approximately 230 million bitext pairs.
A multilingual single-script transformer based model for translating between English and Indian languages. This model is trained using the Samanantar corpus and at the time of its release was the state of the art open source model as evaluated on Facebook's FLORES benchmark.
The first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages
IndicTrans powers the translation workflow in our open-source annotation platform Shoonya by populating automatic translations which can then be edited by humans. These automatic translations act as initial hints and reduce the cognitive load on humans, thereby improving the efficiency of human translation.
We are in discussions with DesiCrew to deploy Shoonya to help them better manage their translation workflows
Our goal is to improve the reach of this monthly science and technology magazine published by IIT Madras by helping in translating it to multiple Indian languages using Shoonya