To know more about our contributions over the years see the timeline below!
At AI4Bharat, we have made significant strides in Machine Translation for Indic languages. Our Samanantar corpus, the largest publicly available parallel dataset, includes 49.7 million sentence pairs between English and 11 Indic languages, with 37.4 million pairs newly mined. This corpus has been instrumental in training multilingual NMT models that outperform existing benchmarks. We developed IndicTrans, a Transformer-based multilingual NMT model trained on Samanantar, and IndicTrans2, the first open-source model supporting high-quality translations across all 22 scheduled Indic languages. IndicTrans2 integrates multiple scripts and employs script unification to enhance transfer learning. Additionally, we introduced the Bharat Parallel Corpus Collection (BPCC), which includes approximately 230 million bitext pairs for all 22 scheduled Indic languages. BPCC features BPCC-Mined with 228 million pairs and BPCC-Human with 2.2 million gold-standard pairs, expanding the dataset and providing new resources for 7 Indic languages. These contributions significantly advance the field of Machine Translation for Indic languages.