IndicTrans2 – AI4BHĀRAT

The first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages

Preprint TMLR

IndicTrans2 is the first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages — including multiple scripts for low-resouce languages like Kashmiri, Manipuri and Sindhi. It adopts script unification wherever feasible to leverage transfer learning by lexical sharing between languages. Overall, the model supports five scripts Perso-Arabic (Kashmiri, Sindhi, Urdu), Ol Chiki (Santali), Meitei (Manipuri), Latin (English), and Devanagari (used for all the remaining languages).

We open-souce all our training dataset (BPCC), back-translation data (BPCC-BT), final IndicTrans2 models, evaluation benchmarks (IN22, which includes IN22-Gen and IN22-Conv) and training and inference scripts for easier use and adoption within the research community. We hope that this will foster even more research in low-resource Indic languages, leading to further improvements in the quality of low-resource translation through contributions from the research community.

This code repository contains instructions for downloading the artifacts associated with IndicTrans2, as well as the code for training/fine-tuning the multilingual NMT models.

Here is the list of languages supported by the IndicTrans2 models:

Assamese (asm_Beng)	Kashmiri (Arabic) (kas_Arab)	Punjabi (pan_Guru)
Bengali (ben_Beng)	Kashmiri (Devanagari) (kas_Deva)	Sanskrit (san_Deva)
Bodo (brx_Deva)	Maithili (mai_Deva)	Santali (sat_Olck)
Dogri (doi_Deva)	Malayalam (mal_Mlym)	Sindhi (Arabic) (snd_Arab)
English (eng_Latn)	Marathi (mar_Deva)	Sindhi (Devanagari) (snd_Deva)
Konkani (gom_Deva)	Manipuri (Bengali) (mni_Beng)	Tamil (tam_Taml)
Gujarati (guj_Gujr)	Manipuri (Meitei) (mni_Mtei)	Telugu (tel_Telu)
Hindi (hin_Deva)	Nepali (npi_Deva)	Urdu (urd_Arab)
Kannada (kan_Knda)	Odia (ory_Orya)

Download Models and Other Artifacts

Multilingual Translation Models

Model	En-Indic	Indic-En	Indic-Indic	Evaluations
Base (used for benchmarking)	download	download	download	translations (as of May 10, 2023), metrics
Distilled	download	download	download

Training Data

Data	URL
Bharat Parallel Corpus Collection (BPCC)	download
Back-translation (BPCC-BT)	download

Evaluation Data

Data	URL
IN22 test set	download
FLORES-22 Indic dev set	download

Installation

Instructions to setup and install everything before running the code.

# Clone the github repository and navigate to the project directory.
git clone https://github.com/AI4Bharat/IndicTrans2
cd IndicTrans2

# Install all the dependencies and requirements associated with the project.
source install.sh

Note: We recommend creating a virtual environment with python>=3.7.

Data

Training
Bharat Parallel Corpus Collection (BPCC) is a comprehensive and publicly available parallel corpus that includes both existing and new data for all 22 scheduled Indic languages. It is comprised of two parts: BPCC-Mined and BPCC-Human, totaling approximately 230 million bitext pairs. BPCC-Mined contains about 228 million pairs, with nearly 126 million pairs newly added as a part of this work. On the other hand, BPCC-Human consists of 2.2 million gold standard English-Indic pairs, with an additional 644K bitext pairs from English Wikipedia sentences (forming the BPCC-H-Wiki subset) and 139K sentences covering everyday use cases (forming the BPCC-H-Daily subset). It is worth highlighting that BPCC provides the first available datasets for 7 languages and significantly increases the available data for all languages covered.

You can find the contribution from different sources in the following table:

BPCC-Mined	Existing	Samanantar	19.4M
	Existing	NLLB	85M
	Newly Added	Samanantar++	121.6M
	Newly Added	Comparable	4.3M
BPCC-Human	Existing	NLLB	18.5K
		ICLI	1.3M
		Massive	115K
	Newly Added	Wiki	644K
	Newly Added	Daily	139K

Additionally, we provide augmented back-translation data generated by our intermediate IndicTrans2 models for training purposes. Please refer our paper for more details on the selection of sample proportions and sources.

English BT data (English Original)	401.9M
Indic BT data (Indic Original)	400.9M

Evaluation

IN22 test set is a newly created comprehensive benchmark for evaluating machine translation performance in multi-domain, n-way parallel contexts across 22 Indic languages. It has been created from three distinct subsets, namely IN22-Wiki, IN22-Web and IN22-Conv. The Wikipedia and Web sources subsets offer diverse content spanning news, entertainment, culture, legal, and India-centric topics. IN22-Wiki and IN22-Web have been combined and considered for evaluation purposes and released as IN22-Gen. Meanwhile, IN22-Conv the conversation domain subset is designed to assess translation quality in typical day-to-day conversational-style applications.

IN22-Gen (IN22-Wiki + IN22-Web)	1024 sentences	🤗 ai4bharat/IN22-Gen
IN22-Conv	1503 sentences	🤗 ai4bharat/IN22-Conv

For more details about running the model, visit our GitHub Repository

License:-

The mined corpora collection (BPCC-Mined), existing seed corpora (NLLB-Seed, ILCI, MASSIVE), Backtranslation data (BPCC-BT), are released under the following licensing scheme:

We do not own any of the text from which this data has been extracted.
We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to BPCC-Mined, existing seed corpora (NLLB-Seed, ILCI, MASSIVE) and BPCC-BT.

The following table lists the licenses associated with the different artifacts released as a part of this work:

Artifact	LICENSE
Existing Mined Corpora (NLLB & Samanantar)	CC0
Existing Seed Corpora (NLLB-Seed, ILCI, MASSIVE)	CC0
Newly Added Mined Corpora (Samanantar++ & Comparable)	CC0
Newly Added Seed Corpora (BPCC-H-Wiki & BPCC-H-Daily)	CC-BY-4.0
Newly Created IN-22 test set (IN22-Gen & IN22-Conv)	CC-BY-4.0
Back-translation data (BPCC-BT)	CC0
Model checkpoints	MIT

Citation

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}