Bharat Parallel Corpus Collection (BPCC)

BPCC is a comprehensive and publicly available parallel corpus containing a mix of Human labelled data and automatically mined data; totaling to approximately 230 million bitext pairs.

Bharat Parallel Corpus Collection (BPCC) is a comprehensive and publicly available parallel corpus that includes both existing and new data for all 22 scheduled Indic languages. It is comprised of two parts: BPCC-Mined and BPCC-Human, totaling approximately 230 million bitext pairs. BPCC-Mined contains about 228 million pairs, with nearly 126 million pairs newly added as a part of this work. On the other hand, BPCC-Human consists of 2.2 million gold standard English-Indic pairs, with an additional 644K bitext pairs from English Wikipedia sentences (forming the BPCC-H-Wiki subset) and 139K sentences covering everyday use cases (forming the BPCC-H-Daily subset). It is worth highlighting that BPCC provides the first available datasets for 7 languages and significantly increases the available data for all languages covered.

You can find the contribution from different sources in the following table:

BPCC-Mined	Existing	Samanantar	19.4M
	Existing	NLLB	85M
	Newly Added	Samanantar++	121.6M
	Newly Added	Comparable	4.3M
BPCC-Human	Existing	NLLB	18.5K
		ICLI	1.3M
		Massive	115K
	Newly Added	Wiki	644K
	Newly Added	Daily	139K

Additionally, we provide augmented back-translation data generated by our intermediate IndicTrans2 models for training purposes. Please refer our paper for more details on the selection of sample proportions and sources.

English BT data (English Original)	401.9M
Indic BT data (Indic Original)	400.9M

Download the Data

Data	URL
Bharat Parallel Corpus Collection (BPCC)	download
Back-translation (BPCC-BT)	download

License

The mined corpora collection (BPCC-Mined), existing seed corpora (NLLB-Seed, ILCI, MASSIVE), Backtranslation data (BPCC-BT), are released under the following licensing scheme:

We do not own any of the text from which this data has been extracted.
We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to BPCC-Mined, existing seed corpora (NLLB-Seed, ILCI, MASSIVE) and BPCC-BT.

The following table lists the licenses associated with the different artifacts released as a part of this work:

Artifact	LICENSE
Existing Mined Corpora (NLLB & Samanantar)	CC0
Existing Seed Corpora (NLLB-Seed, ILCI, MASSIVE)	CC0
Newly Added Mined Corpora (Samanantar++ & Comparable)	CC0
Newly Added Seed Corpora (BPCC-H-Wiki & BPCC-H-Daily)	CC-BY-4.0
Back-translation data (BPCC-BT)	CC0

Citation\

@article{ai4bharat2023indictrans2,
  title   = {IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
  author  = {AI4Bharat and Jay Gala and Pranjal A. Chitale and Raghavan AK and Sumanth Doddapaneni and Varun Gumma and Aswanth Kumar and Janki Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M. Khapra and Raj Dabre and Anoop Kunchukuttan},
  year    = {2023},
  journal = {arXiv preprint arXiv: 2305.16307}
}