Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
Samanantar v0.3 along with LaBSE scores metadata is available for download. Go to Downloads
The publicly released version is randomly shuffled, untokenized, and deduplicated.
The testsets used to benchmark IndicTrans can be found here
The Semantic Textual Similarity (STS) benchmark can be downloaded from here
The entire dataset can be downloaded from Samanantar v0.3.
The folder has 2 directories
The entire Indic-Indic data can be downloaded from here
If you are using any of the resources, please cite the following article:
@misc{ramesh2021samanantar,
title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2021},
eprint={2104.05596},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The bibtex entries for the existing data sources is available here
Samanantar is released under this licensing scheme: