Paper | Huggingface | Benchmarking
Aksharantar is the largest publicly available transliteration dataset for 21 Indic languages. The corpus has 26M Indic language-English transliteration pairs. Benchmarking result on Aksharantar test set using IndicXlit model can be found here. More details regarding Aksharantar can be in the paper.
Downloads
- The Aksharantar dataset can be downloaded from the Aksharantar Hugging Face repository.
- Each language-pair corpus in the Aksharantar dataset is split into training, validation and test subsets. Each subset is a JSONL file consisting of individual data instances comprising a unique identifier, native word, English word, transliteration source and a score (if applicable).
- Individual language-pair download links are provided in the data split below.
Data Split
The language-wise splits for Aksharantar is shown in the table with total number of word pairs (in millions). Individual download links for each language-pair are as against the hyperlink.
Subset | as-en (4.72 MB) | bn-en (31.5 MB) | brx-en (0.933 MB) | gu-en (29.5 MB) | hi-en (31.4 MB) | kn-en (83.7 MB) | ks-en (1.1 MB) | kok-en (16.6 MB) | mai-en (6.74 MB) | ml-en (125 MB) | mni-en (0.313 MB) | mr-en (39.9 MB) | ne-en (67 MB) | or-en (9.09 MB) | pa-en (12.1 MB) | sa-en (56 MB) | sd-en (1.37 MB) | ta-en (92.7 MB) | te-en (69.1 MB) | ur-en (17 MB) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training | 179K | 1231K | 36K | 1143K | 1299K | 2907K | 47K | 613K | 283K | 4101K | 10K | 1453K | 2397K | 346K | 515K | 1813K | 60K | 3231K | 2430K | 699K |
Validation | 4K | 11K | 3K | 12K | 6K | 7K | 4K | 4K | 4K | 8K | 3K | 8K | 3K | 3K | 9K | 3K | 8K | 9K | 8K | 12K |
Test | 5531 | 5009 | 4136 | 7768 | 5693 | 6396 | 7707 | 5093 | 5512 | 6911 | 4925 | 6573 | 4133 | 4256 | 4316 | 5334 | – | 4682 | 4567 | 4463 |
Change Log
- 07 May 2022 – The Aksharantar dataset is now available for download.
Contributors
- Yash Madhani (AI4Bharat, IITM)
- Sushane Parthan (AI4Bharat, IITM)
- Priyanka Bedekar (AI4Bharat, IITM)
- Ruchi Khapra (AI4Bharat)
- Gokul NC (AI4Bharat)
- Anoop Kunchukuttan (AI4Bharat, Microsoft)
- Pratyush Kumar (AI4Bharat, IITM, Microsoft)
- Mitesh Shantadevi Khapra (AI4Bharat, IITM)
Citing
If you are using any of the resources, please cite the following article:
@misc{madhani2022aksharantar,
title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users},
author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2022},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
This data is released under the following licensing scheme:
- Manually collected data: Released under CC-BY license.
- Mined dataset (from Samanantar and IndicCorp): Released under CC0 license.
- Existing sources: Released under CC0 license.
CC-BY License
CC0 License Statement
- We do not own any of the text from which this data has been extracted.
- We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”).
- To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources.
- This work is published from: India.
Contact
- Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
- Mitesh Khapra (miteshk@cse.iitm.ac.in)
- Pratyush Kumar (pratyush@cse.iitm.ac.in)
Acknowledgements
We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitious Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.