IndicCorp is a large publicly-available corpus for Indian languages, developed by discovering and scraping thousands of web sources – primarily news, magazines, and books, over several months.
It has been used to train our released models, which have obtained state-of-the-art performance on many tasks.
The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized, and deduplicated.
Language | # News Articles* | Sentences | Tokens | Link |
---|---|---|---|---|
as | 0.60M | 1.39M | 32.6M | link |
bn | 3.83M | 39.9M | 836M | link |
en | 3.49M | 54.3M | 1.22B | link |
gu | 2.63M | 41.1M | 719M | link |
hi | 4.95M | 63.1M | 1.86B | link |
kn | 3.76M | 53.3M | 713M | link |
ml | 4.75M | 50.2M | 721M | link |
mr | 2.31M | 34.0M | 551M | link |
or | 0.69M | 6.94M | 107M | link |
pa | 2.64M | 29.2M | 773M | link |
ta | 4.41M | 31.5M | 582M | link |
te | 3.98M | 47.9M | 674M | link |
For processing the corpus into other forms (tokenized, transliterated, etc.), you can use the indicnlp library. As an example, the following code snippet can be used to tokenize the corpus:
from indicnlp.tokenize.indic_tokenize import trivial_tokenize
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
lang = 'kn'
input_path = 'kn'
output_path = 'kn.tok.txt'
normalizer_factory = IndicNormalizerFactory()
normalizer = normalizer_factory.get_normalizer(lang)
def process_sent(sent):
normalized = normalizer.normalize(sent)
processed = ' '.join(trivial_tokenize(normalized, lang))
return processed
with open(input_path, 'r', encoding='utf-8') as in_fp, open(output_path, 'w', encoding='utf-8') as out_fp:
for line in in_fp.readlines():
sent = line.rstrip('\n')
toksent = process_sent(sent)
out_fp.write(toksent)
out_fp.write('\n')
If you are using IndicCorp, please cite the following article:
@inproceedings{kakwani2020indicnlpsuite,
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
author={Kakwani, Puru and Ghosh, Ankur and Sharma, Gaurav and Bhattacharyya, Pushpak and Goyal, Aman and Jain, Rishabh and Jain, Akshay and Singh, Anuj and Gupta, Ankur and Tandon, Nitish and others},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
pages={137--142},
year={2020},
publisher={Association for Computational Linguistics}
}
IndicCorp is released under this licensing scheme: