IndicCorp: A Large Corpus for Indian Languages

IndicCorp is a large publicly-available corpus for Indian languages, developed by discovering and scraping thousands of web sources – primarily news, magazines, and books, over several months.

It has been used to train our released models, which have obtained state-of-the-art performance on many tasks.

Corpus Format

The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized, and deduplicated.

Downloads

Language	# News Articles*	Sentences	Tokens	Link
as	0.60M	1.39M	32.6M	link
bn	3.83M	39.9M	836M	link
en	3.49M	54.3M	1.22B	link
gu	2.63M	41.1M	719M	link
hi	4.95M	63.1M	1.86B	link
kn	3.76M	53.3M	713M	link
ml	4.75M	50.2M	721M	link
mr	2.31M	34.0M	551M	link
or	0.69M	6.94M	107M	link
pa	2.64M	29.2M	773M	link
ta	4.41M	31.5M	582M	link
te	3.98M	47.9M	674M	link

Processing Corpus

For processing the corpus into other forms (tokenized, transliterated, etc.), you can use the indicnlp library. As an example, the following code snippet can be used to tokenize the corpus:


from indicnlp.tokenize.indic_tokenize import trivial_tokenize
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

lang = 'kn'
input_path = 'kn'
output_path = 'kn.tok.txt'

normalizer_factory = IndicNormalizerFactory()
normalizer = normalizer_factory.get_normalizer(lang)

def process_sent(sent):
    normalized = normalizer.normalize(sent)
    processed = ' '.join(trivial_tokenize(normalized, lang))
    return processed

with open(input_path, 'r', encoding='utf-8') as in_fp, open(output_path, 'w', encoding='utf-8') as out_fp:
    for line in in_fp.readlines():
        sent = line.rstrip('\n')
        toksent = process_sent(sent)
        out_fp.write(toksent)
        out_fp.write('\n')

Citing

If you are using IndicCorp, please cite the following article:


@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Kakwani, Puru and Ghosh, Ankur and Sharma, Gaurav and Bhattacharyya, Pushpak and Goyal, Aman and Jain, Rishabh and Jain, Akshay and Singh, Anuj and Gupta, Ankur and Tandon, Nitish and others},
    booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
    pages={137--142},
    year={2020},
    publisher={Association for Computational Linguistics}
}

License

IndicCorp is released under this licensing scheme:

We do not own any of the text from which this data has been extracted.
We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Samanantar.
This work is published from: India.