IndicCorp has been developed by discovering and scraping thousands of web sources – primarily news, magazines and books, over a duration of several months.

IndicCorp is one of the largest publicly-available corpora for Indian languages. It has also been used to train our released models which have obtained state-of-the-art performance on many tasks.

Corpus Format

The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.

Downloads

Language# News Articles*SentencesTokensLink
as0.60M1.39M32.6Mlink
bn3.83M39.9M836Mlink
en3.49M54.3M1.22Blink
gu2.63M41.1M719Mlink
hi4.95M63.1M1.86Blink
kn3.76M53.3M713Mlink
ml4.75M50.2M721Mlink
mr2.31M34.0M551Mlink
or0.69M6.94M107Mlink
pa2.64M29.2M773Mlink
ta4.41M31.5M582Mlink
te3.98M47.9M674Mlink

Processing Corpus

For processing the corpus into other forms (tokenized, transliterated etc.), you can use the indicnlp library. As an example, the following code snippet can be used to tokenize the corpus:

Language

from indicnlp.tokenize.indic_tokenize import trivial_tokenizefrom indicnlp.normalize.indic_normalize import IndicNormalizerFactory
lang = 'kn'input_path = 'kn'output_path = 'kn.tok.txt'
normalizer_factory = IndicNormalizerFactory()normalizer = normalizer_factory.get_normalizer(lang)
def process_sent(sent):    normalized = normalizer.normalize(sent)    processed = ' '.join(trivial_tokenize(normalized, lang))    return processed
with open(input_path, 'r', encoding='utf-8') as in_fp,\	 open(output_path, 'w', encoding='utf-8') as out_fp:    for line in in_fp.readlines():        sent = line.rstrip('\n')        toksent = process_sent(sent)        out_fp.write(toksent)        out_fp.write('\n')


Citing

If you are using IndicCorp, please cite the following article:

Language

@inproceedings{kakwani2020indicnlpsuite,    
               title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},    
               
               
               

License

IndicCorp is released under this licensing scheme:

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Samanantar
  • This work is published from: India.