Datasets

Bharat Parallel Corpus Collection (BPCC)

BPCC is a comprehensive and publicly available parallel corpus containing a mix of Human labelled data and automatically mined data; totaling to approximately 230 million bitext pairs.

IndicCorp v2

IndicCorp v2, the largest collection of texts for Indic languages consisting of 20.9 billion tokens of which 14.4B tokens correspond to 23 Indic languages and 6.5B tokens of Indian English content curated from Indian websites.

IndicXTREME

Benchmark for zero-shot and cross-lingual evaluation of various NLU tasks in multiple Indian languages.

Bhasha-Abhijnaanam

Bhasha-Abhijnaanam is a language identification test set for native-script as well as Romanized text which spans 22 Indic languages.

Naamapadam

Training and evaluation datasets for named entity recognition in multiple Indian language.

Samanantar

The largest publicly available parallel corpora collection for Indic languages containing ∼46.9M parallel sentences between English and 11 Indic languages, ranging from 142K pairs between English-Assamese to 8.6M pairs between English-Hindi. Of these 34.6M pairs are newly mined as a part of this work.

Svarah

Indian-accented English benchmark comprising of data from 17/22 Indian languages

Aksharantar

The largest publicly available parallel transliteration corpora containing 26M word pairs spanning 21 languages mined from Wikidata, Samanantar and IndicCorp. It also contains a challenging and diverse benchmark for evaluating transliteration models.

IndicNLG Suite

This is a benchmark containing various tasks to evaluate the natural language generation capabilities of language models for Indian languages.

Shrutilipi

Over 6,400 hours of labelled audio across 12 Indian languages mined and aligned from audio broadcasts and PDF transcripts from All India Radio.

IndicSUPERB

A benchmark of speech recognition tasks including ASR, speaker verification, speaker identification, language identification, query by example, and keyword detection for 12 Indian languages.

Dhwani

17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains including education, news, technology, and finance.

IndicGLUE

This is a benchmark for 6 NLU tasks spanning 11 Indian languages containing standard training and evaluation sets to evaluate the natural language understanding capabilities of language models for Indian languages.

IndicCorp

Large sentence-level monolingual corpora for 11 languages from two language families (Indo-Aryan branch and Dravidian) and Indian English with an average 9-fold increase in size over OSCAR. This corpora was created by crawling content from news articles, magazines and blogposts.