AI4BHARAT

Language Understanding

Understanding language is central to building intelligent applications that can interface with users, help make sense of large amount of text data and provide insights. This requires basic building blocks like named entity recognizers, questions answering systems, sentiment analyzers, intent-analysis modules, etc. To enable these tasks, foundational language models which can understand language are required. At AI4Bharat, our goal is to build these foundational language models as we as high-quality datasets and models for central NLU tasks.

Our Contributions

Benchmarks

IndicGLUE

Benchmark for 6 NLU tasks spanning 11 Indian languages containing standard training and evaluation sets.

Know More →

Coming SoonUpdated
IndicXTREME

Benchmark for zero-shot and cross-lingual evaluation of various NLU tasks in multiple Indian languages.

Know More →

Datasets

IndicCorp

Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources.

Know More →

Updated
Naamapadam

Training and evaluation datasets for named entity recognition in multiple Indian language.

Know More →

Coming Soon
IndicCorp v2

IndicCorp v2, the largest collection of texts for Indic languages consisting of 20.9 billion tokens of which 14.4B tokens correspond to 23 Indic languages and 6.5B tokens of Indian English content curated from Indian websites.

Models

IndicBERT

Multilingual, compact ALBERT language model trained on IndicCorp covering 11 major Indian and English. Small model (18 million parameters) that is competitive with large LMs for Indian language tasks.

Know More →

IndicFT

Word embeddings for 11 Indian languages trained on IndicCorp. The embeddings are based on the fastText model and are well suited for the morphologically rich nature of Indic languages.

Know More →

Updated
IndicNER

Named Entity Recognizer models for multiple Indian languages. The models are trained on the Naampadam NER dataset mined from Samanantar parallel corpora.

Know More →

Coming SoonUpdated
IndicBERTv2

Language model trained on IndicCorp v2 with competitive performance on IndicXTREME

Know More →

Our Partners

Koo

Koo is exploring the use of IndicNER to organize tweets.

IndicNLU Workshop

On 28th July, we are conducting a workshop to demonstrate our datasets, models, and applications.

Learn More