Understanding language is central to building intelligent applications that can interface with users, help make sense of large amount of text data and provide insights. This requires basic building blocks like named entity recognizers, questions answering systems, sentiment analyzers, intent-analysis modules, etc. To enable these tasks, foundational language models which can understand language are required. At AI4Bharat, our goal is to build these foundational language models as we as high-quality datasets and models for central NLU tasks.
Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources.
Training and evaluation datasets for named entity recognition in multiple Indian language.
IndicCorp v2, the largest collection of texts for Indic languages consisting of 20.9 billion tokens of which 14.4B tokens correspond to 23 Indic languages and 6.5B tokens of Indian English content curated from Indian websites.
Multilingual, compact ALBERT language model trained on IndicCorp covering 11 major Indian and English. Small model (18 million parameters) that is competitive with large LMs for Indian language tasks.
Word embeddings for 11 Indian languages trained on IndicCorp. The embeddings are based on the fastText model and are well suited for the morphologically rich nature of Indic languages.
Named Entity Recognizer models for multiple Indian languages. The models are trained on the Naampadam NER dataset mined from Samanantar parallel corpora.