A4B Logo

Large Language Models

To know more about our contributions over the years see the timeline below!

At AI4Bharat, our dedication to building language models and datasets for all 22 constitutionally recognized Indian languages is central to our mission. We employ a multifaceted approach, leveraging large-scale data crawling, synthetic data creation, and human annotation/crowd collections to create comprehensive datasets. Our efforts have resulted in an extensive pretraining corpus of 251 million tokens across 22 languages, complemented by 74.7 million prompt-response pairs in 20 Indian languages. Tools like Setu play a crucial role in large-scale crawling and data cleaning, enabling us to build state-of-the-art models such as Airavata, IndicBART, and IndicBERT. We also emphasize rigorous evaluation of our models, as demonstrated by our works like FBI, IndicXTREME, IndicNLG, and IndicGLUE, which set new benchmarks in language model performance. Looking ahead, we are committed to expanding our pretraining corpora to support the development of even more robust generative models, while ensuring diversity in their generation capabilities, thereby advancing the frontier of language technology for India’s diverse linguistic landscape.

Timeline

A4B Logo

© 2024 AI4Bharat. All rights reserved

TwitterTwitterYouTubeTwitterTwitter