AI4BHARAT

Language Generation

AI systems that can generate natural language text can aid users in a variety of writing tasks like summarization, headline/caption generation, paraphrasing, grammar correction, question generation as well as support tasks like dialog and machine translation. To support these usecases for Indian languages, we are working on building foundational language models, datasets and task-specific models for power language generation applications.

Our Contributions

Datasets

IndicCorp

Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources.

Know More →

IndicNLG Suite

Training and Evaluation datasets for 5 diverse language generation tasks spanning 11 Indic languages. One of the largest multilingual generation dataset collections across languages.

Know More →

Models

IndicBART

Multilingual, sequence-to-sequence language model trained on IndicCorp covering 11 major Indian and English. It is a single script model that enables better cross-lingual transfer.

Know More →

Indic Generation Models

Language generation models for various tasks like headline generation, sentence summarization, etc. The models have be trained by finetuning IndicBART on datasets in the IndicNLGSuite

Our Partners

Coming Soon

IndicNLG Workshop

On 28th July, we are conducting a workshop to demonstrate our datasets, models, and applications.

Learn More