Language Generation

AI systems that can generate natural language text can aid users in a variety of writing tasks like summarization, headline/caption generation, paraphrasing, grammar correction, question generation as well as support tasks like dialog and machine translation. To support these usecases for Indian languages, we are working on building foundational language models, datasets and task-specific models for power language generation applications.

Datasets

IndicCorp

Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources.

IndicNLG Suite

Training and Evaluation datasets for 5 diverse language generation tasks spanning 11 Indic languages. One of the largest multilingual generation dataset collections across languages.

Models

IndicBART

Multilingual, sequence-to-sequence language model trained on IndicCorp covering 11 major Indian and English. It is a single script model that enables better cross-lingual transfer.

Indic Generation Models

Language generation models for various tasks like headline generation, sentence summarization, etc. The models have be trained by finetuning IndicBART on datasets in the IndicNLGSuite