AI systems that can generate natural language text can aid users in a variety of writing tasks like summarization, headline/caption generation, paraphrasing, grammar correction, question generation as well as support tasks like dialog and machine translation. To support these usecases for Indian languages, we are working on building foundational language models, datasets and task-specific models for power language generation applications.
Our Contributions
Datasets
Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources.
Know More →
Training and Evaluation datasets for 5 diverse language generation tasks spanning 11 Indic languages. One of the largest multilingual generation dataset collections across languages.
Know More →
Models
Multilingual, sequence-to-sequence language model trained on IndicCorp covering 11 major Indian and English. It is a single script model that enables better cross-lingual transfer.
Know More →
Language generation models for various tasks like headline generation, sentence summarization, etc. The models have be trained by finetuning IndicBART on datasets in the IndicNLGSuite
Our Partners
Coming Soon
On 28th July, we are conducting a workshop to demonstrate our datasets, models, and applications.