AI systems that can generate natural language text can aid users in a variety of writing tasks like summarization, headline/caption generation, paraphrasing, grammar correction, question generation as well as support tasks like dialog and machine translation. To support these usecases for Indian languages, we are working on building foundational language models, datasets and task-specific models for power language generation applications.
Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources.
Training and Evaluation datasets for 5 diverse language generation tasks spanning 11 Indic languages. One of the largest multilingual generation dataset collections across languages.
Multilingual, sequence-to-sequence language model trained on IndicCorp covering 11 major Indian and English. It is a single script model that enables better cross-lingual transfer.
Language generation models for various tasks like headline generation, sentence summarization, etc. The models have be trained by finetuning IndicBART on datasets in the IndicNLGSuite