IndicBERT v2 is a multilingual BERT model trained on IndicCorpv2, covering 24 Indic languages. IndicBERT performs competitive to strong baselines and performs best on 7 out of 9 tasks on IndicXTREME benchmark.

The models are trained with various objectives and datasets. The list of models are as follows:

  • IndicBERT-MLM Model – A vanilla BERT style model trained on IndicCorp v2 with the MLM objective
  • +Samanantar Model – TLM as an additional objective with Samanantar Parallel Corpus Paper | Dataset\
  • +Back-Translation Model – TLM as an additional objective by translating the Indic parts of IndicCorp v2 dataset into English w/ IndicTrans model Model
  • IndicBERT-SS Model – To encourage better lexical sharing among languages we convert the scripts from Indic languages to Devanagari and train a BERT style model with the MLM objective

All the code for pretraining and fine-tuning IndicBERT model is available at this GitHub repository

Citation

@article{Doddapaneni2022towards,
  title={Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages},
  author={Sumanth Doddapaneni and Rahul Aralikatte and Gowtham Ramesh and Shreyansh Goyal and Mitesh M. Khapra and Anoop Kunchukuttan and Pratyush Kumar},
  journal={ArXiv},
  year={2022},
  volume={abs/2212.05409}
}