IndicBERT v2 is a multilingual BERT model trained on IndicCorpv2, covering 24 Indic languages. IndicBERT performs competitive to strong baselines and performs best on 7 out of 9 tasks on IndicXTREME benchmark.
The models are trained with various objectives and datasets. The list of models are as follows:
- IndicBERT-MLM Model – A vanilla BERT style model trained on IndicCorp v2 with the MLM objective
- +Samanantar Model – TLM as an additional objective with Samanantar Parallel Corpus Paper | Dataset\
- +Back-Translation Model – TLM as an additional objective by translating the Indic parts of IndicCorp v2 dataset into English w/ IndicTrans model Model
- IndicBERT-SS Model – To encourage better lexical sharing among languages we convert the scripts from Indic languages to Devanagari and train a BERT style model with the MLM objective
All the code for pretraining and fine-tuning IndicBERT model is available at this GitHub repository
Citation
@article{Doddapaneni2022towards,
title={Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages},
author={Sumanth Doddapaneni and Rahul Aralikatte and Gowtham Ramesh and Shreyansh Goyal and Mitesh M. Khapra and Anoop Kunchukuttan and Pratyush Kumar},
journal={ArXiv},
year={2022},
volume={abs/2212.05409}
}