IndicBERT is a multilingual ALBERT model trained on large-scale corpora, covering 12 major Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. IndicBERT has much less parameters than other public models like mBERT and XLM-R while it still manages to give state of the art performance on several tasks.

Download Model

The model can be downloaded here. Both tf checkpoints and pytorch binaries are included in the archive. Alternatively, you can also download it from Huggingface.

Usage

The easiest way to use Indic BERT is through the Huggingface transformers library. It can be simply loaded like this:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
model = AutoModel.from_pretrained('ai4bharat/indic-bert')

Tutorials

If you want to quickly try experimenting with IndicBERT, we suggest checking out our tutorials and other fine-tuning notebooks that run on Google Colab:

  • General Finetuning 

Pretraining Details

IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English) The amount of pretraining data for each language is listed below:

Languageasbnenguhikn
No. of Tokens36.9M815M1.34B724M1.84B712M
Languagemlmrorpatateall
No. of Tokens767M560M104M814M549M671M8.9B

In total, the pretraining corpus has a size of 120GB and contains 8.9B tokens.

Evaluation

We evaluate IndicBERT model on a set of tasks as described in the IndicGLUE page. Here are the results that we obtain:

IndicGLUE

TaskmBERTXLM-RIndicBERT
News Article Headline Prediction89.5895.5295.87
Wikipedia Section Title Prediction73.6666.3373.31
Cloze-style multiple-choice QA39.1627.9841.87
Article Genre Classification90.6397.0397.34
Named Entity Recognition (F1-score)73.2465.9364.47
Cross-Lingual Sentence Retrieval Task21.4613.7427.12
Average64.6261.0966.66

Additional Tasks

TaskTask TypemBERTXLM-RIndicBERT
BBC News ClassificationGenre Classification60.5575.5274.60
IIT Product ReviewsSentiment Analysis74.5778.9771.32
IITP Movie ReviewsSentiment Analaysis56.7761.6159.03
Soham News ArticleGenre Classification80.2387.678.45
Midas DiscourseDiscourse Analysis71.2079.9478.44
iNLTK Headlines ClassificationGenre Classification87.9593.3894.52
ACTSA Sentiment AnalysisSentiment Analysis48.5359.3361.18
Winograd NLINatural Language Inference56.3455.8756.34
Choice of Plausible Alternative (COPA)Natural Language Inference54.9251.1358.33
Amrita Exact ParaphraseParaphrase Detection93.8193.0293.75
Amrita Rough ParaphraseParaphrase Detection83.3882.2084.33
Average69.8474.4273.66

* Note: all models have been restricted to a max_seq_length of 128.

Citing

If you are using any of the resources, please cite the following paper:

@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

License

The IndicBERT code (and model) are released under the MIT License.