fastText is a subword-aware word embedding model. It is particularly well-suited for Indian languages due to their highly agglutinative morphology. We train fastText models on our IndicNLP Corpora and evaluate them on a set of tasks to measure its performance.

Our fastText models are available for 11 Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

Usage

To use our fastText models, first download them. Next, install the fastText library:

pip3 install fasttext

and then load the models like this:

import fasttext
model = fasttext.load_model(path_to_binary_file)

For instructions on how to use these models, please refer to the official fastText documentation

Downloads

Languageaspahibnorgumrkntemlta
Vectorslinklinklinklinklinklinklinklinklinklinklink
Modellinklinklinklinklinklinklinklinklinklinklink

Evaluation

For a full results of evaluation, check our paper. Here, we show some of the evaluations.

Word Similarity

LanguagefastText wikifastText wiki+CCIndic fastText
pa0.4670.3840.445
hi0.5750.5510.598
gu0.5070.5210.600
mr0.4970.5440.509
te0.5590.5430.578
ta0.4390.4380.422
Average0.5070.4970.525

News Genre Classification

LanguagefastText wikifastText wiki+CCIndic fastText
pa97.1295.5396.47
bn96.5797.5797.71
or94.8096.2098.43
gu95.1294.6399.02
mr96.4497.0799.37
kn95.9396.5397.43
te98.6798.0899.17
ml89.0289.1892.83
ta95.9995.9097.26
Average95.5295.6397.52

Citing

If you are using IndicFT, please cite the following paper:

@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

License

The IndicFT embeddings are released under the MIT License.