Naamapadam is the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.  In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. 

The dataset contains train, test and dev splits. We have manually annotated gold standard testsets for 8 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Telugu.

We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set.

Downloads

Naamapadam Dataset: Available on our Hugginface repository

IndicNER model: Available on our Huggingface repository

Comparison of Indian language Named Entity training dataset statistics (total number of named entities), For all datasets, the statistics include only LOC, PER and ORG named entities.

asbnguhiknmlmrorpatate
Naamapadam5.0K1.6M765.5K2.1M655K1.0M731.2K189.6K875.9K741.1K747.8K
WikiANN44361K1.6K4.7K1.9K9.4K7.2K6581K12.5K3.4K
FIRE-20146.1K3.5K4.2K3.2K
CFILT262.1K4.8K
MultiCoNER9.9K10.5K
MahaNER16K
AsNER~6K

Statistics for the Naamapadam dataset. The testsets for as, or and ta are either silver standard or small. Work on creation of larger, manually annotated testsets is in progress for these languages.

Lang.Sentence CountTrainDevTest
TrainDevTestOrgLocPerOrgLocPerOrgLocPer
bn961.7K4.9K607340.7K560.9K725.2K1.7K2.8K3.7K207331457
gu472.8K2.4K1.1K205.7K238.1K321.7K1.1K1.2K1.6K419645673
hi985.8K13.5K437686.4K731.2K767.0K9.7K10.2K10.5K257302263
kn471.8K2.4K1.0K167.5K177.0K310.5K8829191.6K291397614
ml716.7K3.6K974234.5K308.2K501.2K1.2K1.6K2.6K309482714
mr455.2K2.3K1.1K164.9K224.0K342.3K8681.2K1.8K391569696
pa463.5K2.3K993235.0K289.8K351.1K1.1K1.5K1.7K408496553
te507.7K2.7K861194.1K205.9K347.8K1.0K1.0K2.0K263482607
ta497.9K2.8K49177.7K281.2K282.2K1.0K1.5K1.6K263422
as10.3K52512.0K1.8K1.2K18531176
or196.8K99399445.6K59.4K84.6K225268386229266431

Contributors

  • Arnav Mhaske (AI4Bharat, IITM)
  • Harshit Kedia (AI4Bharat, IITM)
  • Sumanth Doddapaneni, (AI4Bharat, IITM)
  • Mitesh Khapra,  (AI4Bharat, IITM)
  • Pratyush Kumar,  (Microsoft, AI4Bharat, IITM)
  • Rudra Murthy V, (IBM Research India, AI4Bharat, IITM)[mailto:rmurthyv@in.ibm.com]
  • Anoop Kunchukuttan, (Microsoft, AI4Bharat, IITM)[mailto:ankunchu@microsoft.com]

Corresponding authors: Rudra Murthy V, Anoop Kunchukuttan

Citing

If you are using any of the resources, please cite the following article:

Language

@misc{mhaske2022naamapadam,  doi = {10.48550/ARXIV.2212.10168},  url = {https://arxiv.org/abs/2212.10168},  author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop},  title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages}  publisher = {arXiv},  year = {2022},}

License

Naamapadam is released under this licensing scheme:

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Naamapadam.
  • This work is published from: India.