AI4BHARAT

Naamapadam

Naamapadam is the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.  In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. 

The dataset contains train, test and dev splits. We have manually annotated gold standard testsets for 8 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Telugu.

We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set.

Downloads

Naamapadam Dataset: Available on our Hugginface repository

IndicNER model: Available on our Huggingface repository

Comparison of Indian language Named Entity training dataset statistics (total number of named entities), For all datasets, the statistics include only LOC, PER and ORG named entities.


as bn gu hi kn ml mr or pa ta te
Naamapadam 5.0K 1.6M 765.5K 2.1M 655K 1.0M 731.2K 189.6K 875.9K 741.1K 747.8K
WikiANN 443 61K 1.6K 4.7K 1.9K 9.4K 7.2K 658 1K 12.5K 3.4K
FIRE-2014 - 6.1K - 3.5K - 4.2K - - - 3.2K -
CFILT - - - 262.1K - - 4.8K - - - -
MultiCoNER - 9.9K - 10.5K - - - - - - -
MahaNER - - - - - - 16K - - - -
AsNER ~6K - - - - - - - - - -

Statistics for the Naamapadam dataset. The testsets for as, or and ta are either silver standard or small. Work on creation of larger, manually annotated testsets is in progress for these languages.


Lang. Sentence Count Train Dev Test
Train Dev Test Org Loc Per Org Loc Per Org Loc Per
bn 961.7K 4.9K 607 340.7K 560.9K 725.2K 1.7K 2.8K 3.7K 207 331 457
gu 472.8K 2.4K 1.1K 205.7K 238.1K 321.7K 1.1K 1.2K 1.6K 419 645 673
hi 985.8K 13.5K 437 686.4K 731.2K 767.0K 9.7K 10.2K 10.5K 257 302 263
kn 471.8K 2.4K 1.0K 167.5K 177.0K 310.5K 882 919 1.6K 291 397 614
ml 716.7K 3.6K 974 234.5K 308.2K 501.2K 1.2K 1.6K 2.6K 309 482 714
mr 455.2K 2.3K 1.1K 164.9K 224.0K 342.3K 868 1.2K 1.8K 391 569 696
pa 463.5K 2.3K 993 235.0K 289.8K 351.1K 1.1K 1.5K 1.7K 408 496 553
te 507.7K 2.7K 861 194.1K 205.9K 347.8K 1.0K 1.0K 2.0K 263 482 607
ta 497.9K 2.8K 49 177.7K 281.2K 282.2K 1.0K 1.5K 1.6K 26 34 22
as 10.3K 52 51 2.0K 1.8K 1.2K 18 5 3 11 7 6
or 196.8K 993 994 45.6K 59.4K 84.6K 225 268 386 229 266 431

Contributors

  • Arnav Mhaske (AI4Bharat, IITM)
  • Harshit Kedia (AI4Bharat, IITM)
  • Sumanth Doddapaneni, (AI4Bharat, IITM)
  • Mitesh Khapra,  (AI4Bharat, IITM)
  • Pratyush Kumar,  (Microsoft, AI4Bharat, IITM)
  • Rudra Murthy V, (IBM Research India, AI4Bharat, IITM)[mailto:rmurthyv@in.ibm.com]
  • Anoop Kunchukuttan, (Microsoft, AI4Bharat, IITM)[mailto:ankunchu@microsoft.com]

Corresponding authors: Rudra Murthy V, Anoop Kunchukuttan

Citing

If you are using any of the resources, please cite the following article:

Language

@misc{mhaske2022naamapadam,  doi = {10.48550/ARXIV.2212.10168},  url = {https://arxiv.org/abs/2212.10168},  author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop},  title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages}  publisher = {arXiv},  year = {2022},}

License

Naamapadam is released under this licensing scheme:

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Naamapadam.
  • This work is published from: India.