Naamapadam is the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages.
The dataset contains train, test and dev splits. We have manually annotated gold standard testsets for 8 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Telugu.
We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set.
Downloads
Naamapadam Dataset: Available on our Hugginface repository
IndicNER model: Available on our Huggingface repository
Comparison of Indian language Named Entity training dataset statistics (total number of named entities), For all datasets, the statistics include only LOC, PER and ORG named entities.
as | bn | gu | hi | kn | ml | mr | or | pa | ta | te | |
---|---|---|---|---|---|---|---|---|---|---|---|
Naamapadam | 5.0K | 1.6M | 765.5K | 2.1M | 655K | 1.0M | 731.2K | 189.6K | 875.9K | 741.1K | 747.8K |
WikiANN | 443 | 61K | 1.6K | 4.7K | 1.9K | 9.4K | 7.2K | 658 | 1K | 12.5K | 3.4K |
FIRE-2014 | – | 6.1K | – | 3.5K | – | 4.2K | – | – | – | 3.2K | – |
CFILT | – | – | – | 262.1K | – | – | 4.8K | – | – | – | – |
MultiCoNER | – | 9.9K | – | 10.5K | – | – | – | – | – | – | – |
MahaNER | – | – | – | – | – | – | 16K | – | – | – | – |
AsNER | ~6K | – | – | – | – | – | – | – | – | – | – |
Statistics for the Naamapadam dataset. The testsets for as, or and ta are either silver standard or small. Work on creation of larger, manually annotated testsets is in progress for these languages.
Lang. | Sentence Count | Train | Dev | Test | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Train | Dev | Test | Org | Loc | Per | Org | Loc | Per | Org | Loc | Per | |
bn | 961.7K | 4.9K | 607 | 340.7K | 560.9K | 725.2K | 1.7K | 2.8K | 3.7K | 207 | 331 | 457 |
gu | 472.8K | 2.4K | 1.1K | 205.7K | 238.1K | 321.7K | 1.1K | 1.2K | 1.6K | 419 | 645 | 673 |
hi | 985.8K | 13.5K | 437 | 686.4K | 731.2K | 767.0K | 9.7K | 10.2K | 10.5K | 257 | 302 | 263 |
kn | 471.8K | 2.4K | 1.0K | 167.5K | 177.0K | 310.5K | 882 | 919 | 1.6K | 291 | 397 | 614 |
ml | 716.7K | 3.6K | 974 | 234.5K | 308.2K | 501.2K | 1.2K | 1.6K | 2.6K | 309 | 482 | 714 |
mr | 455.2K | 2.3K | 1.1K | 164.9K | 224.0K | 342.3K | 868 | 1.2K | 1.8K | 391 | 569 | 696 |
pa | 463.5K | 2.3K | 993 | 235.0K | 289.8K | 351.1K | 1.1K | 1.5K | 1.7K | 408 | 496 | 553 |
te | 507.7K | 2.7K | 861 | 194.1K | 205.9K | 347.8K | 1.0K | 1.0K | 2.0K | 263 | 482 | 607 |
ta | 497.9K | 2.8K | 49 | 177.7K | 281.2K | 282.2K | 1.0K | 1.5K | 1.6K | 26 | 34 | 22 |
as | 10.3K | 52 | 51 | 2.0K | 1.8K | 1.2K | 18 | 5 | 3 | 11 | 7 | 6 |
or | 196.8K | 993 | 994 | 45.6K | 59.4K | 84.6K | 225 | 268 | 386 | 229 | 266 | 431 |
Contributors
- Arnav Mhaske (AI4Bharat, IITM)
- Harshit Kedia (AI4Bharat, IITM)
- Sumanth Doddapaneni, (AI4Bharat, IITM)
- Mitesh Khapra, (AI4Bharat, IITM)
- Pratyush Kumar, (Microsoft, AI4Bharat, IITM)
- Rudra Murthy V, (IBM Research India, AI4Bharat, IITM)[mailto:rmurthyv@in.ibm.com]
- Anoop Kunchukuttan, (Microsoft, AI4Bharat, IITM)[mailto:ankunchu@microsoft.com]
Corresponding authors: Rudra Murthy V, Anoop Kunchukuttan
Citing
If you are using any of the resources, please cite the following article:
Language
@misc{mhaske2022naamapadam, doi = {10.48550/ARXIV.2212.10168}, url = {https://arxiv.org/abs/2212.10168}, author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop}, title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages} publisher = {arXiv}, year = {2022},}
License
Naamapadam is released under this licensing scheme:
- We do not own any of the text from which this data has been extracted.
- We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”).
- To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Naamapadam.
- This work is published from: India.