IndicSUPERB is a robust benchmark consisting of 6 speech language understanding (SLU) tasks across 12 Indian languages. The tasks include automatic speech recognition, automatic speaker verification, speech idntification, query by example and keyword spotting. The IndicSUPERB also encompasses Kathbath dataset which has 1684 hours of labelled speech data across 12 Indian Languages.
Read more in our paper - IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages
Kathbath dataset details
bengali | gujarati | hindi | kannada | malayalam | marathi | odia | punjabi | sanskrit | tamil | telugu | urdu | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Data duration (hours) | 115.8 | 129.3 | 150.2 | 65.8 | 147.3 | 185.2 | 111.6 | 136.9 | 115.5 | 185.1 | 154.9 | 86.7 |
No. of male speakers | 18 | 44 | 58 | 53 | 12 | 82 | 10 | 65 | 95 | 116 | 53 | 36 |
No. of female speakers | 28 | 35 | 63 | 26 | 20 | 61 | 32 | 77 | 110 | 42 | 51 | 31 |
Vocabulary (no. of unique words) | 6k | 109k | 54k | 181k | 268k | 132k | 94k | 56k | 298k | 171k | 147k | 44k |
The dataset can be downloaded from the links given below.
Download Links (Clean split):
Transcripts: [clean]
Download Links (Noisy split):
Transcripts: [noisy]
Audio Dataset Format
Folder Structure of audios after extraction
Audio Data
data
├── bengali
│ ├── <split_name>
│ │ ├── 844424931537866-594-f.m4a
│ │ ├── 844424931029859-973-f.m4a
│ │ ├── ...
├── gujarati
├── ...
Transcripts
data
├── bengali
│ ├── <split_name>
│ │ ├── transcription_n2w.txt
├── gujarati
├── ...
If you are using any of the resources, please cite the following article:
@misc{https://doi.org/10.48550/arxiv.2208.11761,
doi = {10.48550/ARXIV.2208.11761},
url = {https://arxiv.org/abs/2208.11761},
author = {Javed, Tahir and Bhogale, Kaushal Santosh and Raman, Abhigyan and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
title = {IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
We would like to hear from you if:
The IndicSUPERB dataset is released under this licensing scheme:
The IndicSUPERB code and models are released under the MIT License.
We would like to thank the Ministry of Electronics and Information Technology (MeitY) of the Government of India and the Centre for Development of Advanced Computing (C-DAC), Pune for generously supporting this work and providing us access to multiple GPU nodes on the Param Siddhi Supercomputer. We would like to thank the EkStep Foundation and Nilekani Philanthropies for their generous grant which went into hiring human resources as well as cloud resources needed for this work. We would like to thank DesiCrew for connecting us to native speakers for collecting data. We would like to thank Vivek Seshadri from Karya Inc. for helping setup the data collection infrastructure on the Karya platform. We would like to thank all the members of AI4Bharat team in helping create the Query by Example dataset.