IndicSUPERB – AI4BHĀRAT

IndicSUPERB is a robust benchmark consisting of 6 speech language understanding (SLU) tasks across 12 Indian languages. The tasks include automatic speech recognition, automatic speaker verification, speech idntification, query by example and keyword spotting. The IndicSUPERB also encompasses Kathbath dataset which has 1684 hours of labelled speech data across 12 Indian Languages.

Kathbath dataset details

	bengali	gujarati	hindi	kannada	malayalam	marathi	odia	punjabi	sanskrit	tamil	telugu	urdu
Data duration (hours)	115.8	129.3	150.2	65.8	147.3	185.2	111.6	136.9	115.5	185.1	154.9	86.7
No. of male speakers	18	44	58	53	12	82	10	65	95	116	53	36
No. of female speakers	28	35	63	26	20	61	32	77	110	42	51	31
Vocabulary (no. of unique words)	6k	109k	54k	181k	268k	132k	94k	56k	298k	171k	147k	44k

Downloads

The dataset can be downloaded from the links given below.

Download Links (Clean split):

Train: [85GB]
Valid: [3GB]
Test Known: [2GB]
Test Unknown: [2GB]

Transcripts: [clean]

Download Links (Noisy split):

Test Known: [2GB]
Test Unknown: [1.4GB]

Transcripts: [noisy]

Audio Dataset Format

The audio files for each language are present in separate folders.
The speaker and gender information are present in the name of the audio file.
The audio files are stored in m4a format. For resampling, please check the sample code here

Folder Structure of audios after extraction

Audio Data
data
├── bengali
│   ├── <split_name>
│   │   ├── 844424931537866-594-f.m4a
│   │   ├── 844424931029859-973-f.m4a
│   │   ├── ...
├── gujarati
├── ...

Transcripts
data
├── bengali
│   ├── <split_name>
│   │   ├── transcription_n2w.txt
├── gujarati
├── ...

Citing our work

If you are using any of the resources, please cite the following article:

@misc{https://doi.org/10.48550/arxiv.2208.11761,
  doi = {10.48550/ARXIV.2208.11761},
  url = {https://arxiv.org/abs/2208.11761},
  author = {Javed, Tahir and Bhogale, Kaushal Santosh and Raman, Abhigyan and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
  title = {IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.

License

Dataset

The IndicSUPERB dataset is released under this licensing scheme:

We do not own any of the raw text used in creating this dataset. The text data comes from the IndicCorp dataset which is a crawl of publicly available websites.
The audio transcriptions of the raw text and labelled annotations of the datasets have been created by us.
We license the actual packaging of all this data under the Creative Commons CC0 license (“no rights reserved”).
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to the IndicSUPERB dataset.
This work is published from: India.

Code and Models

The IndicSUPERB code and models are released under the MIT License.

Contributors

Tahir Javed
Kaushal Bhogale
Abhigyan Raman
Anoop Kunchukuttan
Mitesh Khapra
Pratush Kumar

Contact

Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
Mitesh Khapra (miteshk@cse.iitm.ac.in)
Pratyush Kumar (pratyush@cse.iitm.ac.in)

Acknowledgements

We would like to thank the Ministry of Electronics and Information Technology (MeitY) of the Government of India and the Centre for Development of Advanced Computing (C-DAC), Pune for generously supporting this work and providing us access to multiple GPU nodes on the Param Siddhi Supercomputer. We would like to thank the EkStep Foundation and Nilekani Philanthropies for their generous grant which went into hiring human resources as well as cloud resources needed for this work. We would like to thank DesiCrew for connecting us to native speakers for collecting data. We would like to thank Vivek Seshadri from Karya Inc. for helping setup the data collection infrastructure on the Karya platform. We would like to thank all the members of AI4Bharat team in helping create the Query by Example dataset.