Shrutilipi is a labelled ASR corpus obtained by mining parallel audio and text pairs at the document scale from All India Radio news bulletins for 12 Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu. The corpus has over 6400 hours of data across all languages.

Read more in our paper – Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

Dataset Details

LanguageSize (in Hours)
bengali443
gujarati460
hindi1620
kannada459
malayalam359
marathi1015
odia601
punjabi94
sanskrit27
tamil794
telugu390
urdu193
Total6457

Downloads

The dataset can be downloaded from the links given below –

Download transcripts – Link

The transcripts and audio paths are provided in fairseq format, which can be directly used for training models using the fairseq library. It consists of 3 files –

train.tsv file – Each line in the file contains the relative path to an audio file and the number of frames in the audio separated by tabs. The file also contains a header which has the absolute path to the dataset.

train.wrd (word) file – each line contains the transcription for the audio file in the ‘.tsv’ file which is corresponding to the same line number (ignoring the header in the ‘.tsv’ file).

train.ltr (letter) file – Tokenized transcriptions for the corresponding sentences in ‘wrd’ file. (tokenized to characters)

Audio Dataset Format

  • The audio files for each news bulletin are present in separate folders.
  • The audio files are stored in wav format sampled at 16KHz.
  • The audio filenames are numbered by sentence ids in the bulletin, eg. sent_1.wav

Folder Structure

data
├── bengali
│   ├── <bulletin-1>
│   │   ├── sent_1.wav
│   │   ├── sent_2.wav
│   │   ├── ...
│   │   └── sent_n.txt
│   ├── <bulletin-2>
│   └── ...
├── gujarati
├── ...

Audio Download Links

LanguageDownload Link
bengaliLink (65 GB)
gujaratiLink (68 GB)
hindiLink (229 GB)
kannadaLink (63 GB)
malayalamLink (84 GB)
marathiLink (123 GB)
odiaLink (74 GB)
punjabiLink (12 GB)
sanskritLink (6 GB)
tamilLink (107 GB)
teluguLink (80 GB)
urduLink (30 GB)

Model Download links

LanguageDownload Link (3.6GB)
bengaliLink
gujaratiLink
hindiLink
marathiLink
odiaLink
tamilLink
teluguLink

Shrutilipi – Mining Process

We summarize the key procedure we used for mining audio-text pairs from documents from the AIR dataset in the figure below. For a detailed description of the data mining process, please refer to our paper.

Mining Process

Results on Hindi Benchmarks

BenchmarksKathbath-KnownKathbath-UnKnownTariniCommonVoice 6CommonVoice 7CommonVoice 8CommonVoice 9Avg.
W2V (MUCS)14.114.722.719.419.520.720.518.8
W2V (MUCS + Shrutilipi)9.49.619.71513.413.913.713.5
Conf. (MUCS)17.217.725.420.921.422.922.821.2
Conf. (MUCS + Shrutilipi)15.214.923.919.319.12019.918.9

Results on Kathbath Unknown Test Set

bnguhimrortateAvg.
Existing14.41514.725.631.524.122.321.1
Existing + Shrutilipi13.49.59.615.721.519.717.715.3

Results on MUCS Benchmark

guhimrortateAvg.
Existing17.91213.623.320.516.417.3
Existing + Shrutilipi12.811.111.42320.713.815.5

Citing our work

If you are using any of the resources, please cite the following article:

@misc{https://doi.org/10.48550/arxiv.2208.12666,
  doi = {10.48550/ARXIV.2208.12666},
  url = {https://arxiv.org/abs/2208.12666},
  author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
  title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

We would like to hear from you if:

  • You are using our resources. Please let us know how you are putting these resources to use.
  • You have any feedback on these resources.

License

Dataset

The Shrutilipi dataset is released under this licensing scheme:

  • We do not own any of the raw text and audio from which this dataset has been extracted.
  • The raw dataset and audio have been crawled from the publicly available website: https://newsonair.gov.in
  • We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”) license.
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to the Shrutilipi dataset.
  • This work is published from: India.

Code and Models

The code and models are released under the MIT License.

Contributors

  • Kaushal Bhogale
  • Abhigyan Raman
  • Tahir Javed
  • Sumanth Doddapaneni
  • Anoop Kunchukuttan
  • Mitesh Khapra
  • Pratush Kumar

Contact

Acknowledgements

We would like to thank the Ministry of Electronics and Information Technology (MeitY) of the Government of India and the Centre for Development of Advanced Computing (C-DAC), Pune for generously supporting this work and providing us access to multiple GPU nodes on the Param Siddhi Supercomputer. We would like to thank the EkStep Foundation and Nilekani Philanthropies for their generous grant which went into hiring human resources as well as cloud resources needed for this work. We would like to thank Megh Makhwana from Nvidia for helping in training Conformer-based ASR models. We would like to thank the EkStep Foundation for providing the Tarini dataset. We would like to thank Janki Nawale and Anupama Sujatha from AI4Bharat for helping in coordinating the annotation task, and extend thanks to all the annotators of AI4Bharat team.