Dhwani Dataset – AI4BHĀRAT

Website | Paper | Downloads

Dhwani is a Unlabelled ASR corpus obtained from YouTube and News On AIR news bulletins. The dataset contains raw audios across 40 Indian Languages.

Dataset details

Numbers represent hours

Dataset Format

The audio files present in separate folders.
For YouTubeThe audio filenames are named YouTube-ids and for Newsonair, the contatination of region name, bulletin timing makes the filename.

Folder Structure

For YouTube

YT
├── bengali
│   ├── XXXXXXXXXXX.wav
│   ├── XXXXXXXXXXX.wav
│   ├── XXXXXXXXXXX.wav
│   └── ...
├── gujarati
├── ...

For NOA

NOA
├── Audio
│   ├── assamese
│       ├── audio
│          ├── newsonair.nic.in
│           ├── NSD-Assamese-Assamese-0705-0715-201810107486.mp3
│           ├── NSD-Assamese-Assamese-0705-0715-20181011161537.mp3
├── gujarati
├── ...

Downloads

YouTube urls

NewsOnAir – Please crawl the data from the following website – https://newsonair.gov.in/

Citing our work

If you are using any of the resources, please cite the following article:

@dataset{

}

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.

License

The Dhwani dataset, models and code are released under the MIT License.

Contributors

Tahir Javed, (IITM, AI4Bharat)
Sumanth Doddapaneni, (AI4Bharat, RBCDSAI)
Abhigyan Raman, (AI4Bharat)
Kaushal Bhogale, (AI4Bharat)
Gowtham Ramesh, (AI4Bharat, RBCDSAI)
Anoop Kunchukuttan, (Microsoft, AI4Bharat)
Pratyush Kumar, (Microsoft, AI4Bharat)
Mitesh Khapra, (IITM, AI4Bharat, RBCDSAI)

Contact

Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
Mitesh Khapra (miteshk@cse.iitm.ac.in)
Pratyush Kumar (pratyush@cse.iitm.ac.in)

Acknowledgements

We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitions Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.