Description

Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, building the first TTS model for all 22 official Indian languages.

Audio Samples

Language

Assamese

Gujarati

.

Kannada

Konkani

.

Malayalam

Manipuri

Marathi

.

Odia

Punjabi

Sanskrit

Sindhi

IndicVoices

IndicVoices-R

Downloads

LANGUAGEDOWNLOAD LINKSIZE
AssameseLink83.53 GB
BengaliLink57.33 GB
BodoLink79.62 GB
DogriLink34.71 GB
GujaratiLink4.39 GB
HindiLink37.40 GB
KannadaLink21.72 GB
KashmiriLink30.36 GB
KonkaniLink25.88 GB
MaithiliLink41.64 GB
MalayalamLink40.73 GB
MarathiLink25.22 GB
ManipuriLink11.15 GB
NepaliLink50.20 GB
OdiaLink35.03 GB
PunjabiLink36.89 GB
SanskritLink17.16 GB
SantaliLink29.11 GB
SindhiLink5.28 GB
TamilLink49.38 GB
TeluguLink66.91 GB
UrduLink38.80 GB

Statistics

# Hours# Utterances# Speakers
Assamese175.3473077928
Bengali111.9940943617
Bodo172.0583976941
Dogri *70.6827967470
Gujarati8.94330445
Hindi74.6027557399
Kannada44.6118127452
Kashmiri *64.9926134450
Konkani53.0622357228
Maithili *81.7732483627
Malayalam82.5732544462
Manipuri23.999312127
Marathi50.9020164359
Nepali *105.8743545716
Odia70.9526450441
Punjabi74.9427788335
Sanskrit *35.7514604161
Santali *76.3735155309
Sindhi *10.484197204
Tamil99.47404641084
Telugu136.4048485681
Urdu *78.6130935460
Total1,704.3468956810496

License

CC-By-4.0

Acknowledgements

We thank Digital India Bhashini, the Ministry of Electronics and Information Technology of the Government of India, Centre for Development of Advanced Computing Pune, EkStep Foundation and Nilekani Philanthropies for their generous grants and support. We also thank the entire team at AI4Bharat.