IndicVoices: The Journey

Welcome to the incredible journey of IndicVoices, a monumental endeavor funded by Bhashini, under the Ministry of Electronics and Information Technology, Government of India, and generously supported by Ekstep Foundation and Nilekani Philanthropies. Our ambitious mission? To collect spontaneous speech data across the rich tapestry of Indian languages, while honoring the vast linguistic, cultural, and demographic diversity that India boasts. This quest took us on an exhilarating adventure across the country, from the snowy peaks of the north to the sun-kissed shores of the south, amassing a staggering 7348 hours of read, extempore, and conversational audio from 16,237 speakers, spanning 145 Indian districts and 22 languages. The journey has just begun and we are committed to our goal of capturing ~17,000 hours of voice data across more than 400 districts in India.

The scale of this project was nothing short of epic, involving a dedicated army of 1893 individuals, including language experts, local mobilizers, coordinators, quality control experts, transcribers, language leads, and project managers,  As we embarked on this ambitious journey, we didn't just collect data; we collected stories, laughter, and the myriad voices of India, making IndicVoices not just a project but a life-changing expedition for everyone involved.

Join us as we share some of the most unforgettable moments and experiences from this journey, offering a glimpse into the heart and soul of India through the voices of its people.

Humble Beginnings in the Holy City of Madurai

The Forgotten Generation

Right from the beginning, we were very clear that we wanted sufficient participants from the senior citizen age group to capture their rich life experiences and tap into their repository of knowledge about Indian customs, traditions, and beliefs. However, addressing the underrepresentation of the senior demographic emerged as a significant hurdle. The limited presence of individuals aged 60 and above necessitated a reevaluation of our outreach efforts. Engaging with old age homes, senior citizen clubs, and conducting home visits became essential strategies to include this vital segment of the population. This challenge highlighted the importance of inclusivity and the need for tailored approaches to ensure diverse demographic participation. Our concerted efforts to involve senior citizens bore fruit in an unexpectedly delightful way, particularly notable in the diverse regions of India. A striking example of this success was observed in Jammu, where the older population displayed remarkable enthusiasm towards contributing to the project. This enthusiasm stemmed not just from their fluency in their native language (Dogri, in this case) but also from a deep-seated urge to preserve and pass on their linguistic heritage. Contrary to the challenges faced with other age groups and languages, senior citizens in Jammu and beyond became invaluable participants. Their eagerness to contribute not only enriched our dataset with authentic, nuanced language use but also underscored the critical role of senior citizens in safeguarding linguistic diversity.

The Need for a Guiding Hand

In the initial pilots, our fears about ‘What if people don’t speak’ were confirmed. We learned the hard way that eliciting fluent speech with good quality content from individuals in interactions with strangers posed an intriguing challenge. Recognizing the pivotal role of participants’ comfort in encouraging natural dialogue, we acknowledged the importance of assigning dedicated coordinators to accompany participants throughout the data collection process. Through targeted training, these coordinators were equipped with effective communication skills to engage participants authentically. Consequently, we established a procedural framework to ensure coordinators’ proficiency in guiding participants through the data collection procedure. The coordinators also helped the participants in overcoming technical challenges in installing the app, getting accustomed to the “record-verify-submit” workflow on the app, and clarifying the expectation with respect to each microtask.

Back to the drawing board

After these initial pilots, we took a pause to critically reassess our data collection strategy. Insights from our initial pilots revealed that participants often provided repetitive answers, and everyday conversations failed to yield a diverse vocabulary and had very little coverage of names, numbers, entities, and brand names critical for downstream Automatic Speech Recognition (ASR) applications. The pilots also offered us a glimpse into people’s interactions with technology and their expectations from it, such as placing orders or completing digital transactions. We also realized that simple prompts like “talk about politics” fell short, and sparking a meaningful dialogue between strangers over a phone call proved challenging, often resulting in mere exchanges of pleasantries without covering a wide array of topics.

Recognizing these gaps, we returned to the drawing board for an in-depth pre-collection phase. Our goal was to collect sentences with rich vocabulary, craft engaging questions spanning various domains, and create scenarios that simulate everyday digital interactions. We also refined our process to include tasks that would elicit responses rich in sequences of numbers, named entities, locations, dates, etc., and add role-play scenarios with detailed narratives to encourage dynamic conversations between two parties.

Weather Plays Spoilsport

After refining our approach, we headed to the extreme north of India, to the beautiful land of Kashmir. Excited as we were, we encountered the formidable challenge of the region’s harsh winter. Starting our pilot in mid-November in Srinagar, we were keenly aware of the narrow operational window before the onset of heavy snowfall, which renders data collection nearly impossible from December to February. The unique weather conditions and the limited daylight hours significantly restricted our daily operations, allowing us only a brief period between 10 a.m. and 5:30 p.m. for recordings. This time constraint meant that each coordinator could only manage sessions with a maximum of two participants per day, highlighting the need for a more adaptable approach to meet our productivity and deadline goals. As we moved forward, it became crucial to innovate our data collection methods, shifting towards conducting recordings within the warmth and accessibility of participants’ homes in the subsequent districts. This was a good learning for us and an early realization that given the diverse geographical landscape of India, we will have to be mindful of weather conditions: be it self-imposed afternoon curfews during summer in West Bengal, mobility issues during monsoon in Kerala and Goa, floods in Northeast India, harsh winters in the northern parts of Punjab, Delhi, Kashmir, Jammu, and so on. On a lighter note, while working on INDICVOICES, we have become experts on weather conditions in different parts of the country.

A Journey to the Remotest Parts of India

After covering Kashmir (Kashmiri) and Jammu (Dogri), our journey took us to the north-eastern parts of India to conduct pilot studies covering Assamese, Bodo, Manipuri, and Nepali. Initially filled with enthusiasm, our venture into Assam’s relatively tranquil settings soon led us into the more secluded and challenging terrains of Bodoland. Here, the sparse population and limited access to resources questioned the feasibility of our project, presenting a stark contrast to our prior experiences. Bodoland, with its serene yet isolated landscape of traditional tribal huts and quiet, dimly lit pathways, intimidating silence with no vehicular noise offered a unique set of challenges. The initial low turnout at our designated collection site in Kokrajhar prompted us to rethink our approach to engaging with remote tribal communities. Their concerns over privacy and the openness to share opinions reminded us of the delicate balance required to ensure inclusivity while being mindful of socio-political or geographical hurdles.

Our journey further led us to Kalimpong in West Bengal, a region where the linguistic landscape shifts dramatically to predominantly Nepali speakers, diverging significantly from the Bengali culture of the state. This diversity within a single state highlighted the intricate patchwork of India’s linguistic heritage. In response, we tailored our data collection approach, creating district-specific hints to engage participants in a manner that resonated with their unique linguistic identity.

Following this, the endeavor in Imphal West, Manipur, introduced us to a different set of challenges marked by remoteness and the complexities of operating amidst constant disruptions. The initiation of the pilot in the compact confines of a local hotel was just the beginning of a journey punctuated by curfews, riots, and internet shutdowns, which significantly delayed our progress. The decision to romanize text data to include older participants unfamiliar with the local script, and continuing annotation work offline during internet shutdowns, were necessary to adapt to the ever-changing conditions in the state.

This expedition across the remote parts of India was not just a journey through diverse geographical landscapes but a deep dive into the heart of India’s linguistic diversity. Each region, with its unique challenges, taught us the importance of resilience, adaptability, and the profound value of including voices from every region of the country, no matter how remote.

The Rural Urban Divide

Following this, we did some pilots in rural districts in West Bengal covering two languages, Bengali and Santali. This illuminated the stark urban-rural divide, presenting unique challenges and learning opportunities at every turn. As we ventured into the tribal regions to engage with Santali-speaking communities, the serene yet complex rural setting offered a vivid contrast to the bustling urban environments we had previously navigated. Conducting recording sessions in the midst of forests, under the canopy of trees, or in the humble backyards of huts, we were confronted with the realities of rural life: limited internet connectivity, the scarcity of participants fitting specific demographic profiles, unpredictable weather, and the reticence of women to participate.

These experiences shed light on the significant divide between urban and rural contexts, especially in terms of technological accessibility and the relevance of certain questions and use cases. It became apparent that some of our initial questions, designed with an urban mindset, were not resonant with the daily experiences of rural participants. For instance, the concept of “hailing a cab” was alien to many, revealing a disconnect in the applicability of our queries. This insight prompted us to revisit our approach and make our questions more closely aligned with the realities of rural life. We modified scenarios to involve “arranging for transport for cattle or food grains,” among other adjustments, ensuring our questions and use-cases were relevant and relatable to the lives of our rural participants.

This re-calibration was not merely about changing the wording of questions but about cultural and contextual sensitivity in linguistic data collection. This journey through the contrasting landscapes of India reinforced the notion that to ensure inclusivity we need to constantly change our assumptions, adapt, and improve our processes.

The Silence Amidst the Noise

While rural areas offered a raw and unfiltered glimpse into India’s linguistic diversity, urban settings introduced a different set of challenges. In bustling metropolises like Mumbai, the search for tranquil venues for recording sessions became a Herculean task, particularly in densely populated areas. The urban clamor and the scarcity of quiet spaces escalated the costs and complexities of data collection, underscoring the logistical hurdles unique to urban centers. Moreover, the enthusiasm for participating in data collection efforts was noticeably muted among the urban populace, especially among professionals leading busy lives. This apathy necessitated innovative mobilization strategies to engage a demographic that seemed distant from the cause. Despite these obstacles, the endeavor to capture the linguistic essence of India’s urban centers was as crucial as that of its rural counterparts. Overcoming this silence and capturing voices amidst the noise became an essential part of our mission.

India, a land of many festivals

Navigating the vibrant maze of India’s festivals proved to be one of the most colorful challenges in our data collection journey. In a country where each region celebrates its own set of festivals with fervor and devotion, scheduling work around these celebrations was akin to finding a needle in a haystack of holidays. From Durga Puja in West Bengal and Odisha, Ramzaan in Kashmir, Bihu and Pongal across other parts, to the universally celebrated Diwali, our calendar was a mosaic of cultural festivities.

The complexity of scheduling was humorously encapsulated in a conversation with one of our partners, which turned into a comedic back-and-forth of date dodging.

Partner: We can’t start in May as it is too hot that time of the year!
We: Ok, let’s start in June then.
Partner: No, that would be difficult due to the monsoon season.
Both June and July would be washed out, quite literally!
We: That looks bad. Then we should definitely start in August.
Partner: But then we would have Ganesh Chaturthi which is a very important festival here. No participants would turn up during that time.
We: Phew, what about September?
Partner: Schools and colleges will have exams so we will not be able to use them as venues (other options would be expensive).
We (frustrated): Okay, then I guess after that we would have to wait for Dusshera (October), Diwali (November), Christmas and New Year (December) to also pass by.
Partner (with a straight face): Yes, that would be ideal!

This humorous exchange underscored a significant reality of executing a project of this scale in India. It taught us the importance of flexibility, patience, and the ability to laugh at the seemingly impossible task of scheduling around the endless cycle of festivals. In the end, these challenges just became a part of our journey, making every successfully covered district feel like a festival in its own right.

Everything that could go wrong

Amidst all the celebrations, we soon realized that in a remote and distributed setup involving a large number of people, ensuring quality is a challenging task. Early in the pilot phases, we observed participants abruptly stopping mid-speech or drifting into unrelated conversations with coordinators without stopping the recording, leading to fragmented audio and content corruption. Despite comprehensive guidelines and thorough training, the human element introduced unpredictability in task execution, with participants sometimes simply reading the questions/prompts instead of answering/enacting them. It became very clear that we need to have an in-house QC team whose task would be to listen to every audio file collected on the ground and tag such errors. We iteratively refined our error categorization, adapting to new types of errors as they were discovered.

In their eagerness to assist participants, coordinators sometimes went above and beyond, inadvertently scripting entire responses or conversations. These were then merely read aloud by the participants, transforming what was supposed to be a spontaneous exchange into a rehearsed performance! This would diminish the authenticity of spontaneous speech. To combat these issues and preserve data integrity, we introduced error categories like ‘Bad extemporaneous,’ and ‘Book read,’ so that such content could be tagged. Similarly, in telephonic conversations, a tendency emerged for one participant to dominate, leading to minimal contributions from the other party, a phenomenon tagged as ‘SST’ (Single Speaker Talking) by our QA team. Categories like ‘Stretching,’ ‘Repeating Content,’ and ‘Long Pauses’ were introduced to counter verbosity and repetition, ensuring the recordings were content-rich.

In several places, the authenticity of participants’ identities emerged as a significant challenge. Concerns were raised when the voice of a participant didn’t seem to align with their reported age and/or gender, leading to discoveries of intentional misinformation or unintentional errors in registration. Some instances revealed inconsistency in voices under the same participant ID, hinting at multiple individuals sharing a single ID, while others showed the same voice across different IDs, indicating individuals masquerading as multiple participants. To address these authenticity issues, we introduced a micro-task requiring participants to record a video stating basic information, allowing our QC team to verify age and gender visually. Disparities led to data rejection, while voice mismatches in audio samples triggered further scrutiny. Recognizing privacy concerns and cultural sensitivities, particularly among female participants reluctant to record videos, we offered alternatives like live verification through WhatsApp calls, conducted by female QC members, ensuring a respectful and secure verification process.

Data collection across diverse settings—outdoors, in public schools, small hotels, and participants’ homes—brought the challenge of background noise interference. Distant ambient noises were less intrusive compared to the constant buzz of fans in closed spaces. Distinguishing between unavoidable natural background sounds and disruptive persistent noises was essential.

An unexpected challenge arose with the capture of highly expressive, albeit profane, reactions to daily frustrations, necessitating a balance between authenticity and appropriateness. This led to the creation of an ‘Objectionable Content’ category to carefully screen for hate speech or inappropriate content. Through this iterative process of reviewing audio files our QA team came up with 23 error categories which comprehensively captured everything that could go wrong!

The Subtle Art of Transcription

While we continued on our journey across the country, little did we know that our greatest challenge lay not in the fieldwork, but in the nuanced art of transcription. Initially perceived as a straightforward task of converting speech to text, the complexity of transcribing the diverse speech styles of thousands of individuals soon became apparent. The main issue was the difference between colloquially spoken language and standardized language found in textbooks. The former contains words which may not have any spellings in standard textbooks or dictionaries but still cannot be ignored, simply because this is how people talk! This required us to choose between the pure phonetic representation of speech resulting in non-standard spellings on one hand, and pure textbook representations which deviated from what was being said on the other. To address this we ended up with a two-level transcription approach wherein the first level the transcribers were asked to transcribe verbatim without worrying about correctness of spellings (thus emphasizing only on phonetic fidelity). In the second level the transcribers were asked to standardize the transcription to convert the phonetically correct representations to nearest standard spellings in textbooks. Thus the lazily spoken Hindi word “muje” would be transcribed verbatim as “muje” in Level 1 ensuring phonetic fidelity and then standardized to “mujhe” in Level 2 ensuring spelling accuracy. However, aligning all language experts and transcribers with this novel framework proved challenging. Many transcribers initially resisted typing verbatim spellings, feeling it betrayed the standard writing style. Through extensive discussions, iterative rounds of feedback, and a collaborative effort across language teams, we established a set of guidelines that balanced standardized systems with fidelity to the actual sound wave. This meticulous process underscored transcription not just as a task but as an art form, requiring a deep understanding of linguistic nuances, cultural context, and the delicate balance between preserving the integrity of spoken language and adhering to standard linguistic conventions.

Stories from the Heart of India

Despite the hurdles, the journey was largely filled with numerous positive experiences. The diversity we encountered in weather, languages, and cultures was overwhelming, yet it filled our hearts with an indescribable warmth and a profound appreciation for India’s rich cultural tapestry. The love and hospitality offered by the people, their eagerness to share their stories, and their enthusiasm for preserving their linguistic heritage were truly heartening. Our journey underscored the power of language as a bridge to understanding people, their cultures, and the nation at a deeper level.

In a world where technology often seems to isolate us, our project brought people closer together, allowing them to connect and share their lives through their native languages. This technical endeavor became a conduit for genuine human connection, enabling people from various backgrounds to express their lives, traditions, and experiences. From a Tamil Nadu entrepreneur sharing her journey to success, to a Kashmiri woman finding solace in prayer during tumultuous times; from a young boy in Kashmir divulging his secret recipe, to a Manipuri girl aspiring for higher education amidst challenges, each story added a unique voice to the rich mosaic of IndicVoices.

The diversity in stories we collected — ranging from personal achievements and cultural narratives to expressions of socio-political concerns — highlighted the importance of the dataset we were building. Whether it was an old man reminiscing about life seventy years ago, a professor discussing the significance of a local folk culture, or a young Nepali girl playfully sharing tales of her dates, these narratives painted a vivid picture of the diverse life across India. Such stories not only enrich our understanding but also celebrate the myriad facets of Indian life, from its challenges to its triumphs.

Miles to go

Despite the vast disparities in lifestyle, language, and social circumstances across different regions, a common thread of respect for linguistic heritage and a passion for technology united participants from all walks of life. This shared enthusiasm underscores a collective commitment to preserving India’s linguistic diversity, bridging the gap between tradition and modernity. IndicVoices, thus, stands as a testament to the enduring spirit of India and its people, weaving together the voices of its many inhabitants into a vibrant tapestry of stories that resonate with authenticity, diversity, and a profound sense of belonging. Yet, as extensive as our journey has been, it is but a chapter in a much larger story. With over 12,000 hours of recordings still ahead, our expedition through the heart of India’s linguistic landscape is far from over. Like the timeless verse, “miles to go before I sleep,” our path stretches onward, promising more voices to be heard, more stories to be shared, and an ever-deepening appreciation for the rich mosaic of Indian culture and language. This journey has only just begun, and the road ahead is filled with the promise of discovery, understanding, and the celebration of India’s incredible diversity.