Indic LLM-Arena: A New Paradigm for Indian AI Evaluation

At AI4Bharat, our mission has always been to build a robust, open-source ecosystem for Indian language AI. This journey has involved creating foundational datasets, models, and evaluation benchmarks. As Large Language Models (LLMs) become the new frontier, we are encountering a familiar challenge — a significant gap in our ability to properly measure their performance for India.

The global AI landscape is now filled with benchmarks and leaderboards, but they remain overwhelmingly English-centric, designed to evaluate models on Western cultural contexts and use cases. This is insufficient as India enters the era of sovereign LLMs.

A model's ability to discuss a topic in perfect English is irrelevant if it fails to understand a farmer in rural Maharashtra, provides a culturally inappropriate response to a user in Sikkim, or cannot parse a Tang-lish query from a student in Tamil Nadu.

To address this, we are proud to announce the Indic LLM-Arena, an initiative by AI4Bharat (IIT Madras), supported by Google Cloud. This platform is a crowd-sourced, human-in-the-loop leaderboard designed to benchmark LLMs on the three pillars that affect the Indian experience: language, context, and safety.

The Gaps in Current LLM Evaluation

Current leaderboards are a necessary part of the AI ecosystem, but they do not capture the realities of our nation. The gap exists across three critical dimensions:

1. The Language Gap

Evaluation is not merely about translating 22 scheduled languages. It is about understanding the natural, fluid way Indians communicate. This includes code-switching (e.g., Hinglish or Tanglish), where users mix multiple languages in a single sentence. Models trained on 'pure' text often fail at this, yet it is the primary mode of communication for millions.

Example:
"Bhai, woh naya restaurant ka review accha hai kya? Wahan Andhra meals milta hai?"
(A multi-language prompt that standard benchmarks often overlook.)

2. The Contextual & Cultural Gap

India is not a monolith. A model that provides a generic, pan-Indian answer may be unhelpful or, worse, incorrect.

Example:
If a user asks for a "good gift for a housewarming," the correct answer is not a "bottle of wine" (common in the West).
A culturally-aware model would suggest mithai, a Ganesha idol, or other appropriate items for a Grah Pravesh.

This extends to countless scenarios — from understanding local festivals and social etiquette to navigating region-specific agricultural, financial, and healthcare queries.

3. The Safety & Fairness Gap

AI safety cannot be one-size-fits-all. A model's safety and fairness filters must be trained to recognize and mitigate harms that are specific to the Indian social fabric, including subtle forms of regional bias, communal misinformation, or caste-based stereotypes. Standard safety benchmarks do not cover this.

How the Indic LLM-Arena Works

We cannot rely on static, automated benchmarks alone. We need a dynamic, human-powered evaluation model.

Inspired by the success of platforms like lmarena, our approach is built on fair, blind, side-by-side comparison:

Anonymous Battle: A user enters a prompt in any language or mix of languages. We support both Transliteration as well as voice-based interactions so that it's easier for everyone.
Dual Response: The platform presents responses from two anonymous LLMs (e.g., "Model A" and "Model B").
Human Judgement: The user votes for the response they find superior, or flags the interaction as a tie. The identity of the models is unknown to the user, eliminating provider bias.
Statistically Robust Ranking: Over thousands of such user-voted "battles," we use the statistically robust Bradley-Terry model to establish a relative ranking of the models' performance on real-world Indian prompts.

Who Benefits from Indic LLM-Arena?

The Indic LLM-Arena is more than a leaderboard; it is a public utility designed to foster a more competitive and inclusive AI ecosystem.

For Developers: This platform provides an invaluable, neutral mechanism for benchmarking. Startups and researchers can now see precisely how their models perform against others on Indic-specific use-cases and languages, enabling a faster, more focused innovation cycle.
For Enterprises: Businesses across domains can use this data to make informed decisions about which models to adopt, mitigating risk and accelerating the deployment of AI that serves their customers.
For the Public: This ensures that the benefits of AI are not confined to English speakers. By creating a transparent standard, we foster the development of digital public goods that are accessible and useful for all Indians.

The Indic LLM-Arena is an open invitation to the entire community to help us define what "good" AI looks like for India. We are eager to see the advancements this new standard will inspire.

Roadmap

We at AI4Bharat are rolling out the Indic LLM-Arena in carefully planned phases:

Phase 1 (currently live):
Support for text-based inputs across multiple Indian languages and code-mix scenarios.
Phase 2:
Expansion to omni models, bringing in vision and audio capabilities so that the Arena addresses image-based, voice-based, and mixed-media interactions.
Phase 3:
Introduction of agentic tasks, such as handling large documents (PDFs), web-search integration, tool-calls, and other advanced workflows.

Other planned features:

Leaderboard Roll-out: We will soon publish an updated public leaderboard, once we have sufficient votes to reduce statistical uncertainty in the rankings.
Continuous Model Updates: Model-building never stops. Neither will our leaderboard.
More Granular Leaderboards: Language-wise, domain-specific, and task-specific rankings.
Open-Source Everything: We will release all anonymized data, code, and pipelines for community inspection and extension.

Call to Action

We invite your collaboration — whether you build models, consume models, or simply believe in inclusive language AI:

Model Builders: Reach out and get your model added to the Arena. We will work with you to add your model to our platform.
Model Consumers (enterprises, developers, public-sector): Tell us which evaluations, domains, or metrics matter most. We'll work to add them to our benchmark and leaderboard suites.
Everyone Else: Use the platform, push its boundaries, explore interesting input types in your languages or dialects, share your interactions, and contribute to developing AI that works for India — in all its linguistic and cultural diversity.
Sponsors & Partners: As an independent academic initiative, we rely on the support of partners to sustain and scale our efforts. We are deeply thankful to Google Cloud for their initial support in making this launch possible.

If you're interested in sponsoring or collaborating, we'd love to hear from you.

The Indic LLM-Arena is an open invitation to the entire community to help us define what "good" AI looks like for India.

Reach out to us: arena@ai4bharat.org