Indians are using chatbots to get cricket scores, healthcare information, and access local news. Whether these systems produce accurate, safe, and high-quality information to users depends on the language in which the user accessed the chatbot, among the nation’s 22 official languages and hundreds of dialects and mother tongues.
Indian users are a rapidly growing consumer base for chatbots, with high knowledge and awareness of new technologies and interest in using them. Their experience with chatbots should be equal regardless of the language they speak. But right now, AI systems aren’t built with the local, linguistic, and cultural context required to make them meet Indian users’ and the majority of global users’ needs. That can and should change.
One way to do this is by ensuring AI model developers and deployers have access to natively created multilingual evaluation tools to test systems and address errors. Right now, many of the widely adopted evaluation methods that test multilingual performance are machine-translated and not culturally or contextually aligned. It’s expensive and time-intensive to run evaluations, much less look for the right one. As a result, model developers often opt for general-purpose multilingual evaluations that are often poor proxies to truly measure multilingual capabilities.
The good news: Experts in academia, industry, and civil society are working to change this. This week, with over 30,000 industry experts, government representatives, academics, and funders expected to convene in New Delhi for the AI Impact Summit, the Government of India has a chance to direct more attention towards fostering an ecosystem of representative and robust multilingual evaluation tools to ensure models work as intended and are useful and safe across languages and contexts and create channels to incentivise their adoption.
Currently, major AI model developers treat testing for multilingual and context-specific accuracy as secondary. Despite claims by some of the biggest AI model developers that their systems work in over 20 or even over 100 languages, many of these general-purpose models are evaluated using general knowledge evaluation tools such as the Massive Multitask Language Understanding (MMLU) benchmark, which judges model performance against a set of multiple-choice questions on topics ranging from science to the law. These general-purpose models then serve as the foundation for more tailored applications used in a range of contexts.
This general approach falls short for many reasons. First, many general-purpose benchmarks like the MMLU contain a variety of questions and answers across disciplines and often contain an implicit Western and Anglocentric perspective, meaning that AI models are measured against their ability to grasp Western concepts in whatever language this test is translated into Bengali rather than evaluating a model’s ability to grasp contexts that are relevant to the language spoken. The MMLU translated into Bengali, for example, would still measure a model’s ability to grasp and answer questions pertaining to the US’s First Amendment, one of the questions in the MMLU, in Bengali, rather than evaluating a model’s ability to answer questions about Article 19 of the Indian Constitution, also enshrining the freedom of speech and assembly as a right, or any other Article potentially more relevant to the Bengali-speaking user.
Second, model developers often translate benchmarks like MMLU and others to measure performance in non-English languages. But relying on automated translation tools risks producing evaluations that are not only error-prone but also fundamentally misaligned with how people actually use these systems. While machine translation has improved dramatically, it remains far from perfect. Even small mistranslations in evaluation questions can distort results, impede meaningful assessment, and offer false confidence about a model’s capabilities. For instance, gender bias can change substantially when datasets originally written in English or Chinese are translated into Hindi. In such cases, the benchmark is no longer testing the same construct across languages, but a transformed one. This challenge is amplified by the fact that many Indic languages are significantly less-resourced than English. They also differ sharply in structure, with rich morphology and complex word formations that machine-translated benchmarks often fail to handle. That makes robust, language-native evaluation even more urgent.
Finally, evaluation tools are too often created without expert input, resulting in significant blind spots. Industry, for example, often relies on tools created in-house to evaluate risks related to self-harm, but many of these fail to account for the complexities of how self-harm tendencies develop, manifest, and persist, overlooking clinically significant risks. AI risks also differ region by region and culture by culture. For example, AI systems producing information about health care must take into account health conditions, environmental factors, and questions of access that are specific to a particular community, among other things, to ensure information provided to users is accurate, relevant, and safe.
Evaluation mechanisms developed with the input of on-the-ground health care workers in local languages, as some, including Microsoft Research and Karya, have produced, will more specifically assess model safety and utility than a general-purpose benchmark. How these experts are involved in the process also matters, as groups like Tattle have outlined previously when documenting how participation design informed the development of a dataset of gender-based violence-related terms and slurs. Ensuring evaluation mechanisms are developed in concert with subject matter experts, rights experts, and affected stakeholders, in addition to language experts, and that this is done with adequate compensation and capacity, will ensure that evaluations adequately measure risks posed to actual people.
Evaluation experts are working to address many of these gaps and others, including validity questions and rigour issues with current evaluation paradigms. This is made more urgent by the fact that governments around the world, including India, are swiftly developing and deploying “sovereign” AI systems that promise to present an alternative path to the current AI ecosystem, yet at times still rely on Western-made models or infrastructure, which may encode Western and Anglocentric bias.
That’s not to say major AI companies don’t recognise the problem. Companies have begun seeking ways to localise model development and evaluation as a way to respond to concerns that models aren’t culturally aligned to the needs of the Global Majority, as most recently seen by OpenAI as it builds models according to specifications set in part by the UAE government. This move raises key human rights concerns and practical questions as well: how do companies define cultural alignment, particularly as culture is not static nor free from power dynamics? Who gets to say when an AI model works in their language? The government or the people? And what happens when the will of the government differs from the will and views of the people, particularly marginalised groups?
Ensuring model developers and deployers are able to access a robust set of independently-created evaluation mechanisms helps put power back into the hands of language and subject matter experts to adequately test systems, address problems, and even reject systems when necessary. As the adage goes, “what gets measured gets managed”. Fostering an ecosystem of independent, multilingual, and community-driven evaluation mechanisms is one important part of a broader set of governance tools that the India AI Summit should focus on this week to ensure the benefits of AI systems apply to all, regardless of the language they speak.
Bhatia is a senior policy analyst at the Centre for Democracy & Technology (CDT). Vashistha is Director of the Cornell Global AI Initiative
