India’s AI Playbook: Why Bigger Isn’t Always Better
What if your health care chatbot couldn’t make sense of the way your grandmother described her symptoms? Or if a hurricane warning in the Gulf Coast reached your phone in a language you didn’t understand? Or if your small-business loan application was rejected because the system misread the way you naturally speak?
That’s the quiet risk of today’s AI, it’s not just about whether the technology works, but whether it works for you. Much of the U.S. race is about getting there faster and at massive scale. But India is showing that speed means little without relevance, building AI that truly understands the people it serves. If we don’t pay attention, we may wake up to find AI in rural Bihar working better than AI in rural Alabama.
Across India, researchers are fanning out into villages, marketplaces, and rural districts, not to sell a product, but to listen. They are recording everyday speech in Hindi, Bhojpuri, Tamil, and dozens of other languages, capturing idioms and expressions that carry generations of meaning.
This is AI for Bharat, a project at IIT Madras supported by Nandan Nilekani, the visionary technologist behind India’s Aadhaar digital ID system. It is also one of the most ambitious, ground-up AI initiatives in the world today.
Early on, the AI for Bharat team realized that advancing Indian AI technology meant building massive, high-quality datasets, something that simply didn’t exist at scale. With support from India’s Ministry of Electronics and Information Technology (MeitY), they are now leading a nationwide data collection effort as part of the Data Management Unit of Bhashini, the government’s flagship multilingual AI program under which AI for Bharat operates.
Their goal: gather 15,000 hours of transcribed speech from over 400 districts, covering all 22 scheduled languages of India. In parallel, an in-house team of more than 100 translators is creating a parallel corpus of 2.2 million translation pairs across these languages.
In their recording studios, professional voice artists are producing studio-quality data for expressive text-to-speech systems, while annotators meticulously label pages for document layout parsing, accounting for India’s diverse scripts.
And that’s only the beginning. To accelerate the development of large language models, they’re building pipelines to curate and synthetically generate pre-training data, collect contextually grounded prompts, and create evaluation datasets that truly reflect India’s rich linguistic landscape. All of these efforts are supported by open-source tools for data collection and annotation, designed not just for India but for any multilingual region in the world.
Nilekani put it plainly: “They’re collecting data from the field, so it’s not just scraping some internet stuff… All that data is being brought in and is open-source.”
The same principles we outline in AI for Community apply whether you’re building AI for a single neighborhood or for an entire nation. What Howard University and Google are doing for African American Vernacular English is, in essence, what India is doing on a grand scale, capturing the linguistic and cultural realities of millions so the technology can serve them accurately and respectfully.
Now India is expanding its ambition even further. The government has unveiled BharatGen, the country’s first indigenously developed, government-funded multimodal large language model (LLM), supporting text, speech, and images across all 22 Indian languages.
BharatGen is more than a tech milestone, it’s a declaration that AI can be ethical, multilingual, and anchored in the lived realities of its citizens. It promises region-specific solutions in health care, governance, education, and beyond, AI doctors who speak your dialect, AI-driven grievance redressal in native tongues, and citizen services that feel familiar and accessible.
This is not the AI race you read about in headlines. It’s not a contest to build the biggest large language model. It’s about meaning, context, and relevance, and it’s work that could make AI far more useful to far more people.
From Digital Identity to Digital Understanding
Nilekani’s track record shows why this matters. In 2009, he launched Aadhaar, now the world’s largest digital identity program. Aadhaar assigns each resident a unique 12-digit number linked to biometric and demographic data, enabling secure access to services from banking and mobile connections to welfare benefits and tax systems. Today, more than 1.4 billion Indians use it as the backbone of everyday transactions.
In a Rest of World interview, Nilekani warned that the world’s fixation on massive AI models risks missing the point: “We have to bring the conversation back to making AI useful to people.”
Large models are impressive, but they are also prone to opacity and a flattening of cultural context. Data is never neutral, it reflects the voices and contexts it’s drawn from. An AI trained mostly on dominant-language, Western-centric internet content will have a limited worldview, no matter how vast its architecture.
Why This Matters Globally
For governments, AI built with local language capacity means more effective public services, smoother governance, and stronger citizen trust. For businesses, it means access to markets and customer bases that have long been underserved by technology. For communities, it means preserving the voices, knowledge, and traditions that define them, before they disappear in the rush toward technological uniformity.
When AI speaks your language, literally and figuratively, it opens doors. Farmers can get accurate weather and crop advice in their own dialect. Small-business owners can navigate regulations in plain, familiar terms. Health workers can share vital information in ways that fit local customs. AI that understands context earns trust, and trust is the foundation for adoption.
Nilekani envisions AI that responds not just to standard speech, but to the languages and dialects in which people truly live.
“We think the future will be spoken — you speak to the computer, but in a language of your choice, in a dialect of your choice, using the colloquialisms that you like. If a farmer in Bihar can speak to a computer in Maithili or Bhojpuri or whichever language and gets the right answer, you have made AI so much more accessible to him.”
If India can build an AI ecosystem that understands its people in all their linguistic depth, so can everyone else. BharatGen and AI for Bharat show that it’s possible to marry technological ambition with cultural respect, and in doing so, create tools that people trust and actually use. Surely, India will do this right, and in the process, it just might set the gold standard for the rest of the world.
My co-authored book “AI for Community,” now available from Taylor & Francis, explores how artificial intelligence can preserve cultural heritage, support human flourishing, and foster trustworthy, community-centered innovation.
Editorial note: I used AI to help shape and refine this blog, collaborating with a language model to enhance flow, clarity, and tone.
