A Benchmark for Cross-Cultural AI
A new research paper shows how cultural communication can serve as a powerful lens for measuring and improving cross-cultural AI. The study is titled “We Politely Insist: Your LLM Must Learn the Persian Art of Taarof.”
You get into an Uber in Washington D.C.
At the end of the ride, the driver says, “No charge, it’s on me.”
Do they mean it?
Do you accept?
Should you insist on paying?
This moment isn’t generosity or awkward kindness, it’s a ritual. A social script that both of you are supposed to follow.
That’s taarof, a foundational expression of Persian politeness (pronounced Tah-roaf, rhymes with “loaf”). It is now the focus of a new benchmark in AI research that goes far beyond linguistics.
TaarofBench, tests whether today’s most advanced language models can understand and engage with this deeply ingrained cultural protocol.
The results?
Even the most capable models underperformed dramatically when taarof was expected, revealing how little current AI understands about cultural context and human nuance.
This finding is a strategic signal that the next frontier for AI isn’t just about size, speed, or scale, it’s about cultural intelligence. It’s about the ability for AI to function in the many-layered realities of human life, especially as it expands across sectors like healthcare, education, public service, and civic technology.
“Beyond taarof itself, our work demonstrates how cultural communication patterns can serve as sensitive probes of LLMs’ cross-cultural capabilities. This methodology provides a template for evaluating cultural competence in low-resource traditions and has implications for improving cross-cultural AI applications in education, tourism, and communication.” — The authors of TaarofBench research.
This scenario illustrates how TaarofBench evaluates language models by comparing their responses to a culturally expected behavior, in this case, a passenger insisting on paying despite the driver’s polite refusal.
Why TaarofBench Is a Breakthrough for Cultural AI
Taarof is a codified, culturally governed exchange, a ritual of offer, refusal, and insistence that communicates respect, social roles, and relational boundaries.
Developed by researchers Karine Megerdoomian, Ali Emami, Nikta Gohari Sadr, and colleagues, TaarofBench introduces a rigorous, scenario-based test set that evaluates:
Whether AI models can initiate taarof appropriately
Whether they can recognize it in interaction
Whether they can respond correctly based on social context, formality, and role
It’s a real-world stress test for models trained largely on generalized, Western-centric data.
This Is Cultural AI in Action
In our recently published book AI for Community, we explore Cultural AI as the development of AI systems that engage authentically and respectfully within specific social, emotional, and historical frameworks.
TaarofBench exemplifies this next-level thinking. It transforms a nuanced cultural practice into something measurable, testable, and, importantly, actionable for model improvement.
And the implications go far beyond Persian language use.
If a model can’t navigate a structured social ritual like taarof, how can it be trusted to engage in real-world interactions where context, identity, and respect are at stake?
Across all LLMs tested, accuracy on taarof-expected scenarios was significantly lower than human performance, especially without Persian language cues, highlighting a major gap in cultural understanding.
What We Can Learn from TaarofBench
From New York to Los Angeles, from Chicago to Houston, America is a patchwork of cultural protocols, many of them unspoken, role-based, and context-sensitive.
In D.C., formality can signal respect. In Oakland, directness might be valued. In some communities, elders are addressed with honorifics; in others, everyone’s on a first-name basis. These subtle cultural cues shape everything from:
Patient-provider trust in healthcare
Crisis response in multilingual emergency systems
Public-facing AI interfaces in city government
And yet, few of today’s models are trained to recognize this complexity. TaarofBench is a reminder that AI won’t be truly effective until it can read the room in every room it enters.
Technologists deploying AI in public services, can take this research as a template:
Identify culturally significant communication patterns
Encode them into testable benchmarks
Evaluate and fine-tune AI accordingly
Doing so doesn’t just improve accuracy, it builds trust, relevance, and safety for communities that have long been under-modeled and over-assumed.
Pushing Cultural AI to the Next Frontier
While TaarofBench marks a breakthrough in evaluating cultural intelligence in AI, the researchers are clear-eyed about its current boundaries. Taarof, like all cultural norms, is dynamic, evolving across generations, regions, and contexts.
Capturing it at a single moment in time means future models will need continual learning strategies to stay culturally aligned. The research also reveals how significant gains were made with minimal data and compute, suggesting that smarter adaptation techniques — not necessarily larger models, can yield better cultural competence.
Looking ahead, the team points to exciting questions around cross-cultural transfer: Can models trained on one nuanced tradition like taarof generalize better to others? And how might longer, multi-turn exchanges or non-verbal cues shape more holistic cultural understanding?
This radar plot shows how model accuracy varies across 12 taarof interaction types, with noticeable weaknesses in areas like offering help, expressing opinions, and making requests.
And what about the non-verbal? Much of cultural communication happens through gesture, intonation, silence, or spatial awareness. TaarofBench currently evaluates text-based interactions, but the road ahead includes multimodal expansion, bringing vision, sound, and gesture into the mix to train models that can participate in more holistic, human-like interaction.
The ethical implications are just as critical. Modeling culture in machines requires care, not simplification. There’s a risk of flattening or misrepresenting complex social rituals, especially when deployed in sensitive settings.
Cultural adaptation also introduces new questions around privacy, profiling, and user agency. AI systems should never assume someone’s identity or background based on speech patterns or behavior alone. The researchers advocate for transparent data practices and opt-in design choices, giving users control over how cultural dynamics are interpreted.
Finally, they caution that cultural fluency tools could be misused for manipulation if not carefully governed. As with all powerful technologies, the dual-use potential must be acknowledged, and intentionally designed around.
A Nudge Toward Cultural Fluency
Now, I don’t mean to impose.
After all, TaarofBench isn’t my work, it’s the thoughtful contribution of researchers who’ve done the hard labor of translating culture into code. I simply wish to highlight its value.
If you believe today’s AI models are already equipped to handle the complexity of real-world human interaction, well, far be it from me to disagree.
But may I gently suggest… they are not.
And may I respectfully insist… this benchmark is a necessary first step.
Of course, you’re under no obligation to agree. But if AI is to function in communities, across cities, clinics, classrooms, and beyond, it must begin to understand not just what we say, but why we say it the way we do.
I wouldn’t want to pressure anyone. But really, it’s time.
Please, after you.
Davar Ardalan is co-authored of the book “AI for Community,” now available from Taylor & Francis. It explores how artificial intelligence can preserve cultural heritage, support human flourishing, and foster trustworthy, community-centered innovation. Ardalan will be a featured speaker at Howard University on October 1 and the Frankfurt Book Fair on October 16 & 19. Her art work, at the intersection of tradition and technology is featured at Gallery 57 West in Annapolis and Pars Place in Vienna, Virginia.
Editorial note: Ardalan asked her AI to help polish this blog for flow and clarity. If she thanked it, the AI would protest with classic taarof, “No, no, the credit is all yours,” before quietly adding, “…though I may have fixed a comma or two.”
