AI Localization

How we localized an AI product into 40 languages in 6 weeks

A conversational AI company came to us last quarter with a hard deadline: 40 languages in 8 weeks. Their incumbent agency quoted 14. We delivered in 6, including 12 languages the tier-1 LSP had declined to touch. Here's what we learned about localizing AI products that translation work doesn't teach you.

R. Roky
May 15, 2026
9 min read
How we localized an AI product into 40 languages in 6 weeks
Share this article:

A few weeks into the project, one of our Bengali raters flagged something odd in a test conversation. The AI assistant, when a user asked it to do something it shouldn't, was refusing. Technically. The translation was clean. But the phrasing read more like an apology than a refusal, and a Bengali speaker would walk away thinking they could probably try again with slightly different wording.

That kind of finding doesn't show up in a glossary. It doesn't show up in a TM match report. It shows up because a native speaker sat with the output for a while and noticed it felt wrong.

I want to write about that project because I think it changed how I see this work. We're not really in the translation business anymore, at least not when the client is shipping an AI product. The job has gotten bigger in ways that most LSPs, including some of the largest ones, haven't fully caught up to.

What the client actually needed

The brief was straightforward on the surface. A North American conversational AI company had closed their Series B and the board wanted international launch fast. Forty languages, eight weeks, full coverage across EMEA, APAC, and LATAM. They were trying to get into emerging markets before two well-funded competitors did the same.

Their internal localization team was two people. Their incumbent agency, a name everyone in this industry would recognize, came back with 14 weeks for 28 languages. The remaining 12, including Sylheti, Khmer, Amharic, Pashto, Burmese, and Tigrinya, were either declined or priced as "exploratory work" with no quality commitment.

The client could have descoped. They didn't want to. Those 12 languages were the ones that made the geographic story work for investors.

Translation is a small piece of the actual scope

When most people hear "localize an AI product," they think of strings. The buttons, the menu labels, the help center articles. That's real work, and it has to happen, but it's the smallest layer.

The bigger work sits one level deeper. System prompts, which are the instructions that shape how the AI behaves, have to be culturally adapted, not just translated. A prompt that tells the assistant to be "warm and conversational" produces different output in Japanese than it does in English, because warmth in Japanese involves formality levels that don't map to English vocabulary. If you translate the prompt literally, the assistant comes out sounding either too cold or oddly stiff.

Refusal templates are even trickier. The phrases an AI uses when it declines a request, "I can't help with that," "I'm not able to do that," and so on, are governed by emerging regulations in different markets (the EU AI Act has specific phrasing expectations, and India and Brazil have their own rules now). They also carry cultural weight. A direct refusal in German is fine. The same directness in Japanese reads as rude. In Arabic, the same template translated literally can come across as harsh in some Gulf dialects and overly formal in others.

Then there are few-shot examples, which are the small sample conversations included in a prompt to teach the model how to respond. These often contain English names, English-style addresses, English idioms. Drop them into a Brazilian Portuguese context unchanged and the assistant starts producing oddly American-flavored responses in Portuguese.

And finally, the evaluation work. To prove the localized product is actually working, you need native speakers running test conversations through it and scoring them. Not on translation accuracy. On whether the AI feels natural, behaves safely, and handles edge cases the way a fluent user would expect.

This last part, evaluation, was where the project got most interesting.

Why the traditional workflow couldn't hit six weeks

The standard agency workflow looks like this: translate, review, QA, deliver. Repeat for each language. With 40 languages and a six-week window, this sequencing was a non-starter. Even if every step was fast, the wait times between handoffs alone would burn the deadline.

So we set up 40 language pods running in parallel. Each pod had a senior translator, a reviewer, and what we ended up calling an AI-context specialist, usually a linguist who'd worked on prompt engineering or model evaluation projects before. The AI-context specialist's job was to handle the parts of the work that aren't really translation: prompt adaptation, refusal template review, eval rubric scoring.

We also embedded into the client's TMS through API. UI strings flowed in continuously as the product team committed code, not in weekly batches. This is how modern dev teams ship product, and the localization workflow has to match the cadence. Most tier-1 agencies still operate on a "freeze the strings, send us a PO" model that was already outdated five years ago.

I should be honest about something. This parallel pod structure isn't free. It requires a deep linguist pool, project managers who can coordinate 40 simultaneous workstreams, and a QA layer that catches inconsistencies across languages (when 40 pods are working in parallel, you get version drift fast if nobody is watching). It's the kind of infrastructure that takes years to build and isn't something you can spin up for a single project. We could do it because we'd been investing in this model for our existing AI client work. If a smaller team tried this cold, the coordination overhead would eat the timeline savings.

The findings that weren't on anyone's deliverable list

Six weeks in, we delivered. All 40 languages. Forty internal eval test sets each with around 500 scored conversations, totaling roughly 20,000 evaluations. Our localized output averaged 4.6 on a 5-point naturalness rubric. The MT baseline we ran in parallel averaged 3.1.

But the most valuable output wasn't the translation work. It was three safety findings.

Our Bengali rater found a jailbreak prompt that worked. It was a prompt structure that the assistant correctly refused in English, but in machine-translated Bengali, the safety pattern didn't transfer. The model would comply with a version of the request that English-side safety training had been built to refuse. We flagged it, the client's safety team patched it before launch.

The same thing happened in Pashto and Burmese. Different prompts, same pattern. Safety training had been done primarily in high-resource languages, and the patterns didn't generalize to languages with thinner training data.

The client wasn't expecting this. They had safety teams internally, but those teams didn't have native speakers across the long tail of languages. Our linguists, working through evaluation conversations, were the first humans to surface these gaps.

I bring this up because it shifts the whole conversation about what AI localization is. We weren't just delivering translated strings. We were doing pre-launch red teaming in 40 locales, and the client got safety findings that would have been very expensive (and very public) to discover post-launch.

The low-resource language problem is a real one

About those 12 languages the incumbent agency declined.

When I talk to localization buyers at AI companies, I keep hearing the same thing. The tier-1 LSPs are excellent at the top 25 or so languages: English, Spanish, French, German, Japanese, Chinese, Korean, the usual list. Below that, things get inconsistent. Below the top 50, they're often subcontracting to smaller agencies anyway, with markups, and the quality varies based on which subcontractor they use.

For an AI company expanding into emerging markets, this is a problem. Bangladesh, Ethiopia, Myanmar, Afghanistan, Tigray. The places where a chatbot or AI assistant could have real product-market fit are often the places where the top agencies have the shallowest depth.

We're built differently because of where we're based. Our linguist network across South and Southeast Asia, the Middle East, and parts of Africa is something we've grown over years, not something we subcontract to. The Sylheti reviewers on this project were in Sylhet. The Amharic linguists were in Addis Ababa. The Pashto speakers were in Peshawar. Vetted, paid directly, with credentials we can stand behind.

This isn't a marketing claim. It's just where the bodies are. If you want native Khmer evaluation at scale, you have to know people in Phnom Penh.

What I'd do differently

Looking back, two things.

We underestimated how long the prompt adaptation work would take. The first two weeks, we treated system prompts like another content type and assigned them to senior translators. It took us about ten days to realize that prompt adaptation needs a different kind of expertise (closer to UX writing than to translation), and that ramping the right people takes longer than ramping a typical translation project. If I were starting over, I'd build that team first and let the standard translation work start a few days later.

We also didn't push hard enough on the client to give us their internal eval rubric early. We built ours from scratch, which worked, but it meant our naturalness scoring and theirs didn't always align cleanly when they did spot-checks. Two weeks in, we got their rubric and re-calibrated. If you're working with an AI company, ask for the eval rubric on day one. They probably have one. They're just not used to sharing it with their localization partner.

Closing thought

The localization industry has been in a strange place for the last few years. The cost-per-word floor has been pushed down by MT. The high end of the work, the parts requiring real human judgment, has been pushed up by AI products that need more than translation. The middle, which is where a lot of agencies still live, is shrinking.

The companies that figure out the new top of the market, the AI-native localization work, are going to do well. The ones still selling translation by the word will compete on margin until there's nothing left.

We're betting on the first path. The 40-language project taught us a lot about what that actually requires.

If you're working on a multilingual AI launch, or trying to figure out how to handle the long tail of languages your incumbent can't cover, we're happy to talk through what we've learned. The full case study from this project is on our case studies page, including the metrics breakdown and the methodology in more detail.

Tags

AI LocalizationLLM LocalizationMultilingual AITranslationConversational AI

More Articles

Explore more from our blog

Localization
The rare-language problem that the localization industry pretends doesn't exist
May 16, 2026 9 min read

The rare-language problem that the localization industry pretends doesn't exist

A tier-1 global agency came to us last year after exhausting their network on four languages out of eighteen. Sylheti, Rohingya, Khasi, Mizo. The story of how that project went says a lot about a gap in our industry that most agencies don't want to talk about openly.

SaaS Localization
Why most SaaS localization workflows can't keep up with continuous deployment
May 15, 2026 6 min read

Why most SaaS localization workflows can't keep up with continuous deployment

A SaaS client came to us with a problem that took me a while to fully understand. They were shipping product faster than their localization workflow could move. Three vendors, 25 languages, 5-day turnaround, and roughly 1 in 6 sprints missing translations at release. Here is what we found when we rebuilt their pipeline.

AI Training Data
How We Collected 500 Hours of Bangla Duplex Conversational Audio in 6 Weeks — 32% Cheaper, 25% Faster
May 4, 2026 7 min read

How We Collected 500 Hours of Bangla Duplex Conversational Audio in 6 Weeks — 32% Cheaper, 25% Faster

A global AI lab needed 500 hours of dual-channel Bangla conversational speech for ASR fine-tuning in 8 weeks. We delivered in 6 — at 32% below benchmark cost. Here's exactly how, end to end: speaker recruitment, parallel multi-city recording, three-layer QA, and the operational decisions behind the gains.

Localization
The 2025 Localization Playbook: TEP vs MTPE—When to Use Which
May 4, 2026 4 min read

The 2025 Localization Playbook: TEP vs MTPE—When to Use Which

Choose the right workflow in 2025. This playbook shows when to use TEP (human translation + edit + proof) and when MTPE makes sense—plus a decision matrix, quality bars, a pilot plan, and risk controls.

Subtitling
Subtitles That Don’t Feel “Machine”: Read-Speed, SDH & Platform Specs
May 4, 2026 3 min read

Subtitles That Don’t Feel “Machine”: Read-Speed, SDH & Platform Specs

Why some captions feel robotic—and how to fix them fast. A practical guide to read-speed, SDH vs. standard subtitles, on-screen text, and a simple QC checklist you can run before publish.

Transcription
Research-Grade Transcription: From Noisy Audio to Analysis-Ready Text
May 4, 2026 2 min read

Research-Grade Transcription: From Noisy Audio to Analysis-Ready Text

Turn messy recordings into clean, analysis-ready text. This guide shows a practical pipeline—restoration, diarization, human QC, PII redaction, and deliverables (RTTM, ELAN, TextGrid, SRT)—plus a two-minute checklist to run before publishing.