Localization

The rare-language problem that the localization industry pretends doesn't exist

A tier-1 global agency came to us last year after exhausting their network on four languages out of eighteen. Sylheti, Rohingya, Khasi, Mizo. The story of how that project went says a lot about a gap in our industry that most agencies don't want to talk about openly.

R. Roki

May 16, 2026

9 min read

The rare-language problem that the localization industry pretends doesn't exist

Share this article:

There's a conversation that happens inside global localization agencies that doesn't usually make it into industry conferences.

It goes like this. A salesperson at a tier-1 agency closes a contract for a Fortune 500 client. The contract covers, say, eighteen languages across South and Southeast Asia. The client signs based on the agency's reputation, their global footprint, their ISO certifications. Standard procurement.

Then the project lands in vendor operations. The vendor operations director opens the language list, gets to about language fourteen, and realizes the agency doesn't actually have anyone to do four of them. Not properly. Not at the quality the contract implies. The internal network has gaps.

What happens next is one of two things. Either the agency goes hunting through their sub-vendor list and hopes for the best, or they call someone like us.

I want to write about this because we got that call last year, and the project that followed taught me a lot about a structural problem in our industry that I don't think gets discussed honestly.

The four languages

The project was healthcare communications. A Fortune 500 client running patient outreach across South and Southeast Asia. Eighteen languages total. The agency had fourteen of them covered through their internal network: Hindi, Bengali, Tamil, Telugu, Urdu, Thai, Vietnamese, and the rest of the usual list.

The four they couldn't staff were Sylheti, Rohingya, Khasi, and Mizo.

If you don't work in localization, those names might not mean much. Let me put them in context. Sylheti is spoken by roughly 11 million people, primarily in northeast Bangladesh and parts of Assam. It's often described, including in academic literature, as a "dialect of Bengali." This is wrong, or at minimum, contested. Sylheti speakers don't generally accept Bengali content as a substitute, the way the framing implies they should. When you send a Sylheti speaker Bengali-language patient information, they often won't engage with it.

Rohingya is harder to describe briefly. The community is concentrated in Bangladesh, Myanmar, and a global diaspora. The language has limited standardized orthography, which means even getting two translators to agree on how to write a given word can require a terminology decision. Most aggregators don't want to touch it because the work is slow and the linguist pool is small.

Khasi has roughly 1.6 million speakers, concentrated in Meghalaya. Mizo has roughly 1 million, in Mizoram. Both have rich literary traditions and active publishing communities locally, but neither has the kind of agency infrastructure that the more populous languages have built up.

These four languages weren't fringe to the project. They were essential. The client's end communities, the people who would actually receive the patient materials, spoke these languages as their primary language. English fallback wouldn't reach them. Substituting Bengali for Sylheti or Hindi for Khasi wouldn't reach them either, even though I've seen agencies try to argue that it should.

Why the tier-1 networks don't cover these languages

This is the part the industry doesn't talk about openly.

The largest agencies, the ones you'd recognize from industry rankings, build their networks around volume. Their economics work because they can run high throughput through a relatively stable supply chain. Adding a new language to that supply chain has fixed costs: recruiting linguists, vetting them, building terminology resources, training PMs, integrating QA workflows. For a language with millions of monthly word volume across multiple clients, those fixed costs amortize quickly. For a language where you might see 20,000 words per year if you're lucky, they don't.

So the rational decision, from the agency's standpoint, is to not invest. They subcontract when these languages come up, mark up the work, and move on.

The problem with this is that the subcontractor market for rare languages is messy. There are a small number of legitimate specialist agencies, mostly regional, mostly hard to find from a tier-1 procurement desk. And there is a much larger number of aggregators who don't actually have linguists for these languages either, but who will accept the work and figure it out somehow. "Figure it out" usually means freelancer marketplaces, machine translation with light cleanup, or in the worst cases, just trusting that nobody on the client side speaks the language well enough to catch problems.

I've seen healthcare materials in rare South Asian languages that were produced this way. Some of them had serious errors. Not subtle stylistic issues. Errors that could affect how someone took medication.

What the agency actually needed

When this tier-1 agency contacted us, they weren't really asking for translation. They were asking for two things, even if they didn't frame it that way at first.

The first thing they needed was qualified linguists they could put their name behind. The end client's compliance team was going to ask who did the work. The agency needed to be able to answer.

The second thing they needed was someone who could integrate into a single workflow rather than handing them four separate problems for four languages. They didn't want to manage four different rare-language vendors, each with their own quote format, their own delivery schedule, their own quality issues.

So the first thing we did, before anything else, was send them the actual CVs of every linguist we proposed. With credentials, regional location, years of experience, prior project context where we could share it. For the four rare dialects, we included extra detail: which districts the linguists were from, what their domain experience looked like, references where available.

This is not how the industry typically works. Most rare-language work runs on a "trust us" model where the linguist's identity is abstracted. Sometimes that's because the aggregator doesn't actually know who's doing the work. Sometimes it's because they're worried the client will try to go around them and contract directly. The result is that the procurement side has no visibility into who is producing safety-critical content.

For a healthcare client, this opacity should be unacceptable. In practice, it's often accepted because the alternative (real transparency) is hard to find.

The Sylheti case is the one I keep thinking about

The senior Sylheti linguist on this project had twelve years of medical translation experience. She'd previously worked on a public health campaign for the Bangladesh Ministry of Health. She is based in Sylhet. We've worked with her on four prior projects over six years.

When the agency's compliance team reviewed her credentials, they had a question I found interesting. They asked whether her work could be cross-validated by another Sylheti linguist of similar seniority, for safety-critical content. The answer was yes. We have a small Sylheti senior reviewer pool. Two of them happen to have worked together before, which means their terminology decisions converge more quickly than two strangers' would.

The compliance team had never been able to get this kind of answer from a vendor before, in their words. Not because the answer is hard, but because the people they'd worked with previously didn't have the network depth to make the answer true.

This is what "in-country linguist network" actually means when it's real, versus when it's a marketing phrase. It means specific people in specific cities with specific working histories. Sylhet town. Cox's Bazar. Shillong. Aizawl. People you can name. People who have worked together before.

If you can't name them, you don't have a network. You have a list of contractors.

What the project taught me

Three things, really.

First, the rare-language gap in tier-1 agency networks is bigger than the industry admits publicly. Most agencies will say they "cover 200+ languages" on their marketing site. What that actually means is they will accept work in 200+ languages. The number of languages they can produce in-house, with credentialed linguists they directly employ or contract, is closer to 60-80 for even the largest agencies. The gap is filled by subcontracting, and the subcontracting market is uneven.

Second, the buyers most likely to encounter this gap are the buyers least likely to tolerate it. Healthcare, pharma, legal, financial services, government, regulated communications. The categories where rare-language work most often matters are the categories where opacity is most dangerous.

Third, the fix isn't complicated, but it's slow. You can't build a credentialed Sylheti linguist network in three months. You can't build a Rohingya linguist relationship without spending time in Cox's Bazar, or with the diaspora community, building trust over years. The agencies that do this work seriously are agencies that have been investing in it for a long time, in regions that aren't usually the centers of localization industry attention.

We're one of those agencies. Not because we set out to be, but because we're based in a part of the world where these languages are spoken, and the linguists are our neighbors and colleagues. The depth wasn't a strategy. It's just where we live.

What I'd tell a buyer dealing with this gap

If you're a procurement lead at a Fortune 500 company, or a vendor operations director at a global agency, and you have rare-language requirements you're not sure your current network can really handle, I'd suggest three things.

Ask for linguist CVs upfront. Not summaries. Actual CVs with names, locations, and credentials. If your vendor can't or won't provide them, that's data.

Ask where the work will physically be done. If the answer is "our distributed network," ask which cities. The cities matter. Sylheti work done in Sylhet is different from Sylheti work done by a diaspora linguist in London who hasn't spoken the language in regional context for fifteen years.

Ask about back-translation QA on safety-critical content, and whether the back-translation will be done by a different linguist than the forward translation. For healthcare and similar regulated work, this isn't optional. If your vendor doesn't have the depth to staff two independent senior linguists in the same rare language, that's also data.

These questions are uncomfortable to ask. They're uncomfortable for the vendor to answer. Asking them anyway is how the industry slowly gets better.

Closing thought

The localization industry has spent the last decade talking a lot about technology. MT engines, AI, automation, pipelines. Most of these conversations matter. Some of them are overhyped.

But there's a much older question that gets less attention. The question of who, exactly, is producing the words that end up in front of a patient, a customer, a refugee, a citizen. Whether that person is real. Whether they're qualified. Whether they're getting paid fairly. Whether they're actually from the community whose language they're translating into.

This isn't a technology problem. It's an infrastructure problem, and it's been an infrastructure problem in our industry for a long time.

We're trying to do this part of the work better. The 18-language project, including the four rare dialects, is part of how we're trying.

If you're a buyer or an agency dealing with rare-language requirements, and the conversation in this post sounds familiar, I'm happy to talk. The full case study with metrics and methodology is on our case studies page.

By ABD Rahman, Founder of Saytica

Why most SaaS localization workflows can't keep up with continuous deployment

May 15, 2026

How we localized an AI product into 40 languages in 6 weeks

May 15, 2026

How We Collected 500 Hours of Bangla Duplex Conversational Audio in 6 Weeks — 32% Cheaper, 25% Faster

May 4, 2026

Explore more from our blog

SaaS Localization

May 15, 2026 6 min read

Why most SaaS localization workflows can't keep up with continuous deployment

A SaaS client came to us with a problem that took me a while to fully understand. They were shipping product faster than their localization workflow could move. Three vendors, 25 languages, 5-day turnaround, and roughly 1 in 6 sprints missing translations at release. Here is what we found when we rebuilt their pipeline.

AI Localization

May 15, 2026 9 min read

How we localized an AI product into 40 languages in 6 weeks

A conversational AI company came to us last quarter with a hard deadline: 40 languages in 8 weeks. Their incumbent agency quoted 14. We delivered in 6, including 12 languages the tier-1 LSP had declined to touch. Here's what we learned about localizing AI products that translation work doesn't teach you.

AI Training Data

May 4, 2026 7 min read

How We Collected 500 Hours of Bangla Duplex Conversational Audio in 6 Weeks — 32% Cheaper, 25% Faster

A global AI lab needed 500 hours of dual-channel Bangla conversational speech for ASR fine-tuning in 8 weeks. We delivered in 6 — at 32% below benchmark cost. Here's exactly how, end to end: speaker recruitment, parallel multi-city recording, three-layer QA, and the operational decisions behind the gains.

Localization

May 4, 2026 4 min read

The 2025 Localization Playbook: TEP vs MTPE—When to Use Which

Choose the right workflow in 2025. This playbook shows when to use TEP (human translation + edit + proof) and when MTPE makes sense—plus a decision matrix, quality bars, a pilot plan, and risk controls.

Subtitling

May 4, 2026 3 min read

Subtitles That Don’t Feel “Machine”: Read-Speed, SDH & Platform Specs

Why some captions feel robotic—and how to fix them fast. A practical guide to read-speed, SDH vs. standard subtitles, on-screen text, and a simple QC checklist you can run before publish.

Transcription

May 4, 2026 2 min read

Research-Grade Transcription: From Noisy Audio to Analysis-Ready Text

Turn messy recordings into clean, analysis-ready text. This guide shows a practical pipeline—restoration, diarization, human QC, PII redaction, and deliverables (RTTM, ELAN, TextGrid, SRT)—plus a two-minute checklist to run before publishing.

The rare-language problem that the localization industry pretends doesn't exist

The four languages

Why the tier-1 networks don't cover these languages

What the agency actually needed

The Sylheti case is the one I keep thinking about

What the project taught me

What I'd tell a buyer dealing with this gap

Closing thought

Tags

Related Articles

Why most SaaS localization workflows can't keep up with continuous deployment

How we localized an AI product into 40 languages in 6 weeks

How We Collected 500 Hours of Bangla Duplex Conversational Audio in 6 Weeks — 32% Cheaper, 25% Faster

More Articles

Why most SaaS localization workflows can't keep up with continuous deployment

How we localized an AI product into 40 languages in 6 weeks

How We Collected 500 Hours of Bangla Duplex Conversational Audio in 6 Weeks — 32% Cheaper, 25% Faster

The 2025 Localization Playbook: TEP vs MTPE—When to Use Which

Subtitles That Don’t Feel “Machine”: Read-Speed, SDH & Platform Specs

Research-Grade Transcription: From Noisy Audio to Analysis-Ready Text