This website uses cookies

Read our Privacy policy and Terms of use for more information.

Issue #11 · May 26, 2026

Everyone treats the language gap as a size problem: too little text, so collect more and train harder. A study this month shows the cost of getting that wrong. Feed a model more of a low-resource language without doing the safety work, and you don't close the gap — you widen the dangerous part. It learns to speak the language faster than it learns how to behave in it. Fluency outruns judgment.

THE NUMBER

59.8% 75.8%

A team at Stellenbosch University tested five major AI models — ChatGPT, Claude, DeepSeek, Gemini, Grok — to see whether their safety guardrails hold in low-resource African languages. A single translated request mostly failed: the models refused, or returned nonsense.

But a multi-turn conversation in Afrikaans, Swahili, isiXhosa or isiZulu got harmful answers through 59.8% of the time on average — and 75.8% when skilled people ran the attack instead of an automated script. It is counterintuitive, but the better a model understood the language, the more easily its safety broke. arXiv 2605.18239 · 18 May 2026

Why care? The finding is that fluency and safety are separate problems. A model's grasp of a language is trained on a large pile of text. Its refusal behaviour is trained on a much smaller, mostly-English pile. So as the language capability improves, the gap between "can understand it" and "knows when to refuse in it" gets wider, not narrower — more data alone makes a model more fluent at saying things it should not. Note: The research paper is preprint, and it tested older model versions, not the current frontier. But the finding needs to be kept in mind.

THIS WEEK

Story 1

The language you command an AI in can shape the result — at first

A Tsinghua University team had an AI agent redesign an aircraft wing to cut drag, and found it did better when commanded in Chinese than in English — but only at the start. Once the system was properly trained for the task, the gap closed. Why care? The headline reads "Chinese beats English," but the real finding is quieter: a language head-start that training erased. For technical teams it is a caution about reading too much into prompt-language comparisons. One study, one task, in one aviation journal.

Story 2

Zoom now sells its translation tools to other companies

Zoom used to keep its translation and summary features inside Zoom. Now it sells them as APIs, so other companies can put Zoom's translation into their own products. Why care? A video-meeting company is now competing with the dedicated translation vendors — DeepL and the language service providers (LSPs) — by renting out its language tech as a building block.

Story 3

Sam Altman floats micropayments — paid by AI agents, not readers

Asked what the future holds for publishers, OpenAI's Sam Altman pointed to micropayments made by AI agents as they use content, rather than by human readers. Why care? Every serious answer to "how do publishers get paid when AI reads the web for people" is worth tracking, because the current answer is roughly nothing. But this is a podcast remark, not a product — no system, no rate, no timeline — and Nieman Lab linked to a nuanced reply about micropayments: Why micropayments can’t save news, by Ben Werdmuller.

TALK OF THE WEEK

Some light into the dark: what is actually being done to make AI understand more languages

There is a floor. Below a certain amount of text, a language model simply does not work — and the number is more concrete than you might expect. The EU's own EuroLLM project sets a minimum of one billion tokens per language just to start. Up to that floor, collecting more data is the whole game. Above it, the picture flips: more data keeps helping for a while, then quietly stops mattering.

Here is the part that is rarely spelled out — three reasons the data runs out of road, none of which more text can fix:

Morphology. A single word in Turkish or Arabic can carry what English spreads across a whole phrase. The model sees many more distinct word-forms and far fewer examples of each, so even when text exists, it is spread thin. Translating out of these languages is harder for exactly this reason.

Tokenisation. The same sentence costs more tokens in Greek, Tamil or Turkish than in English — sometimes five or six times more. The model gets less actual language per unit of compute, and the user pays more for a worse result.

Language pairs. Translating into English is close to solved. Translating between two non-English languages is the rough road — and the cheap fix, routing through English as a pivot, adds errors at every hop.

So the interesting work has moved to method: getting more out of the data that exists, instead of chasing data that does not. A few real examples, none of them "just add more":

  • Measure first. Microsoft's Paza is the first transparent speech-recognition leaderboard built for low-resource languages — more than 30 African languages and 50 models — with companion models evaluated directly with local communities, not only in the lab. You cannot improve what nobody is measuring.

  • Borrow, don't collect. Trans-tokenization adapts an existing high-resource model to a new language by mapping its vocabulary across, and built a working Tatar translation model with no high-quality parallel data at all.

  • Route smarter. Choosing the right bridge language can lift cross-lingual accuracy sharply — in one test from 47% to 64.5% — with no new data, just a better path through the model.

  • Collect with the community. Where data must genuinely be made, the method that works is people, not scraping. In New Zealand, Talanoa AI has Pacific elders teach the model directly — the founder's line is that if you speak English the whole internet works for you, and if your first language is Fijian or Sāmoan you are close to invisible to it.

The picture is mixed, honestly. Coverage is improving fast, and at the frontier the translation gap is real but narrowing. Yet the newest 2026 evaluations of frontier models still fall back on the English-pivot trick for Turkic and Mongolic languages — which tells you those languages are still not native to the models, only reachable through a detour.

If you care about the 24 official EU languages, EuroLLM-22B now covers all of them. Its most recent technical report is worth reading for the detail on how the model is trained and which sources go into it. By its own numbers, it still scores about ten points below comparable commercial models. The lesson is that there is no single gap to close — getting to real quality across many languages means closing several at once.

GOOD TO KNOW

Meta's No Language Left Behind, in Nature — The 2024 paper behind the 200-language model, open-sourced along with its benchmarks. Worth reading as the reference point any "we now cover N languages" claim should be measured against.

Microsoft Paza — A transparent speech-recognition leaderboard and model set for more than 30 African languages, evaluated with local communities rather than only in the lab. The useful takeaway is the method: build the benchmark before the model, because for most of these languages no one had measured the baseline.

ON THE CALENDAR

NAMS26 Nordic AI in Media Summit · 27–28 May 2026 · Copenhagen · AI in newsrooms, with a European public-service lens.

TAUS Massively Multilingual AI Conference · 3–5 June 2026 · Rome · Industry conference on the operational stack for multilingual AI. taus.net

BEFORE YOU LEAVE

A practical approach: normally, to teach an AI a new language, you need a huge pile of text in that language — which for most languages doesn't exist. Trans-tokenization skips that. It takes a model that already speaks a big language well and re-points it at a new one, reusing what it already knows instead of starting over. They built a working Tatar translator this way, with almost no Tatar training data. More on this here.

ABOUT & DISCLOSURE

I am Mirko Lorenz. I work on language technology projects at Deutsche Welle in Germany.

Three projects you will hear about in this newsletter:

  • plain X (plainx.com) — media localisation platform, DW Innovation / Priberam.

  • ChatEurope (chateurope.eu) — AI chatbot network for 15 European news partners.

  • MOSAIC (mosaic-media.eu) — EU DIGITAL EUROPE-funded multilingual media infrastructure.

I cover all three with the same critical lens applied to competitors.

AI use: I use Claude (Anthropic) for research and to edit this newsletter, based on refined and specific prompts. My goal is to understand where the AI performs and where it fails. I learn something every week. Responsibility for stated facts, names, and links is entirely mine.

babylon-newsletter.com · Every Tuesday

7,000 languages. AI works for 20.

Reply

Avatar

or to participate

Keep Reading