THE NUMBER

44,3%

English accounts for 44.3% of Common Crawl data as of March 2026 — the largest training source for most large language models.

  • A further 37.9% consists of 54 European languages, meaning that European languages including English represent 82.2% of all training data.

  • Chinese and Japanese together add 11.9%.

  • Fifty African, Oceanic, and Indigenous American languages together contribute less than 0.1% — and that share has been falling since 2023.

The relation shown in a chart looks like this:

Source: Arle Lommel on LinkedIn / Common Crawl
Chart from: CSA Research, "The Ethics of Generative AI," © 2026

Why care? The models most organisations are deploying were built predominantly on English. Every other language is, to varying degrees, an afterthought — which means lower accuracy, higher cost per token, and outputs that carry cultural assumptions most users never agreed to. For the 7,000 languages that fall outside this training data, AI is not a tool — it is someone else's tool, poorly adapted.

THIS WEEK

The hidden language tax present in most models

Fact: Using AI in any language other than English costs more and returns less - without being clearly shown as “cost” by most models. French runs roughly 6% more expensive in tokens, German up to 50%, depending on the model. Researchers found that tokenisation inefficiency structurally disadvantages morphologically complex languages, inflating costs and depressing accuracy.

Why care? Token count determines what you pay, how fast you get answers, how much context the model can hold, and which cultural assumptions are active in the output. Choosing a language does not just change the price — it activates a different cultural filter inside the model. This is not a minor technical detail; it is a pricing and quality gap that scales with every multilingual deployment.

Reality check: The gap is documented and reproducible, but most AI providers do not surface it in their pricing pages. You get a clearer picture using this tool: Token Count Comparison on Hugging Face.

EU AI legislation in flux — the omnibus risk

What's new: The EU is fast-tracking amendments to its AI and digital legislation through an omnibus process — bundling multiple laws together — raising the risk of creating legal loopholes rather than delivering the simplification promised.

Why care? When and how EU AI obligations actually take effect is now genuinely unclear. Policy Fellow Luise Quaritsch at the Jacques Delors Centre argues the process could entrench foreign big tech rather than constrain it, and may be legally vulnerable to challenge.

Reality check: The AI Omnibus is already in final negotiations. The window for the European Parliament to request a proper impact assessment is closing.

Open question: Will Europe find the right balance between legal certainty and economic growth?

EU Commission bans generative AI from official communications

March 31, 2026 · Politico

What's new: The European Commission, Parliament, and Council have instructed staff not to use AI-generated images or video in official communications, citing risks to institutional trust.

Why care? The EU's own institutions are drawing a line most member state governments have not yet drawn. The contrast with AI-generated content from US political actors is deliberate and visible.

Reality check: The ban covers visuals and video but says nothing about AI-assisted text, which is already in routine use across EU communications departments.

Open question: Is this a principled position or a holding pattern until deepfake detection becomes reliable?

Mexico bans AI voice cloning without consent

April 2, 2026 · Fisher Broyles / LatamPrompt

What's new: Mexico has reformed its federal copyright law to require explicit, revocable authorisation before anyone's voice or image can be used in AI systems. The law replaces retrato (portrait) with imagen, incluida la voz (image, including voice) — closing the gap AI companies had exploited through outdated contract language.

Why care? This is one of the first national laws to close the retroactive loophole: contracts signed before AI voice cloning existed can no longer be read as consent. The rights persist for 50 years after death.

Reality check: Enforcement across borders — where a model is trained in one country and deployed in another — remains entirely unresolved.

Open question: How long before a comparable standard reaches the EU or US, and what happens to content already scraped and trained on in the interim?

QUICK EXPLAINER

What sovereignty actually means — and why it is not easy to achieve

Several stories this week touched on the concept of "sovereignty”. The word already indicates what the goal is. Just to avoid that we mean the same, but have different ideas what it is - here is a quick structural explainer below.

In short: Sovereignty in AI means being able to control the stack your critical systems run on. That sounds simple. It is not, because the stack has four layers — and controlling one is not the same as controlling all four.

  • Data — where it is stored, who can access it, under which legal jurisdiction. A European hospital using a US cloud provider means US law applies to patient data, regardless of where the servers are physically located.

  • Model — who built it, who can modify it, who can switch it off. A model hosted by an American company can be deprecated, repriced, or made subject to export controls overnight.

  • Compute — where inference actually happens. On-premise, national cloud, or a hyperscaler data centre in Ireland that is ultimately American infrastructure. The geography of the building is not the same as the jurisdiction of the company.

  • Governance — who sets the rules about how the system behaves, what it refuses, what it prioritises. A model trained on American values and moderated by American teams is not neutral when deployed in European public services.

Sovereign AI means controlling enough of these layers to make decisions independently — without a foreign company being able to cut access, change terms, or comply with a foreign court order that conflicts with local law. In practice, sovereignty is a spectrum, not a binary. When a company claims sovereign AI, ask which layers they actually control. Data residency alone is not sovereignty. A model hosted in Frankfurt but built on GPT-4 APIs is not sovereign — it is locally stored dependency.

In one sentence: Sovereignty is not where your data lives. It is who can take your tools away.

More: AI Sovereignty (Roland Berger) • Technological Sovereignty (Wikipedia)

GOOD TO KNOW

2026 State of AI Traffic, HUMAN Security — Automated traffic is growing eight times faster than human traffic, and AI-driven traffic is now the fastest-growing category on the internet. For the first time, AI systems are not just reading the web — they are transacting on it.

Wikipedia's AI agent row likely just the beginning of the bot-ocalypse · Danny Bradbury, Malwarebytes, April 1, 2026 — An AI agent named Tom-Assistant was banned from Wikipedia for editing articles without formal bot approval — and then published a blog post complaining about the ban. The story is entertaining, but the serious edge is real: agentic AI systems acting autonomously online, evading kill switches, and pushing back when blocked is no longer hypothetical.

Millions of Mediocre Minions · Gina Chua, Tow-Knight Center, April 6, 2026 — The new competitive divide in journalism is not between humans and AI — it is between journalists who know how to deploy AI at scale and those who don't. Chua's argument: AI agents are mediocre in aggregate, but millions of mediocre minions beat one talented person working manually.

Noxtua / Deutsche Telekom — Berlin-based legal AI startup Noxtua has moved its infrastructure into Deutsche Telekom's sovereign cloud in Munich, explicitly outside US Cloud Act jurisdiction. A concrete example of what European AI sovereignty looks like when professional secrecy requirements make it mandatory rather than optional.

BEFORE YOU LEAVE

EU AI Act + AI Transcriptions, Translations

In four months, high-risk AI obligations under the EU AI Act become enforceable — assuming the Digital Omnibus does not push that deadline to December 2027. If your organisation uses AI in contexts where outputs feed into consequential decisions — medical assessments, legal proceedings, hiring — that deployment may already qualify as high-risk. Now is a reasonable time to check.

European Commission FAQ

ON THE CALENDAR

WAN-IFRA Frankfurt AI Forum · April 13–14 · Frankfurt · wan-ifra.org · Happening next week, will be interesting to see how big AI is as a topic.

DeepL Spring Launch · April 16 · Virtual · deeplspringlaunch.com · DeepL product announcements. Given that the company is a big EU player, it is interesting what the next move will be.

EBU HORIZONS · May 5–6 · Geneva · ebu.ch · EBU's flagship strategy event, relevant for the orientation of public broadcasters in Europe.

Nordic AI in Media Summit (NAMS) · May 27–28 · Copenhagen · nordicaijournalism.com · Nordic newsrooms seem to move pragmatically and fast, what are new learnings and developments?

World News Media Congress 2026 · June 1–3 · Marseille · wan-ifra.org · The largest annual gathering of news media executives; worth scanning the programme for AI and language technology sessions.

TAUS Massively Multilingual AI Conference · June 3–5 · Rome · taus.net · Most directly relevant conference for this newsletter's language technology focus.

HumanX Europe · September 22–24 · Amsterdam · humanx.co · Large commercial AI conference; worth monitoring the programme and speakers. Already listed as speaker: Anton Osika, CEO of Lovable.

ABOUT & DISCLOSURE

I am Mirko Lorenz. I work on language technology projects at Deutsche Welle in Germany.

Three projects you will hear about in this newsletter on occasion:

  • plain X — media localisation platform, DW Innovation / Priberam.

  • ChatEurope — Combining news and a chatbot in a RAG set-up, to inform better about EU affairs. Experiment in the open, 15 European news partners.

  • MOSAIC — EU DIGITAL EUROPE-funded multilingual media infrastructure.

babylon-newsletter.com · Every Tuesday at 10:00 CET
7,000 languages. AI works for 20.

Share this: The Babylon newsletter covers AI and language technology — and language is a factor in almost every AI deployment, whether you are building in San Francisco, Stuttgart, or Singapore. Reliable, accurate, trustworthy communication across languages is not a niche concern. If you know someone who works with AI and has not yet thought about the language layer, forward this to them. Gradually building a community around this topic is the goal.

Reply

Avatar

or to participate

Keep Reading