This website uses cookies

Read our Privacy policy and Terms of use for more information.

Issue #10 · May 19, 2026

Who will keep the record of what is or was published if market dynamics and legal fights kill the “bees of information”? Keeping the record is a fragile thing. The world spends billions on data centres but relies on yearly funding pleas to keep Wikipedia going. Risky.

An ongoing controversy is now harming another keeper of records — the Wayback Machine. We are killing the bees of the information world.

THE NUMBER

87%

While AI companies invest billions in compute, we rely on underfinanced projects like Wikipedia and the Wayback Machine to preserve critical information. The Wayback Machine is a digital archive of the web, run by the Internet Archive, a US nonprofit. Launched in 2001, it lets users go "back in time" to see how websites looked in the past. The founders wanted to provide "universal access to all knowledge."

Now this project is under pressure. Between May and October 2025, the Wayback Machine's snapshots of 100 major news homepages dropped from 1.2 million to 148,628 — an 87% decline. Nieman Lab traced the drop to a breakdown in the Archive's own crawling projects, attributed to resource allocation.

On top of that, publishers — including The New York Times, The Guardian, USA Today, and Reddit — have started blocking the Archive's crawlers, citing concerns that AI companies could use the Wayback Machine as a backdoor to scrape their content for training.

Why care? Once an article is no longer archived, that version of it becomes editable without accountability. The legal battle that matters is between publishers and AI companies, with comparable resources on both sides. The Wayback Machine is neither, and it is the one losing snapshots and potentially its existence.

Sources: Andrew Deck and Hanaa' Tameez, Nieman Lab, October 2025 and January 2026. In April 2026 Wired reported about the situation. Another update in the same month was published by Andrew Deck, again on the Nieman Lab website.

THIS WEEK

Reuters draws the line at archive content

6 May 2026 · Press Gazette

Reuters CEO Steve Hasker, speaking at the Truth Tellers Summit in London, said the news agency's AI licensing deals cover archive text only — not current reporting. The position separates Reuters from publishers selling broader access and from those refusing to sell at all.

Mistral CEO: Europe has two years

12 May 2026 · Business Insider / French National Assembly

Arthur Mensch, CEO of Mistral, told France's National Assembly on Tuesday that Europe has two years to build its own AI infrastructure before becoming permanently dependent on US providers. "Once supply is monopolised by American players, suddenly we can no longer transform electrons into tokens." Mensch is also pitching Mistral as that European alternative — the company expects to cross €1 billion in revenue this year while spending roughly the same on chips and infrastructure.

Xiaomi publishes 600-language open-source TTS

15 May 2026 · Slator / arXiv 2604.00688

Xiaomi's next-generation Kaldi team open-sourced OmniVoice, a zero-shot text-to-speech model trained on 581,000 hours across 646 languages, with code, weights, and paper published. Reported word error rate of 0.84% on Chinese, multilingual benchmarks claimed to beat ElevenLabs v2 and MiniMax. What to note: Open-source, paper-backed, China-developed, 600+ languages with explicit focus on low-resource.

“Time-to-trustworthy” as a newsroom rule

13 May 2026 · BR Munich AI for Media Meetup #8 / News Machines

Speed of building is no longer the bottleneck. AI coding tools can turn a prototype into running code in hours. The harder question is whether the result can be trusted. Ulrike Langer named it clearly at a meetup of German broadcasters and publishers: "Time-to-prototype is not relevant. Time-to-trustworthy is what matters." A working example: Joe Amditis at the Center for Cooperative Media built Reroute NJ, a transit detour planner, in 1.5 days. About 2,000 commuters used it daily for four weeks.

Source: News Machines

TALK OF THE WEEK

Killing the bees of the information world

Why is it important to keep records of what was published in the ever-changing world of websites and information bits? An example: the Vancouver Police Department edited a press release after PressProgress reporter Brishti Basu published an article criticising it, then accused her of falsifying information. She used the Wayback Machine to retrieve the original version and prove the police had changed their own statement. This is why such records matter, even at scale.

But the Wayback Machine is in trouble. According to Wired reporting in April 2026, 23 major news sites now block ia_archiverbot, the Wayback Machine's main crawler, including The New York Times, USA Today, and The Guardian. A wider dataset from the AI-detection startup Originality AI counted 241 sites across nine countries blocking at least one Internet Archive crawler. The New York Times implemented what Wayback Machine director Mark Graham described as a "hard block" starting in late 2025.

What are the publishers saying? According to Wired and Nieman Lab reporting:

  • The New York Times said Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with them.

  • USA Today Co. said the blocking is part of a broader effort against all scraping bots, not specifically aimed at the Internet Archive.

  • The Guardian admitted it has not documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Its measures are proactive.

What is relevant here is the geometry of the dispute. The Wayback Machine is being squeezed by entities that have orders of magnitude more legal and financial resources than it does, in a fight whose actual antagonist is somewhere else entirely. Publishers suing AI companies makes sense — same weight class, same arena. Publishers blocking the Wayback Machine to defend against AI companies offloads the cost onto a small nonprofit that did not cause the problem and cannot defend itself in the same way.

Michael Nelson, computer scientist at Old Dominion University, told Nieman Lab: "Common Crawl and Internet Archive are widely considered to be the 'good guys' and are used by 'the bad guys' like OpenAI. In everyone's aversion to not be controlled by LLMs, the good guys are collateral damage."

Should a small nonprofit have to sue rich publishers to keep providing that function? There should be a better way and a path to an agreement. Otherwise we lose a foundational block and gain little.

Daily news for curious minds.

Be the smartest person in the room. 1440 navigates 100+ sources to deliver a comprehensive, unbiased news roundup — politics, business, culture, and more — in a quick, 5-minute read. Completely free, completely factual.

GOOD TO KNOW

A US research lab launches an investigative-journalism competition. The AI research lab at Northwestern University launched a competition for journalists and developers. The task: build agent skills using Claude Code to find newsworthy stories in a corpus of US federal lobbying disclosures and Congressional press releases, 2022 through March 2026. Prizes: $5,000 / $2,500 / $1,000 for the top three. Top submissions also get an invitation to present at an academic symposium in Evanston, Illinois. All submissions are open-source, MIT-licensed. Deadline: 15 July 2026. gain-agent-challenge.northwestern.edu

Countering Disinformation in the Era of Generative AI. A new academic volume from Springer covers detection methods for AI-generated text, image, audio, and video. Rooted in the Horizon Europe vera.ai project, which builds verification tools for fact-checkers and newsrooms. The DW Innovation team is part of the vera.ai consortium. Practical material for anyone deciding what to deploy in a verification workflow. link.springer.com

News Atom Lite. Sannuta Raghu, a journalist at Scroll, an Indian newsroom, published an open framework that turns news articles into two structured outputs: a record of what happened (the event), and a record of how each sentence of an article reported it (the atom). The point is to make journalism machine-readable at the sentence level — usable for archives, fact-checking, source attribution, or training. Running it on a frontier model costs about $0.37 per 500-word article, which is too expensive at archive scale. The next phase is a fine-tuned local extractor that should bring the cost close to zero. github.com/sannuta/news-atom-lite

ON THE CALENDAR

NAMS26 — Nordic AI in Media Summit · 27–28 May 2026 · Copenhagen · Fourth edition of the Nordic public-service-broadcaster gathering on AI in media. nordicaimedia.org

TAUS Massively Multilingual AI Conference · 3–5 June 2026 · Rome · Industry conference on the operational stack for multilingual AI. taus.net

BEFORE YOU LEAVE

Four technologies and resources you should have heard about. They might help to turn European sovereignty dreams into something concrete:

  • CommonsDB — an EU-funded registry for openly licensed and public-domain works. Passed one million declarations in March 2026, targeting five million by summer. commonsdb.org

  • FAIAFair AI Attribution. Flags whether AI was involved in producing or editing a piece of content. Developed by Liccium with Leiden University and the GO FAIR Foundation. liccium.com

  • TDM·AIText and Data Mining for AI. Opt-out declarations bound to content fingerprints, so creators can signal "do not use for training" in a machine-readable way. liccium.com

  • ISCCInternational Standard Content Code. The underlying identifier (ISO 24138) that makes the other three work. Generated from the content itself, not attached to the file. iscc.codes

Liccium was founded by Sebastian Posth, who might have solved some really complex challenges with a very small team.

ABOUT & DISCLOSURE

I am Mirko Lorenz. I work on language technology projects at Deutsche Welle in Germany.

Three projects you will hear about in this newsletter:

  • plain X (plainx.com) — media localisation platform, DW Innovation / Priberam.

  • ChatEurope (chateurope.eu) — AI chatbot network for 15 European news partners.

  • MOSAIC (mosaic-media.eu) — EU DIGITAL EUROPE-funded multilingual media infrastructure.

I cover all three with the same critical lens applied to competitors.

AI use: I use Claude (Anthropic) for research and to edit this newsletter, based on refined and specific prompts. My goal is to understand where the AI performs and where it fails. I learn something every week. Responsibility for stated facts, names, and links is entirely mine.

babylon-newsletter.com · Every Tuesday

7,000 languages. AI works for 20.

Reply

Avatar

or to participate

Keep Reading