TL;DR

AI companies are facing a tighter market for high-value training data as public web text nears exhaustion, publishers seek licensing fees, and governments and companies guard proprietary datasets. The confirmed shift is legal, economic and strategic: data is no longer treated as a free input, and access may shape which AI firms can compete.

AI companies are entering a new phase in which high-value training data is becoming harder to obtain, more expensive to license and more legally contested, a shift that could reshape competition among model developers and give new leverage to publishers, enterprises and governments that control scarce datasets.

According to Epoch AI estimates cited in the source material, the public internet contains roughly 300 trillion tokens of high-quality text, and frontier AI models are already training on datasets that approach that supply. Epoch AI projects that the stock of usable public human text could be fully used between 2026 and 2032, with a median estimate around 2028. Those figures are projections, not a fixed deadline, and depend on model size, training methods and how researchers define high-quality data.

The source material frames the change as the third chokepoint in an AI control series: compute can be rented, power can be leased, but proprietary or sovereign data cannot be copied if another party controls it. It also cites falling H100 rental rates as a sign that some compute bottlenecks may ease while rare datasets become a larger source of advantage.

Legal and commercial pressure is changing the supply side. The material cites Anthropic’s reported $1.5 billion settlement with authors over alleged use of pirated books, described as the largest copyright recovery in U.S. copyright law. The judge in that case drew a distinction between training on legally acquired books, which the court described as “quintessentially transformative” fair use, and downloading pirated copies from shadow libraries, which was not treated the same way. The settlement addressed past piracy claims; it did not resolve all questions about future training or model outputs.

AI Dispatch · The Control Series · Part 3
Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑
Sovereign / real-world
Avengers combat data · FSD · ISR
can’t be bought
Expert-authored
PhDs, lawyers, surgeons define “good”
the new gold
Licensed content
paywalled, deal-only — now priced
fenced
Public web text
scraped for free — exhausting ~2028
commoditizing
~300T
public text tokens — used up 2026–2032
$1.5B
Anthropic authors settlement — scraping era ends
$14.3B
Meta for 49% of Scale — triggered an exodus
keep the model
Ukraine’s condition — data as sovereign asset
The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.
thorstenmeyerai.com · 03 / 06

Scarce Data Shifts AI Power

The change matters because data access may decide which AI companies can build leading models. Large firms can pay for licensing deals, settlements and exclusive partnerships. Smaller developers may face higher barriers if important text, video, enterprise, medical, legal, defense or expert-labeled datasets sit behind contracts or national controls.

For businesses, the shift makes proprietary data a strategic asset rather than an operational byproduct. Customer records, internal workflows, expert decisions and domain-specific archives may become valuable training material, but sharing them with an AI vendor can create competitive risk if contract terms allow the provider to use that data to improve services sold to others.

For readers outside the AI industry, the stakes include copyright payments, media licensing, workplace data governance and national security. The source material points to battlefield, intelligence and autonomous vehicle data as examples of real-world datasets that cannot be easily bought in open markets.

Amazon

high quality training data for AI models

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Free Web Training Hits Limits

The first generation of large AI systems drew heavily from open web text, code repositories, books, forums and other public or semi-public sources. That approach helped model developers scale quickly, but it also created disputes over copyright, consent and compensation.

The source material says the industry is now moving away from broad free scraping toward licensing and controlled access. It cites ongoing litigation involving The New York Times and OpenAI, along with publisher licensing deals by News Corp and others, as signs that rights holders are seeking payment for training use.

At the same time, AI labs are turning to synthetic data and expert-generated data. Nvidia’s reported $320 million acquisition of synthetic-data company Gretel and Microsoft’s use of large synthetic token sets are cited as examples. The source material also warns that synthetic data can create risks when answers are hard to verify, because errors can compound across model generations.

“Quintessentially transformative”

— U.S. court language cited in the Anthropic authors case

Amazon

licensed datasets for machine learning

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Licensing Rules Remain Unsettled

Several key issues remain unresolved. It is not yet clear where courts will draw final lines around copyrighted training data, model outputs and compensation for creators. The Anthropic settlement covered past piracy claims, but it did not create a full legal rule for future training across the industry.

The projected exhaustion of public text is also uncertain. Estimates depend on data quality, deduplication, model design, training efficiency and the degree to which synthetic or multimodal data can substitute for fresh human material. Claims about a firm deadline for the end of usable public data should be treated as projections, not confirmed fact.

It is also unclear how much proprietary enterprise data will be shared with AI vendors, how often it will be licensed exclusively, and whether regulators will treat some datasets as national assets.

Amazon

proprietary data collection tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Contracts Become the Battleground

The next phase is likely to center on licensing terms, court rulings and data-control agreements. Publishers, authors, enterprises and governments are expected to seek clearer payment, consent and retention terms before allowing model developers to train on their material.

AI firms will keep pursuing synthetic data, expert feedback and private corpora, but their competitive advantage may depend less on simply renting more chips and more on securing data that rivals cannot easily replicate. Companies holding proprietary records will face a practical decision: license data for revenue, keep it private for leverage, or build internal AI systems around it.

Amazon

AI data annotation services

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the main development in AI data?

High-value training data is becoming scarcer, more expensive and more controlled as public web text nears projected limits and rights holders push for licensing or legal remedies.

Is public internet data already exhausted?

No. The claim is based on projections. Epoch AI estimates that high-quality public text could be fully used between 2026 and 2032, with a median around 2028, depending on training methods and data definitions.

Why does this favor large AI companies?

Large companies are better positioned to pay for settlements, licensing deals, expert labeling and exclusive data partnerships. Smaller firms may struggle if key datasets are fenced behind contracts.

Can synthetic data solve the shortage?

Synthetic data can help, and major companies are using it. But researchers have warned that machine-generated data can compound errors in areas where outputs are hard to verify, which increases demand for fresh human-verified data.

What should companies do with proprietary data?

The source material argues that companies should treat proprietary data as strategic leverage. That means reviewing AI vendor contracts carefully and deciding whether data should be shared, licensed, retained or used to build internal systems.

Source: Thorsten Meyer AI

You May Also Like

7 Best PC Routers for Prime Day Deals in 2026

Thorsten Meyer AI ranks seven PC router picks for Prime Day 2026, led by NETGEAR, TP-Link, ASUS, Ubiquiti, MikroTik and GL.iNet.

Inside PV manufacturing: Solarge’s module factory in Netherlands

Dutch firm Solarge begins production at its new automated plant in Weert, producing lightweight, recyclable solar modules with plans to expand capacity.

New Zealand’s largest rooftop solar installation comes online

Sunergise has commissioned a 5.3 MW rooftop solar system at Fisher & Paykel Healthcare in Auckland, setting a new national benchmark for commercial solar.

7 Best Film Camera Prime Day Deals for Instant Prints in 2026

A 2026 Prime Day guide ranks Instax bundles, HP Sprocket printers and Fujifilm QuickSnap packs, with prices still unconfirmed.