Verticals & buyers · 4 min read

Reddit's $60mn Google Deal Was a Floor, Not a Ceiling

Google's $60mn/year Reddit deal, signed Feb 2024 the same day as the IPO filing, set the template for AI content licensing. Reddit is now pushing for usage-based pricing tied to AI Overview citations. None of this revenue flows to scraping vendors.

By Signal Census Editorial
Apify
Apify · marketplace signal
Google's $60mn/year Reddit deal, signed Feb 2024 the same day as the IPO filing, set the template for AI content licensing.

In February 2024, Google and Reddit signed a $60mn per year licensing agreement for AI training data. The deal was announced the same day Reddit filed its IPO prospectus — a timing choice that priced the licensing revenue into Reddit’s listing valuation. Two years later, that deal is the template every AI content licensing negotiation works from.

What the press undercovered at the time, and what now matters more, is what the deal did to the rest of the data market. The Google-Reddit agreement effectively set a price floor for premium AI training data, established a template for non-monetary terms (attribution, usage limits, citation requirements), and signaled to every other publisher with a meaningful corpus that direct licensing was on the table.

For web scraping vendors, that signal was bad news.

What the Reddit deal actually contains

The $60mn/year headline figure is the anchor. The reported terms also include real-time access to Reddit data via API, structured citation when Reddit content appears in Gemini answers, and rate-limit guarantees that exceed what Reddit offers any other API customer. The deal renews on a multi-year term, with Reddit reportedly pushing for usage-based pricing in the next renegotiation — a structure where the per-citation rate scales with how often Gemini surfaces Reddit content.

The economic logic is straightforward. Reddit data is uniquely valuable for AI training because of its conversational density, vertical depth, and topical coverage that Common Crawl alone does not produce. Google has the deepest pockets among potential buyers and the most direct use case (Gemini AI Overviews). The deal that emerged is what a single-buyer, single-seller negotiation produces when both sides know the data has scarcity value.

The follow-on deals that have surfaced since:

  • Stack Overflow with OpenAI (May 2024): undisclosed terms; OverflowAPI subscription product launched with attribution-in-ChatGPT requirement
  • Wiley with anonymous AI buyer: reported $44mn deal for academic content
  • Axel Springer, AP, News Corp, TIME with OpenAI: aggregate value reportedly above $250mn over five years
  • Shutterstock with OpenAI: ongoing licensing arrangement

A consistent shape emerges. Premium content goes to a single AI buyer (or a small concentrated set), under multi-year contracts, with terms favorable enough that the publisher prefers licensing to litigating.

What the licensing market looks like to a scraper

The licensing market is platform-to-platform. Reddit gets $60mn from Google. Wiley gets $44mn from a buyer. None of that revenue flows to the scraping vendors who, in many cases, were the de facto data pipeline before the licensing deals existed.

That is a structural shift. As recently as 2022, the dominant way an LLM lab acquired training data was scraping (Common Crawl, plus targeted scrapes of high-value sources). The licensing era moves a meaningful slice of training-data supply behind paywalled, contracted access — and the scraper sees none of it.

The implication for the scraping infrastructure market is that the highest-value training-data targets are progressively becoming unscrapable, in the sense that the LLM labs have committed contractually not to scrape them. That does not literally close the technical scraping path, but it removes the buyer demand that made the path commercially viable.

Common Crawl, which is free and remains the largest single training corpus, illustrates the new equilibrium. The largest LLM labs have moved off Common Crawl as a primary source not because the data is unusable, but because the legal-liability exposure of training on it is unattractive. The licensing deals are framed explicitly as IP-lawsuit insulation, and Common Crawl-trained models do not have that insulation.

What this means for scraping vendor strategy

Three implications worth being precise about.

The training-data buyer for scraped data is contracting. Foundation labs are moving to licensed sources. The remaining buyers for scraped training data are second-tier model builders, academic researchers, and customers who do not need the legal cover that licensing provides. That is a smaller market than it was three years ago.

The non-training-data use cases are growing. Lead enrichment, price intelligence, real-time market data, and agent-driven workflow scraping are all healthy and growing. The training-data pipeline contraction is balanced — for the larger scraping vendors — by other revenue lines that did not require the licensing equilibrium.

The licensing-era platforms are not friends to scrapers. Reddit is more aggressive about API rate-limiting and bot detection than it was pre-deal. LinkedIn, Stack Overflow, and the publisher consortium are similarly hostile. The technical cost of scraping these sources has gone up, the legal exposure has gone up, and the buyer demand has gone down. All three pressures point in the same direction.

Where the surviving demand sits

For Apify Store publishers, the licensing era changes the calculus on which targets to invest in.

High-licensing-pressure targets (Reddit, Stack Overflow, major news publishers, LinkedIn, possibly X/Twitter): the technical access is increasingly hard, the legal exposure is high, and the buyer demand is shifting away from scrape-based supply. Actors targeting these are operating against a falling demand curve.

Low-licensing-pressure targets (job boards, e-commerce sites, real-estate listings, vertical SaaS): the licensing era does not directly affect them, and demand from non-training-data buyers (lead gen, price intel, market research) is healthy. The Q1 2026 censuses on this site show this is where most of the Store’s measured demand actually sits.

The strategic read is that the future-defensible Apify Store position is in non-training-data verticals — exactly where the catalog is already strongest. The temptation to chase headline targets like Reddit or X is largely a trap; the better positioning is to build deeper coverage in the verticals where the buyers are still scraping and the licensing-era headwinds do not apply.

The Reddit deal is two years old and already a template. More premium publishers will sign similar contracts; none of that revenue will reach scraping vendors. The dynamic is durable, and it points scrapers clearly at where the remaining buyer demand will sit — outside the licensed-content perimeter.


Sources