Anti-bot & legal · 5 min read

DataDome's AI Crawler Taxonomy: 12 Months Later

DataDome's AI-crawler taxonomy launched Jan 2025 with three categories. By May 2026 it expanded to five, AI-agent share of bot traffic doubled from 2.6% to 8-12%, and the crawler-vs-agent distinction collapsed. Taxonomy now under continuous revision.

By Signal Census Editorial May 28, 2026

Apify · marketplace signal

DataDome's AI-crawler taxonomy launched Jan 2025 with three categories.

DataDome launched its AI-crawler taxonomy in January 2025 — a three-category classification (training crawlers, retrieval crawlers, scraping operators) intended to give publishers operational language for distinguishing AI traffic from traditional bot traffic. Twelve months on, the taxonomy has expanded to five categories, the underlying traffic mix has roughly doubled in AI-share, and the operational distinction the original taxonomy was built around has partially collapsed.

The taxonomy revision is itself the most informative thing about the AI-traffic landscape. The categories have moved because the underlying traffic has moved, and DataDome’s customer base has demanded the resolution that the original three-bucket model could not provide.

The original three categories

The January 2025 DataDome AI-crawler taxonomy split AI traffic into:

Training crawlers — large-batch ingestion for model pretraining. GPTBot, Google-Extended, ClaudeBot, CCBot. Predictable user-agents, infrequent visits per IP, large content windows.
Retrieval crawlers — real-time grounding for AI search and chat. PerplexityBot, Bing’s AI crawlers, ChatGPT’s browse mode. Identified by user-agent and behavior pattern (short content windows, conversational query patterns).
Scraping operators — adversarial or commercial scraping disguised as AI traffic. Rotating user-agents, residential IPs, no public identification.

This split worked for early 2025 because the three categories did map onto observably distinct traffic patterns. A training crawler hitting a publisher site looked nothing like a retrieval crawler hitting the same site, and neither looked like a residential-proxy scraping operation.

What changed in 12 months

Three shifts forced the taxonomy revision.

Agent traffic emerged as a fourth category. Browser-based AI agents — OpenAI’s Operator, Anthropic’s Claude with computer use, Google’s Mariner — produce traffic that does not fit the original three buckets. The agent navigates the publisher site like a human would, but at agent-typical timing patterns (faster than a human, slower than a scraper). Their user-agents identify them as Chrome or another browser, not as an AI tool. DataDome added an “agent” category in mid-2025 to handle this.

The retrieval-vs-training distinction blurred. Foundation labs increasingly use crawlers that serve dual purposes — fetched content goes into both immediate retrieval-augmented generation and longer-term training corpus updates. The DataDome taxonomy now distinguishes by behavior rather than declared purpose.

A fifth “synthetic user” category emerged. Scrapers using AI-driven browser automation (Browser-Use, Stagehand, Skyvern in production deployments) produce traffic that is functionally indistinguishable from human user traffic at the request level. DataDome’s detection has moved from request-pattern fingerprinting to longer-session behavioral analysis to catch this.

The May 2026 taxonomy is:

Training crawlers — same as 2025 definition
Retrieval crawlers — same as 2025 definition, but acknowledged overlap with training
AI agents — browser-based, performing tasks
Synthetic users — AI-driven scraping using browser automation
Mixed-purpose scraping operators — the residual category for unclassified AI traffic

DataDome’s published reports across 2025 placed AI-related traffic at roughly 2.6% of all bot traffic in January 2025, rising to 8-12% by Q1 2026 depending on publisher segment. The growth is sharply uneven across verticals.

News and reference publishers see the highest AI-traffic share — 15-20% by mid-2026 on some sites, because AI-search and grounding workloads heavily target news and reference content. The Reddit-Google deal and the Stack Overflow OpenAI deal are visible proof of where the demand concentrates.

E-commerce and SaaS publishers see lower AI shares — 4-8% — because the content is less useful for training and the AI-agent traffic is mostly tied to specific commercial workflows (purchasing, account management) rather than general indexing.

Government and academic sites see relatively low AI traffic by share but high absolute volume — these surfaces are still being indexed for training as foundation labs continue corpus expansion.

The category-mix within the AI-traffic share has also shifted. In Q1 2025, training crawlers dominated. By Q2 2026, training has dropped to roughly 30-40% of the AI-traffic share, with retrieval and agent traffic capturing the rest. The implication for publishers is that the traffic they receive is increasingly read-time, not training-time, which changes the monetization opportunity.

What the taxonomy revisions reveal

The fact that DataDome has revised its taxonomy three times in 18 months is itself a market signal. The underlying traffic is shifting faster than the classification system can stabilize. Two structural reasons:

The operator landscape is fragmenting. Early 2025 had a small number of identifiable AI crawlers (four-five major operators). Mid-2026 has dozens — every AI-product startup runs some form of web crawler, and many use residential-IP infrastructure that defeats user-agent identification. The taxonomy has to either expand to cover the new operators or collapse into behavior-based categories that do not depend on operator identity.

The crawler-agent boundary is breaking. The taxonomy was built when “crawler” (programmatic HTTP fetcher) and “agent” (browser automation) were operationally distinct. By mid-2026, the same product can run as either depending on the target’s defense level. A site that detects and blocks Bytespider’s user-agent will see Bytespider next month using browser automation under a Chrome user-agent. The taxonomy categories are decreasingly stable because the operators are using them as targets for adversarial behavior.

The DataDome customer experience reflects this churn. Publishers who configured their AI-traffic responses in early 2025 (allow training crawlers, charge retrieval crawlers, block scraping operators) have had to revise the configuration three to four times to keep pace with the taxonomy and the underlying traffic shift.

The crawler-economics implication

For Apify Store publishers building actors that hit other publishers’ sites, the DataDome taxonomy revision is a forward-looking guide to what gets detected next.

The taxonomy categories that have stabilized are the easy detection targets — declared training crawlers, declared retrieval crawlers. The categories that DataDome is still revising — AI agents, synthetic users, mixed-purpose operators — are the harder detection problems. The scraping operations that need to evade detection are increasingly operating in those harder categories, which is why DataDome and the other vendors are spending engineering capacity on classifying them.

The longer-term implication is that operator-identity-based defenses (user-agent matching, IP-range blocking) are losing efficacy as the operators learn to evade them. The defenses that work are behavior-based, session-level, and require longer observation windows to be accurate. That detection work is expensive, which is why publisher-side defense increasingly costs as much per request as the underlying content access. The 2.6% → 10% AI-traffic share is also a 10×+ increase in publisher anti-bot spend.

For the publisher side, the pay-per-crawl monetization is the rational response to costs that the existing anti-bot stack cannot absorb economically. Defense via blocking is a sunk cost. Defense via payment-extraction recovers the cost from the operator. The DataDome taxonomy revisions are tracking the rise of operators that have to be either monetized or invisibly tolerated, because pure blocking does not work against them.

Where the taxonomy lands next

By Q4 2026, expect a sixth or seventh category to be added — probably an “agentic browser” tier covering the consumer-facing AI browsers (Arc Search, Perplexity Comet, others) where the user-intent is human but the request-issuing entity is an AI. The taxonomy stabilization point is not visible in the data yet. The continued expansion is the structural signal that the AI-traffic ecosystem is still in active rearrangement.

The DataDome 12-month update is a useful counter to the AI-crawler narrative that frames the issue as a static “publishers vs scrapers” conflict. The reality is more nuanced — multiple distinct AI-traffic categories with different operator profiles, different intent patterns, and different defensive responses. The publisher who configures one blanket policy for “AI traffic” is making a 2024-era mistake. The configuration has to be five-to-seven-category by 2026, and the categories themselves are still moving.

Sources