Anti-bot & legal · 4 min read

More AI Crawlers Ignore robots.txt Than Respect It

Of ten major AI crawlers in 2026 logs, four reliably respect robots.txt (GPTBot, ClaudeBot, Google-Extended, CCBot). The other six show inconsistent or zero compliance — PerplexityBot and Bytespider lead the bypass list. The gap is what Cloudflare's pay-per-crawl monetizes.

By Signal Census Editorial
Apify
Apify · marketplace signal
Of ten major AI crawlers in 2026 logs, four reliably respect robots.txt (GPTBot, ClaudeBot, Google-Extended, CCBot).

Of the ten AI crawler user-agents most frequently observed in 2026 server logs, four reliably respect robots.txt directives and six do not. The compliant cohort is dominated by the foundation labs with public reputations to defend — OpenAI’s GPTBot, Anthropic’s ClaudeBot, Google-Extended, and CommonCrawl’s CCBot. The non-compliant cohort is dominated by either second-tier search-and-retrieval products with weaker brand exposure or by crawlers operating from rotating identities that obscure attribution entirely.

The compliance gap is the practical foundation of the AI-scraping economy. It is also the gap that Cloudflare’s pay-per-crawl product is monetizing — because if the crawlers had self-compliance, the publisher-side enforcement layer would not exist.

The compliance band

The ten crawler user-agents seen most frequently across major publisher CDN logs in early 2026, with observed robots.txt compliance:

CrawlerOperatorrobots.txt compliance
GPTBotOpenAIYes — respects directives, documents user-agent
ClaudeBotAnthropicYes — respects directives, publishes IP ranges
Google-ExtendedGoogle AI trainingYes — opt-out via robots.txt, separate from Googlebot
CCBotCommonCrawlYes — long-standing compliance
BytespiderByteDancePartial — observed bypassing on some publisher sites
PerplexityBotPerplexityNo / inconsistent — repeatedly observed ignoring directives
AmazonbotAmazon Alexa / AIPartial — claims compliance, observed gaps
Applebot-ExtendedApple IntelligenceYes — respects opt-out
Meta-ExternalAgentMeta AIPartial — compliance observed inconsistently
Unidentified UA stringsMixed (residential proxies, scrapers, agents)No — by construction

The compliant cohort represents roughly 65-70% of observed AI-crawler traffic by volume. The non-compliant cohort represents 30-35% but is growing faster — both because more entrants (agentic browsers, smaller AI products) operate without crawler-identity discipline, and because rotating-residential-IP scraping is now cheap enough that “crawler” and “scraper” have collapsed into the same operational layer.

Why the gap exists

Three structural forces produce the compliance asymmetry.

Reputation exposure. OpenAI, Anthropic, Google, and Apple have public reputations to protect. A documented case of one of their crawlers bypassing robots.txt directives at a major publisher generates news coverage, regulatory questions, and meaningful trust damage with enterprise customers. The compliance is rational risk management, not benevolence.

Operational identification. A crawler that publishes its user-agent string and IP ranges is verifiable. A publisher can confirm whether the crawler hitting their site is actually the labeled one. The compliant cohort all do this. The non-compliant cohort either uses inconsistent UA strings or operates from residential IP pools that cannot be attributed to a specific corporate operator.

Economic incentive. A crawler that respects robots.txt and loses access to publishers actively opting out is paying a cost — the content it cannot index. A crawler that bypasses gets the content. The cost of bypass is the risk of detection plus the risk of legal action. For operators with no public-facing brand to lose, the cost-of-bypass is low and the value-of-bypass is non-zero. The math favors bypass.

The result is a market structure where the largest, most-resourced AI operators absorb the compliance cost, and the smaller operators free-ride on the publisher data the compliant cohort would have indexed anyway. The compliant cohort effectively subsidizes the non-compliant by maintaining the perception that AI-crawler traffic is generally well-behaved.

What changes the math

The compliance asymmetry held throughout 2024-2025 because the publisher-side enforcement options were weak. Detection required server-log analysis. Blocking required IP-range maintenance. Legal action required attribution that the non-compliant operators specifically defeated.

Three 2026 developments shift this.

Cloudflare’s pay-per-crawl. The HTTP 402 response with managed payment integration changes the enforcement layer from “block or allow” to “charge per request”. A non-compliant crawler can no longer free-ride invisibly — every request either pays the publisher or fails the payment challenge. The economic asymmetry that favored bypass shifts when bypass costs money rather than just legal risk.

DataDome and HUMAN’s behavioral classification. Behavioral fingerprinting now distinguishes residential-IP scraping from genuine residential user traffic with high accuracy on most publisher sites. The “I’m running on residential proxies so I look like a real user” defense is weaker than it was even a year ago.

The EU AI Act enforcement frontier. Article 5 enforcement attention has built EU DPA expertise around scraping cases. The next enforcement wave — visible in noyb filings and DPA decisions — extends beyond facial-image scraping into broader AI-training-data acquisition. Operators that combined ignoring robots.txt with EU-resident data harvest sit at the intersection of two enforcement regimes.

The compliance posture for Apify actors

The implications for Apify Store publishers run in two directions.

As crawlers. An Apify actor that scrapes publishers without honoring robots.txt sits in the non-compliant cohort, regardless of whether the actor identifies itself with a custom UA string. The operator is liable under the same enforcement frameworks that govern Bytespider or rogue residential-IP scrapers. The historical defense — “we are a small operator, no one is watching us” — gets weaker as the publisher-side detection tooling improves. Documenting robots.txt-compliance in the actor README and offering it as a default-on configuration is now a meaningful differentiator on the Store.

As targets. Apify actors hosted on the platform are themselves crawlable web endpoints. The Apify documentation, the Store category browser, and individual actor pages are all subject to whatever robots.txt policy Apify maintains. Publishers who care about whether their actor metadata is being ingested into competing AI-training corpora should check Apify’s policies and the Store’s response headers — the same logic that applies to any publisher applies to the marketplace itself.

The longer-term equilibrium will not look like 2024’s “robots.txt is advisory, mostly honored.” It will look more like financial market regulation: enforced compliance for the largest operators, monitored compliance for the mid-tier, and a long tail of small operators that operate below the enforcement threshold because the cost of pursuing them exceeds the recovery. The compliant operators will increasingly compete on the basis of their compliance — not as a brand attribute but as a regulatory and licensing advantage that the non-compliant cannot match.


Sources