More AI Crawlers Ignore robots.txt Than Respect It
Of ten major AI crawlers in 2026 logs, four reliably respect robots.txt (GPTBot, ClaudeBot, Google-Extended, CCBot). The other six show inconsistent or zero compliance — PerplexityBot and Bytespider lead the bypass list. The gap is what Cloudflare's pay-per-crawl monetizes.
Of the ten AI crawler user-agents most frequently observed in 2026 server logs, four reliably respect robots.txt directives and six do not. The compliant cohort is dominated by the foundation labs with public reputations to defend — OpenAI’s GPTBot, Anthropic’s ClaudeBot, Google-Extended, and CommonCrawl’s CCBot. The non-compliant cohort is dominated by either second-tier search-and-retrieval products with weaker brand exposure or by crawlers operating from rotating identities that obscure attribution entirely.
The compliance gap is the practical foundation of the AI-scraping economy. It is also the gap that Cloudflare’s pay-per-crawl product is monetizing — because if the crawlers had self-compliance, the publisher-side enforcement layer would not exist.
The compliance band
The ten crawler user-agents seen most frequently across major publisher CDN logs in early 2026, with observed robots.txt compliance:
| Crawler | Operator | robots.txt compliance |
|---|---|---|
| GPTBot | OpenAI | Yes — respects directives, documents user-agent |
| ClaudeBot | Anthropic | Yes — respects directives, publishes IP ranges |
| Google-Extended | Google AI training | Yes — opt-out via robots.txt, separate from Googlebot |
| CCBot | CommonCrawl | Yes — long-standing compliance |
| Bytespider | ByteDance | Partial — observed bypassing on some publisher sites |
| PerplexityBot | Perplexity | No / inconsistent — repeatedly observed ignoring directives |
| Amazonbot | Amazon Alexa / AI | Partial — claims compliance, observed gaps |
| Applebot-Extended | Apple Intelligence | Yes — respects opt-out |
| Meta-ExternalAgent | Meta AI | Partial — compliance observed inconsistently |
| Unidentified UA strings | Mixed (residential proxies, scrapers, agents) | No — by construction |
The compliant cohort represents roughly 65-70% of observed AI-crawler traffic by volume. The non-compliant cohort represents 30-35% but is growing faster — both because more entrants (agentic browsers, smaller AI products) operate without crawler-identity discipline, and because rotating-residential-IP scraping is now cheap enough that “crawler” and “scraper” have collapsed into the same operational layer.
Why the gap exists
Three structural forces produce the compliance asymmetry.
Reputation exposure. OpenAI, Anthropic, Google, and Apple have public reputations to protect. A documented case of one of their crawlers bypassing robots.txt directives at a major publisher generates news coverage, regulatory questions, and meaningful trust damage with enterprise customers. The compliance is rational risk management, not benevolence.
Operational identification. A crawler that publishes its user-agent string and IP ranges is verifiable. A publisher can confirm whether the crawler hitting their site is actually the labeled one. The compliant cohort all do this. The non-compliant cohort either uses inconsistent UA strings or operates from residential IP pools that cannot be attributed to a specific corporate operator.
Economic incentive. A crawler that respects robots.txt and loses access to publishers actively opting out is paying a cost — the content it cannot index. A crawler that bypasses gets the content. The cost of bypass is the risk of detection plus the risk of legal action. For operators with no public-facing brand to lose, the cost-of-bypass is low and the value-of-bypass is non-zero. The math favors bypass.
The result is a market structure where the largest, most-resourced AI operators absorb the compliance cost, and the smaller operators free-ride on the publisher data the compliant cohort would have indexed anyway. The compliant cohort effectively subsidizes the non-compliant by maintaining the perception that AI-crawler traffic is generally well-behaved.
What changes the math
The compliance asymmetry held throughout 2024-2025 because the publisher-side enforcement options were weak. Detection required server-log analysis. Blocking required IP-range maintenance. Legal action required attribution that the non-compliant operators specifically defeated.
Three 2026 developments shift this.
Cloudflare’s pay-per-crawl. The HTTP 402 response with managed payment integration changes the enforcement layer from “block or allow” to “charge per request”. A non-compliant crawler can no longer free-ride invisibly — every request either pays the publisher or fails the payment challenge. The economic asymmetry that favored bypass shifts when bypass costs money rather than just legal risk.
DataDome and HUMAN’s behavioral classification. Behavioral fingerprinting now distinguishes residential-IP scraping from genuine residential user traffic with high accuracy on most publisher sites. The “I’m running on residential proxies so I look like a real user” defense is weaker than it was even a year ago.
The EU AI Act enforcement frontier. Article 5 enforcement attention has built EU DPA expertise around scraping cases. The next enforcement wave — visible in noyb filings and DPA decisions — extends beyond facial-image scraping into broader AI-training-data acquisition. Operators that combined ignoring robots.txt with EU-resident data harvest sit at the intersection of two enforcement regimes.
The compliance posture for Apify actors
The implications for Apify Store publishers run in two directions.
As crawlers. An Apify actor that scrapes publishers without honoring robots.txt sits in the non-compliant cohort, regardless of whether the actor identifies itself with a custom UA string. The operator is liable under the same enforcement frameworks that govern Bytespider or rogue residential-IP scrapers. The historical defense — “we are a small operator, no one is watching us” — gets weaker as the publisher-side detection tooling improves. Documenting robots.txt-compliance in the actor README and offering it as a default-on configuration is now a meaningful differentiator on the Store.
As targets. Apify actors hosted on the platform are themselves crawlable web endpoints. The Apify documentation, the Store category browser, and individual actor pages are all subject to whatever robots.txt policy Apify maintains. Publishers who care about whether their actor metadata is being ingested into competing AI-training corpora should check Apify’s policies and the Store’s response headers — the same logic that applies to any publisher applies to the marketplace itself.
The longer-term equilibrium will not look like 2024’s “robots.txt is advisory, mostly honored.” It will look more like financial market regulation: enforced compliance for the largest operators, monitored compliance for the mid-tier, and a long tail of small operators that operate below the enforcement threshold because the cost of pursuing them exceeds the recovery. The compliant operators will increasingly compete on the basis of their compliance — not as a brand attribute but as a regulatory and licensing advantage that the non-compliant cannot match.
Sources
- Cloudflare AI crawler traffic reports
- DataDome AI crawler taxonomy and observed bypass patterns
- robots.txt specification
- Signal Census: Cloudflare Pay-Per-Crawl Reshapes Scraping
- Signal Census: EU AI Act Article 5 Enforcement
- Signal Census: Stealth Is Dead 2026 — adjacent anti-bot infrastructure analysis