Vendor landscape · 4 min read

Scrapy at 15: Still the Substrate Beneath Modern Scrapers

Scrapy launched 2010, hit 15 in 2025. PyPI downloads remain in the millions per month. Most modern AI-scraper infrastructure (Apify SDK, Zyte, internal scrapers) uses Scrapy as the underlying engine. Substrate-layer dominance answers the 'still relevant' question.

By Signal Census Editorial Scrapy 15 Years Substrate
All articles
Scrapy at 15: Still the Substrate Beneath Modern Scrapers editorial image
Apify
Apify · marketplace signal

Scrapy turned 15 in 2025. The framework that defined Python-based web scraping in the 2010s is, by every measurable signal, still the substrate beneath most production scraping infrastructure in 2026. PyPI download counts remain in the millions per month. The dependency-graph of modern AI-scraping tools — Apify SDK, Zyte’s cloud platform, internal scrapers at every serious data-vendor — names Scrapy directly. The framework’s anniversary passed without much industry coverage because its dominance is now so structural that it does not generate headlines.

The “is Scrapy dead in the Playwright era” question that circulated through 2022-2024 has been answered. Playwright took the browser-automation surface. Scrapy kept the high-volume HTTP-scraping surface. The two coexist, and a meaningful share of production scraping uses both together — Scrapy as the orchestrator, Playwright as the rendering layer where JavaScript execution is required.

What the dependency-graph shows

A search of GitHub for Python projects with “scrapy” in their requirements files surfaces roughly 90,000 public repositories using Scrapy as of mid-2026. The count for Playwright (in Python projects, excluding general web-testing usage) is roughly 35,000. Beautiful Soup — the legacy HTML-parsing library — sits between at roughly 200,000, but most of those are not production scrapers. They are tutorial code, one-off scripts, and academic projects.

The PyPI download numbers tell the same story:

  • Scrapy: ~3-4mn monthly downloads (consistent through 2024-2026)
  • Playwright (Python): ~5-6mn monthly downloads (includes web-testing usage, not just scraping)
  • Beautiful Soup: ~25mn monthly downloads (dominated by tutorial / one-off usage)
  • Selenium (Python): ~8mn monthly downloads (general automation, much broader than scraping)

The Scrapy 3-4mn figure is high-signal because nearly all of it is scraping-related. The framework has no major non-scraping use case. The download number is approximately equal to the number of production scrapers being maintained, give or take a factor of 2.

Why Scrapy survived the Playwright wave

The 2022-2024 narrative was that Playwright would displace Scrapy because modern websites required JavaScript execution to scrape, and Scrapy’s pure-HTTP model couldn’t handle that. The narrative was partly right but missed the structural point.

Playwright is expensive. Running a real browser per scraped page is 10-100× more compute-expensive than running Scrapy’s HTTP-only fetcher. For high-volume scraping (1M+ pages per day per target), Playwright costs structurally prohibit it. Scrapy’s HTTP-only model stays the only viable option at that volume.

Most pages do not need JavaScript execution. Estimates from the scraping infrastructure vendors put 60-75% of commercially-scraped pages as fetchable via pure HTTP without JS rendering. The 25-40% that require JS are handled by Playwright integration; the majority stay on Scrapy.

Scrapy + Splash / Scrapy + Playwright integration matured. The frameworks compose. A Scrapy spider can dispatch to Playwright for the JS-rendering subset of requests and stay on its native HTTP fetcher for the rest. This hybrid pattern is now the production default at most serious scraping operations. The “Scrapy vs Playwright” framing was always wrong; the production answer is Scrapy + Playwright.

What Scrapy still does well

Three capability classes where Scrapy remains the best-in-class choice.

High-throughput crawling. Scrapy’s Twisted-based async architecture handles 1,000+ concurrent requests cleanly. Equivalent performance with Playwright requires substantially more compute. For crawl workloads (link discovery, sitemap traversal, breadth-first ingestion), Scrapy is the dominant choice and continues to be.

Production-grade scraper engineering. The framework’s middleware system, settings architecture, and item pipeline abstractions support the kind of long-lived, maintainable scraper code that production environments need. Newer alternatives ship simpler APIs but recreate this scaffolding inside production deployments anyway.

Compatibility with the broader Python data stack. Scrapy items convert naturally to pandas DataFrames, integrate with SQLAlchemy, and feed cleanly into data-pipeline tooling. Newer scraping frameworks often live in their own ecosystem that requires translation at the data-handoff layer.

What Scrapy doesn’t do well in 2026

The framework’s limitations are real and bound where it gets used.

LLM-augmented extraction. Scrapy is built around CSS/XPath selectors. The newer pattern of “send HTML to LLM, get structured output” does not fit naturally into Scrapy’s item pipeline. The integration is buildable but requires custom code that the self-healing scraper pattern handles more cleanly in alternative tooling.

JavaScript-heavy SPAs. Despite the Scrapy + Playwright integration, scraping a heavy single-page-application is more naturally done in a browser-first framework (Playwright directly, Stagehand, Browser-Use) than in Scrapy with Playwright as a backend.

Beginner accessibility. Scrapy’s learning curve is steep relative to newer frameworks. The Apify SDK, Crawlee, and Firecrawl’s APIs all expose simpler “scrape this URL” interfaces that get a beginner to first-success faster. Scrapy’s complexity is justified for production but not for prototyping.

What Scrapy means for the scraping market

The Scrapy substrate position matters for three reasons.

Vendor lock-in is weaker than vendor positioning implies. A team that built its scrapers on Scrapy can move between hosting providers (self-hosted, Apify, Zyte, Scrapinghub) without rewriting the scraper code itself. The framework provides genuine portability. Vendor marketing tends to obscure this by emphasizing platform-specific features that lock-in the deployment but not the scraper logic.

Open-source matters as a moat. Zyte’s commercial survival is partly defended by its custody of the Scrapy maintainer team. The open-source dominance produces a soft moat for the company that maintains the project — they get first-mover advantage on new features, brand recognition, and the recruiting advantage of being where the Scrapy talent works.

Scraper-vendor benchmarks usually overstate differentiation. When vendor benchmarks compare proprietary scraping APIs (Bright Data Web Unblocker, ZenRows, Scrapfly), they ignore the Scrapy-based self-hosted alternative that has comparable capability at much lower per-call cost. The benchmarks are useful for comparing vendors against each other; they understate the case for staying on Scrapy + your own proxy and skipping the vendor layer entirely.

For Apify Store publishers, the Scrapy substrate is the default underlying technology behind most actors. The Apify SDK wraps Scrapy patterns; many actor implementations are essentially Scrapy spiders with platform-specific input/output handling. The framework continues to be the technical floor on which the platform-specific tooling builds.

Fifteen years on, the question worth asking is not whether Scrapy is still relevant but what would actually displace it. The answer is: probably nothing in the next 3-5 years. The successor would need to match Scrapy’s throughput, composability, and ecosystem maturity while offering meaningful new capability. No current framework does all three. Scrapy will likely still be the substrate when it turns 20.


Sources