LLM Scraping Beats Traditional at $0.0007 a Page
Per-page LLM extraction in Q2 2026: Gemini Flash 2.5 $0.0007, GPT-4o-mini $0.0012, Haiku 4.5 $0.007, Sonnet 4.6 $0.026. Traditional scraping still cheapest at $0.0003-$0.001 but the LLM floor is now within 2× — flips build-vs-buy on volatile targets.
The token-pricing race among Anthropic, OpenAI, and Google has pulled per-page LLM extraction cost down to $0.0007 on Gemini Flash 2.5 and $0.0012 on GPT-4o-mini. The heavier reasoning-tier models — Claude Sonnet 4.6, GPT-4o — sit at $0.02-0.03 per page. Traditional hand-coded scraping on residential proxy plus compute still wins on raw unit cost at $0.0003-$0.001 per page. But the gap is now small enough to flip the build-versus-buy math on any target where selectors break more than twice a year.
The relevant number is no longer “is LLM extraction expensive.” It is “is LLM extraction within 2× of traditional, and how often does the target break.”
This is why the architecture question changes: for volatile targets, maintenance time can dominate the remaining per-page delta.
The cost math
A typical scrape target page is 200-500 KB of raw HTML. Stripping <script>, <style>, <svg>, and navigation chrome reduces this by 60-80%, leaving 40-100 KB of structured content. At roughly 4 characters per token, that lands at 6,000-12,500 input tokens per page after preprocessing. Structured extraction adds roughly 500 output tokens for a typical e-commerce or contact-data record.
The cost matrix at Q2 2026 published prices, for a 6,250-input / 500-output token extraction:
| Model | Input $/M | Output $/M | Per page |
|---|---|---|---|
| Gemini Flash 2.5 | $0.075 | $0.30 | |
| GPT-4o-mini | $0.15 | $0.60 | |
| Claude Haiku 4.5 | $0.80 | $4.00 | |
| GPT-4o | $2.50 | $10.00 | |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
The 40× spread between Gemini Flash 2.5 and Claude Sonnet 4.6 is real economic content, not just a tier-labeling artifact. Flash-tier models trade off some accuracy on edge-case HTML for an order-of-magnitude lower cost. For high-volume extraction on standardized page shapes (e-commerce product listings, job postings, real-estate listings), Flash-tier accuracy is usually sufficient. For one-off extraction on novel page structures, the reasoning-tier markup recovery rate justifies the 30-40× price premium.
The traditional-scraper baseline:
- Residential proxy bandwidth: $5-10 per GB at market rates. A 100 KB request costs roughly $0.0005-0.001.
- Compute (Apify, equivalent): ~$0.0001 per typical actor run.
- Per-page traditional cost: $0.0003-0.001 with proxy, sub-$0.0001 without.
Traditional still wins on raw unit cost, but Gemini Flash 2.5 at $0.00065 is now inside the proxy-cost band. If the target tolerates datacenter IPs (no anti-bot), the gap collapses to nothing.
When traditional still wins
The build-vs-buy decision pivots on three variables, and traditional scraping wins when all three favor it:
Volume. Above ~1 million pages per month per target, the per-page cost arithmetic dominates. A 10× cost difference at 1M pages is $1,000-$26,000 per month. The marginal cost of a hand-coded scraper is structurally below the marginal cost of an LLM-loop scraper. Volume operators always optimize.
Stability. A target site that has not changed its layout in 18 months is the worst-case for LLM-led extraction. The LLM cost is paid every request, while the equivalent hand-coded selector cost is paid once at scraper-build time and then amortized to zero per request.
Output shape predictability. If the extraction target is a tabular schema with known fields (price, SKU, title, image URL), a hand-coded scraper produces deterministic output. LLM extraction is non-deterministic — the same input HTML may produce slightly different field formatting across runs. Downstream systems that care about bit-stable output prefer traditional.
The classic example: a price-monitoring system scraping Amazon’s main product pages a million times per day. Volume favors traditional. Stability favors traditional (Amazon does not redesign its main PDP layout monthly). Output shape predictability favors traditional. The LLM-led approach is the wrong tool.
When LLM extraction wins now
LLM-led extraction wins on the inverse profile:
Low volume. Sub-100k pages per month per target. The per-page cost arithmetic is dominated by engineer maintenance time, not by direct cost. At 50,000 pages/month, a 10× cost difference is $50-1,300 per month — not enough to fund the engineer-hour needed to maintain a hand-coded scraper on a frequently-changing target.
High volatility. Sites that A/B-test layouts, ship JavaScript framework updates frequently, or operate behind anti-bot stacks that mutate every quarter. A hand-coded scraper for these targets requires meaningful engineering capacity to maintain. The self-healing scraper pattern — hand-coded selectors with LLM fallback on failure — strikes the right balance.
Variable output shape. Lead-extraction tools that scrape company About pages, RSS-aggregating news scrapers, and any target where the relevant content shape is not fully known in advance. LLM extraction can interpret context that a regex-based selector cannot.
The break-even math for self-healing scrapers
The hybrid self-healing pattern — hand-coded selectors first, LLM regeneration on failure, persisted selectors until next break — produces a different cost curve. The LLM cost is incurred only at the point of layout change, not on every request.
For a target that breaks once a quarter on a 100k-pages/month scraper:
- Pure traditional cost (with one engineer-hour per quarter to fix breaks): ~$50 in proxy + $400 engineer cost = $450/quarter
- Self-healing hybrid (proxy + LLM only on the 100 retry attempts after layout change): ~$50 proxy + $1 LLM = $51/quarter
The hybrid is roughly 9× cheaper than pure traditional once engineer time is priced in. Pure LLM-led on every request would be ~$200/month = $600/quarter — more than traditional. The hybrid is the dominant configuration for the volatile-medium-volume use case.
What this means for Apify publishers
For Apify Store publishers running pay-per-event actors on volatile targets, the token-math floor changes pricing strategy in two ways.
Per-call price floor. The marginal cost of running an LLM-fallback path on a self-healing actor is roughly $0.001-0.007 per page (using Flash-tier or Haiku-tier respectively). PPE pricing on the Store ranges from $0.0015 to $0.005 per record. The marginal cost is now a meaningful share of the per-call price for any actor using LLM-loop architecture. Publishers who do not price in the LLM-call cost are absorbing margin compression they have not measured.
Tier-pricing emergence. The 40× cost spread between Flash-tier and reasoning-tier models means publishers can credibly offer two-tier pricing: a cheap “best-effort” tier that uses Flash-tier for extraction, and a “high-fidelity” tier that uses Sonnet 4.6 or GPT-4o for the same target. Few Apify actors do this today. The publishers who introduce explicit tier-based pricing will be the first to capture the high-margin tier of buyers who care about extraction quality enough to pay 30× for it.
The longer-term implication for the token economics of agent-driven scraping is that the per-page LLM cost is converging with the per-page proxy cost. When the two costs cross — which Flash-tier already does for any target accessible via datacenter IPs — the architectural question becomes “why are we maintaining a separate scraping infrastructure at all?” The answer for high-volume operators stays the same. The answer for everyone else is starting to change.
Sources
- Anthropic API pricing — Claude Haiku 4.5, Sonnet 4.6 token costs
- OpenAI API pricing — GPT-4o-mini, GPT-4o token costs
- Google Cloud Vertex AI pricing — Gemini Flash 2.5, Pro 2.5 token costs
- Signal Census: Self-Healing Scrapers in Production — companion architecture analysis
- Signal Census: Cost per 1,000 Pages — Q1 2026 Benchmark — vendor pricing baseline