AI agents · 5 min read

Self-Healing Scrapers in Production: The Math Changed

The 'scraper as agent' pattern — an LLM regenerates broken CSS selectors after a site changes — is now in production at Kadoa, AutoScraper, and OSS implementations. TCO math finally favors LLM tokens over scraper maintenance.

By Signal Census Editorial
Apify
Apify · marketplace signal
The 'scraper as agent' pattern — an LLM regenerates broken CSS selectors after a site changes — is now in production at Kadoa, AutoScraper, and OSS implementations.

The scraper-as-agent pattern — where a Large Language Model regenerates broken CSS selectors when a target site changes its layout — has moved from blog-post novelty to production deployment. Kadoa, AutoScraper (with its newer Claude integration), llm-scraper, and a growing list of internal implementations at scraping shops are now operating with this architecture in customer-facing pipelines.

The technique is straightforward. A traditional scraper hits a target page, parses it with hard-coded selectors, and either succeeds or returns nothing. A self-healing scraper does the same, but on failure invokes an LLM with the raw HTML, asks it to identify the relevant data fields, and proposes new selectors. The new selectors are validated against expected output shape and persisted for the next run.

What matters here is not the architecture, which has been technically possible since GPT-4 shipped. What matters is that the total-cost-of-ownership math finally works.

The TCO inversion

The historical objection to LLM-in-the-loop scraping was per-call cost. Sending raw HTML (often 50KB+) to GPT-4 or Claude on every request, then parsing structured output, was 10x to 100x more expensive than running a hard-coded scraper. For a scraper that runs cleanly for months, the LLM premium was unjustifiable.

Two things changed.

First, the LLM cost dropped. GPT-4o-mini, Claude Haiku 3.5, Gemini Flash, and the newer reasoning-tier models (o1-mini, Claude Sonnet 4.6) priced per-million-token cost down by an order of magnitude between 2023 and 2026. Sending 50KB of HTML for structured extraction now costs roughly $0.0003 per request on Haiku 3.5 and somewhere between $0.0008 and $0.002 on the heavier models. That is competitive with the marginal cost of running a hand-coded scraper on Apify or a similar platform once you account for proxy and compute.

Second, the maintenance cost of hand-coded scrapers stayed flat or went up. Site rewrites, A/B tests, anti-bot updates, and JavaScript-framework migrations all generate selector breakage at the same rate they always did. A scraper team that maintains 100 scrapers spends meaningful engineering time chasing breaks. The TCO of one engineer’s time per quarter, divided across 100 scrapers, is non-trivial. Self-healing scrapers reduce that maintenance cost toward zero — at the cost of higher per-call LLM spend.

The crossover point depends on (a) how often the target site breaks, and (b) how expensive an engineer-hour is. For a scraper that breaks once a quarter on a site that the team monitors, hand-coded wins on TCO. For a scraper that breaks every two weeks on a site that the team can’t keep up with, self-healing wins. Most production scraper portfolios have many more of the second kind than the first.

The architectures that work

Three patterns have emerged in production self-healing scrapers, and they make different trade-offs.

Selector-regeneration on failure is the cheapest. The scraper runs hard-coded selectors first; only on failure does it call an LLM to propose replacements. The LLM cost is incurred only when the site has actually changed. The downside is that a partially-broken scraper (some selectors work, others don’t) may not trigger the regeneration path and silently produce incomplete data. Production implementations add output-shape validation to catch this.

LLM-first extraction sends the HTML to the LLM on every request, with no hard-coded selectors at all. This is the Firecrawl and Kadoa model. Cost per record is higher, but the system is fundamentally robust to layout changes. The trade-off is that LLM extraction is non-deterministic — the same input HTML may produce slightly different output across runs, which is a problem for downstream systems that expect bit-stable data.

Vision-based extraction, pioneered by Diffbot years ago and now seeing renewed attention, sends a screenshot of the rendered page to a multimodal LLM. The model identifies the relevant content visually rather than parsing HTML. Cost is higher still, but the system is robust to both DOM changes and JavaScript-rendered content. Production deployments are growing among scraping shops that target SPAs and other heavy-JS sites.

The newer pattern is hybrid: hard-coded selectors with LLM fallback for resilience, plus vision-based validation that the extracted data matches what a human would see. The compute budget for that combination is roughly 5x to 10x a vanilla hand-coded scraper, but the maintenance budget approaches zero. For a 100-scraper portfolio, that math is now defensible.

Token-cost mitigation in practice

Production self-healing scrapers do not blindly send full HTML to the LLM. The token-cost mitigation patterns are well-established now.

A typical pre-processing pipeline strips <script>, <style>, <svg>, <iframe>, and navigation elements before LLM submission, reducing token count by 60–80% on most pages. Apify’s approach and similar implementations at Kadoa do exactly this. The trimmed HTML retains the structural and content information needed for selector inference, without the boilerplate that inflates the token bill.

Caching the regenerated selectors is the other major lever. A self-healing scraper that regenerates on failure should not re-invoke the LLM on every subsequent request to the same target — it should persist the new selector and use it directly until the next failure. That moves the LLM cost from per-request to per-layout-change, which is a meaningful difference.

The pricing implication for PPE actors

For Apify Store publishers, the self-healing pattern has two implications.

Maintenance burden drops. A publisher running 20 actors with self-healing fallback spends substantially less engineering time on break-fix work than one running 20 hand-coded actors. That capacity can be redirected to publishing more actors, supporting customers, or improving margins.

Per-call cost increases — but only on sites that change. For stable targets (long-running marketplaces, government data portals, established job boards), the LLM never gets invoked and the cost stays low. For volatile targets (consumer e-commerce, social media, frequently-updated SaaS UIs), the LLM gets invoked often and the per-call cost climbs. The publisher’s PPE pricing has to account for the volatility of the target — which it currently mostly doesn’t.

The downstream pricing implication is that PPE rates on volatile-target actors should be higher than rates on stable-target actors of equal complexity. Most actor pricing on the Store does not yet make this distinction. The publishers who introduce it explicitly — “high resilience tier” pricing for self-healing actors on volatile targets — will be capturing real value that the long tail is not yet capturing.

The frontier is hybrid actors that ship with a baseline hand-coded path, an LLM regeneration fallback, a vision validation layer, and transparent per-call pricing that reflects the actual compute mix used. Few of those exist on the Store today. By Q4 2026, several will.


Sources