4.1 Billion Job Postings: The Labor-Intel Wholesale Stack
Lightcast, Revelio Labs, and Aura Intelligence all build their datasets by scraping job boards and company career pages. Revelio's COSMOS dataset alone covers 4.1B postings. None publish per-record pricing — the last big data vertical still on enterprise-only contracts.
There is a quiet, well-funded, profitable corner of the data market that operates almost entirely on scraping infrastructure but rarely appears in scraping-industry coverage: the labor-intelligence vendors. Lightcast, Revelio Labs, and Aura Intelligence sell job-market data — postings, hiring trends, workforce composition, skills demand — to PE firms, hedge funds, workforce planners, and economists.
Three publicly disclosed facts shape the segment:
- Revelio Labs’ COSMOS dataset covers 4.1 billion current and historic job postings across 6.6 million companies, sourced from “440k company websites + all major job boards + staffing-firm boards”
- Lightcast’s methodology explicitly states “scraping spiders identify new advertisements on a source-level basis,” with a 60-day deduplication window on title/company/location
- Aura Intelligence claims 2 billion+ workforce datapoints, sourced from job boards, profile data, sentiment, and SEC filings — explicitly multi-source to avoid LinkedIn-only bias
None of the three publishes per-record pricing. None publishes seat pricing. The pricing pages are blank or replaced with “request a demo.” This is the last large-scale data vertical that still operates on full enterprise-sales motion, with deal sizes in the five- to seven-figure annual range and no public benchmark.
That opacity is itself the story. It tells you what kind of buyer the segment serves, why the segment is profitable, and what the entry points look like for new players.
Why the segment is opaque
Three structural reasons drive the lack of public pricing.
The buyer is sophisticated. PE due-diligence teams, hedge fund quant researchers, and Fortune 500 workforce planners do not buy data through self-serve checkout flows. They evaluate through procurement, negotiate through commercial teams, and sign annual contracts with custom terms. Public pricing helps the long-tail buyer; this segment does not have a long tail.
The product is dataset-plus-analytics. Lightcast does not sell raw job postings; it sells the data plus a query interface plus the pre-built taxonomies (titles normalized to occupation codes, skills extracted, salary bands estimated). Revelio sells COSMOS plus the ability to query workforce flows over time. Aura sells the multi-source aggregation plus the analytics layer that traces individual workforce events. The product is the whole stack, not the underlying records.
The buyer cohort is small. The realistic addressable market for “buy 10M+ job postings per quarter for analytical use” is hundreds of organizations, not thousands. At that buyer count, every customer is a custom deal. Published pricing would constrain the negotiation, not enable it.
The combined effect is a market where the leading vendors all gross meaningful eight- to nine-figure ARR while remaining invisible to the public scraping-industry conversation.
How the data actually flows
The provenance disclosures in the Lightcast and Revelio methodology pages are unusually clear about what scraping the segment depends on. Both vendors maintain in-house spider operations against named job boards (Indeed, LinkedIn, Glassdoor, ZipRecruiter, Monster, regional sites in non-US markets) and against company career pages (Workday, Greenhouse, Lever, SmartRecruiters, BambooHR — the same ATS surfaces documented in the Q1 2026 ATS census on this site).
The deduplication problem is harder than scraping itself. A single job posted by a company can appear on the company career page, on Indeed (sponsored), on LinkedIn (free indexing), on Glassdoor (mirrored), and on three or four staffing firm boards (re-listed). Lightcast’s 60-day window on title/company/location is the standard heuristic; Revelio uses a similar but proprietary approach.
The market opportunity for higher-quality dedup is real but technically deep. Most of the value in the labor-intel stack is not the scraping (which is commoditizable) but the dedup, normalization, and skills extraction layers (which are not).
What the segment tells you about the broader scraping market
The labor-intel vendors are a useful case study because they sit at the intersection of three trends visible across the rest of the scraping market.
First, they validate that scraping infrastructure can support a real enterprise-software business. A vendor at $50mn+ ARR that depends on continuous scraping of dozens of source surfaces is proof that the scraping side can be operated reliably at the scale required for institutional customers. That is not a trivial claim; it requires the vendor to have solved the anti-bot stack, the proxy economics, the legal posture, and the operational resilience to keep collecting through site changes.
Second, they confirm where the margin actually sits. Lightcast does not charge for scraping. They charge for the analytics layer. The scraping is commodity; the value is the normalized, dedup’d, queryable data product on top. Same pattern as Clay’s $3.1bn on top of Apollo and ZoomInfo. Same pattern as Bright Data’s datasets product on top of its proxy network.
Third, they show that the long-tail Apify Store ecosystem is operating in a different segment than the labor-intel vendors are. A buyer who needs analytical-quality job-market data buys Lightcast or Revelio. A buyer who needs raw recent job postings for a specific narrow use case (research, lead-gen against hiring companies, vertical-specific recruiting tools) buys an Apify Actor. The two segments do not directly compete because the products are not substitutable — but they share the same underlying scraping infrastructure problem.
The wholesale-primitive path
For Apify Store publishers in the job-board vertical, the labor-intel vendors are not direct competitors but they are reference points.
The implication: if a publisher wants to move upmarket from “raw job postings on demand” to “analytical-quality job-market data product,” the path runs through the same dedup, normalization, and skills-extraction work that Lightcast and Revelio invested years in. The market opportunity is real — the labor-intel vendors are not selling to every PE firm or hedge fund yet, and there is room for more players. But the engineering requirement is significant, and the customer-acquisition cost is enterprise-grade.
The more accessible path for an Apify-class publisher is to be the wholesale data primitive that the analytics layer buys from. A high-quality, well-maintained, broadly-coverage job-board Actor that ships clean structured data is a more attractive component for a future analytics product than a scraping operation built from scratch. The publishers who position their Actors explicitly for this wholesale role — clean schema, predictable pricing, strong SLAs, target coverage breadth — are the ones who will eventually sell into the analytics-vendor segment.
The labor-intel market is unusual in how opaque it remains to the broader scraping conversation. It is also unusual in how clearly it demonstrates the structure of every other scraping-dependent enterprise data vertical: collection is commodity, value is in the analytics layer, and the publishers who win are the ones whose data is good enough to be re-sold by someone else.
Sources