Anti-bot & legal · 5 min read

Why Glassdoor Scraping Is a GDPR Question, Not Article 5

EU AI Act Article 5 targets facial-image scraping for biometric DBs (Clearview pattern). Glassdoor review-scraping doesn't trigger Article 5 but falls under GDPR. Exposure is more diffuse, harder to litigate, but real for high-volume operators.

By Signal Census Editorial
Apify
Apify · marketplace signal
EU AI Act Article 5 targets facial-image scraping for biometric DBs (Clearview pattern).
Bright Data
Bright Data · vendor signal
EU AI Act Article 5 targets facial-image scraping for biometric DBs (Clearview pattern).

The Apify Store tracks 77 Glassdoor-targeted scrapers with continuous data history, serving roughly 1,300 monthly active users. The broader review-data scraping segment — Glassdoor, Trustpilot, Yelp, Indeed reviews — holds 367 actors and 18,479 MAU. The segment is small relative to the 25,787-actor Apify catalog but disproportionately exposed to EU legal pressure, because the data being scraped is personal data under GDPR even when the reviews themselves are pseudonymous.

The common misreading is that EU AI Act Article 5 covers this category. It does not. Article 5 specifically targets one scraping pattern. Glassdoor scraping triggers a different, older, more diffuse regulatory regime — and the practical compliance posture is correspondingly different.

What Article 5 actually covers

The EU AI Act’s Article 5(1)(e) prohibition entered into force on February 2, 2025. The specific language bans “the placing on the market, the putting into service for this specific purpose, or the use of AI systems that create or expand facial recognition databases through the untargeted scraping of facial images from the internet or CCTV footage.”

The Clearview AI fact pattern is the template: scraping public-facing facial images at scale to build a biometric identification database for sale to police, security firms, or other downstream buyers. The Article 5 prohibition is narrow but binary — there is no defensible processing basis for the prohibited activity, and the regulatory penalty is uncapped (up to €35mn or 7% of global turnover, whichever is higher).

What Article 5 does not cover:

  • Text scraping of any kind, including reviews
  • Scraping of contact data (names, emails, phone numbers) without facial images
  • Scraping of structured data (prices, listings, jobs) where the personal data is incidental
  • Scraping for downstream uses that are not “create or expand facial recognition databases”

A Glassdoor scraper that pulls reviews — even reviews authored by named employees — does not place the operator in Article 5 territory. The data being collected is text. Even if the resulting dataset contains personal data, the absence of facial images and the absence of biometric-database purpose keeps the activity outside the Article 5 prohibition.

What GDPR says about review data

GDPR is the regime that actually governs Glassdoor scraping in the EU, and the analysis is more nuanced.

Glassdoor reviews are pseudonymous in the user-facing surface — reviews are signed with display names like “Software Engineer in London” or “Current Employee — Manager”. The display name on its own is not personal data. But the review text frequently identifies the employer, the team, the office, and details that, in combination with the role description, are re-identifiable in a small enough hiring pool.

GDPR Article 4(1) defines personal data as “any information relating to an identified or identifiable natural person.” The European Data Protection Board has consistently held that pseudonymized data which can be re-identified through combination with other available data is still personal data under GDPR. Glassdoor reviews meet this definition for most reviews of employers with fewer than ~500 employees per location.

The lawful-basis test in GDPR Article 6 becomes the operative constraint. A scraper operator processing Glassdoor reviews needs to identify one of six lawful bases:

  • Consent — not realistically obtainable from review authors
  • Contract — does not apply
  • Legal obligation — does not apply
  • Vital interests — does not apply
  • Public task — does not apply
  • Legitimate interest — the only credible basis, and requires a balancing test against the data subject’s rights

The legitimate-interest argument for review scraping varies in strength by downstream use. A competitive-intelligence platform aggregating sentiment across employers has a weak balancing case. An academic researcher studying labor-market signals has a stronger one. A scraping operator selling raw review datasets to anonymous buyers has the weakest case of all.

The Apify Glassdoor scrapers — where they sit

The 77 Glassdoor actors on the tracked Apify catalog cluster around four use-case patterns:

  • Job-board scraping that incidentally captures Glassdoor’s job listings (the largest cohort, dominated by multi-board aggregators)
  • Reviews scraping for competitive intelligence
  • Salary-data scraping for compensation benchmarking
  • Profile-data scraping for HR analytics

The legal exposure varies sharply across these. Job-listing scraping has the strongest legitimate-interest case — public-facing recruitment data with limited personal-data content. Review and salary scraping have the weakest cases, because the personal-data content is direct and the downstream uses are commercial.

The aggregated demand pattern is consistent with the legal-exposure pattern. The job-listing actors capture the bulk of the 1,300 monthly users; review and salary actors capture the long tail. Buyers who would be most exposed to GDPR enforcement are already self-selecting toward the lower-exposure use cases.

The practical compliance posture

For a scraping operator running Glassdoor or similar review-data actors in the EU, the compliance posture has three load-bearing components:

Geographic scoping. GDPR applies to processing of EU residents’ personal data regardless of where the processing happens. A US-based operator scraping reviews of EU employees is subject to GDPR. The common move — geofence away from EU IP addresses — does not solve this. The exposure is about whose data is being processed, not where the scraping happens.

Data minimization. GDPR requires processing only the personal data necessary for the stated purpose. A scraper that pulls full review text, author display name, and employer when the downstream use case only needs sentiment scores is collecting more than minimization allows. The mitigation is to extract derived signals (sentiment, topic, score) and discard the raw text on ingestion.

Right-to-erasure handling. GDPR Article 17 gives data subjects the right to demand erasure. A scraping operator with a Glassdoor-derived dataset needs a workable process for receiving and acting on erasure requests. Most do not have one. The legal exposure compounds with the size of the retained dataset.

The combined effect is that small-volume operators (single-developer Apify actors serving 10-50 users per month) are practically below the enforcement threshold for EU DPAs. Mid-volume operators (datasets and aggregation services serving thousands of customers) sit in the enforcement-risk zone. Large-volume operators are visible enough to attract attention but typically have legal teams that have already shaped the data-handling architecture.

Where the enforcement attention will go

The EU AI Act Article 5 enforcement wave of 2025-2026 created a regulatory expertise base in EU DPAs around scraping-related cases. That expertise will not stay focused on facial-image scraping. The next enforcement frontier — visible in noyb’s filings, in Italian DPA decisions, and in the Bright Data v. Meta precedent landscape — is review and rating data scraping under GDPR.

The vendors most exposed are the ones that aggregate review data into resellable datasets without obtaining lawful-basis documentation from the scraping operators they buy from. Apify Store publishers are mostly small enough to be invisible individually, but the aggregators downstream of them — Bright Data’s review datasets, Crawlbase’s similar offerings, the labor-intel wholesalers — are larger and more visible targets.

For Apify Store publishers building review-data scrapers, the practical implication is to design for data minimization from the start: derive signals at extraction time, store only what the downstream use case demands, and document the legitimate-interest balancing test in the actor README. The legal exposure on a 50-user-per-month actor is small. The legal exposure on the aggregation tier that resells the output at scale is not, and enforcement attention will follow the data downstream.

By Q4 2026, expect the first DPA decision specifically on review-data scraping in an Apify-style marketplace context. The decision will not change the scraping economics for high-margin operators. It will reshape the legal-posture documentation that mid-tier publishers and dataset aggregators are forced to produce.


Sources