AI agents · 5 min read

OSWorld 2026: Agents Closed Half the Gap to Human 72.4%

OSWorld benchmark: human baseline 72.4%, top agent score moved from 12.2% (Jan 2025) to ~66% (May 2026). Half the gap closed in 18 months. Remaining 6-point gap is harder — long-horizon planning, GUI interpretation, error recovery.

By Signal Census Editorial June 5, 2026 Osworld Benchmark 2026

All articles

The OSWorld benchmark — multi-step computer-use tasks in a real Ubuntu desktop environment — has been the most-referenced standardized test for browser-and-OS agents since its launch in early 2024. The human baseline on the benchmark is 72.4%. The top agent score in January 2025 was roughly 12%. By May 2026, top-tier agents are scoring approximately 66%.

The numerical reading is dramatic: agents closed more than half the gap to human performance in 18 months. The structural reading is more nuanced. The first 54 points of progress came from foundation-model improvements, better tool-calling, and standardized agent scaffolding. The remaining 6 points to reach human parity will require advances in long-horizon planning, GUI state interpretation, and error recovery that are demonstrably harder.

Benchmark progression OSWorld's top-agent score rose quickly, then slowed just below the human baseline

The visual point is convergence, not a forecast: the remaining gap is small, but the recent gains are slower than the 2025 jumps.

The progression

The OSWorld leaderboard progression for top-tier agents:

OSWorld progression Top agents moved from 12.2% to roughly 66%, but the curve is flattening below the 72.4% human baseline

Date	Top agent score	Gap to human	Closed
Jan 2025	~12.2%	-60.2 pts	-
Apr 2025	~22%	-50.4 pts	9.8 pts
Jul 2025	~38%	-34.4 pts	16 pts
Oct 2025	~52%	-20.4 pts	14 pts
Jan 2026	~60%	-12.4 pts	8 pts
May 2026	~66%	-6.4 pts	6 pts
The headline gap has narrowed sharply, but each quarterly gain is now smaller than the 2025 jumps.

The slope is clearly flattening. The 9.8-16.0 point per quarter gains of 2025 have decelerated to 6 points in the most recent quarter. The pattern is consistent with a benchmark approaching its asymptote — the easy gains are in, the remaining gains require approach-changes rather than scaling.

The top-of-leaderboard composition has also shifted. Early 2025 leaders were primarily OpenAI’s Operator and Anthropic’s Claude with computer use, both running with standardized agent scaffolds. By mid-2026, the top positions include Google’s Mariner, multiple academic-research agents, and specialized vendor agents (Skyvern, Browser-Use derivatives) that have optimized specifically for the OSWorld task distribution.

What got closed

The first 54 points of improvement came from three converging forces.

Foundation model capability gains. Claude Sonnet 4 → 4.6, GPT-4o → o-series → GPT-5, Gemini 2 → 2.5 — each model release added meaningful capability on screenshot understanding, instruction following, and multi-step planning. The OSWorld improvements track the model release cadence directly.

Standardized agent scaffolding. The 2024-era pattern of every benchmark submission running a different scaffold has collapsed into a small number of canonical approaches. Anthropic’s computer-use loop, OpenAI’s Operator framework, and the open-source frameworks that converged on similar patterns all produce comparable results. The scaffolding work is no longer the differentiator.

Benchmark-specific optimization. Agents that target OSWorld specifically have learned the task distribution and can game some of its patterns. This is the standard benchmark-progression dynamic — early scores are limited by general capability, later scores include benchmark-specific tuning. The dynamic has driven 5-10 of the 54 points of progress and is now mostly tapped out.

What remains harder

The 6.4 points from current top-agent score to human baseline include capability classes that the current generation of agents handles poorly.

Long-horizon multi-step planning. Tasks that require holding 8-12 sub-goal states in working memory while navigating between applications. Current agents do better than 6-12 months ago but still degrade as task-step-count increases. The improvement curve on long-horizon tasks specifically is much flatter than on short-horizon tasks.

Recovery from compound errors. When an agent takes a wrong action and lands in an unexpected GUI state, recovering requires correctly diagnosing what happened and back-tracking. Humans do this naturally; agents tend to compound errors by trying to push forward from the wrong state. OSWorld tasks that include this recovery pattern are where the human-agent gap stays widest.

Subtle UI interpretation. Tasks that depend on noticing small interface cues (a disabled button, an inline error message, a state-change indicator) require careful screenshot interpretation that current vision-language models miss more often than humans do. The OSWorld task subset that requires this kind of perception is over-represented in the remaining gap.

The forward projection for OSWorld saturation depends on which of these capability classes get attacked next.

If foundation-model labs prioritize long-horizon planning (visible in current Anthropic and OpenAI research roadmaps), the next 3-4 points could close in 6-9 months.
If they prioritize multimodal grounding (which has independent value for non-agent use cases), subtle UI interpretation improves and that piece of the gap closes.
The compound-error-recovery piece is the hardest and may require architectural changes (memory, planning, verification) that are not just scaling.

A realistic estimate for OSWorld at-or-above human (72.4%) is Q4 2026 to Q2 2027. Beyond that the benchmark becomes saturated and the conversation moves to the next harder benchmark.

What OSWorld doesn’t measure

The benchmark’s structural limitations matter for interpreting the numbers.

Production reliability. OSWorld measures single-attempt task success on standardized inputs. Production browser-agent workloads — competitive intelligence scraping, lead enrichment, scraping-via-agent workloads — require sustained reliability across thousands of runs. The benchmark score is a capability ceiling; the production utility number is meaningfully lower.

Cost-per-task. OSWorld scores ignore the compute cost of getting the score. A top-66% agent often runs on heavy reasoning-tier models at $0.50-2 per task. Production workloads cannot tolerate that cost structure, which is why the Q2 2026 token-math analysis shows production agents typically using Flash-tier models — which score 15-25 points lower on OSWorld than the heavy-tier models that win the benchmark.

Adversarial conditions. OSWorld tasks run on stable applications with predictable interfaces. Production browser-agent workloads encounter anti-bot defenses, A/B-tested interfaces, regional content differences, and real-time content changes. The benchmark does not capture this dimension at all.

The benchmark is a useful capability indicator. It is not a useful production-readiness indicator. The 66% top-agent score in May 2026 implies that an agent can probably handle the kind of task OSWorld measures, but says little about whether the agent can reliably handle real-world browsing workloads at production volume.

What it means for scraping infrastructure

For Apify Store publishers and other scraping operators, the OSWorld trajectory matters in two specific ways.

The agent layer is getting more capable. Tasks that required human-in-the-loop intervention in 2024 are increasingly automatable by agent in 2026. The implication for scraper design is that the buyer-side workload is moving from “configure a scraper and let it run” to “describe an extraction task and let the agent figure out which scraper to run.” The actors that thrive in this transition are the ones that expose themselves cleanly to agent discovery via MCP and via clear input schemas.

The reasoning-tier models will keep winning the benchmark but losing the production layer. OSWorld-leading scores are on $0.50-2-per-task model configurations. Production deployments will continue to use Flash-tier or cheaper models. The benchmark-vs-production divergence is structural and not closing. Publishers who optimize for what production agents actually use (cheap, fast, schema-compliant) will outperform publishers who optimize for what benchmark-leading agents demonstrate.

The OSWorld curve is informative as a capability signal. The translation from capability to production utility is non-linear. For the next 6-12 months, OSWorld will continue rising and production agent reliability will continue lagging behind the headline numbers. The headline-vs-production gap is where the actual market opportunity for scraping infrastructure sits.

Sources

OSWorld benchmark and leaderboard
OSWorld paper (2024) — original benchmark methodology
Signal Census: Operator/Mariner CUA Benchmark Truth
Signal Census: Multi-Agent Frameworks Converge
Signal Census: LLM Extraction Token Math