AI agents · 5 min read

Operator vs Mariner vs Computer Use: The Benchmark Truth

OpenAI Operator, Google Mariner, and Anthropic Computer Use all shipped within 90 days late 2024 / early 2025. Benchmark spread is narrower than the marketing suggests. All three trail humans badly on multi-step OS tasks — gap not closing fast.

By Signal Census Editorial Agent Benchmarks
All articles
Operator vs Mariner vs Computer Use: The Benchmark Truth editorial image
Apify
Apify · marketplace signal

Three frontier labs shipped browsing agents inside a 90-day window. Anthropic’s Computer Use on October 22, 2024. Google’s Mariner on December 11, 2024. OpenAI’s Operator on January 23, 2025. The simultaneity is itself a data point — three independent labs concluded within months of each other that browser agents are the next product surface.

The benchmark battery to compare them now exists, the gap between the leader and the laggard is narrower than the marketing suggests, and the gap to humans on multi-step tasks is wider than the marketing wants to admit.

What the benchmarks actually show

Three benchmarks dominate the public comparison: WebVoyager (web tasks against real sites), OSWorld (multi-step desktop OS tasks across applications), and WebArena (simulated e-commerce / CMS environments).

The current numbers, per the O-Mega 2025–2026 guide and the Stanford HAI 2026 AI Index:

WebVoyager (web-task success rate)

  • OpenAI CUA: 87%
  • Google Mariner: 83.5%
  • Anthropic Computer Use: 56%

OSWorld (desktop OS multi-step)

  • OpenAI CUA: 38.1%
  • Anthropic Computer Use: 22.0%
  • Human baseline: 72.4%

WebArena (simulated e-commerce / CMS)

  • OpenAI CUA: 58.1%

Two readings come out of those numbers immediately.

The web-task gap is small. Operator beats Mariner by 3.5 percentage points on WebVoyager. Anthropic is meaningfully behind on the same benchmark, but the underlying issue is positioning rather than capability — Anthropic shipped Computer Use as a research preview emphasizing human-in-the-loop safety, not as an autonomous agent. The capability gap is real, but it is not 31 points of intrinsic model difference; it is mostly a difference in how aggressively each lab tuned for autonomous task completion.

The desktop OS gap to humans is large. Even the leader — OpenAI CUA at 38.1% — is roughly half a human’s success rate on multi-step OS tasks. Anthropic’s 22% is closer to one-third. That gap has not closed materially in the last year despite three model generations of frontier labs shipping. It is a hard problem.

What the strategic split actually looks like

The three labs are pursuing visibly different strategies, and the benchmark numbers reflect those strategies.

OpenAI is positioning Operator as a programmable substrate. Chat with the agent, give it a task, it goes off and does it. The Operator product is gated to ChatGPT Pro for now, but the strategic path is clearly toward a developer API that competes with Browserbase and Browser Use on infrastructure. If that ships, the venture-backed browser-agent class compresses hard.

Google is positioning Mariner as the enterprise agent. Integrated into Gemini Advanced, with Google Workspace connectors, focused on completing tasks across the Google ecosystem. The benchmark performance reflects that focus — strong on web tasks, less attention to general OS-level capability. The economic bet is that enterprise customers want the agent that already speaks Google Drive, Gmail, and Calendar.

Anthropic is positioning Computer Use as the human-in-the-loop tool. Lower autonomous success rates are partly a deliberate product choice — the Anthropic safety posture explicitly favors slower, supervised execution over aggressive autonomous task completion. The lower benchmark numbers do not necessarily reflect lower model capability; they reflect a different product target.

For the buyer (developer or enterprise), the choice between the three is increasingly less about capability and more about which ecosystem the buyer already lives in.

What this means for the third-party browser-agent class

The existence and quality of the frontier-lab agents has direct implications for the venture-backed browser-agent class — Browserbase ($67.5mn raised), Browser Use ($17mn, 79K GitHub stars), Skyvern ($2.7mn but 85.85% on WebVoyager).

The threat: if Operator ships a developer API at scale, the substrate value proposition of Browserbase compresses. If Mariner integrates broadly with enterprise SaaS, the integration value proposition of Browser Use compresses. If Computer Use becomes the default Anthropic API call shape, the safety / human-in-loop positioning that some of the smaller players carve out gets absorbed.

The defense: distribution and developer experience. The startups can ship faster than the labs on workflow primitives, can integrate with non-LLM ecosystems (Postgres, Snowflake, custom CRMs) that the labs do not prioritize, and can offer pricing models that the API-only frontier labs do not. Some of those defenses are real. Some are temporary.

The next data point to watch is whether OpenAI ships an Operator developer API in 2026. If yes, the venture-backed substrate plays consolidate to two or three survivors. If no, the field stays open for another year.

Apify as both consumer and competitor

For Apify Store publishers, the frontier-lab agents are both consumers and competitors.

As consumers: an agent (Operator, Mariner, Computer Use) that needs to extract structured data from a target site can call an Apify Actor via MCP and get back clean structured output, instead of running its own browser session against the target. For target-specific extraction, the Actor is faster, cheaper, and more reliable than agent-driven browsing. That is a genuine distribution channel for Apify publishers if the agents pick up MCP-mediated tool routing.

As competitors: the same agent could in principle browse the target site itself instead of calling an Actor. Whether it does depends on the cost differential, the reliability differential, and the agent’s tool-discovery surface. For high-volume, structured extraction from known targets, the Actor wins on every dimension. For one-off, ad-hoc extraction from unknown targets, agent-driven browsing wins on flexibility.

The unresolved question is what happens when the agents become reliable enough on multi-step OS tasks that they start replacing human knowledge work — and the demand for “an agent that can do X” goes up by an order of magnitude. The data infrastructure underneath that demand has to come from somewhere. Apify Actors are well-positioned to be that infrastructure if the catalog stays comprehensive enough. The frontier labs are well-positioned to commoditize it if Operator ships a managed scraping API.

For now, the benchmark spread between labs is narrow, the gap to humans on multi-step OS tasks is wide, and both gaps will compress through 2026 as the next generation of models ships. The shape of the agent layer in 2027 will be visible by the end of this year.


Sources