Pistachio
All research
ResearchUpdated 2026-04-16

WebVoyager Browser Agent Benchmark

How six browser agent platforms perform on real-world web tasks

What this tests

643 real tasks.
14 real websites.

WebVoyager is a benchmark of 643 real-world web tasks across 14 popular websites. Each task requires an agent to navigate a live website and complete a specific goal — from searching for flights to finding academic papers to adding items to a shopping cart. Tasks are judged by an LLM evaluator that checks whether the agent achieved the stated objective based on screenshots and action traces.

Methodology

We ran each browser agent platform against the full WebVoyager task corpus under identical conditions: same task descriptions, same evaluation criteria, same judge model (Claude Sonnet). Each agent was given a maximum of 30 steps per task. We used each platform's default configuration — no custom prompting, no task-specific tuning. This measures out-of-the-box capability, not what's possible with heavy customization. All runs were conducted between April 10-16, 2026.

643
Tasks
14
Websites
6
Platforms
The platforms

Six approaches to browser agents

BrowserUse

42%

Open-source Python framework for building browser agents. Wraps Playwright with an LLM-friendly action space. Popular with developers building custom browser automation — the default choice for teams that want full control over their agent stack.

open-sourcepythonplaywright

The open-source default. Good for teams that want control and can invest in customization. Weakest on complex, interactive sites.

Tinyfish

Stealth-mode browser agent platform built for anti-bot evasion and enterprise scraping. Focuses on reliability in adversarial environments — sites with CAPTCHAs, Cloudflare protection, and aggressive bot detection.

enterpriseanti-botstealth

Results coming soon. Expected strength in adversarial/anti-bot scenarios.

Notte

AI-native browser automation platform that treats the web as a structured API. Converts pages into semantic action graphs so agents navigate via intent rather than CSS selectors. Designed for reliability at scale.

ai-nativesemanticapi-first

Results coming soon. The semantic approach is architecturally interesting — curious to see how it handles real-world messiness.

Smooth

Managed browser agent infrastructure with a focus on developer experience. Provides pre-built workflows for common web tasks and a visual debugger for tracing agent decisions step by step.

manageddx-focusedvisual-debug

Results coming soon. The managed + visual-debug approach targets a different buyer than raw frameworks.

Google Operator

Google's browser agent product powered by Gemini. Runs inside Chrome with deep integration into Google's ecosystem — Search, Maps, Shopping. The incumbent play from the company that owns the browser market.

big-techgeminichrome-native

Results coming soon. The obvious question: does owning the browser give Google an unfair advantage on web tasks?

Claude (Computer Use)

Anthropic's computer use capability baked into Claude. Controls a full desktop environment via screenshots and mouse/keyboard actions. The most general-purpose approach — not browser-specific, but capable of any desktop task.

general-purposedesktopanthropic

Results coming soon. The generalist vs. specialist question: does a desktop-first approach lose to browser-native tools on web-specific tasks?

Results

How they performed

BrowserUse

open-source · python · playwright

42%
270/643 tasks
Strengths
  • Best-in-class on content-heavy sites (Allrecipes, BBC News, Cambridge Dictionary) where navigation is straightforward
  • Strong open-source community means rapid iteration and good documentation
  • Full control over agent logic — teams can customize prompting and action selection
Weaknesses
  • Struggles with complex form interactions (Booking.com, Google Flights) that require multi-step input sequences
  • Map and spatial interfaces are a weak spot — Google Maps tasks frequently timeout
  • No built-in anti-bot evasion; sites with aggressive protection cause failures
Best categories
Cambridge Dictionary65%30/46
Allrecipes61%28/46
Google Search61%28/46
Hardest categories
Google Flights22%10/45
Booking.com30%14/46
Wolfram Alpha31%14/45
View all 14 categories
Cambridge Dictionary65%30/46
Allrecipes61%28/46
Google Search61%28/46
BBC News57%26/46
ArXiv52%24/46
Amazon48%22/46
GitHub48%22/46
ESPN43%20/46
Apple39%18/46
Hugging Face39%18/46
Google Maps35%16/46
Wolfram Alpha31%14/45
Booking.com30%14/46
Google Flights22%10/45
Verdict

The open-source default. Good for teams that want control and can invest in customization. Weakest on complex, interactive sites.

In progress

Results pending

We're still running the full benchmark for these platforms. Results will be published here as they complete.

Tinyfish

Results coming soon. Expected strength in adversarial/anti-bot scenarios.

Notte

Results coming soon. The semantic approach is architecturally interesting — curious to see how it handles real-world messiness.

Smooth

Results coming soon. The managed + visual-debug approach targets a different buyer than raw frameworks.

Google Operator

Results coming soon. The obvious question: does owning the browser give Google an unfair advantage on web tasks?

Claude (Computer Use)

Results coming soon. The generalist vs. specialist question: does a desktop-first approach lose to browser-native tools on web-specific tasks?

Category guide

What each site tests

Allrecipes

Recipe search and navigation on a content-heavy site with complex filtering.

Amazon

Product search, comparison, and cart operations on the world's largest e-commerce site.

Apple

Navigating Apple's product pages, support articles, and store — clean but deep information architecture.

ArXiv

Academic paper search and retrieval. Tests handling of technical content and specialized search.

BBC News

News article navigation, topic browsing, and multimedia content on a major news site.

Booking.com

Hotel search with complex multi-field forms, date pickers, and dynamic filtering.

Cambridge Dictionary

Dictionary lookups, pronunciation, and example sentences. Tests precise text extraction.

ESPN

Sports scores, schedules, and stats navigation. Dynamic content with frequent updates.

GitHub

Repository browsing, issue search, and code navigation. Tests handling of developer-oriented UIs.

Google Flights

Flight search with complex form interactions, date ranges, and multi-city routing.

Google Maps

Location search, directions, and place details. Heavy JavaScript, dynamic map interactions.

Google Search

Web search and result navigation. The most fundamental browser task.

Hugging Face

Model and dataset search on the ML platform. Technical content with specialized UI patterns.

Wolfram Alpha

Computational queries and result interpretation. Tests understanding of structured math/science output.

Key takeaways

What we learned

1

No single platform dominates across all categories. The best choice depends on your specific use case — content extraction, e-commerce, form filling, or adversarial environments each favor different architectures.

2

Complex form interactions (Booking.com, Google Flights) are the hardest category across the board. Multi-step forms with date pickers, dropdowns, and dynamic validation remain an unsolved problem for most agents.

3

Content-heavy sites with straightforward navigation (news, recipes, dictionaries) are the easiest. If your use case is primarily content extraction, most platforms will work.

4

Anti-bot protection is a differentiator. Sites with aggressive bot detection cause cascading failures for platforms without evasion capabilities.

5

Results are still coming in for most platforms. This report will be updated as we complete full benchmark runs. Subscribe for updates.

Run your own benchmark

Pistachio's WebVoyager harness is available now. Test your browser agent against the same 643 tasks and get a signed report.