WebVoyager Browser Agent Benchmark
How six browser agent platforms perform on real-world web tasks
643 real tasks.
14 real websites.
WebVoyager is a benchmark of 643 real-world web tasks across 14 popular websites. Each task requires an agent to navigate a live website and complete a specific goal — from searching for flights to finding academic papers to adding items to a shopping cart. Tasks are judged by an LLM evaluator that checks whether the agent achieved the stated objective based on screenshots and action traces.
We ran each browser agent platform against the full WebVoyager task corpus under identical conditions: same task descriptions, same evaluation criteria, same judge model (Claude Sonnet). Each agent was given a maximum of 30 steps per task. We used each platform's default configuration — no custom prompting, no task-specific tuning. This measures out-of-the-box capability, not what's possible with heavy customization. All runs were conducted between April 10-16, 2026.
Six approaches to browser agents
BrowserUse
42%Open-source Python framework for building browser agents. Wraps Playwright with an LLM-friendly action space. Popular with developers building custom browser automation — the default choice for teams that want full control over their agent stack.
The open-source default. Good for teams that want control and can invest in customization. Weakest on complex, interactive sites.
Tinyfish
Stealth-mode browser agent platform built for anti-bot evasion and enterprise scraping. Focuses on reliability in adversarial environments — sites with CAPTCHAs, Cloudflare protection, and aggressive bot detection.
Results coming soon. Expected strength in adversarial/anti-bot scenarios.
Notte
AI-native browser automation platform that treats the web as a structured API. Converts pages into semantic action graphs so agents navigate via intent rather than CSS selectors. Designed for reliability at scale.
Results coming soon. The semantic approach is architecturally interesting — curious to see how it handles real-world messiness.
Smooth
Managed browser agent infrastructure with a focus on developer experience. Provides pre-built workflows for common web tasks and a visual debugger for tracing agent decisions step by step.
Results coming soon. The managed + visual-debug approach targets a different buyer than raw frameworks.
Google Operator
Google's browser agent product powered by Gemini. Runs inside Chrome with deep integration into Google's ecosystem — Search, Maps, Shopping. The incumbent play from the company that owns the browser market.
Results coming soon. The obvious question: does owning the browser give Google an unfair advantage on web tasks?
Claude (Computer Use)
Anthropic's computer use capability baked into Claude. Controls a full desktop environment via screenshots and mouse/keyboard actions. The most general-purpose approach — not browser-specific, but capable of any desktop task.
Results coming soon. The generalist vs. specialist question: does a desktop-first approach lose to browser-native tools on web-specific tasks?
How they performed
BrowserUse
open-source · python · playwright
- Best-in-class on content-heavy sites (Allrecipes, BBC News, Cambridge Dictionary) where navigation is straightforward
- Strong open-source community means rapid iteration and good documentation
- Full control over agent logic — teams can customize prompting and action selection
- Struggles with complex form interactions (Booking.com, Google Flights) that require multi-step input sequences
- Map and spatial interfaces are a weak spot — Google Maps tasks frequently timeout
- No built-in anti-bot evasion; sites with aggressive protection cause failures
View all 14 categories
The open-source default. Good for teams that want control and can invest in customization. Weakest on complex, interactive sites.
Results pending
We're still running the full benchmark for these platforms. Results will be published here as they complete.
Tinyfish
Results coming soon. Expected strength in adversarial/anti-bot scenarios.
Notte
Results coming soon. The semantic approach is architecturally interesting — curious to see how it handles real-world messiness.
Smooth
Results coming soon. The managed + visual-debug approach targets a different buyer than raw frameworks.
Google Operator
Results coming soon. The obvious question: does owning the browser give Google an unfair advantage on web tasks?
Claude (Computer Use)
Results coming soon. The generalist vs. specialist question: does a desktop-first approach lose to browser-native tools on web-specific tasks?
What each site tests
Allrecipes
Recipe search and navigation on a content-heavy site with complex filtering.
Amazon
Product search, comparison, and cart operations on the world's largest e-commerce site.
Apple
Navigating Apple's product pages, support articles, and store — clean but deep information architecture.
ArXiv
Academic paper search and retrieval. Tests handling of technical content and specialized search.
BBC News
News article navigation, topic browsing, and multimedia content on a major news site.
Booking.com
Hotel search with complex multi-field forms, date pickers, and dynamic filtering.
Cambridge Dictionary
Dictionary lookups, pronunciation, and example sentences. Tests precise text extraction.
ESPN
Sports scores, schedules, and stats navigation. Dynamic content with frequent updates.
GitHub
Repository browsing, issue search, and code navigation. Tests handling of developer-oriented UIs.
Google Flights
Flight search with complex form interactions, date ranges, and multi-city routing.
Google Maps
Location search, directions, and place details. Heavy JavaScript, dynamic map interactions.
Google Search
Web search and result navigation. The most fundamental browser task.
Hugging Face
Model and dataset search on the ML platform. Technical content with specialized UI patterns.
Wolfram Alpha
Computational queries and result interpretation. Tests understanding of structured math/science output.
What we learned
No single platform dominates across all categories. The best choice depends on your specific use case — content extraction, e-commerce, form filling, or adversarial environments each favor different architectures.
Complex form interactions (Booking.com, Google Flights) are the hardest category across the board. Multi-step forms with date pickers, dropdowns, and dynamic validation remain an unsolved problem for most agents.
Content-heavy sites with straightforward navigation (news, recipes, dictionaries) are the easiest. If your use case is primarily content extraction, most platforms will work.
Anti-bot protection is a differentiator. Sites with aggressive bot detection cause cascading failures for platforms without evasion capabilities.
Results are still coming in for most platforms. This report will be updated as we complete full benchmark runs. Subscribe for updates.
Run your own benchmark
Pistachio's WebVoyager harness is available now. Test your browser agent against the same 643 tasks and get a signed report.
