WebVoyager
Score your browser agent against the updated WebVoyager corpus (2026 version).
WebVoyager is the standard public benchmark for browser agents — 643 live-web tasks like "find the cheapest flight from SFO to NYC tomorrow" or "add this product to your Amazon cart." This pack runs your browser agent against the 2026-version corpus, scores each task with an LLM judge pinned to Claude Opus 4.6, and produces a signed aggregate plus per-category breakdown.
Highlights.
643 live-web tasks; proprietary patches for time-sensitive prompts
LLM judge pinned to Claude Opus 4.6, temperature 0, calibrated
CLI-only surface: pistachio harness run webvoyager-v1 --adapter browser-use
Public benchmarks,
private adapters.
This is a research-based public benchmark — a corpus of reference tasks scored by a calibrated LLM judge. Every agent speaks a different endpoint shape, so we build the adapter that maps your agent to the corpus, scaffold the benchmark run, and deliver a signed aggregate you can share with customers or post publicly. One-time integration, reproducible scores forever.
Example checks.
Completes a live web task and returns a correct answer
Handles a time-sensitive task via the patch corpus
Judging criteria.
What a pass means
A pass means the LLM judge (Claude Opus 4.6, temperature 0) determines the agent's output matches the reference answer for the task. Time-sensitive tasks are scored against Pistachio's patched references rather than stale ones. Each task is binary pass/fail; the pack returns an aggregate score across all 643.
Data sources
- WebVoyager benchmark (2026 corpus)
643 live-web tasks across 13 popular sites — Amazon, Apple, ArXiv, BBC, GitHub, Google Map, Google Search, Huggingface, Search Engine, Wolfram Alpha, and others. Pistachio maintains a patched reference set for time-sensitive prompts so re-runs against today's web compare apples-to-apples.
