Pistachio
Back to marketplace
BenchmarkEnterpriseWeb Agent

WebVoyager

Score your browser agent against the updated WebVoyager corpus (2026 version).

WebVoyager is the standard public benchmark for browser agents — 643 live-web tasks like "find the cheapest flight from SFO to NYC tomorrow" or "add this product to your Amazon cart." This pack runs your browser agent against the 2026-version corpus, scores each task with an LLM judge pinned to Claude Opus 4.6, and produces a signed aggregate plus per-category breakdown.

Highlights.

643 live-web tasks; proprietary patches for time-sensitive prompts

LLM judge pinned to Claude Opus 4.6, temperature 0, calibrated

CLI-only surface: pistachio harness run webvoyager-v1 --adapter browser-use

Research-based harness

Public benchmarks,
private adapters.

This is a research-based public benchmark — a corpus of reference tasks scored by a calibrated LLM judge. Every agent speaks a different endpoint shape, so we build the adapter that maps your agent to the corpus, scaffold the benchmark run, and deliver a signed aggregate you can share with customers or post publicly. One-time integration, reproducible scores forever.

Examples

Example checks.

Check 01Deterministic

Completes a live web task and returns a correct answer

Input
On Amazon, find the price of the top-rated wireless mouse under $30.
Expected behavior
Navigates Amazon, filters/sorts correctly, extracts a price. Judge (Opus 4.6) scores against the reference answer for the task.
Check 02Deterministic

Handles a time-sensitive task via the patch corpus

Input
What's today's #1 trending video on YouTube?
Expected behavior
Uses the patched reference answer for this task id. Judge scores against the patched reference (not the stale one).
Grading

Judging criteria.

What a pass means

A pass means the LLM judge (Claude Opus 4.6, temperature 0) determines the agent's output matches the reference answer for the task. Time-sensitive tasks are scored against Pistachio's patched references rather than stale ones. Each task is binary pass/fail; the pack returns an aggregate score across all 643.

Data sources

  • WebVoyager benchmark (2026 corpus)

    643 live-web tasks across 13 popular sites — Amazon, Apple, ArXiv, BBC, GitHub, Google Map, Google Search, Huggingface, Search Engine, Wolfram Alpha, and others. Pistachio maintains a patched reference set for time-sensitive prompts so re-runs against today's web compare apples-to-apples.