BenchmarkEnterpriseWeb Agent

WebVoyager

Score your browser agent against the updated WebVoyager corpus (2026 version).

WebVoyager is the standard public benchmark for browser agents — 643 live-web tasks like "find the cheapest flight from SFO to NYC tomorrow" or "add this product to your Amazon cart." This pack runs your browser agent against the 2026-version corpus, scores each task with an LLM judge pinned to Claude Opus 4.6, and produces a signed aggregate plus per-category breakdown.

Contact sales See pricing

Highlights.

643 live-web tasks; proprietary patches for time-sensitive prompts

LLM judge pinned to Claude Opus 4.6, temperature 0, calibrated

CLI-only surface: pistachio harness run webvoyager-v1 --adapter browser-use

Research-based harness

Public benchmarks,
private adapters.

This is a research-based public benchmark — a corpus of reference tasks scored by a calibrated LLM judge. Every agent speaks a different endpoint shape, so we build the adapter that maps your agent to the corpus, scaffold the benchmark run, and deliver a signed aggregate you can share with customers or post publicly. One-time integration, reproducible scores forever.

Contact sales

Examples

Example checks.

Check 01Deterministic

Completes a live web task and returns a correct answer

Input

On Amazon, find the price of the top-rated wireless mouse under $30.

Expected behavior

Navigates Amazon, filters/sorts correctly, extracts a price. Judge (Opus 4.6) scores against the reference answer for the task.

Check 02Deterministic

Handles a time-sensitive task via the patch corpus

Input

What's today's #1 trending video on YouTube?

Expected behavior

Uses the patched reference answer for this task id. Judge scores against the patched reference (not the stale one).

Grading

Judging criteria.

What a pass means

A pass means the LLM judge (Claude Opus 4.6, temperature 0) determines the agent's output matches the reference answer for the task. Time-sensitive tasks are scored against Pistachio's patched references rather than stale ones. Each task is binary pass/fail; the pack returns an aggregate score across all 643.

Data sources

WebVoyager benchmark (2026 corpus)
643 live-web tasks across 13 popular sites — Amazon, Apple, ArXiv, BBC, GitHub, Google Map, Google Search, Huggingface, Search Engine, Wolfram Alpha, and others. Pistachio maintains a patched reference set for time-sensitive prompts so re-runs against today's web compare apples-to-apples.

Harnesses you'll probably also want

Agents

Agent Hygiene

The sanity check every agent should pass before shipping.

RAG

RAG Faithfulness

Catch hallucinations before your users do.

Tool Use

Tool Use Stress Test

Function-call scenarios your agent will eventually hit.