Tool UsePersonalHorizontal

Tool Use Stress Test

Function-call scenarios your agent will eventually hit.

Any AI agent with tools has to pick the right one and pass the right arguments — and not call any tool when no available function actually matches the user's request. This pack runs your agent through 1,440 labeled function-calling tasks from the Berkeley Function Calling Leaderboard (BFCL V4) and checks each call against the human-curated ground truth, including "should not call" irrelevance cases.

Install in Claude Code See pricing

Highlights.

1,440 real BFCL V4 rows across six canonical tool-use categories

Deterministic grading — no judge variance, no calibration drift

Includes irrelevance (should-not-call) and multi-turn categories

Every fixture traces back to a BFCL row id for audit

Install

Three commands.
Then receipts.

Install the Pistachio CLI, add the harness as a Claude Code MCP tool, run it against your agent, and get a signed pass/fail report you can drop into a PR or sales deck.

CLI (Claude Code)

zsh

# 1. Install Pistachio CLI
npx @pistachio/cli login

# 2. Add the harness as a Claude Code MCP tool
pistachio harness add tool-use-stress

# 3. Run it from inside Claude Code — get a signed report
pistachio harness run tool-use-stress

SDK (Node)

typescript

import { Pistachio } from "@pistachio/sdk";

const pistachio = new Pistachio({ apiKey: process.env.PISTACHIO_KEY });

const run = await pistachio.harnesses.run("tool-use-stress", {
  endpoint: "https://your-agent.example.com/v1/messages",
});

console.log(run.passRate);
console.log(run.signedReportUrl);

MCP-native

Lives inside Claude Code.

Signed reports

Ed25519 attestation.

Deterministic

Same input, same score.

Examples

Example checks.

Check 01Deterministic

Calls the right tool with matching args

Input

Find the area of a triangle with a base of 10 units and height of 5 units.

Expected behavior

Calls calculate_triangle_area(base=10, height=5). Args match BFCL ground truth.

Check 02Deterministic

Declines to call when no tool fits

Input

Calculate the area of a triangle given base 10m and height 5m.

Expected behavior

No tool call — the irrelevance category expects the agent to recognize no available function matches.

Grading

Judging criteria.

What a pass means

A pass means the agent's tool call(s) match BFCL's ground-truth function name and arguments — lenient on numeric int/float, case-insensitive on strings — or, for irrelevance fixtures, the agent makes no tool call at all. Tool-call argument JSON must parse.

Data sources

Berkeley Function Calling Leaderboard (BFCL V4)
ShishirPatil/gorilla, Apache-2.0. 1,440 labeled rows across six canonical tool-use categories — simple, parallel, multiple, parallel-multiple, irrelevance, multi-turn. Ground truth comes from BFCL's possible_answer annotations (human-curated).

Harnesses you'll probably also want

Agents

Agent Hygiene

The sanity check every agent should pass before shipping.

RAG

RAG Faithfulness

Catch hallucinations before your users do.

Benchmark

WebVoyager

Score your browser agent against the updated WebVoyager corpus (2026 version).