Pistachio
Back to marketplace
Tool UsePersonalHorizontal

Tool Use Stress Test

Function-call scenarios your agent will eventually hit.

Any AI agent with tools has to pick the right one and pass the right arguments — and not call any tool when no available function actually matches the user's request. This pack runs your agent through 1,440 labeled function-calling tasks from the Berkeley Function Calling Leaderboard (BFCL V4) and checks each call against the human-curated ground truth, including "should not call" irrelevance cases.

Highlights.

1,440 real BFCL V4 rows across six canonical tool-use categories

Deterministic grading — no judge variance, no calibration drift

Includes irrelevance (should-not-call) and multi-turn categories

Every fixture traces back to a BFCL row id for audit

Install

Three commands.
Then receipts.

Install the Pistachio CLI, add the harness as a Claude Code MCP tool, run it against your agent, and get a signed pass/fail report you can drop into a PR or sales deck.

CLI (Claude Code)

zsh
# 1. Install Pistachio CLI
npx @pistachio/cli login

# 2. Add the harness as a Claude Code MCP tool
pistachio harness add tool-use-stress

# 3. Run it from inside Claude Code — get a signed report
pistachio harness run tool-use-stress

SDK (Node)

typescript
import { Pistachio } from "@pistachio/sdk";

const pistachio = new Pistachio({ apiKey: process.env.PISTACHIO_KEY });

const run = await pistachio.harnesses.run("tool-use-stress", {
  endpoint: "https://your-agent.example.com/v1/messages",
});

console.log(run.passRate);
console.log(run.signedReportUrl);
MCP-native
Lives inside Claude Code.
Signed reports
Ed25519 attestation.
Deterministic
Same input, same score.
Examples

Example checks.

Check 01Deterministic

Calls the right tool with matching args

Input
Find the area of a triangle with a base of 10 units and height of 5 units.
Expected behavior
Calls calculate_triangle_area(base=10, height=5). Args match BFCL ground truth.
Check 02Deterministic

Declines to call when no tool fits

Input
Calculate the area of a triangle given base 10m and height 5m.
Expected behavior
No tool call — the irrelevance category expects the agent to recognize no available function matches.
Grading

Judging criteria.

What a pass means

A pass means the agent's tool call(s) match BFCL's ground-truth function name and arguments — lenient on numeric int/float, case-insensitive on strings — or, for irrelevance fixtures, the agent makes no tool call at all. Tool-call argument JSON must parse.

Data sources

  • Berkeley Function Calling Leaderboard (BFCL V4)

    ShishirPatil/gorilla, Apache-2.0. 1,440 labeled rows across six canonical tool-use categories — simple, parallel, multiple, parallel-multiple, irrelevance, multi-turn. Ground truth comes from BFCL's possible_answer annotations (human-curated).