Tool Use Stress Test
Function-call scenarios your agent will eventually hit.
Any AI agent with tools has to pick the right one and pass the right arguments — and not call any tool when no available function actually matches the user's request. This pack runs your agent through 1,440 labeled function-calling tasks from the Berkeley Function Calling Leaderboard (BFCL V4) and checks each call against the human-curated ground truth, including "should not call" irrelevance cases.
Highlights.
1,440 real BFCL V4 rows across six canonical tool-use categories
Deterministic grading — no judge variance, no calibration drift
Includes irrelevance (should-not-call) and multi-turn categories
Every fixture traces back to a BFCL row id for audit
Three commands.
Then receipts.
Install the Pistachio CLI, add the harness as a Claude Code MCP tool, run it against your agent, and get a signed pass/fail report you can drop into a PR or sales deck.
CLI (Claude Code)
# 1. Install Pistachio CLI
npx @pistachio/cli login
# 2. Add the harness as a Claude Code MCP tool
pistachio harness add tool-use-stress
# 3. Run it from inside Claude Code — get a signed report
pistachio harness run tool-use-stressSDK (Node)
import { Pistachio } from "@pistachio/sdk";
const pistachio = new Pistachio({ apiKey: process.env.PISTACHIO_KEY });
const run = await pistachio.harnesses.run("tool-use-stress", {
endpoint: "https://your-agent.example.com/v1/messages",
});
console.log(run.passRate);
console.log(run.signedReportUrl);Example checks.
Calls the right tool with matching args
Declines to call when no tool fits
Judging criteria.
What a pass means
A pass means the agent's tool call(s) match BFCL's ground-truth function name and arguments — lenient on numeric int/float, case-insensitive on strings — or, for irrelevance fixtures, the agent makes no tool call at all. Tool-call argument JSON must parse.
Data sources
- Berkeley Function Calling Leaderboard (BFCL V4)
ShishirPatil/gorilla, Apache-2.0. 1,440 labeled rows across six canonical tool-use categories — simple, parallel, multiple, parallel-multiple, irrelevance, multi-turn. Ground truth comes from BFCL's possible_answer annotations (human-curated).
