Pistachio
Back to marketplace
RAGPersonalHorizontal

RAG Faithfulness

Catch hallucinations before your users do.

In retrieval-augmented generation, the agent gets a question plus a few retrieved documents and answers. Two things go wrong: hallucinating claims not in the docs, or answering a different question than the one asked. This pack runs your agent through 1,850 labeled RAG rows from RAGBench (medical, legal, financial, general QA) and grades both — does every claim trace back to a document, and is the answer on-topic?

Highlights.

1,800 real RAG rows with human-labeled adherence and relevance scores

Calibrated grounding + relevance judges (85% floor enforced at boot)

Nine commercial-safe RAGBench subsets across medical, legal, financial, and general QA

Every fixture traces back to an upstream row id for audit

Install

Three commands.
Then receipts.

Install the Pistachio CLI, add the harness as a Claude Code MCP tool, run it against your agent, and get a signed pass/fail report you can drop into a PR or sales deck.

CLI (Claude Code)

zsh
# 1. Install Pistachio CLI
npx @pistachio/cli login

# 2. Add the harness as a Claude Code MCP tool
pistachio harness add rag-faithfulness

# 3. Run it from inside Claude Code — get a signed report
pistachio harness run rag-faithfulness

SDK (Node)

typescript
import { Pistachio } from "@pistachio/sdk";

const pistachio = new Pistachio({ apiKey: process.env.PISTACHIO_KEY });

const run = await pistachio.harnesses.run("rag-faithfulness", {
  endpoint: "https://your-agent.example.com/v1/messages",
});

console.log(run.passRate);
console.log(run.signedReportUrl);
MCP-native
Lives inside Claude Code.
Signed reports
Ed25519 attestation.
Deterministic
Same input, same score.
Examples

Example checks.

Check 01Deterministic

Grounds every claim in retrieved docs

Input
What would aid accurate calculation of a case fatality ratio? (+ 3 retrieved abstracts)
Expected behavior
Answer covers probability-of-dying and reporting-bias adjustments; every claim traceable to a retrieved passage.
Check 02Deterministic

Stays on topic

Input
What would aid accurate calculation of a case fatality ratio? (+ 3 retrieved abstracts)
Expected behavior
Fails relevance judge — answer is grounded but doesn't address the question asked.
Grading

Judging criteria.

What a pass means

A pass means two calibrated LLM judges agree: every factual claim in the response is supported by the retrieved documents (grounding judge, calibrated against RAGBench's adherence_score), AND the response actually addresses the question (relevance judge, calibrated against RAGBench's relevance_score). Both judges enforce an 85% accuracy floor at boot.

Data sources

  • RAGBench

    1,800+ human-labeled RAG rows across nine commercial-safe subsets — covidqa, cuad, delucionqa, expertqa, finqa, hagrid, pubmedqa, tatqa, techqa. Annotations CC-BY-4.0 (Galileo); upstream corpora retain own licenses.