RAG Faithfulness
Catch hallucinations before your users do.
In retrieval-augmented generation, the agent gets a question plus a few retrieved documents and answers. Two things go wrong: hallucinating claims not in the docs, or answering a different question than the one asked. This pack runs your agent through 1,850 labeled RAG rows from RAGBench (medical, legal, financial, general QA) and grades both — does every claim trace back to a document, and is the answer on-topic?
Highlights.
1,800 real RAG rows with human-labeled adherence and relevance scores
Calibrated grounding + relevance judges (85% floor enforced at boot)
Nine commercial-safe RAGBench subsets across medical, legal, financial, and general QA
Every fixture traces back to an upstream row id for audit
Three commands.
Then receipts.
Install the Pistachio CLI, add the harness as a Claude Code MCP tool, run it against your agent, and get a signed pass/fail report you can drop into a PR or sales deck.
CLI (Claude Code)
# 1. Install Pistachio CLI
npx @pistachio/cli login
# 2. Add the harness as a Claude Code MCP tool
pistachio harness add rag-faithfulness
# 3. Run it from inside Claude Code — get a signed report
pistachio harness run rag-faithfulnessSDK (Node)
import { Pistachio } from "@pistachio/sdk";
const pistachio = new Pistachio({ apiKey: process.env.PISTACHIO_KEY });
const run = await pistachio.harnesses.run("rag-faithfulness", {
endpoint: "https://your-agent.example.com/v1/messages",
});
console.log(run.passRate);
console.log(run.signedReportUrl);Example checks.
Grounds every claim in retrieved docs
Stays on topic
Judging criteria.
What a pass means
A pass means two calibrated LLM judges agree: every factual claim in the response is supported by the retrieved documents (grounding judge, calibrated against RAGBench's adherence_score), AND the response actually addresses the question (relevance judge, calibrated against RAGBench's relevance_score). Both judges enforce an 85% accuracy floor at boot.
Data sources
- RAGBench
1,800+ human-labeled RAG rows across nine commercial-safe subsets — covidqa, cuad, delucionqa, expertqa, finqa, hagrid, pubmedqa, tatqa, techqa. Annotations CC-BY-4.0 (Galileo); upstream corpora retain own licenses.
