RAGPersonalHorizontal

RAG Faithfulness

Catch hallucinations before your users do.

In retrieval-augmented generation, the agent gets a question plus a few retrieved documents and answers. Two things go wrong: hallucinating claims not in the docs, or answering a different question than the one asked. This pack runs your agent through 1,850 labeled RAG rows from RAGBench (medical, legal, financial, general QA) and grades both — does every claim trace back to a document, and is the answer on-topic?

Install in Claude Code See pricing

Highlights.

1,800 real RAG rows with human-labeled adherence and relevance scores

Calibrated grounding + relevance judges (85% floor enforced at boot)

Nine commercial-safe RAGBench subsets across medical, legal, financial, and general QA

Every fixture traces back to an upstream row id for audit

Install

Three commands.
Then receipts.

Install the Pistachio CLI, add the harness as a Claude Code MCP tool, run it against your agent, and get a signed pass/fail report you can drop into a PR or sales deck.

CLI (Claude Code)

zsh

# 1. Install Pistachio CLI
npx @pistachio/cli login

# 2. Add the harness as a Claude Code MCP tool
pistachio harness add rag-faithfulness

# 3. Run it from inside Claude Code — get a signed report
pistachio harness run rag-faithfulness

SDK (Node)

typescript

import { Pistachio } from "@pistachio/sdk";

const pistachio = new Pistachio({ apiKey: process.env.PISTACHIO_KEY });

const run = await pistachio.harnesses.run("rag-faithfulness", {
  endpoint: "https://your-agent.example.com/v1/messages",
});

console.log(run.passRate);
console.log(run.signedReportUrl);

MCP-native

Lives inside Claude Code.

Signed reports

Ed25519 attestation.

Deterministic

Same input, same score.

Examples

Example checks.

Check 01Deterministic

Grounds every claim in retrieved docs

Input

What would aid accurate calculation of a case fatality ratio? (+ 3 retrieved abstracts)

Expected behavior

Answer covers probability-of-dying and reporting-bias adjustments; every claim traceable to a retrieved passage.

Check 02Deterministic

Stays on topic

Input

What would aid accurate calculation of a case fatality ratio? (+ 3 retrieved abstracts)

Expected behavior

Fails relevance judge — answer is grounded but doesn't address the question asked.

Grading

Judging criteria.

What a pass means

A pass means two calibrated LLM judges agree: every factual claim in the response is supported by the retrieved documents (grounding judge, calibrated against RAGBench's adherence_score), AND the response actually addresses the question (relevance judge, calibrated against RAGBench's relevance_score). Both judges enforce an 85% accuracy floor at boot.

Data sources

RAGBench
1,800+ human-labeled RAG rows across nine commercial-safe subsets — covidqa, cuad, delucionqa, expertqa, finqa, hagrid, pubmedqa, tatqa, techqa. Annotations CC-BY-4.0 (Galileo); upstream corpora retain own licenses.

Harnesses you'll probably also want

Agents

Agent Hygiene

The sanity check every agent should pass before shipping.

Tool Use

Tool Use Stress Test

Function-call scenarios your agent will eventually hit.

Benchmark

WebVoyager

Score your browser agent against the updated WebVoyager corpus (2026 version).