How to Evaluate Your RAG Pipeline

RAG has two places to fail: retrieval and generation. Most teams only catch one. Here's the complete evaluation framework.

Your RAG-powered feature returns a confident, well-formatted answer. The problem: it's wrong. Not hallucinated in an obvious way — it cites a real document, uses correct terminology, and sounds authoritative. But the document it retrieved was from six months ago, before the policy changed, and the answer is no longer valid.

This is the failure mode that makes RAG evaluation hard. Unlike a pure LLM where you're testing one system, RAG is a pipeline: a retriever that finds relevant context, and a generator that synthesizes an answer from that context. Each component can fail independently. Evaluating only the final answer misses half the problem.

This post covers the complete RAG evaluation framework: how to evaluate retrieval and generation separately, the three core metrics (context precision, faithfulness, answer relevance), the hyperparameters worth sweeping, and how to detect the silent failures that only appear in production.

Why RAG Requires Separate Evaluation of Retrieval and Generation

When a RAG system fails, the root cause is almost always one of two things: the retriever returned the wrong chunks, or the LLM generated an answer that wasn't supported by the chunks it received. These require different fixes, so you need to know which failure you're dealing with.

Consider the failure modes:

Retrieval failure: The relevant document exists in your corpus, but the retriever doesn't surface it. The LLM never sees the right context, so even a perfect generator can't produce the right answer.
Generation failure: The retriever returns excellent context, but the LLM ignores it or contradicts it — synthesizing an answer from prior training weights instead of the retrieved chunks.
Compound failure: Partially relevant context retrieved, and the LLM extrapolates beyond what the context supports.

Evaluating only the final answer tells you that something went wrong. It doesn't tell you where to fix it. If your retrieval is broken, optimizing your prompt won't help. If your generation is hallucinating, upgrading your embedding model won't help.

The silent failure pattern: Teams often report their RAG system "works fine" because end-to-end answer quality is acceptable on their test set. But their test set only covers queries where retrieval is easy. On long-tail queries — niche topics, recent documents, ambiguous phrasing — retrieval silently degrades, and the LLM fills the gap with plausible-sounding fabrications.

The RAG Triad: The Core Evaluation Framework

The RAG Triad is the most widely used framework for RAG evaluation. It measures three relationships: between the query and retrieved context, between the context and the answer, and between the query and the answer.

Metric	What It Measures	Failure Signal
Context Relevance	Are the retrieved chunks relevant to the query?	Retriever returning noise
Faithfulness	Is the answer grounded in the retrieved context?	LLM hallucinating beyond context
Answer Relevance	Does the answer actually address the query?	Correct but non-responsive answers

All three metrics need to be high simultaneously. A high faithfulness score with low context relevance means the LLM faithfully reproduced irrelevant content. High answer relevance with low faithfulness means the LLM answered the question correctly but made it up.

Retrieval Quality Metrics

Context Precision

Context precision measures what fraction of your retrieved chunks are actually relevant to the query. If you retrieve 5 chunks and only 2 are relevant, your precision is 0.4. Low precision means you're feeding your LLM noisy context — irrelevant information that increases the chance of confused or distracted generation.

// JavaScript: calculate context precision
function contextPrecision(retrievedChunks, relevantChunkIds) {
    // relevantChunkIds: set of chunk IDs judged relevant by LLM evaluator or human
    const relevantSet = new Set(relevantChunkIds);
    const relevant = retrievedChunks.filter(chunk => relevantSet.has(chunk.id));
    return relevant.length / retrievedChunks.length;
}

// Usage: score each query in your eval set
async function scoreRetrieval(query, retrievedChunks, evaluatorLLM) {
    // Ask the evaluator to judge relevance for each chunk
    const relevanceJudgments = await Promise.all(
        retrievedChunks.map(chunk =>
            evaluatorLLM.judge({
                query,
                chunk: chunk.text,
                task: 'Is this chunk relevant to answering the query? Answer YES or NO.'
            })
        )
    );

    const relevantChunkIds = retrievedChunks
        .filter((_, i) => relevanceJudgments[i] === 'YES')
        .map(c => c.id);

    return {
        precision: contextPrecision(retrievedChunks, relevantChunkIds),
        relevant_count: relevantChunkIds.length,
        total_retrieved: retrievedChunks.length
    };
}

Context Recall

Context recall is the complement: of all the relevant chunks that exist in your corpus, how many did your retriever actually find? High precision with low recall means you're retrieving a clean set of relevant chunks, but missing others that could have improved the answer. You need both.

Recall requires knowing the ground truth — which chunks in your corpus are relevant for a given query. For evaluation sets, you build this ground truth manually (or with LLM-assisted annotation) on a representative sample of queries, then measure how often your retriever surfaces those chunks.

Entity Recall

Entity recall is a cheaper proxy when full ground-truth annotation isn't feasible. Extract the named entities (people, organizations, dates, product names) from the correct answer, then check what fraction of those entities appear in the retrieved context. If the answer mentions "the Q3 2025 policy update" and that phrase doesn't appear in any retrieved chunk, your retriever missed something important.

// JavaScript: entity recall using simple extraction
function entityRecall(answerText, retrievedChunksText) {
    // Extract key noun phrases and entities from the answer
    // In production, use an NLP library or LLM for extraction
    const answerEntities = extractEntities(answerText);
    const contextText = retrievedChunksText.join(' ').toLowerCase();

    const foundEntities = answerEntities.filter(entity =>
        contextText.includes(entity.toLowerCase())
    );

    return {
        recall: foundEntities.length / answerEntities.length,
        found: foundEntities,
        missing: answerEntities.filter(e => !foundEntities.includes(e))
    };
}

Retrieval evaluation shortcut: If you have ground-truth question-answer pairs, you can score retrieval without human chunk annotation. Use the gold answer to check whether the retrieved context contains the information needed to derive that answer — an LLM evaluator can judge this cheaply at scale.

Generation Quality Metrics

Faithfulness (Groundedness)

Faithfulness is the most critical generation metric. It measures whether every claim in the generated answer is supported by the retrieved context — not by the LLM's training data, not by plausible inference, but directly supported by what was retrieved.

The evaluation approach: decompose the answer into individual factual claims, then check each claim against the context. A claim counts as faithful if it can be directly inferred from the retrieved chunks without additional knowledge.

# Python: LLM-based faithfulness scoring
def score_faithfulness(answer, retrieved_context, evaluator_llm):
    """
    Decompose answer into claims and verify each against context.
    Returns faithfulness score 0.0-1.0.
    """
    # Step 1: extract individual claims from the answer
    claims_response = evaluator_llm.complete(f"""
    Break the following answer into individual factual claims.
    Return a JSON list of strings, each a single verifiable claim.

    Answer: {answer}
    """)
    claims = json.loads(claims_response)

    if not claims:
        return 1.0  # empty answer is vacuously faithful

    # Step 2: verify each claim against the context
    verdicts = []
    for claim in claims:
        verdict = evaluator_llm.complete(f"""
        Context: {retrieved_context}

        Claim: {claim}

        Is this claim directly supported by the context? Answer YES or NO only.
        """)
        verdicts.append(verdict.strip() == "YES")

    faithful_count = sum(verdicts)
    return {
        "faithfulness": faithful_count / len(claims),
        "claims_total": len(claims),
        "claims_supported": faithful_count,
        "unsupported_claims": [c for c, v in zip(claims, verdicts) if not v]
    }

Hallucination Detection

Hallucination is the inverse of faithfulness — it measures what fraction of the answer is fabricated. The most reliable signal is cross-referencing specific factual claims against multiple sources, not just the retrieved context. A claim is a hallucination if it's stated as fact but appears in neither the retrieved context nor verified external knowledge.

Pay special attention to precise factual claims: numbers, dates, names, percentages, and direct quotes. LLMs hallucinate specific details far more often than general concepts. A pipeline that scores 95% faithfulness overall might still be fabricating specific figures 20% of the time.

Answer Relevance

Answer relevance is deceptively tricky. An answer can be faithful to the context and still completely miss the question. This happens most often when the retrieved context is technically related but not directly responsive — the LLM writes accurately about adjacent topics instead of answering what was asked.

The evaluation approach: ask an LLM evaluator to generate the question that the answer is most directly responding to, then measure semantic similarity between that generated question and the original query. If the answer is about cost when you asked about accuracy, the generated question will diverge from the original.

The completeness trap: An answer can score high on faithfulness and answer relevance but still be dangerously incomplete. If the retrieved context only contains half the relevant information and the LLM faithfully summarizes that half, you get a confident, grounded, relevant — but incomplete — answer. This is why context recall matters: incompleteness starts at retrieval.

Hyperparameter Sweeps: What to Tune and How to Measure It

RAG pipelines have several hyperparameters that dramatically affect quality. Tuning them without a measurement framework is guesswork. With evaluation in place, you can sweep them systematically.

Chunk Size

Smaller chunks improve precision (less irrelevant content per chunk) but hurt recall (you may split relevant content across chunk boundaries). Larger chunks improve recall but decrease precision and increase LLM context noise. The optimal chunk size varies by document type — legal documents and technical manuals often need larger chunks than FAQ-style content.

Sweep strategy: fix top-K, vary chunk size from 256 to 2048 tokens in steps, measure context precision and recall on your eval set. Plot the precision-recall tradeoff and pick the chunk size at the knee of the curve.

Top-K Retrieval

Retrieving more chunks increases recall but decreases precision and can overwhelm the LLM context window. There's also a position bias effect in many LLMs — information near the beginning and end of the context is more likely to be used than information in the middle.

// JavaScript: sweep top-K and measure quality
async function sweepTopK(evalQueries, vectorStore, evaluator) {
    const kValues = [1, 3, 5, 10, 15, 20];
    const results = [];

    for (const k of kValues) {
        const queryScores = await Promise.all(
            evalQueries.map(async ({ query, goldAnswer }) => {
                const chunks = await vectorStore.retrieve(query, { topK: k });
                const answer = await generateAnswer(query, chunks);

                return {
                    contextPrecision: await evaluator.precision(query, chunks),
                    faithfulness: await evaluator.faithfulness(answer, chunks),
                    answerRelevance: await evaluator.relevance(query, answer)
                };
            })
        );

        const avgScores = {
            k,
            precision: mean(queryScores.map(s => s.contextPrecision)),
            faithfulness: mean(queryScores.map(s => s.faithfulness)),
            relevance: mean(queryScores.map(s => s.answerRelevance)),
            composite: mean(queryScores.map(s =>
                (s.contextPrecision + s.faithfulness + s.answerRelevance) / 3
            ))
        };

        results.push(avgScores);
        console.log(`k=${k}: composite=${avgScores.composite.toFixed(3)}`);
    }

    return results;
}

Embedding Model

The embedding model determines how well semantic similarity maps to actual relevance for your domain. A model trained on general web text may perform worse than a domain-specific model for legal, medical, or technical corpora. The only way to know is to measure context recall on your actual queries, not on generic benchmarks.

When evaluating embedding models, hold everything else constant (same chunk size, same top-K, same LLM generator) and vary only the embedding model. Measure context recall and precision, not end-to-end answer quality — you want to isolate the retrieval contribution.

Production Monitoring: Silent RAG Failures

Lab evaluation covers the queries you anticipated. Production covers everything else. Silent RAG failures — degradation that users don't explicitly report — are the most dangerous class of failure because they accumulate invisibly.

Three signals worth monitoring in production:

Average context relevance score: If this drops, your corpus has likely grown stale or your query distribution has shifted. New document categories may not be well-represented in your embedding space.
Faithfulness rate: An uptick in low-faithfulness answers signals the LLM is increasingly ignoring context — often triggered by context window saturation (too many chunks) or corpus contamination (irrelevant documents retrieved at high frequency).
No-retrieval rate: Queries that return zero relevant chunks above your confidence threshold. These queries are effectively unanswerable by your RAG system but may receive fabricated responses instead of honest "I don't know" answers.

// JavaScript: production RAG monitoring
class RAGMonitor {
    constructor(evaluator, alertThresholds) {
        this.evaluator = evaluator;
        this.thresholds = alertThresholds;
        this.window = []; // rolling window of scores
    }

    async logQuery(query, retrievedChunks, answer) {
        // Sample evaluation — not every query, but a statistically representative subset
        const shouldEval = Math.random() < 0.05; // 5% sampling rate
        if (!shouldEval) return;

        const scores = {
            timestamp: Date.now(),
            context_relevance: await this.evaluator.precision(query, retrievedChunks),
            faithfulness: await this.evaluator.faithfulness(answer, retrievedChunks),
            answer_relevance: await this.evaluator.relevance(query, answer),
            chunk_count: retrievedChunks.length
        };

        this.window.push(scores);

        // Keep rolling window of last 500 evaluations
        if (this.window.length > 500) this.window.shift();

        this.checkAlerts(scores);
        return scores;
    }

    checkAlerts(latestScore) {
        const recent = this.window.slice(-50); // last 50 evals
        const avgFaithfulness = mean(recent.map(s => s.faithfulness));
        const avgRelevance = mean(recent.map(s => s.context_relevance));

        if (avgFaithfulness < this.thresholds.faithfulness) {
            this.alert('FAITHFULNESS_DEGRADATION', { avg: avgFaithfulness });
        }
        if (avgRelevance < this.thresholds.contextRelevance) {
            this.alert('RETRIEVAL_DEGRADATION', { avg: avgRelevance });
        }
    }

    alert(type, data) {
        console.error(`[RAG Alert] ${type}:`, data);
        // Hook into your alerting system
    }
}

The corpus drift problem: Even if your RAG pipeline is perfectly tuned today, your document corpus changes over time — new documents added, old ones not retired, relevance distributions shifting. Re-run your evaluation suite monthly. Treat a 5% drop in context recall as a deployment-blocking regression, not a minor inconvenience.

How Benchwright Handles RAG Evaluation

The evaluation framework above requires infrastructure: an evaluation dataset, an LLM evaluator, a metrics pipeline, and a monitoring layer. Building this from scratch for every RAG project adds weeks of engineering time that isn't core to your product.

Benchwright provides this infrastructure out of the box. Connect your RAG pipeline — any vector store, any LLM generator — define your evaluation dataset, and Benchwright runs the full RAG Triad evaluation on a schedule. You get context precision, context recall, faithfulness, and answer relevance scores, tracked over time, with regression alerts when any metric drops below your threshold.

When you change your chunk size, swap embedding models, or update your system prompt, Benchwright automatically re-evaluates against your dataset and flags any regressions before they reach users. No spreadsheets, no manual scoring runs, no finding out from support tickets.

Why RAG Requires Separate Evaluation of Retrieval and Generation

The RAG Triad: The Core Evaluation Framework

Retrieval Quality Metrics

Context Precision

Context Recall

Entity Recall

Generation Quality Metrics

Faithfulness (Groundedness)

Hallucination Detection

Answer Relevance

Hyperparameter Sweeps: What to Tune and How to Measure It

Chunk Size

Top-K Retrieval

Embedding Model

Production Monitoring: Silent RAG Failures

How Benchwright Handles RAG Evaluation

Evaluate Your RAG Pipeline Before Users Find the Failures