How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks

Master how to evaluate the RAG pipeline and solve common failures with DeepEval, RAGAS, TruLens, and Phoenix.

Machine LearningDeep LearningData SciencePythonAgentic AILLM

By Kuriko IWAI

Introduction

Retrieval-Augmented Generation (RAG) is critical for modern AI architecture, serving as an essential framework for building context-aware agents.

But moving from a basic prototype to a production-ready system involves navigating significant hurdles in data retrieval, context consolidation, and response synthesis.

This article provides a deep dive into seven typical RAG failure points and the evaluation metrics with practical coding examples.

The Anatomy of RAG Breakdown - 7 Failure Points (FPs)

According to researchers Barnett et al., Retrieval Augmented Generation (RAG) systems encounter seven specific Failure Points (FPs) throughout the pipeline.

The below diagram illustrates these stages:

Figure A. Indexing and Query processes required for creating a RAG system. The indexing process is done at development time and queries at runtime. Failure points identified in this study are shown in red boxes (source)

Let us explore each FP arranged according to the pipeline sequence, following the top-left to bottom-right progression shown in Figure A.

◼ FP1. Missing Content

Missing content happens when the system is asked a question that cannot be answered because the relevant information is not present in the available vector store in the first place.

The failure occurs when an LLM provides a plausible-sounding but incorrect response instead of stating it doesn't know.

◼ FP2. Missed the Top-Ranked Documents

This is a situation where a correct document exists in the vector store, but the retriever fails to rank it highly enough to include it in top-k documents fed to an LLM as context.

In consequence, the correct information never reaches the LLM.

◼ FP3. Not in Context (Consolidation Strategy Limitations)

This is a situation where a correct document exists and is retrieved from the vector store, but is excluded during the consolidation process.

This happens when too many documents are returned and the system must filter them down to fit within an LLM's context window, token limits, or rate limits.

◼ FP4. Not Extracted

This is a situation where an LLM fails to identify the correct information in the context, even though the correct information was in the vector store, and successfully retrieved/consolidated.

This happens when the context is overly noisy or contains contradictory information that confuses the LLM.

◼ FP5. Wrong Format

This is a situation where storage, retrieval, consolidation, and LLM interpretation are successfully handled, but the LLM fails to follow specific formatting instructions provided in the prompt, such as a table, a bulleted list, or a JSON schema.

◼ FP6. Incorrect Specificity

An LLM's output is technically present, but either too general or too complex compared to the user's needs.

For example, an LLM generates simple answers to a user query with a complex professional goal.

◼ FP7. Incomplete Answers

This is a situation where an LLM generates an output not necessarily wrong, but missing key pieces of information that were available in the context.

For example, when a user asks a complex question like "What are the key points in documents A, B, and C?", the LLM only addresses one or two of the sources.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

How FPs Compromise RAG Pipeline Performance

Each of these FPs impact performance of RAG pipelines:

◼ Data Integrity & Trust Failures

When missing or incorrect information is present, the system is no longer a reliable source of information. Primary FPs include:

FP1 (Missing Content): The answer is not in the doc in the first place.
FP4 (Not Extracted): The LLM decides to ignore the correct answer in the doc.
FP7 (Incomplete): The LLM gives half-truths, missing important pieces.

◼ Retrieval & Efficiency Bottlenecks

The RAG pipeline can be inefficient when it misses key information in the retrieval and consolidation stages. Primary FPs include:

FP2 (Missed Top Ranked): The embedding model fails to select top-k embeddings.
FP3 (Consolidation Strategy): The script to trim docs to fit the LLM limits drops the most important parts.

◼ User Experience & Formatting Errors

Although correct, an output with poor readability or in a wrong format can compromise user experience. Primary FPs include:

FP5 (Wrong Format): The LLM fails to follow the specific output format like JSON.
FP6 (Incorrect Specificity): The LLM generates a lengthy output for a simple yes/no question, or vise versa (too brief answer to a complicated question).

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

The Evaluation Stack: Frameworks to Mitigate FPs

Evaluation metrics are designed to systematically mitigate these FPs.

This section explores major evaluation metrics with practical use cases.

▫ Major RAG Evaluation Metrics:

DeepEval
RAGAS
TruLens
Arize Phoenix
Braintrust

◼ DeepEval - The Unit Test before Deployment

DeepEval calculates a weighted score based on the criteria.

An LLM-as-a-judge (e.g., GPT-4o) evaluates each criteria against an LLM's output:

DeepEval leverages G-eval, a chain-of-thought (CoT) framework which takes the multi-step approach to evaluate the output:

Define a criteria to measure (e.g., "coherence,""fluency," or "relevance").
Generate evaluation steps (using an evaluator LLM).
Follow the evaluation step and analyzes the input and the LLM's output.
Calculates an expected weighted sum of the score of each criteria.

Leveraging the approach, DeepEval measures the score:

S c o r e = \sum_{i = 1}^{n} w_{i} \cdot f (C_{i}, O) \dots (1)

where:

w_i: The weight of a specific parameter like tone or helpfulness.
C_i: A specific score for the criteria i against an output O.
f: The LLM's Likert-scale assessment:

Types	Response Options
Agreement	Strongly Agree	Agree	Neutral	Disagree	Strongly Disagree
Likelihood	Very Likely	Likely	Neutral	Unlikely	Very Unlikely
Quality	Excellent	Above Average	Average	Below Average	Poor
Frequency	Very Often	Often	Sometimes	Rarely	Never
Numeric	5	4	3	2	1

Table 1. The Likert-Scale Framework for LLM-as-a-Judge Scoring.

▫ Common Scenario in Practice

Situation: A technical documentation assistant (bot) for a complex software product seems to be working every time the engineer team updates the codebase.
Problem: No quantitative proof if the bot can still answer the user query (You just "think" it's working...).
Solution: Integrate a PyTest function as CI/CD regression suite into Github Action where DeepEval runs G-Eval and others metrics over a test case:

1# pytest component
2import pytest
3from deepeval import assert_test 
4from deepeval.test_case import LLMTestCase, LLMTestCaseParams
5from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
6
7def test_bot_silent_regression(): 
8    # setup metrics with threshold 
9    relevancy = AnswerRelevancyMetric(threshold=0.85)
10    faithfulness = FaithfulnessMetric(threshold=0.85)
11
12    # geval (llm judge)
13    geval_correctness = GEval(
14        name="Correctness",
15        criteria="Determine if the actual output is factually accurate based on the expected output.",
16        evaluation_params=[
17            LLMTestCaseParams.ACTUAL_OUTPUT,
18            LLMTestCaseParams.EXPECTED_OUTPUT
19],
20        threshold=0.85
21    )
22
23    # define a test case
24    test_case = LLMTestCase(
25        input="How do I rotate API keys in the dashboard?",
26        actual_output="To rotate keys, go to Settings > Security and click 'Regenerate'.",
27        retrieval_context=["The security tab allows users to regenerate API keys for safety."
28],
29        expected_output="Users can rotate API keys via the Security section in Settings."
30    )
31
32    # assert the test case against the metrics
33    assert_test(test_case,
34[relevancy, faithfulness, geval_correctness
35])
36

Expected results: If any score of the metrics drops below the threshold (0.85), the PyTest raises AssertionError - immediately failing the CI build, preventing the silent regression from reaching production.

▫ Pros

A variety of metrics (50+) including specialized bias and toxicity checks are available.
Seamlessly integrates with existing CI/CD pipelines.
No reference needed. Assess an output based solely on the prompt and provided context.

▫ Cons

The quality of evaluation heavily depends on the judge LLM's capabilities.
Computationally expensive when the judge LLM is a high-end model.

Developer Note - The Test Case for DeepEval
A set of LLMTestCase objects defines the test case that DeepEval runs.

In practice, this test case should contain most important user queries and labeled outputs with the retrieved context.

These can be retrieved from a JSON or CSV file.

◼ RAGAS - The Needle in a Haystack Optimizer

Retrieval Augmented Generation Assessment (Ragas) aims to evaluate RAG without human-annotated dataset by generating synthetic test sets.

Then, it computes flagship metrics:

Figure B. The RAGAS evaluation triad diagram connecting Question, Context, and Answer through Precision, Recall, Faithfulness, and Relevancy metrics (Created by Kuriko IWAI)

The flagship metrics are categorized into the three groups:

Retrieval pipeline (black, solid line, Figure B): Context precision, context recall.
Generation pipeline (black, dotted line, Figure B): Faithfulness, answer relevancy.
Ground truth (red box, Figure B): Answer semantic similarity, answer correctness.

For example, faithfulness calculates the overlap between the claims in the response and the retrieved context such that:

F a i t h f u l n e s s = \frac{X_{A C}}{X_{A}} \dots (2)

where:

X_AC: The number of the claims in the answer (response) A supported by the given context, and
X_A: The total number of the claims in the response A (with or without the context supported).

▫ Common Scenario in Practice

Situation: The RAG system for legal contracts is missing key clauses. You are unsure if the problem is in the Search (Retriever) or the Reading (Generator).
Problem: No idea on the optimal top-k (number of chunks retrieved).
Solution: Use RAGAS to create a synthetic test set with 100 pairs of questions and evidence. Then, run the RAG pipeline against the test set to calculate context recall and context precision:

1from datasets import Dataset
2from langchain_openai import ChatOpenAI, OpenAIEmbeddings
3from langchain.docstore.document import Document
4from ragas import evaluate
5from ragas.testset.synthesizers.generate import TestsetGenerator
6from ragas.metrics.collections import context_precision, context_recall, faithfulness, answer_relevancy
7
8# setup models
9generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
10critic_llm = ChatOpenAI(model="gpt-4o")
11embedding_model = OpenAIEmbeddings()
12
13# setup documents
14langchain_docs = [
15    Document(page_content=doc_1),
16    Document(page_content=doc_2),
17]
18
19# instantiate generator and generate synthetic testset w/ 100 pairs
20generator = TestsetGenerator.from_langchain(
21    llm=generator_llm,
22    embedding_model=embedding_model
23)
24testset = generator.generate_with_langchain_docs(langchain_docs, testset_size=100)
25test_df = testset.to_pandas()
26
27### <--- rag pipeline execution --->
28
29# create results dataset w/ the rag pipeline
30results_data = {
31    "question": test_df["question"
32  ].tolist(),
33    "contexts": [rag_pipeline.get_chunks(q) for q in test_df["question"
34    ]
35  ],
36    "answer": [rag_pipeline.get_answer(q) for q in test_df["question"
37    ]
38  ],
39    "ground_truth": test_df["ground_truth"
40  ].tolist()
41}
42result_dataset = Dataset.from_dict(results_data)
43
44# evaluate the results w/ metrics
45score = evaluate(
46    dataset=result_dataset,
47    metrics=[
48        context_precision,
49        context_recall,
50        faithfulness,
51        answer_relevancy
52],
53    llm=critic_llm,
54    embeddings=embedding_model,
55)
56

Expected result: Depending on the metric results, action plan can be the following:

Metric	Score	Diagnostic	Action Plan
Context Recall	Low	The retriever missed the correct info.	- Increase top-k. - Try hybrid search (BM25 + Vector).
Context Precision	Low	Top-k chunks contain too much filter and noise - confusing the LLM.	- Decrease top-k - Implement a Reranker (e.g., Cohere).
Faithfulness	Low	The generator is hallucinating despite having data.	- Adjust system prompt. - Check for context window limits.

Table 2. RAGAS Diagnostic Action Plan - Mapping Scores to System Adjustments.

▫ Pros

Excellent for an early-stage project without ground-true datasets (As we saw in the code snippet, RAGAS can make a synthetic test set).

▫ Cons

Synthetic test set might miss nuanced factual errors.
Requires a robust extractor model to break down answers into individual claims (I used gpt-4o in the example).

◼ TruLens - The Feedback Loop Specialist

TruLens focuses on the internal mechanics of the RAG process rather than just the final output by using feedback functions.

For example, it measures answer relevance with a cosine similarity:

R e l e v a n c e = \cos (θ) = \frac{V_{Q} \cdot V_{R}}{‖ V_{Q} ‖ ‖ V_{R} ‖} \dots (3)

where Q and R represent the query and response respectively.

It also uses an LLM-based score reflecting how well the response satisfies the query's intent, using a 4-point Likert scale (0-3), making it superior for ranking the quality of different search results.

▫ Common Scenario in Practice

Situation: A medical advisor bot answers a user's question correctly but adds a pro-tip that isn't in the vetted PDF base.
Problem: The add-on pro-tip might be helpful, but not grounded.
Solution: Use TruLens to implement a groundedness feedback function with a threshold like score > 0.8:

1import os
2from trulens_eval import Tru, Feedback, Select, TruCustomApp 
3from trulens_eval.feedback.provider.openai import OpenAI as tOpenAI
4
5# instantiate tru and evaluator model
6tru = Tru()
7provider = tOpenAI(model_engine=MODEL_NAME)
8
9# define the feedback func - compare the output vs source (chunks)
10f_groundedness = (
11    Feedback(
12        provider.groundedness_measure_with_cot_reasons,
13        name="Groundedness"
14    )
15    .on(Select.RecordCalls.func.args.context) # source (chunks)
16    .on_output() # output (llm response)
17)
18
19# wrap the rag pipeline w/ tru recorder
20tru_recorder = TruCustomApp(
21    rag_query_engine, # rag app
22    app_id="app_ver1", 
23    feedbacks=[f_groundedness
24]
25)
26
27# execute the query w/ tru recoder
28with tru_recorder as recording:
29    res = rag_query_engine(
30        prompt="What are the side effects of Ibuprofen?"
31    )
32
33# retrieve the tru assessment results
34record_df, feedback_cols = tru.get_records_and_feedback(app_ids=['app_ver1'
35])
36

Expected results: When the LLM generates a response that contains information not present in the retrieved chunks, TruLens flags the record in your dashboard.

▫ Pros

Visualizes the reasoning chain to identify exactly where the agent went off-track.
Provides built-in support for grounding to catch hallucinations in real-time.

▫ Cons

Learning curve for defining custom feedback functions.
The dashboard can feel heavyweight for simple scripts.

◼ Arize Phoenix - The Silent Failure Map

Arize Phoenix is an open-source observability and evaluation tool to evaluate LLM outputs, including complex RAG systems.

Built on OpenTelemetry by Arize AI, it focuses on observability by treating LLM evaluation as a subset of MLOps.

In the context of RAG evaluation, Phoenix excels at embedding analysis, using Uniform Manifold Approximation and Projection (UMAP) to reduce high-dimensional vector embeddings into 2D/3D space:

f : R^{n} \to R^{d} where d ≪ n \dots (4)

where n is the original dimension of the vector space, and d is the reduced dimension (e.g., d = 3)

This embedding analysis mathematically reveals if the failed queries are semantically grouped together, which indicates a gap in the vector database.

▫ Common Scenario in Practice

Situation: A customer support bot works great for refunds, but gives nonsensical answers to warranty claims.
Problem: Data hole in the vector database (Cannot find in logs).
Solution: Use Arize Phoenix to generate a Umap Embedding Visualization (UEV), a 3D map for the vector database - to overlay user queries on the document chunks.
Expected results: Visually see a cluster of user queries landing in the dark zone where no documents exist, telling that some documents are forgotten to upload to the vector store.

▫ Pros

OpenTelemetry-native; integrates with existing enterprise monitoring stacks.
The best tool for visualizing blind spots of the vector store.

▫ Cons

Less focused on scoring, more on observing.
Can be overkill for small-scale applications or single-agent tools.

◼ Braintrust - The Prompt Regression Safety Net

Braintrust is designed for high-frequency iteration cycles by using cross-model comparison.

It assesses if Model A + Prompt B is mathematically superior to Model C + Prompt D:

E_{A} = \frac{1}{1 + 10^{(R_{B} - R_{A}) / 400}} \dots (5)

where R_A and R_B are the performance ratings of two different RAG configurations.

▫ Common Scenario in Practice

Situation: An engineer team upgrade prompt from "Answer the question" (Case A) to a more complex 500-word system instruction (Case B).
Problem: Improving the prompt for Case B might accidentally break Case A.
Solution: Use Braintrust to create a golden dataset with a set of N perfect examples (e.g., N = 50). Let Braintrust run side-by-side (SxS) comparison every time the team updates a single word in prompt:

1import braintrust 
2from autoevals import Levenshtein
3
4# initialize the project
5project = braintrust.init(project="Prompt-Upgrade-Regression")
6
7# define the ground truth dataset (N=50)
8dataset =  [
9  {"input": "What is 2+2?", "expected": "4"
10  }, # case a (simple)
11    {"input": "Explain quantum entanglement in the style of a pirate.", "expected": "Arr, particles be linked..."
12  }, # case b (complex) 
13    ...
14]
15
16# evaluate
17braintrust.Eval(
18    name="Prompt Upgrade SxS", 
19    data=dataset, 
20    task=lambda input: { 
21        "case_a": prompt_case_a(input), # current prompt
22        "case_b": prompt_case_b(input), # new, complex prompt
23}, 
24    scores=[Levenshtein
25],
26)
27

Expected result: A difference report showing exactly which cases got better/worse for each of the golden dataset (N = 50).

▫ Pros

Extremely fast to test before the deployment.
Great UI for non-technical stakeholders to review and grade the output.

▫ Cons

Proprietary/SaaS-focused (though they have open-source components).
Fewer built-in deep-tech metrics compared to DeepEval or Ragas.

Wrapping Up

When handled with proper evaluation frameworks, RAG can be a competitive tool to provide an LLM context most relevant to the user query.

◼ Implementation Strategy: Mapping Metrics to Failure Points

Although there’s no one-fit-all solution, Table 3 shows which evaluation metrics to apply for each FP we covered in this article:

Failure Point	Evaluation Metric Idea	Feature to Use
FP1: Missing Content	RAGAS	Faithfulness / Answer Correctness
FP2: Missed Ranking	TruLens	Context Recall / Precision
FP3: Consolidation	Arize Phoenix	Retrieval Tracing & Latency Analysis
FP4: Not Extracted	DeepEval	Faithfulness / Contextual Recall
FP5: Wrong Format	DeepEval	G-Eval (Custom Rubric)
FP6: Specificity	Braintrust	Manual Grading & Side-by-Side Eval
FP7: Incomplete	RAGAS	Answer Relevancy

Table 3. The Failure Point Mitigation Matrix - Which Tool Solves Which FP?

DeepEval and RAGAS can leverage their faithfulness metrics to measure data integrity failures (FP1, FP4, FP7).

TruLens leverages its context precision / recall to measure the context relevance to the output - effectively assessing FP2.

Arize Phoenix provides a visual trace of the retrieval process, making it easy to see if the document retrieved was lost during the consolidation (FP3).

For UX failures, DeepEval creates custom metrics to assess UX failures, while Braintrust excels at ground-truth dataset comparison.

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

FAQ

1) What is the primary difference between RAGAS and DeepEval?

👉 RAGAS focuses on 'Reference-Free' evaluation by generating synthetic test sets to measure the 'RAG Triad' (Faithfulness, Relevancy, Precision), while DeepEval is built for CI/CD integration, functioning like a unit-testing framework (PyTest) for LLM outputs.

2) How do I fix FP2 - Missed Top-Ranked Documents?

👉 To solve FP2, improve your retrieval strategy by implementing Hybrid Search (combining Vector and BM25) or adding a Reranker step (e.g., Cohere) to verify the top-k chunks before feeding them to the LLM.

3) Can I evaluate RAG without a ground-truth dataset?

👉 Yes. Frameworks like RAGAS and TruLens allow for 'Reference-Free' evaluation by using an LLM-as-a-Judge to compare the generated answer directly against the retrieved context for groundedness and faithfulness.

4) What is the Consolidation Failure (FP3)?

👉 FP3 occurs when relevant documents are successfully retrieved but lost during the prompt-stuffing process due to context window limits or ranking logic that pushes the 'needle' to the middle of the prompt where LLMs lose attention.

5) When should I use Arize Phoenix over other tools?

👉 Use Arize Phoenix when you need deep observability into your Vector Database. Its UMAP embeddings visualization helps identify 'dark zones'—clusters of user queries where no relevant documentation exists in your store.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Share What You Learned

Kuriko IWAI, "How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks" in Kernel Labs

https://kuriko-iwai.com/rag-failure-points-evaluation-metrics-guide

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

How to Design a Production-Ready RAG System (Architecture + Tradeoffs) (2026 Edition)
Master common Retrieval-Augmented Generation architectures: Naive, Advanced, Modular, GraphRAG, and Agentic with decision path for optimal retrieval.
Understanding Vector Databases and Embedding Pipelines
Explore the mechanics of vector databases, text embedding (Dense, Sparse, Hybrid), and similarity metrics like Cosine Similarity with coding examples.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks

Master how to evaluate the RAG pipeline and solve common failures with DeepEval, RAGAS, TruLens, and Phoenix.

Table of Contents

Introduction

The Anatomy of RAG Breakdown - 7 Failure Points (FPs)

◼ FP1. Missing Content

◼ FP2. Missed the Top-Ranked Documents

◼ FP3. Not in Context (Consolidation Strategy Limitations)

◼ FP4. Not Extracted

◼ FP5. Wrong Format

◼ FP6. Incorrect Specificity

◼ FP7. Incomplete Answers

Shipping AI Systems?

How FPs Compromise RAG Pipeline Performance

◼ Data Integrity & Trust Failures

◼ Retrieval & Efficiency Bottlenecks

◼ User Experience & Formatting Errors

Shipping AI Systems?

The Evaluation Stack: Frameworks to Mitigate FPs

▫ Major RAG Evaluation Metrics:

◼ DeepEval - The Unit Test before Deployment

▫ Common Scenario in Practice

▫ Pros

▫ Cons

◼ RAGAS - The Needle in a Haystack Optimizer

▫ Common Scenario in Practice

▫ Pros

▫ Cons

◼ TruLens - The Feedback Loop Specialist

▫ Common Scenario in Practice

▫ Pros

▫ Cons

◼ Arize Phoenix - The Silent Failure Map

▫ Common Scenario in Practice

▫ Pros

▫ Cons

◼ Braintrust - The Prompt Regression Safety Net

▫ Common Scenario in Practice

▫ Pros

▫ Cons

Wrapping Up

◼ Implementation Strategy: Mapping Metrics to Failure Points

FAQ

Shipping AI Systems?

Share What You Learned

Continue Your Learning

How to Design a Production-Ready RAG System (Architecture + Tradeoffs) (2026 Edition)

Understanding Vector Databases and Embedding Pipelines

Related Books for Further Understanding