Building Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks
Master how to evaluate the RAG pipeline and solve common failures with DeepEval, RAGAS, TruLens, and Phoenix.
By Kuriko IWAI

Table of Contents
IntroductionThe Anatomy of RAG Breakdown - 7 Failure Points (FPs)Introduction
Retrieval-Augmented Generation (RAG) is critical for modern AI architecture, serving as an essential framework for building context-aware agents.
But moving from a basic prototype to a production-ready system involves navigating significant hurdles in data retrieval, context consolidation, and response synthesis.
This article provides a deep dive into seven typical RAG failure points and the evaluation metrics with practical coding examples.
Building production-grade AI systems?
I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.
Or explore:
- Dive deeper 👉 Research Archive
- Learn by building 👉 AI Engineering Masterclass
- Try it live 👉 Playground
The Anatomy of RAG Breakdown - 7 Failure Points (FPs)
Barnett et al. identify seven critical failure points (FPs) in the development and implementation of Retrieval Augmented Generation (RAG) systems:

Figure A. Indexing and Query processes required for creating a RAG system. The indexing process is done at development time and queries at runtime. Failure points identified in this study are shown in red boxes (source)
These FPs represent key areas where system performance is compromised, leading to suboptimal results or system failures.
◼ FP1. Missing Content
Missing content happens when the system is asked a question that cannot be answered because the relevant information is not present in the available vector store in the first place.
The failure occurs when an LLM provides a plausible-sounding but incorrect response instead of stating it doesn't know.
◼ FP2. Missed the Top-Ranked Documents
This is a situation where a correct document exists in the vector store, but the retriever fails to rank it highly enough to include it in top-k documents fed to an LLM as context.
In consequence, the correct information never reaches the LLM.
◼ FP3. Not in Context (Consolidation Strategy Limitations)
This is a situation where a correct document exists and is retrieved from the vector store, but is excluded during the consolidation process.
This happens when too many documents are returned and the system must filter them down to fit within an LLM's context window, token limits, or rate limits.
◼ FP4. Not Extracted
This is a situation where an LLM fails to identify the correct information in the context, even though the correct information was in the vector store, and successfully retrieved/consolidated.
This happens when the context is overly noisy or contains contradictory information that confuses the LLM.
◼ FP5. Wrong Format
This is a situation where storage, retrieval, consolidation, and LLM interpretation are successfully handled, but the LLM fails to follow specific formatting instructions provided in the prompt, such as a table, a bulleted list, or a JSON schema.
◼ FP6. Incorrect Specificity
An LLM's output is technically present, but either too general or too complex compared to the user's needs.
For example, an LLM generates simple answers to a user query with a complex professional goal.
◼ FP7. Incomplete Answers
This is a situation where an LLM generates an output not necessarily wrong, but missing key pieces of information that were available in the context.
For example, when a user asks a complex question like "What are the key points in documents A, B, and C?" , the LLM only addresses one or two of the sources.
◼ How FPs Impact RAG Performance
Each of these FPs impact performance of RAG pipelines in the following three ways:
▫ Data Integrity & Trust Failures (FP1, FP4, FP7)
When missing or incorrect information is present, the system is no longer a reliable source of information. Primary FPs include:
FP1 (Missing Content): The answer is not in the doc in the first place.
FP4 (Not Extracted): The LLM decides to ignore the correct answer in the doc.
FP7 (Incomplete): The LLM gives half-truths, missing important pieces.
▫ Retrieval & Efficiency Bottlenecks (FP2, FP3)
The RAG pipeline can be inefficient when it misses key information in the retrieval and consolidation stages. Primary FPs include:
FP2 (Missed Top Ranked): The embedding model fails to select top-k embeddings.
FP3 (Consolidation Strategy): The script to trim docs to fit the LLM limits drops the most important parts.
▫ User Experience & Formatting Errors (FP5, FP6)
Although correct, an output with poor readability or in a wrong format can compromise user experience. Primary FPs include:
FP5 (Wrong Format): The LLM fails to follow the specific output format like JSON.
FP6 (Incorrect Specificity): The LLM generates a lengthy output for a simple yes/no question, or vise versa (too brief answer to a complicated question).
Building production-grade AI systems?
I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.
Or explore:
- Dive deeper 👉 Research Archive
- Learn by building 👉 AI Engineering Masterclass
- Try it live 👉 Playground
The Evaluation Stack: Frameworks to Mitigate FPs
Evaluation metrics are designed to systematically mitigate these FPs.
This section explores major evaluation metrics with practical use cases.
▫ Major RAG Evaluation Metrics:
DeepEval
RAGAS
TruLens
Arize Phoenix
Braintrust
◼ DeepEval - The Unit Test before Deployment
DeepEval calculates a weighted score based on the criteria.
An LLM-as-a-judge (e.g., GPT-4o) evaluates each criteria against an LLM's output:

DeepEval leverages G-eval, a chain-of-thought (CoT) framework which takes the multi-step approach to evaluate the output:
Define a criteria to measure (e.g., "coherence," "fluency," or "relevance").
Generate evaluation steps (using an evaluator LLM).
Follow the evaluation step and analyzes the input and the LLM's output.
Calculates an expected weighted sum of the score of each criteria.
Leveraging the approach, DeepEval measures the score:
where:
w_i: The weight of a specific parameter like tone or helpfulness.
C_i: A specific score for the criteria i against an output O.
f: The LLM's Likert-scale assessment:
Table 1. The Likert-Scale Framework for LLM-as-a-Judge Scoring.
▫ Common Scenario in Practice
Situation: A technical documentation assistant (bot) for a complex software product seems to be working every time the engineer team updates the codebase.
Problem: No quantitative proof if the bot can still answer the user query (You just "think" it's working...).
Solution: Integrate a PyTest function as CI/CD regression suite into Github Action where DeepEval runs G-Eval and others metrics over a test case:
1# pytest component
2import pytest
3from deepeval import assert_test
4from deepeval.test_case import LLMTestCase, LLMTestCaseParams
5from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
6
7def test_bot_silent_regression():
8 # setup metrics with threshold
9 relevancy = AnswerRelevancyMetric(threshold=0.85)
10 faithfulness = FaithfulnessMetric(threshold=0.85)
11
12 # geval (llm judge)
13 geval_correctness = GEval(
14 name="Correctness",
15 criteria="Determine if the actual output is factually accurate based on the expected output.",
16 evaluation_params=[
17 LLMTestCaseParams.ACTUAL_OUTPUT,
18 LLMTestCaseParams.EXPECTED_OUTPUT
19 ],
20 threshold=0.85
21 )
22
23 # define a test case
24 test_case = LLMTestCase(
25 input="How do I rotate API keys in the dashboard?",
26 actual_output="To rotate keys, go to Settings > Security and click 'Regenerate'.",
27 retrieval_context=["The security tab allows users to regenerate API keys for safety."],
28 expected_output="Users can rotate API keys via the Security section in Settings."
29 )
30
31 # assert the test case against the metrics
32 assert_test(test_case, [relevancy, faithfulness, geval_correctness])
33
- Expected results: If any score of the metrics drops below the threshold (0.85), the PyTest raises AssertionError - immediately failing the CI build, preventing the silent regression from reaching production.
▫ Pros
A variety of metrics (50+) including specialized bias and toxicity checks are available.
Seamlessly integrates with existing CI/CD pipelines.
No reference needed. Assess an output based solely on the prompt and provided context.
▫ Cons
The quality of evaluation heavily depends on the judge LLM's capabilities.
Computationally expensive when the judge LLM is a high-end model.
Developer Note - The Test Case for DeepEval
A set of LLMTestCase objects defines the test case that DeepEval runs.
In practice, this test case should contain most important user queries and labeled outputs with the retrieved context.
These can be retrieved from a JSON or CSV file.
◼ RAGAS - The Needle in a Haystack Optimizer
Retrieval Augmented Generation Assessment (Ragas) aims to evaluate RAG without human-annotated dataset by generating synthetic test sets.
Then, it computes flagship metrics:

Figure B. The RAGAS evaluation triad diagram connecting Question, Context, and Answer through Precision, Recall, Faithfulness, and Relevancy metrics (Created by Kuriko IWAI)
The flagship metrics are categorized into the three groups:
Retrieval pipeline (black, solid line, Figure B): Context precision, context recall.
Generation pipeline (black, dotted line, Figure B): Faithfulness, answer relevancy.
Ground truth (red box, Figure B): Answer semantic similarity, answer correctness.
For example, faithfulness calculates the overlap between the claims in the response and the retrieved context such that:
where:
X_AC: The number of the claims in the answer (response) A supported by the given context, and
X_A: The total number of the claims in the response A (with or without the context supported).
▫ Common Scenario in Practice
Situation: The RAG system for legal contracts is missing key clauses. You are unsure if the problem is in the Search (Retriever) or the Reading (Generator).
Problem: No idea on the optimal top-k (number of chunks retrieved).
Solution: Use RAGAS to create a synthetic test set with 100 pairs of questions and evidence. Then, run the RAG pipeline against the test set to calculate context recall and context precision:
1from datasets import Dataset
2from langchain_openai import ChatOpenAI, OpenAIEmbeddings
3from langchain.docstore.document import Document
4from ragas import evaluate
5from ragas.testset.synthesizers.generate import TestsetGenerator
6from ragas.metrics.collections import context_precision, context_recall, faithfulness, answer_relevancy
7
8# setup models
9generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
10critic_llm = ChatOpenAI(model="gpt-4o")
11embedding_model = OpenAIEmbeddings()
12
13# setup documents
14langchain_docs = [
15 Document(page_content=doc_1),
16 Document(page_content=doc_2),
17]
18
19# instantiate generator and generate synthetic testset w/ 100 pairs
20generator = TestsetGenerator.from_langchain(
21 llm=generator_llm,
22 embedding_model=embedding_model
23)
24testset = generator.generate_with_langchain_docs(langchain_docs, testset_size=100)
25test_df = testset.to_pandas()
26
27### <--- rag pipeline execution --->
28
29# create results dataset w/ the rag pipeline
30results_data = {
31 "question": test_df["question"].tolist(),
32 "contexts": [rag_pipeline.get_chunks(q) for q in test_df["question"]],
33 "answer": [rag_pipeline.get_answer(q) for q in test_df["question"]],
34 "ground_truth": test_df["ground_truth"].tolist()
35}
36result_dataset = Dataset.from_dict(results_data)
37
38# evaluate the results w/ metrics
39score = evaluate(
40 dataset=result_dataset,
41 metrics=[
42 context_precision,
43 context_recall,
44 faithfulness,
45 answer_relevancy
46 ],
47 llm=critic_llm,
48 embeddings=embedding_model,
49)
50
Expected result: Depending on the metric results, action plan can be the following:
Table 2. RAGAS Diagnostic Action Plan - Mapping Scores to System Adjustments.
▫ Pros
- Excellent for an early-stage project without ground-true datasets (As we saw in the code snippet, RAGAS can make a synthetic test set).
▫ Cons
Synthetic test set might miss nuanced factual errors.
Requires a robust extractor model to break down answers into individual claims (I used gpt-4o in the example).
◼ TruLens - The Feedback Loop Specialist
TruLens focuses on the internal mechanics of the RAG process rather than just the final output by using feedback functions.
For example, it measures answer relevance with a cosine similarity:
where Q and R represent the query and response respectively.
It also uses an LLM-based score reflecting how well the response satisfies the query's intent, using a 4-point Likert scale (0-3), making it superior for ranking the quality of different search results.
▫ Common Scenario in Practice
Situation: A medical advisor bot answers a user's question correctly but adds a pro-tip that isn't in the vetted PDF base.
Problem: The add-on pro-tip might be helpful, but not grounded.
Solution: Use TruLens to implement a groundedness feedback function with a threshold like score > 0.8:
1import os
2from trulens_eval import Tru, Feedback, Select, TruCustomApp
3from trulens_eval.feedback.provider.openai import OpenAI as tOpenAI
4
5# instantiate tru and evaluator model
6tru = Tru()
7provider = tOpenAI(model_engine=MODEL_NAME)
8
9# define the feedback func - compare the output vs source (chunks)
10f_groundedness = (
11 Feedback(
12 provider.groundedness_measure_with_cot_reasons,
13 name="Groundedness"
14 )
15 .on(Select.RecordCalls.func.args.context) # source (chunks)
16 .on_output() # output (llm response)
17)
18
19# wrap the rag pipeline w/ tru recorder
20tru_recorder = TruCustomApp(
21 rag_query_engine, # rag app
22 app_id="app_ver1",
23 feedbacks=[f_groundedness]
24)
25
26# execute the query w/ tru recoder
27with tru_recorder as recording:
28 res = rag_query_engine(
29 prompt="What are the side effects of Ibuprofen?"
30 )
31
32# retrieve the tru assessment results
33record_df, feedback_cols = tru.get_records_and_feedback(app_ids=['app_ver1'])
34
- Expected results: When the LLM generates a response that contains information not present in the retrieved chunks, TruLens flags the record in your dashboard.
▫ Pros
Visualizes the reasoning chain to identify exactly where the agent went off-track.
Provides built-in support for grounding to catch hallucinations in real-time.
▫ Cons
Learning curve for defining custom feedback functions.
The dashboard can feel heavyweight for simple scripts.
◼ Arize Phoenix - The Silent Failure Map
Arize Phoenix is an open-source observability and evaluation tool to evaluate LLM outputs, including complex RAG systems.
Built on OpenTelemetry by Arize AI, it focuses on observability by treating LLM evaluation as a subset of MLOps.
In the context of RAG evaluation, Phoenix excels at embedding analysis, using Uniform Manifold Approximation and Projection (UMAP) to reduce high-dimensional vector embeddings into 2D/3D space:
where n is the original dimension of the vector space, and d is the reduced dimension (e.g., d = 3)
This embedding analysis mathematically reveals if the failed queries are semantically grouped together, which indicates a gap in the vector database.
▫ Common Scenario in Practice
Situation: A customer support bot works great for refunds, but gives nonsensical answers to warranty claims.
Problem: Data hole in the vector database (Cannot find in logs).
Solution: Use Arize Phoenix to generate a Umap Embedding Visualization (UEV), a 3D map for the vector database - to overlay user queries on the document chunks.
Expected results: Visually see a cluster of user queries landing in the dark zone where no documents exist, telling that some documents are forgotten to upload to the vector store.
▫ Pros
OpenTelemetry-native; integrates with existing enterprise monitoring stacks.
The best tool for visualizing blind spots of the vector store.
▫ Cons
Less focused on scoring, more on observing.
Can be overkill for small-scale applications or single-agent tools.
◼ Braintrust - The Prompt Regression Safety Net
Braintrust is designed for high-frequency iteration cycles by using cross-model comparison.
It assesses if Model A + Prompt B is mathematically superior to Model C + Prompt D:
where R_A and R_B are the performance ratings of two different RAG configurations.
▫ Common Scenario in Practice
Situation: An engineer team upgrade prompt from "Answer the question" (Case A) to a more complex 500-word system instruction (Case B).
Problem: Improving the prompt for Case B might accidentally break Case A.
Solution: Use Braintrust to create a golden dataset with a set of N perfect examples (e.g., N = 50). Let Braintrust run side-by-side (SxS) comparison every time the team updates a single word in prompt:
1import braintrust
2from autoevals import Levenshtein
3
4# initialize the project
5project = braintrust.init(project="Prompt-Upgrade-Regression")
6
7# define the ground truth dataset (N=50)
8dataset = [
9 {"input": "What is 2+2?", "expected": "4"}, # case a (simple)
10 {"input": "Explain quantum entanglement in the style of a pirate.", "expected": "Arr, particles be linked..."}, # case b (complex)
11 ...
12]
13
14# evaluate
15braintrust.Eval(
16 name="Prompt Upgrade SxS",
17 data=dataset,
18 task=lambda input: {
19 "case_a": prompt_case_a(input), # current prompt
20 "case_b": prompt_case_b(input), # new, complex prompt
21 },
22 scores=[Levenshtein],
23)
24
- Expected result: A difference report showing exactly which cases got better/worse for each of the golden dataset (N = 50).
▫ Pros
Extremely fast to test before the deployment.
Great UI for non-technical stakeholders to review and grade the output.
▫ Cons
Proprietary/SaaS-focused (though they have open-source components).
Fewer built-in deep-tech metrics compared to DeepEval or Ragas.
Building production-grade AI systems?
I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.
Or explore:
- Dive deeper 👉 Research Archive
- Learn by building 👉 AI Engineering Masterclass
- Try it live 👉 Playground
Wrapping Up
When handled with proper evaluation frameworks, RAG can be a competitive tool to provide an LLM context most relevant to the user query.
◼ Implementation Strategy: Mapping Metrics to Failure Points
Although there’s no one-fit-all solution, Table 3 shows which evaluation metrics to apply for each FP we covered in this article:
Table 3. The Failure Point Mitigation Matrix - Which Tool Solves Which FP?
DeepEval and RAGAS can leverage their faithfulness metrics to measure data integrity failures (FP1, FP4, FP7).
TruLens leverages its context precision / recall to measure the context relevance to the output - effectively assessing FP2.
Arize Phoenix provides a visual trace of the retrieval process, making it easy to see if the document retrieved was lost during the consolidation (FP3).
For UX failures, DeepEval creates custom metrics to assess UX failures, while Braintrust excels at ground-truth dataset comparison.
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.
FAQ
1) What is the primary difference between RAGAS and DeepEval?
👉 RAGAS focuses on 'Reference-Free' evaluation by generating synthetic test sets to measure the 'RAG Triad' (Faithfulness, Relevancy, Precision), while DeepEval is built for CI/CD integration, functioning like a unit-testing framework (PyTest) for LLM outputs.
2) How do I fix FP2 (Missed Top-Ranked Documents)?
👉 To solve FP2, improve your retrieval strategy by implementing Hybrid Search (combining Vector and BM25) or adding a Reranker step (e.g., Cohere) to verify the top-k chunks before feeding them to the LLM.
3) Can I evaluate RAG without a ground-truth dataset?
👉 Yes. Frameworks like RAGAS and TruLens allow for 'Reference-Free' evaluation by using an LLM-as-a-Judge to compare the generated answer directly against the retrieved context for groundedness and faithfulness.
4) What is the 'Consolidation Failure' (FP3)?
👉 FP3 occurs when relevant documents are successfully retrieved but lost during the prompt-stuffing process, often due to context window limits or ranking logic that pushes the 'needle' to the middle of the prompt where LLMs lose attention.
5) When should I use Arize Phoenix over other tools?
👉 Use Arize Phoenix when you need deep observability into your Vector Database. Its UMAP embeddings visualization helps identify 'dark zones'—clusters of user queries where no relevant documentation exists in your store.
Building production-grade AI systems?
I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.
Or explore:
- Dive deeper 👉 Research Archive
- Learn by building 👉 AI Engineering Masterclass
- Try it live 👉 Playground
Share What You Learned
Kuriko IWAI, "Building Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks" in Kernel Labs
https://kuriko-iwai.com/rag-failure-points-evaluation-metrics-guide
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
How to Design a Production-Ready RAG System (Architecture + Tradeoffs) (2026 Edition)
Understanding Vector Databases and Embedding Pipelines
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

