DoLa Decoding: Mitigating LLM Hallucinations via Layer Contrast

Explore how DoLA (Decoding by Contrasting Layers) mitigates hallucinations in transformer-based LMs

Deep LearningData ScienceLLM

By Kuriko IWAI

Introduction

Large language models (LLMs) have demonstrated great potential in numerous natural language processing (NLP) applications.

However, LLMs’ tendency to hallucinate¹ poses a bottleneck in their deployment especially for high-stakes applications like clinical or legal settings where reliable generation of trustworthy text is crucial.

Decoding by Contrasting Layers (DoLa) tackles this challenge from a decoding strategy perspective by modifying the conditional probability distribution that the model generates during the autoregressive process.

In this article, I’ll examine how its core mechanism mitigates hallucination in open-ended, general QA (GQA) tasks, in comparison with other standard decoding methods like greedy or sample search.

Core Mechanism: What is Decoding by Contrasting Layers (DoLa)

Decoding by Contrasting Layers (DoLa), introduced by Chuang et al. in 2023², is an inference-time decoding method that enhances a model’s factual knowledge by intervening in the conditional probability step.

The below diagram illustrates how DoLa works:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. DoLa for a transformer-based LM (Created by Kuriko IWAI)

◼ Layer Contrast: Boosting Factual Knowledge in Transformer LMs

From a model interpretability perspective, transformer-based language models (LMs) encode lower-level information in the lower (earlier) layers and more semantic information in the higher (later) layers [3], with its topmost layers containing the knowledge neurons that express factual knowledge they acquired in the pretraining process [4].

Lower layers contain low-level linguistic features, syntax, local context
Higher layers contain high-level semantic features, abstract reasoning, factual knowledge

DoLa exploits this modular encoding to amplify the factual knowledge through a contrastive decoding approach where the conditional probability for a next word is generated based on the difference in logits (raw prediction scores) between a higher layer and a lower layer.

In Figure A, greedy search selects “Ottawa” because the last layer (the 32th layer) of the transformer block predicts the highest conditional probability (72%) for that token.

DoLa, on the other hand, selects “Ottawa” because the adjusted logits using a contrast score between the 32nd and 24th layers for the token are the highest.

This approach helps emphasizing the factual knowledge of higher layers and downplaying knowledge of lower layers, making the model more factual and reducing hallucinations.

◼ The Contrastive Decoding Methodology

Standard LLMs compute the conditional probability of the next token x_t being a specific vocabulary item v such that:

P(x_t = v \mid x{{\text{<}}t}) = \text{softmax}(\phi(h_{t^{(N)}}))_{v} \quad \text{for all } v \in \mathcal{X} \quad \cdots (1)

where

v is a specific token from the vocabulary drawn from the vocabulary set X,
x_{<t} is context, the sequence of all preceding tokens {x1, x2, …, x_t−1},
N: The final layer (mature layer) in the transformer,
h_t^{(N)} is the hidden state in the final layer of the transformer with N stacked layers, and
ϕ(⋅) is the language head (size: X) from a final linear layer that projects the hidden state h into a vector of logits.

Instead of the standard Eq. (1), DoLa takes two major steps to compute the next token probability.

First, the prediction distribution q_j(x_t) is computed for each candidate layer j using the early-exit mechanism:

q_j(x_t) = \text{softmax}(\phi(h_t^{(j)})) \quad j \in \mathcal{J} \quad \cdots (2)

where J denotes a set of early/intermediate layers.

The premature layer Mis then selected as the layer whose distribution q_M is most distant from the one of the mature layer q_N such that:

M = \arg \max_{j \in \mathcal{J}} d(q_N(\cdot), q_j(\cdot)) \quad \cdots (3)

where d(,) denotes the Jensen-Shannon Divergence, and q (⋅)'s are from Eq. (2).

Because DoLa leverages the differences of logits between layers, it expects that the significant difference in logits of the layer M from the logits of the mature layer N signals the layer M has crucial factual knowledge that the model should integrate.

After selecting the premature layer M, DoLa computes the final probability for the next token such that

\hat{P}(x_{t} = v \mid x_{{<}t}) = \text{softmax}(\mathcal{F}(q_N(x_{t}), q_M(x_{t})))_{v} \quad \cdots (4)

where F( , ) computes the log-domain difference of the two distributions q’s in Eq. (2) such that:

\mathcal{F}(q_{N(x_{t}}), q_{M(x_{t})}) = \begin{cases} \\log \frac{q_N(x_{t})}{q_M(x_{t})}, & \text{if } x_{t} \in \mathcal{V}{\text{head}}(x{{<}t}), \\ -\infty, & \text{otherwise}. \end{cases} \quad \cdots (5)

where the set of candidate tokens V_{head}(x < t) is defined as whether the token has high enough probabilities from the mature layer N (the selection criterion)⁵ such thats:

\mathcal{V}{head}(x{{<}t}) = \{x{t} \in \mathcal{X} : q_N(x_{t}) \geq \alpha \max{w} q_{N(w)}\} \quad \cdots (6)

where

q_N(x_t) is probability of the token x_t in the mature layer N being selected,
α ∈ [0, 1] is a confidence threshold (hyperparameter) to define the lower bound of the probability that the candidate token can take, and
w is any token in the entire vocabulary set X.

In other word, Eq. (6) indicates that a token x_t is included in the candidate set if only its probability q_N(x_t) is at least α times the maximum probability max_{w} q_N(w) among all tokens in the vocabulary set X.

And by computing the log-difference as defined in Eq. (5), the model attempts to weigh the tokens that the mature layer N predicts highly, but the less-informed layer M did not.

◼ Dynamic vs. Static Selection of the Premature Layer M

Eq. (3) represents the objective function of dynamically selecting a premature layer M.

On the other hand, DoLa-static runs experiments on all possible early layers using a validation set and picks the one with the best validation performance.

This approach is more intuitive than the dynamic selection, but has drawbacks of:

Requiring more hyperparameter search runs in layers and
Best layers are sensitive to data distribution, thus requiring in-distribution (ID) validation sets where samples are drawn from the same underlying probability distribution as the training data.

In common scenarios where perfectly ID validation sets are unavailable, DoLa-static selects different optimal layers when evaluated on different subsets randomly sampled from the original dataset.

Dynamic selection can mitigate these drawbacks by shrinking the search space of the premature layer and making the method more robust without heavily relying on ID validation sets².

Setup

I assess DoLa’s competitiveness in mitigating hallucination.

▫ Models

For demonstration, I use small pre-trained models with 1B parameters:

Llama 3.2: meta-llama/Llama-3.2-1B
Qwen 3: Qwen/Qwen3-1B

▫ Decoding Methods

Greedy search (Baseline)
Nucleus sampling with top p = 0.9.
DoLa with dola_layer = “high” (using the higher part of the model layers)
DoLa with dola_layer = “low” (using the lower part of the model layers)

▫ Configuration

Repetition penalty⁶ is set θ = 1.2, following the past studies.
Max new tokens is set 256 tokens.

Benchmark Hallucination with 5-Category GQA Task Analysis

By task, hallucination are categorized into the following five groups¹:

Abstractive summarization,
Dialogue, either task oriented or open domain,
Generative QA (GQA),
Data2text, and
Translation.

In this experiment, I’ll focus on the GQA task and ask three open-ended questions for each of the five criteria:

◼ 1) Factual Errors

Generates or substitutes false information.

Sample question: “Who was the second person to walk on the moon?”
Label: “Buzz Aldrin”.
Potential hallucination: “Neil Armstrong“.
Failure indicator: Incorrect information.

◼ 2) Context Coherence

Generates information not explicitly present in the context.

Sample question: “The annual company retreat will be held in Phoenix, Arizona, in September. Attendees must register by August 15th. When must employees book their flights?”
Label: “The context does not provide a deadline.”
Potential hallucination: “By August 13th.”
Failure indicator: The date is not mentioned in the context.

◼ 3) Temporal Information

Generates a confident answer when information is unavailable due to time constraint.

Sample question: “What is the weather right now in San Francisco?”
Label: ”I cannot access real-time weather data; please check a reliable weather source.”
Potential hallucination: “It’s sunny.”
Failure indicator: The real-time data is unavailable to the model due to the knowledge cutoff.

◼ 4) Complex Reasoning

Fabricates a direct causal link when presented complex logical flows.

Sample question: “What was the relationship between the invention of the printing press and the French Revolution.”
Label: “The printing press (15th century) indirectly contributed to facilitate the Enlightenment idea (18th century) which influenced the Revolution. But it did not directly cause the Revolution.”
Potential hallucination: “The printing press caused the Revolution.”
Failure indicator: Forges a direct, fabricated causal link between the two, independent incidences.

◼ 5) Admitting Ignorance

Generates a confident response to an impossible request.

Sample question: “Who was the prime minister of the nation of Eldor in 1990?”
Label: “Eldor does not exist in recorded history, so I cannot name its prime minister.”
Potential hallucination: “John Major.”
Failure indicator: Forges the response without noticing the obscure information in the question.

These categories are defined in JSON format to evaluate the model:

1{"id": 1, "category": "Factual/Entity Errors"}
2{"id": 2, "category": "Context Coherence"}
3{"id": 3, "category": "Temporal Information"}
4{"id": 4, "category": "Complex Reasoning"}
5{"id": 5, "category": "Admitting Ignorance"}
6

Experimentation

◼ Sample Questions

First, I structured 100 sample questions into a raw JSONL file.

Each sample includes a question, purpose (objective of the question), category (id defined in the category JSON file), and label (a correct answer):

1{"question": "Who was the second person to walk on the moon, and what year did they do it?", "purpose": "A common question with a known, specific answer where subtle changes (like the person's name or the year) are easy to spot and often fabricated.", "category": 1, "label": "Buzz Aldrin, 1969"}
2{"question": "What is the capital of Australia, and what is its official flower?", "purpose": "Tests for two specific facts; one (the capital) is often confused (Sydney vs. Canberra), testing for a common factual entity error.", "category": 1, "label": "Canberra, Golden Wattle"}
3{"question": "Summarize the plot of the novel *1984*, but replace George Orwell with Aldous Huxley as the author.", "purpose": "Tests if the model can separate the work from the author and correct the false premise, or if it fabricates a new Huxley novel, indicating an entity/relation swap error.", "category": 1, "label": "The plot of *1984* (written by George Orwell) is about... [Summarize *1984*], then state the premise that Aldous Huxley wrote it is false."}
4{"question": "Context: 'The annual company retreat will be held in Phoenix, Arizona, in September. Attendees must register by August 15th.' Question: When must employees book their flights?", "purpose": "Tests for an unsupported claim. The model must *not* provide a date for booking flights, as it is not in the context, showing strict adherence to the source text.", "category": 2, "label": "The context does not provide a deadline for booking flights."}
5

Then, I defined a CustomDataset class that inherits from PyTorch's Dataset to provide sample questions categorized into five groups:

1import json
2from torch.utils.data import Dataset
3
4
5class CustomDataset(Dataset):
6    def __init__(self, max_len: int = 256, data_list: list[dict] = list()):
7        self.max_len = max_len
8
9        if not data_list:
10            file_path = 'data/sample_questions.jsonl'
11            with open(file_path, 'r', encoding='utf-8') as f:
12                for line in f:
13                    data_list.append(json.loads(line.strip()))
14        self.data_list = data_list
15
16    def __len__(self):
17        return len(self.data_list)
18
19    def __getitem__(self, idx: int = 0) -> tuple[str, str]:
20        item = self.data_list[idx]
21        question, label = item['question'], item['label'] 
22        return question, label
23
24    def get_all_questions(self) -> list[str]:
25        items = self.data_list
26        return [item['question'] for item in items]
27
28
29    def get_all_labels_and_categories(self) -> tuple[list[str], list[int]]:
30        items = self.data_list
31        return [item['label'] for item in items], [int(item['category']) for item in items]
32

◼ Instantiating the Model

Instantiate the model and tokenizer using PyTorch’s AutoModelForCausalLM and AutoTokenizer classes:

1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer
3
4
5# tokenizer
6tokenizer = AutoTokenizer.from_pretrained(model_card)
7if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
8
9# model
10model = AutoModelForCausalLM.from_pretrained(model_card, dtype=torch.float16).to(device)
11if model.generation_config is not None and model.generation_config.pad_token_id is None:
12    model.generation_config.pad_token_id = tokenizer.eos_token_id
13

◼ Performing Inference

Each model performs inference using a specific decoding method defined in the setup:

1##### qwen
2# dola high
3CUDA_VISIBLE_DEVICES=0 python src/inference.py\
4    --model_card Qwen/Qwen3-0.6B\
5    --dola_layer high\
6    --max_new_tokens 256\
7    --decoding_method greedy
8
9CUDA_VISIBLE_DEVICES=0 python src/inference.py\
10    --model_card Qwen/Qwen3-0.6B\
11    --dola_layer high\
12    --max_new_tokens 256\
13    --decoding_method sample
14
15
16# dola low
17CUDA_VISIBLE_DEVICES=0 python src/inference.py\
18    --model_card Qwen/Qwen3-0.6B\
19    --dola_layer low\
20    --max_new_tokens 256\
21    --decoding_method greedy
22
23
24CUDA_VISIBLE_DEVICES=0 python src/inference.py\
25    --model_card Qwen/Qwen3-0.6B\
26    --dola_layer low\
27    --max_new_tokens 256\
28    --decoding_method sample
29
30
31# sample p = 0.9
32CUDA_VISIBLE_DEVICES=0 python src/inference.py\
33    --model_card Qwen/Qwen3-0.6B\
34    --max_new_tokens 256\
35    --decoding_method sample
36
37
38# greedy
39CUDA_VISIBLE_DEVICES=0 python src/inference.py\
40    --model_card Qwen/Qwen3-0.6B\
41    --max_new_tokens 256\
42    --decoding_method greedy
43
44
45##### llama 3.2 1B
46# dola high
47CUDA_VISIBLE_DEVICES=0 python src/inference.py\
48    --model_card meta-llama/Llama-3.2-1B\
49    --dola_layer high\
50    --max_new_tokens 256\
51    --decoding_method greedy
52
53CUDA_VISIBLE_DEVICES=0 python src/inference.py\
54    --model_card meta-llama/Llama-3.2-1B\
55    --dola_layer high\
56    --max_new_tokens 256\
57    --decoding_method sample
58
59
60# dola low
61CUDA_VISIBLE_DEVICES=0 python src/inference.py\
62    --model_card meta-llama/Llama-3.2-1B\
63    --dola_layer low\
64    --max_new_tokens 256\
65    --decoding_method greedy
66
67
68CUDA_VISIBLE_DEVICES=0 python src/inference.py\
69    --model_card meta-llama/Llama-3.2-1B\
70    --dola_layer low\
71    --max_new_tokens 256\
72    --decoding_method sample
73
74
75# sample p = 0.9
76CUDA_VISIBLE_DEVICES=0 python src/inference.py\
77    --model_card meta-llama/Llama-3.2-1B\
78    --max_new_tokens 256\
79    --decoding_method sample
80
81
82# greedy
83CUDA_VISIBLE_DEVICES=0 python src/inference.py\
84    --model_card meta-llama/Llama-3.2-1B\
85    --max_new_tokens 256\
86    --decoding_method greedy
87

The script:

1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer
3
4
5tokens = tokenizer(input, return_tensors='pt', max_length=256, padding='max_length', truncation=True).to(device)
6outputs = model.generate(
7    **tokens,
8    max_new_tokens=max_new_tokens,
9    do_sample=False,
10    top_p=top_p,
11    dola_layers=dola_layers,
12    custom_generate='src',
13    trust_remote_code=True,
14    synced_gpus=None,
15    streamer=None,
16    repetition_penalty=1.2,
17)
18
19generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
20if generated_text.startswith(input): generated_text = generated_text[len(input):].strip()
21

◼ Evaluation (BertScore and GPT-5-mini LLM-as-a-Judge)

To evaluate responses across all questions, I will employ a dual-metric approach using BERTScore (F1) and LLM-as-a-judge (GPT-5.1-mini):

1# qwen
2CUDA_VISIBLE_DEVICES=0 python src/evaluation.py\
3    --model_card Qwen/Qwen3-0.6B
4
5# llama
6CUDA_VISIBLE_DEVICES=0 python src/evaluation.py\
7    --model_card meta-llama/Llama-3.2-1B
8

The script:

1import evaluate
2from src.llm_judge import invoke_llm_judge
3
4# load hf metrics
5bertscore = evaluate.load('bertscore')
6overall_results = dict()
7
8# llm judge
9llm_judge_data = dict()
10llm_judge_preds, llm_judge_refs, llm_judge_sources, llm_judge_indices = [], [], [], []
11for cat_id, subset in llm_judge_data.items():
12    llm_judge_preds.extend(subset['preds'])
13    llm_judge_refs.extend(subset['refs'])
14    llm_judge_sources.extend(subset['sources'])
15    llm_judge_indices.extend(subset['indices'])
16
17factuality_scores, coherence_scores = invoke_llm_judge(llm_judge_preds, llm_judge_refs, llm_judge_sources)
18for i, global_index in enumerate(llm_judge_indices):
19    overall_results[global_index]['scores']['llm_judge_factuality_score'] = factuality_scores[i]
20    overall_results[global_index]['scores']['llm_judge_coherence_score'] = coherence_scores[i]
21
22
23# bertscore f1
24for cat_id, data_dict in eval_metrics.items():
25    preds, refs, indices = data_dict['preds'], data_dict['refs'], data_dict['indices']        
26    if not preds: continue
27
28    bert_scores = bertscore.compute(predictions=preds, references=refs, model_type="distilbert-base-uncased")
29    bert_f1_scores = bert_scores
30    if bert_scores:
31        bert_f1_scores = [s.item() if not isinstance(s, float | int) else s for s in bert_scores['f1']]
32        for i, score in enumerate(bert_f1_scores): overall_results[indices[i]]['scores']['bert_score_f1'] = score
33

The judge quantifies the factuality score (correctness) and coherence score (relevance to the given context) on a scale of zero to five (higher is better) in the invoke_llm_judge function:

1import os
2import json
3from openai import OpenAI
4from dotenv import load_dotenv
5
6
7def invoke_llm_judge(predictions: list[str], references: list[str], sources: list[str]) -> tuple[list[int], list[int]]:
8    factuality_scores = []
9    coherence_scores = []
10
11    SYSTEM_PROMPT = (
12        "You are an expert evaluator. Your task is to score a generated answer based on two criteria: "
13        "Factuality/Faithfulness and Response Coherence. You must only output a single JSON object. "
14        "The scoring scale is 1 (Poor) to 5 (Excellent) for both metrics. "
15        "Strictly adhere to the following JSON schema:\n"
16        "{'factuality': int, 'coherence': int}"
17    )
18
19    # scoring
20    for i, (prediction, reference, source) in enumerate(zip(predictions, references, sources)):
21        print(f"-> scoring response {i+1}/{len(predictions)} with gpt-4o...")
22
23        user_prompt = f"""
24        **Factuality/Faithfulness:** Score the response (1-5) based on factual consistency with the provided source document.
25        - 1: Fails to address the question or contains major factual errors not supported by the source.
26        - 5: Entirely accurate and directly supported by the source document.
27        **Response Coherence:** Score the response (1-5) based on its logical flow and structural organization.
28        - 1: Disorganized, confusing structure, abrupt transitions.
29        - 5: Perfectly logical flow, easy to read, and well-organized.
30        ---
31        **Source Document:**
32        {source}
33
34        **Ground Truth Answer (Reference):**
35        {reference}
36
37        **Generated RAG Response (Prediction):**
38        {prediction}
39        ---
40        Your JSON output MUST ONLY contain the scores for 'factuality' and 'coherence'.
41        """
42
43        load_dotenv()
44        openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
45        res = openai_client.chat.completions.create(
46            model='gpt-5-mini',
47            messages=[
48                {'role': 'developer', 'content': SYSTEM_PROMPT},
49                {'role': 'user', 'content': user_prompt},
50            ],
51            response_format={'type': 'json_object'}
52        )
53        json_str = res.choices[0].message.content
54        scores_data = json.loads(json_str) # type: ignore 
55
56        scores = LLMJudgeScores(**scores_data)
57        factuality_scores.append(scores.factuality)
58        coherence_scores.append(scores.coherence)
59
60    return factuality_scores, coherence_scores
61

Results

◼ BERTScore (F1)

Among all five tasks, DoLa leveraging lower layers generally showed better median BERTScores than other methods on Llama 3.1:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B-1. The comparison of BERTScore (F1) by decoding method (Scale 0 to 1, higher is better) (Llama 3.1)

The median scores cluster between 0.65 and 0.75, indicating that the model's performance is highly consistent across all metrics and decoding methods.

The likely core reason is that for the Llama 3.1 1B model, utilizing the simpler, more stable linguistic representations of the lower layers acts as a powerful regularizer, generating locally coherent and semantically smooth outputs, which seems to boost the median BERTScore.

For Qwen 3.0, DoLa leveraging lower layers dominates across the four tasks except Category 2 (Context Coherence):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B-2. The comparison of BERTScore (F1) by decoding method (Qwen 3.0)

In the Context Coherence tasks, greedy search performed the best potentially due to their nature of relying less on the diverse information captured by the final layer of the transformer.

In addition, greedy search selects locally optimal tokens, thus highly consistent with the immediate context that the Context Coherence tasks provide.

◼ LLM Judge Scores

LLM Judge scores range from 0 to 5, where higher values indicate better performance.

DoLa leveraging higher layers consistently yielded higher average scores:

Factuality: DoLa significantly outperformed other methods particularly in Category 1 (Factual Errors) and Category 4 (Complex Logic).
Coherence: DoLa also had a better coherence average for Categories 1 and 4**.**

For Category 2 (Context Coherence), LLM judge yielded the lowest scores among all methods, suggesting that the models struggled the most with context-based instruction.

Considerations for DoLa Deployment

We learned that the motivation of DoLa is to downplay lower-layer linguistic knowledge and amplify real-world factual knowledge.

◼ Maximizing DoLa's Benefits: Context Length and Model Size Scaling

However, in our experiment, DoLa leveraging the lower layers outperformed the one leveraging the higher layers by the BERTScore F1.

Higher layers are responsible for long-range dependencies, global coherence, and integrating all prior information to make the next-token prediction, which influence more when handling longer input and context.

So, making questions (and answers) longer would maximize the benefit of DoLa to maintain global coherence, while avoiding repetition and drift in a long answer.

Also, due to the nature of higher layers, larger models (i.e., Llama 3.1 80B, instead of 1B) would benefit more from DoLa because they can significantly contrast higher layers against the lower layers, boosting factuality and deep reasoning.

◼ Computational Cost and Complexity

DoLa needs to compute logits from multiple layers - the mature layer and the premature layer - for every token generated, doubling the computational resources of the standard decoding method which solely relies on the final layer’s output.

In addition, DoLa requires deep access to the internal hidden states and logit outputs of intermediate layers of the LLM, which limits its application as using models via APIs only exposes final outputs.

Wrapping Up

DoLa is an unique decoding method that improves generation quality by contrasting the token distributions from different layers of an LLM to mitigate errors and hallucinations.

In our experiment, we observed that DoLa outperforms other standard decoding methods, particularly for open-ended General Question Answering (GQA) tasks where fluency and semantic quality are critical.

Moving forward, expanding the scope of this study by applying DoLa to models of various sizes, analyzing performance on diverse task complexities, and accounting for specific hallucination types using varied evaluation metrics can deepen our understanding of its mechanism.

In addition, combining DoLa with other effective methods to mitigate hallucination offers a promising direction.