LLM Decoding Strategies: A Guide to Algorithms and Sampling Methods

Discover how major decoding methods and algorithms work and their usages with practical examples

Deep LearningData ScienceLLM

By Kuriko IWAI

Introduction

Decoding strategies are essential guides used in Large Language Models (LLMs) to transform the conditional probability distributions into coherent, high-quality text.

Different strategies influence the characteristics of the LLM’s output, including its quality, creativity, consistency, and variability.

In this article, I’ll delve into the decoding strategies, analyzing underlying mechanisms, usage, and limitations of major decoding methods and algorithms.

LLM Decoding Basics: The Probability Distribution

A Large Language Model (LLM), especially those with a decoder-only architecture, is a system designed to generate text that mirrors human-like fluency and coherence.

Below diagram illustrates the autoregressive process of text generation by a decoder-only LLM:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Decoder-only LLM architecture (left) and its autoregressive process (right) (Created by Kuriko IWAI)

Given the initial input “I lived in Tokyo, so", the decoder-only LLM (left in Figure A):

Creates an input embedding Xo,
Computes a rich representation of the input (O) through the multi-head attention and feed forward layer (pink box in Figure A), and
Predicts the proceeding token (subword or word) one by one in the autoregressive process (right in Figure A).

In the autoregressive process, the model generates the next token based on the highest conditional probability given the sequence generated so far.

For example, in Turn 1, the model selects "I" (left, pink box in Figure A) as it has the highest probability of 62% among all others.

Then, this newly generated token is appended to the input sequence, creating a longer context.

In Turn 2, the extended sequence “I lived in Tokyo, so I" is fed back into the model to predict the next token, "speak."

This iterative process continues until the model produces an End-of-Sentence (EOS) token or reaches the maximum context length.

As illustrated from Turn 1 to 4 in Figure A, the model successfully completes the sentence:

I lived in Tokyo, so I can speak Japanese

The final linear layer (left) is responsible for computing the conditional probability distribution across the model's entire vocabulary for the next token selection.

Mathematically, the process is generalized:

P(w_{t+1}|w_1, \dots, w_t) = \text{softmax}(z) = \frac{e^{z_i}}{\sum_{j=1}^{|V|} e^{z_j}}

where:

P( | ): The conditional probability of a token w_{t+1} being selected as a next token given the preceding sequence of tokens from w_1 to w_t,
t: The current token length,
z: Logits (un-normarized score) such that:

z = h_{final} W_{\text{vocab}} + b_{\text{vocab}}

where:

h_{final}: The final hidden layer of the linear output block,
W_{vocab}: The weight matrix that projects the hidden state to the size of the vocabulary |V|, and
b_{vocab}: The bias term of a given vocabulary.

The Softmax function computes a probability distribution over the entire vocabulary (e.g., 128K tokens in Llama 3.2), ensuring that the probabilities for all tokens sum to one.

Implementing Decoding Strategies

Now, we know that LLMs compute the conditional probability distribution over the vocabulary for the next token; we then require a rule to effectively sample a token from this distribution to continue the sequence.

A decoding strategy is the highest-level concept that governs the desired characteristics of the output based on the overall goals of a task in hand.

◼ A Hierarchical Breakdown

The below diagram illustrates some examples of decoding strategies:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Use cases in implementing decoding strategies (Created by Kuriko IWAI)

For example, for a factual Q&A (Use Case 1 in Figure B), the decoding strategy should be deterministic because the goal is to consistently generate the single correct answer to the given query.

The decoding strategy for chat (Use Case 2), on the other hand, needs to balance cohesiveness of the conversation and randomness of vocabularies.

To implement these decoding strategies, decoding methods (middle row in Figure B) are used as practical rules on selecting the next token from the model’s conditional probability distribution.

Then, the decoding algorithm (bottom row in Figure B) manages to run the autoregressive process, repeatedly applying the chosen decoding methods to generate the sequence of tokens.

For example, in Use Case 1, the decoding methods prevent any stochasticity by adjusting key parameters like top-k sampling, while applying Greedy Search decoding algorithm to ensure that the model keeps selecting a token with the absolute highest probability at every autoregression step.

This can generate the single most probable sequence, which is essential for tasks like factual Q&A that requires high confidence and correctness.

In the next section, I’ll explore major decoding methods and decoding algorithms.

Decoding Methods: Controlling Probability Distributions

**Decoding methods (**also called stochastic sampling methods) introduce randomness to the autoregressive process, critical for generating diverse text.

In this section, I’ll explore:

Sampling methods: Top-k sampling, top-p sampling, and
Modify probability distribution: Temperature scaling, repetition penalties.

◼ Sampling Methods

The below diagram compares major sampling methods, top-k and top-p:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Top-k sampling vs top-p sampling (Created by Kuriko IWAI)

▫ Top-k Sampling

Top-k simply selects the top k most probable tokens at each step.

In Figure C, to select a proceeding token to the sentence “I like blue“, top-k = 2 selects the top two tokens (“sky“ and “jeans“) based on the probability ranking.

Goals

Restrict randomness and diversity in most intuitive way.

Usage

Used in early sampling implementations (e.g., initial GPT models).

Limitations

Arbitrary k selection: Choosing an optimal k depends heavily on the context, task, and data distribution, making the selection process non-robust.
Hard cut-off: Top-k ignores marginal difference between the k-th and k+1-th tokens.
Bias amplification: Top-k amplifies the bias of the underlying ranking function, leading to unfair exclusion of under-represented groups.
Computational cost: Finding the top-k requires scoring every candidate tokens.

▫ Top-p Sampling (Nucleus Sampling)

Top-p samples from the smallest set of most probable tokens whose cumulative probability exceeds a threshold p.

In Figure C, top-p = 0.85 samples three tokens (“sky", “jeans", and “shoes") as the cumulative probability of these tokens is just below the threshold of 0.85.

By setting the threshold closer to one, top-p can sample more diverse tokens than top-k even when the probability distribution is sharp.

Goals

Balance coherence and creativity.

Usage

Dominant sampling strategy for major LLMs like ChatGPT, Gemini, Claude, Llama, and Mistral variants.

Limitations

Arbitrary p selection: Similar to top-k, selecting the threshold p remains hand-tuned. p can be suboptimal.
Insensitivity to confidence: Top-p can be overly restrictive in a situation where the model is highly confident and generates extremely sharp distributions. For instance, a top token has 99% probability, top-p = 0.95 only selects that one token, while top-k can guarantee to select multiple tokens.
No guaranteed diversity: When the probability distribution is clustered on redundant tokens, the output from top-p can be repetitive, not diverse.

◼ Temperature Scaling

Temperature scaling is a decoding method to controlling the entropy (randomness and diversity) of the probability distribution by adjusting a hyperparameter T.

The below diagram shows how different temperature values alter the shape of the probability distribution:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Temperature scaling and sampling methods (Created by Kuriko IWAI)

When T = 1 (center, orange box in Figure D), the distribution remains unchanged.

As T gets lower (T < 1, going to the left side), the distribution gets sharpened, concentrating the probability mass on the most likely tokens.

For example, when T = 0.2 (left), the probability mass gathers at the top token “sky“. So, when applying top-p = 0.85 (bottom row), the model only selects “sky“, resulting in the most deterministic selection.

On the other hand, high temperature (T > 1, right side) flattens the distribution, making less probable tokens more likely to be sampled.

In Figure D, as T grows to T = 50, the probability distribution gets very flat. So, the same top-p = 0.85 can sample from many more tokens, resulting in more diverse selection.

Alternatively, top-k (top-k = 2) selects the same two tokens (“sky" and “jeans") regardless of the T values.

This indicates that combining temperature and top-p sampling can control the randomness and diversity of text generation.

Mathematically, the process is generalized by the logits (z) being scaled by the temperature T to generate the conditional probability distribution (P):

P_T(w_{t+1}|w_1, \dots, w_t) = \text{softmax}\bigg(\frac{z} {T}\bigg)i = \frac{e^{z_i / T}}{\sum{j=1}^{|V|} e^{z_j / T}}

Goals

Control the level of randomness and diversity.

Usage

Universal for major LLMs like ChatGPT, Claude, Gemini, and Llama (setting around 0.7 < T < 1.0).

Limitations

Multi-dimensional tuning: Finding the optimal combination of T, p, and k can be complex, leading to unpredictable results if values are pushed to extremes.
Forced trade-off between exploitation and exploration: As a fixed, global hyperparameter, T cannot be flexible in finding the optimal balance at a token level during the autoregressive process. For example, it is ideal to set lower T for factual parts like names in the sequence and higher T for creative parts like story development. But in reality, T is constant.
Ignored token-dependency: The fixed T also fails to leverage the rich internal state of the model like hidden states and attention weights to predict a context-aware temperature for the current token.
Compromised miscalibration handling: When the model exhibits severe miscalibration (overconfidence), it fails to correct the confidence of highly incorrect predictions.
Vulnerability to task shift: A temperature optimized for one task (e.g., summarization) is suboptimal for a different task (e.g., code generation). When the model is versatile across many tasks, a single T value can lead to poor performance.

◼ Repetition and Frequency Penalties

Repetition and frequency penalties also modify the original probability distribution before applying a sampling method, by directly penalizing logits z.

The universal formula provided by OpenAI combines frequency penalty and presence penalty into a single formula such that:

\tilde{z}j = z_j - c_j \cdot \alpha{\text{freq}} - \mathbb{1}[c_j > 0] \cdot \alpha{\text{pres}}

where:

z~j: New, penalized logit for the j-th token,
z_j: Original logit for the j-th token,
c_j: The count (frequency) of the j-th token in the previously generated text,
α_{freq}: The frequency penalty coefficient (hyperparameter) which ranges from -2.0 to 2.0,
1[c_j > 0]: The indicator function that returns 1 if c_j > 0 (the j-th token has appeared at least once) and 0 otherwise, and
α_{pres}: The presence penalty coefficient (hyperparameter) which ranges from -2.0 to 2.0.

In the formula, c_j⋅α_{freq} is the frequency penalty term, proportional to how many times the given token has already appeared (measured by c_j).

This penalty indicates that a token appeared five times gets a penalty five times stronger than a token appeared once.

On the other hand, the presence penalty term (1[c_j > 0]⋅α_{pres}) is a fixed, one-time penalty for any token that has appeared at least once, acting as a repetition penalty based on a boolean condition.

After modifying the logits, the model generates a new probability distribution P~ such that:

\tilde{P_T}(w_{t+1}|w_1, \dots, w_t) = \text{softmax}\bigg(\frac{ \tilde{z}} {T}\bigg)i = \frac{e^{ \tilde{z_i} / T}}{\sum{j=1}^{|V|} e^{ \tilde{z_j} / T}}

while optionally applying temperature scaling T.

And lastly, the model samples tokens from the new probability distribution using top-k or top-p sampling.

Goals

Prevent the model from getting stuck in loops and encourage new vocabulary.

Usage

Universal in all chat interfaces like ChatGPT-4.0.

Limitations

No contextual awareness: Because the penalties are uniformly applied based on a token's appearance count, it potentially penalizes the necessary repetition of keywords or some function words like “the“ and “of“.
Arbitrary parameter tuning: The hyperparameters α_{freq} and α_{pres} are hand-tuned, which might results in suboptimal.
Ineffectiveness in long-range repetition: The penalties are designed to checking the immediate history like the last few tokens. Less effective at finding multi-sentence or thematic repetition over very long sequences.

Decoding Algorithms: Maximizing Sequence-Level Quality

**Decoding algorithms (**or called deterministic search methods) runs autoregressive process, implementing decoding methods to ensure that the current sequence maximizes the overall probablitites according to the model.

In this section, I’ll explore the major algorithms:

Greedy Search,
Beam Search (Pure Beam Search, Stochastic Beam Search), and
Diverse Beam Search.

◼ Greedy Search

Greedy search is a deterministic decoding algorithm which guarantees the same output, prioritizing coherence and likelihood over creativity.

Mechanism

At every step, the model chooses the single token with the highest probability.
Exactly the same as setting top-k = 1.

Goals

Maximize immediate local probability.

Usage

Used as a baseline for performance metrics.
Best for highly constrained tasks like code generation where certainty is key.

Limitations

Local optimality: Selects the token with the highest immediate probability at each step, ignoring the potential for a better globally optimal sequence later on.
Lack of diversity & repetition: Produces the exact same output for the same input. The model can get stuck in boring, repetitive loops.

◼ Beam Search (BS)

Beam Search is another deterministic decoding algorithm which samples from multiple most likely tokens, instead of a single token, at each step.

The below diagram shows the process of Pure Beam Search with the beam width N = 2:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Example of Pure Beam Search (beam width: N = 2) (Created by Kuriko IWAI)

In any given turn of the autoregression process like Turns 1 to 3 in Figure E, the algorithm selects two beams (sequences) from the candidate sequence pool based on the aggregate scores.

For example, in Turn 1, the algorithm selects two tokens “sky“ and “jeans” which proceeds to the sequence I like blue ____ based on the score (S(Y1)), generating two beams:

"I like blue sky" and
"I like blue jeans".

Then, in Turn 2, the algorithm generates a conditional probability distribution for each beam and computes the score of candidate sequences (S(Y2)).

For example, adding “because“ (score: -1.427 based on the conditional probability) to the beam "I like blue sky" (score: -0.3567) results in the total score of -1.7837 (= -0.3567 + (-1.427)), which is highest among all.

The algorithm runs the same process for the other beam "I like blue jeans", resulting in selecting two beams:

“I like blue sky because”, and
“I like blue jeans and”

by the end of Turn 2, while pruning the rest.

Mathematically, this score computation is generalized by adding natural logarithm of the conditional probability (IN(P)) of every possible token w to the total score from the previous turn:

$$S(Y) = \sum_{i=1}^{L} \log(P(w_i|w_{

where:

S(Y): The score (standard score) of the entire sequence Y,
Y: The entire sequence with the length of L, and
P(|) is the conditional probability of the i-th token to be selected given the context.

This process continues until it reaches the maximum token length or EOS, where the algorithm finally chooses a beam with the highest score among all.

Mechanism (summary)

Keeps N beams during the autoregressive process, choose the beam with the highest score by the end.
Discards vast number of low-scoring paths to maintain the computational tractability.
When combining with top-p or top-k sampling, Beam Search can narrow down the candidate token pool before running each step (Stochastic Beam Search).
Technically, Greedy Search is Beam Search with N = 1 and top-k = 1.

Goals

Maximizes overall sequence probability (less short-sighted than Greedy Search).

Usage

Traditionally popular in machine translation and abstractive summarization (e.g., in early Transformer/encoder-decoder models like T5).
Less common for open-ended chat models like ChatGPT due to its tendency to repeat phrases in long-form generation.

Limitations

Computational cost : Expanding a large beam increases the computational load, which can be a limitation for real-time applications.
Beam degeneration : In some cases, beam search may lead to repetitive or degenerate outputs, especially when the beam size is too small or the model’s predictions are highly confident.

◼ Diverse Beam Search (DBS)

Diverse Beam Search (DBS) adds a penalty term to the original Beam Search score to encourage diversity.

DBS first divides the beam width N into smaller, independent groups G and then penalizes sequences within those groups based on redundancy.

Below diagram compares the case of Pure BS and DBS where both are applied to the model with 12 vocabularies:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Pure Beam Search vs Diverse Beam Search (N = 6, |V| = 12, G = 3) (Created by Kuriko IWAI)

Both Pure BS and DBS select six beams as N = 6, but DBS forces diversity.

DBS groups the 12 tokens into three groups from G1 to G3 and selects top two (N / G = 6 / 3 = 2) beams from each group based on the score S’(Y).

S’(Y) is added a penalty terms such that:

S'(Y) = S(Y) - \lambda \cdot r_g

where:

S(Y): The standard score used in Pure BS (accumulated log probability),
λ (λ > 0): The penalty hyperparameter controlling how aggressively the algorithm should penalize similar sequences. A higher λ forces more diversity.
r_g: The rank of the sequence within the group g. For example:
r_g = 0 for the best sequence,
r_g = 1 for the second best,
r_g = 2 for the third best and so on.

In Figure F, λ = 0.2 adds the penalty of -0.2, -0.4, and -0.6 to the second, third, and the last sequence respectively in the group, increasing the penalty as the rank goes down.

This encourages each group to explore different high-level idea, prefix, or theme because the algorithm penalizes heavily the sequences with lower scores.

In Figure F, DBS results in selecting “I like blue cats“ whose score S’(Y) = -0.69 is lower than the sequence “I like blue shoes“ (-0.62) in G1.

Pure BS, on the other hand, is purely score-driven. It retains the top six beams, resulting in the same, most-probable word with low diversity.

Goal

Encourage to select N sequences, semantically diverse from each other.

Usage

Used in R&D and application scenarios like translation and image captioning tasks where multiple distinct candidates are required.

Limitations:

Arbitrary hyperparameters: An aggressive λ may generate diverse but low-quality outputs, while a small λ may fail to create enough diversity. An improper G and N setup can limit the diversity benefit.
Reduced quality and coherence: Penalizing high-probability sequences can lead to selecting sequences with lower scores than Pure Beam Search, which might result in less coherent outputs.
Computational complexity: DBS requires tracking and calculating penalties across multiple diversity groups G in addition to the standard beams N, adding computational overhead and memory usage.
Limited Diversity Mechanism: The penalty based on the rank within the group may fail to enforce “semantic” diversity. For example, DBS with N = 2 generates two beams: “The Labrador dog is dashing through the field." and “The golden retriever runs across the lawn.“, both of which have the same semantic meaning: The dog is running on the grass.

Conclusion

Decoding strategies are critical in LLM text generation as they directly influence the quality, coherence, diversity, and computational cost of the output, moving beyond simple sampling to produce human-like sequences.

To implement these strategies, we use decoding methods to adjusting the probability distribution via temperature or top-k/top-p filtering, and then apply decoding algorithms to efficiently run the autoregressive prediction process.

Moving forward, advanced algorithms like Speculative Decoding, Contrastive Search, and Neural Text Generation (NTG) methods will continue to be explored to achieve even faster inference and higher quality in the LLM deployment.