Beyond Zero: A Guide to N-Gram Smoothing and Language Model Robustness

Practical applications of key smoothing algorithms in n-gram models

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

Introduction
Simple Example: The Zero-Frequency Problem - Bigram Case
Why is Lapse Smoothing Necessary?Common Lapse Smoothing Methods
1. Add-One Smoothing (Laplace Smoothing)
2. Add-k Smoothing
3. Good-Turing Smoothing
4. Kneser-Ney Smoothing
Important Note for Real-World Scenarios Using Kneser-Ney
Substitute Strategies: Backoff and Interpolation
Wrapping Up

Introduction

In the realm of natural language processing (NLP) and statistical modeling, particularly when dealing with language models, a common challenge arises:

the problem of zero-frequency events.

This occurs when a particular sequence of words (a n-gram) does not appear in the training corpus.

A language model trained solely on observed frequencies would assign a probability of zero to such unseen events.

This is problematic because it implies impossibility and can lead to severe errors in tasks like speech recognition, machine translation, or text generation.

Figure A. Examples of N-Gram (Image Source)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Examples of N-Gram (Image Source)

Simple Example: The Zero-Frequency Problem - Bigram Case

Let us consider a very small training corpus consisting of just two sentences:

  1. "I like red apples."

  2. "I like green grapes."

Our vocabulary (V) for this tiny corpus is V = 7 (including the period as a word.):

{"I", "like", "red", "apples", "green", "grapes", "."}.

The Zero-Frequency Problem: Calculating the Probability of “like blue“

I’ll try to calculate the probability of the bigram "like blue" using an unsmoothed approach.

First, find the probability of the word "blue" given the preceding word "like": P(blue | like).

  • Count of "like blue": C(like blue)=0 (since "blue" never appears after "like" in our corpus).

  • Count of "like": C(like)=2 (appears in both sentences).

Using the unsmoothed formula:

P(wnw1,,wn1)=C(w1,,wn)C(w1,,wn1)P(w_n | w_1, \dots, w_{n-1}) = \frac {C(w_1, \dots, w_{n})} {C(w_1, \dots, w_{n-1})}

(w_1, w_2, … w_n: n-grams in the training corpus, C(x): the count of the n-grams)

P(bluelike)=C(like blue)C(like)=02=0P(blue | like) = \frac {C(\text{like blue})} {C(like)} = \frac {0} {2} = 0

This result zero implies that it's impossible for "blue" to follow "like".

This is clearly unrealistic; in real language, "I like blue cars" is a perfectly valid phrase.

The zero probability would break any larger sequence containing "like blue".

Why is Lapse Smoothing Necessary?

Lapse smoothing (or smoothing) is a technique used to address this zero-frequency problem.

Its core idea is to adjust the probabilities of observed n-grams downwards slightly, and distribute this “lapsed” probability mass among the unseen n-grams, ensuring that every possible n-gram, even those not present in the training corpus, receives a non-zero, yet very small probability.

As we saw in the case earlier, without smoothing, a model would suffer from several critical issues:

  1. Impossibility of Unseen Events

    Assigning a probability of zero to an unseen n-gram means the model believes it can never occur. This is often incorrect; it simply hasn't been observed yet in the finite training data.

  2. Fragility

    A single unseen n-gram in a longer sequence would cause the probability of the entire sequence to become zero, rendering the model useless for that input.

    This means that no matter how strong the evidence from other n-grams are, the presence of just one n-gram with a zero probability will completely nullify it.

  3. Generalization Issues

    The model would struggle to generalize to new data that contains n-grams not present in its training corpus, limiting its real-world applicability.

Smoothing techniques provide a more robust and realistic probability distribution, allowing the model to handle novel inputs gracefully.

Common Lapse Smoothing Methods

Various methods have been developed for lapse smoothing, each with its own approach to distributing the probability mass. Here are some of the most common ones:

1. Add-One Smoothing (Laplace Smoothing)

Add-one smoothing is the simplest form of smoothing.

It involves adding 1 to the count of every n-gram (including those with zero counts) and adjusting the denominator accordingly.

Psmoothed(wnw1,,wn1)=C(w1,,wn)+1C(w1,,wn1)+VCf. P(wnw1,,wn1)=C(w1,,wn)C(w1,,wn1)\begin {align*} P_{smoothed}(w_n | w_1, \dots, w_{n-1}) &= \frac {C(w1, \dots, w_{n}) + 1} {C(w1, \dots, w_{n-1}) + V} \\ \\ \text{Cf. } P(w_n | w_1, \dots, w_{n-1}) &= \frac {C(w1, \dots, w_{n})} {C(w1, \dots, w_{n-1})} \end {align*}

where

  • w_1, w_2, … w_n: n-gram

  • C(w_1, … w_n): the count of the n-gram in the training corpus.

  • V: the size of the vocabulary (number of unique words)

Pros

  • Simple to implement.

Cons

  • Tends to assign too much probability mass to unseen events, especially in large vocabularies, thus over-smoothing the distribution.

The Bigram Case

I’ll apply Add-One Smoothing to the bigram case in the previous section.

For P_smoothed​(blue | like):

Psmoothed(bluelike)=0+12+70.1110\begin{align*} P_{smoothed}(blue|like) &= \frac {0+1} {2 + 7} \approx 0.111 \neq 0 \end {align*}

(C(like blue)=0, C(like)=2, and V=7)

With Add-One Smoothing, "like blue" now has a non-zero probability of approximately 0.111.

This acknowledges that while "blue" hasn't been seen after "like" in our limited data, it's still a possible sequence.

Let's also look at an observed bigram, "like red":

Psmoothed(redlike)=1+12+70.222Cf. P(redlike)=12=0.50\begin{align*} P_{smoothed}(red|like) &= \frac {1+1} {2 + 7} \approx 0.222 \\ \\ \text{Cf. } P(red|like) &= \frac{1} {2} = 0.50 \end {align*}

(C(like red)=1, C(like)=2, and V=7)

Notice that the probability of "like red" has been slightly discounted from its unsmoothed probability 0.50 to the smoothed probability 0.222.

This "lapsed" probability mass is what gets distributed to the unseen bigrams, ensuring no n-gram has a zero probability.

2. Add-k Smoothing

Add-k smoothing is a generalization of Add-one smoothing, where a small fractional constant k (e.g., 0.1 or 0.5) is added instead of 1.

Psmoothed(wnw1,,wn1)=C(w1,,wn)+kC(w1,,wn1)+kVCf. P(wnw1,,wn1)=C(w1,,wn)C(w1,,wn1)\begin {align*} P_{smoothed}(w_n | w_1, \dots, w_{n-1}) &= \frac {C(w1, \dots, w_{n}) + k} {C(w1, \dots, w_{n-1}) + kV} \\ \\ \text{Cf. } P(w_n | w_1, \dots, w_{n-1}) &= \frac {C(w1, \dots, w_{n})} {C(w1, \dots, w_{n-1})} \end {align*}

Pros

  • Attempts to mitigate the over-smoothing issue of Add-one

Cons

  • Choosing an optimal k can be challenging and often requires empirical tuning.

3. Good-Turing Smoothing

Good-Turing smoothing is a more sophisticated method that re-estimates the counts of n-grams based on the counts of n-grams that appear Nc​ times.

Psmoothed(wnw1,,wn1)=cNP_{smoothed}(w_n | w_1, \dots, w_{n-1}) = \frac {c^*} {N}

where:

  • N: the sum of the n-grams

  • c∗: Re-estimated count for an n-gram with original count (c):

c=(c+1)Nc+1Ncc^* = (c+1)\frac {N_{c+1}} {N_c}

(N_c​: the number of n-grams that appear exactly c times in the corpus)

Pros

  • More theoretically sound, especially for handling zero counts.

  • Particularly effective for estimating the probability of unseen events

Cons

  • Can be complex to implement, especially for higher counts.

  • Requires sufficient data for accurate Nc​ estimations.

The Bigram Case

In the bigram example, the smoothed probability is:

Psmoothed(bluelike)=0.42=0.2P_{smoothed} (blue∣like)= \frac{ 0.4} {2} = 0.2

(N_1 = 2, N_0 = 7 - 2 = 5, c = 0, c* = (0 + 1 ) 2 / 5 = 0.4, N=C(like red) + C(like green) = 2)

4. Kneser-Ney Smoothing

Kneser-Ney smoothing is one of the most effective and widely used smoothing techniques, particularly for n-gram language models.

It's a discounting method that takes into account the context of a word, distinguishing between the probability of a word appearing for the first time in a new context versus appearing in a known context.

Mathematically, the probability of a specific bigram is defined:

PKN(wiwi1)=max(C(wi1wi)D,0)C(wi1)+λ(wi1)Pcont(wi)P_{KN}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}w_i) - D, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) P_{cont}(w_i)

where:

  • C(w_(i−1) ​wi​): Count of the bigram w_(i−1) ​w_i​.

  • C(w_(i−1)​): Count of the unigram prefix w_(i−1)​.

  • D: Absolute discount factor (e.g., 0.75).

  • P_cont: The continuation probability of word w_i​ (how likely w_i​ is to appear as a new word in a new context.):

Pcont(wi)=N1+(wi)wN1+(w)P_{cont}(w_i) = \frac{N_1^+(\cdot w_i)}{\sum_{w'} N_1^+(\cdot w')}
  • λ(wi−1​): A normalization constant for the prefix w_(i−1)​, ensuring probabilities sum to one:
λ(wi1)=DN1+(wi1)C(wi1)\lambda(w_{i-1}) = \frac{D \cdot N_1^+(w_{i-1}\cdot)}{C(w_{i-1})}

The formula shows us its core ideas:

  • Discounting (D): Subtracting a fixed discount (D) from the counts of observed n-grams.

  • Continuation probability (P_cont): Estimates how likely a word is to appear as a new word in a new context, rather than just its overall frequency.

  • Lower-order distribution: Using a lower-order n-gram distribution (e.g., bigram for trigram, unigram for bigram) to distribute the discounted probability mass.

Pros

  • Highly effective, state-of-the-art for many NLP tasks.

Cons

  • Complex to implement.

The Bigram Case

In reality, applying Kneser-Ney Smoothing to a simple example requires a slightly richer corpus than the bigram case to demonstrate its full effect, especially for an unseen bigram where the second word is known but the combination is unseen.

For demonstration purpose, I’ll estimate the probability of "red | like":

C(like red)=1C(like)=2D=0.75λ(like)=DN1+(like)C(like)=0.7522=0.75wi1=likePcont(red)=N1+(red)wN1+(w)=17    PKN(red|like)=max(0.25,0)2+0.75170.232143\begin{align*} C(\text{like red}) &= 1 \text {, } C(like) = 2 \text {, } D = 0.75 \\ \\ \lambda(like) &= \frac{D \cdot N_1^+(like\cdot)}{C(like)} = \frac {0.75 \cdot 2} {2} = 0.75 \because w_{i-1} = like \\ \\ P_{cont}(red) &= \frac{N_1^+(\cdot \text{red})}{\sum_{w'} N_1^+(\cdot w')} = \frac {1} {7} \\ \\ \implies P_{KN}(\text{red|like}) &= \frac{\text{max}(0.25, 0)}{2} +0.75 \cdot \frac {1} {7} \approx 0.232143 \end{align*}

where:

  • w_{i-1} = “like“,

  • w_{i} = “red“,

  • N+(⋅w): The number of unique words that precede a specific word (w), "red" for instance,

  • N+(w⋅): The number of unique words that follow a specific word (w),

  • ∑ ​N1+​(⋅w′): The total number of unique bigrams in the corpus. In the bigram case, it is 7:
    "I like", "like red", "red apples", "apples .", "like green", "green grapes", "grapes ."

However, I realized the KN probability of blue|like turns out to be zero because P_cont(blue) and C(blue) are both zero:

PKN(bluelike)=0+λ(like)Pcont(blue)=0.750=0\begin{align*} P_{KN}(\text{blue} | \text{like}) = 0 + \lambda(\text{like})P_{cont}(\text{blue}) = 0.75 \cdot 0 = 0 \end{align*}

In the next section, we’ll explore how KN handles zero probabilities.

Important Note for Real-World Scenarios Using Kneser-Ney

In practical language modeling, a completely unseen word like "blue" will indeed often get a zero probability from standard smoothing techniques like Kneser-Ney if it has no observed counts at any level.

To handle such truly out-of-vocabulary (OOV) words, language models often employ techniques like:

  • Mapping all OOV words to a special <UNK> (unknown) token and learning probabilities for <UNK>.

  • Using character-level models or subword tokenization.

  • Backing off to a uniform distribution over the entire vocabulary as a final resort if all other probabilities are zero.

I will demonstrate how to apply UNK (Unknown) tokens to the same blue | like example using KN.

Kneser-Ney with an UNK (Unknown) Token

First, I will introduce an UNK (Unknown) token into the corpus to avoid zero probabilities.

The revised corpus would look like:

  1. "I like red apples."

  2. "I like green grapes."

  3. "He likes [UNK] cars." (Here, "[UNK]" represents an unknown word in this specific context)

  4. "They hate [UNK]."

Calculating P(UNK | like) using Kneser-Ney

Then calculate the KN probability of an unknown word (UNK) appearing after "like" using the common discount factor D = 0.75:

PKN(wiwi1)=PKN([UNK] | like)0.10710\begin{align*} P_{KN}(w_i | w_{i-1}) &= P_{KN} (\text{[UNK] | like}) \approx 0.1071 \neq 0 \end{align*}

Let's break down the calculation:

  1. Compute Bigram Count Term: max(C(like [UNK]) − D, 0)​ / C(like)

    • C(like [UNK]): Looking at our revised corpus, the bigram "like [UNK]" does not appear directly (we have "likes [UNK]" and "hate [UNK]", but not "like [UNK]"). So, C(like [UNK])=0.

    • This means the first term is zero: max(0−0.75, 0)​ / C(like) = 0 / 2 = 0

  2. Back-off Triggered: Because the direct bigram count is zero, Kneser-Ney backs off to the second term of the formula, which involves the continuation probability.

  3. Compute Normalization Factor (λ(like))

    • C(like): From "I like red apples." and "I like green grapes.", "like" appears 2 times. So, C(like)=2.

    • Unique words following "like": "red" and "green" follow "like". So, there are 2 unique words.

    • λ(like)=0.75 ​× 2 / 2 =0.75.

  4. **Compute Continuation Probability of [UNK] P_cont([UNK]):**​

    • Numerator (Unique words preceding [UNK]):

      • "likes" precedes "[UNK]" ("likes [UNK]")

      • "hate" precedes "[UNK]" ("hate [UNK]") So, there are 2 unique words that precede "[UNK]".

    • Denominator (Total unique bigrams in the corpus): Let's list them:

      • "I like" (1)

      • "like red" (1)

      • "red apples" (1)

      • "apples ." (1)

      • "like green" (1)

      • "green grapes" (1)

      • "grapes ." (1)

      • "He likes" (1)

      • "likes [UNK]" (1)

      • "[UNK] cars" (1)

      • "cars ." (1)

      • "They hate" (1)

      • "hate [UNK]" (1)

      • "[UNK] ." (1) Total unique bigrams = 14.

    • P_cont​([UNK])=2 / 14 ​≈0.1428.

  5. Final KN Probability P_KN​([UNK]∣like)
    Now, we put all the pieces together: P_KN​([UNK] | like) = 0 + 0.75 × 0.1428 ≈ 0.1071

How this Example Shows Backoff with [UNK]

This example demonstrates the power of using an [UNK] token:

  • Handling Unseen Bigrams: Even though the specific bigram "like [UNK]" was unseen in our revised corpus (meaning its direct count was 0), Kneser-Ney didn't immediately assign it a zero probability.

  • Leveraging Context: Instead, it backed off to consider the general likelihood of an "[UNK]" token appearing in any context, as captured by Pcont​([UNK]). Because "[UNK]" appeared after "likes" and "hate" in other parts of the corpus, its Pcont​ value was non-zero.

  • Non-Zero Probability: This allowed Kneser-Ney to assign a plausible, non-zero probability to the sequence "like [UNK]", even though it was not directly observed. This is crucial for real-world language models to handle new or rare words gracefully.

Substitute Strategies: Backoff and Interpolation

These are not smoothing methods in themselves but strategies often used with smoothing techniques.

  • Backoff: As we saw in the KN case, if a higher-order n-gram (e.g., trigram) is not found in the training corpus, the model backs off to a lower-order n-gram (e.g., bigram), and if that's not found, to a unigram while applying smoothing at each level.

    Figure B. Applying backoff to trigram (Image Source)

    Kernel Labs | Kuriko IWAI | kuriko-iwai.com

    Figure B. Applying backoff to trigram (Image Source)

  • Interpolation: Combines probabilities from different n-gram orders (e.g., unigram, bigram, trigram) using weighted averages. The weights are typically learned from held-out data.

Wrapping Up

Smoothing is an indispensable technique in language modeling and other statistical applications where sparse data can lead to zero-frequency problems.

By intelligently redistributing probability mass from observed events to unobserved ones, smoothing ensures that models are more robust, generalize better to new data, and provide more realistic probability estimates.

While simple methods like Add-one smoothing offer a basic solution, more advanced techniques like Kneser-Ney smoothing provide superior performance by considering the nuances of word occurrences and contexts.

The choice of smoothing method often depends on the specific application, the size and nature of the training data, and the desired balance between simplicity and accuracy.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Beyond Zero: A Guide to N-Gram Smoothing and Language Model Robustness" in Kernel Labs

https://kuriko-iwai.com/smoothing-techniques-in-nlp

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.