Maximum A Posteriori (MAP) Estimation: Balancing Data and Expert Knowledge

Handling data scares scenario with Bayesian Inference and MAP Estimation

Machine LearningDeep LearningData Science

By Kuriko IWAI

Introduction What is Maximum A Posteriori Estimation How Maximum A Posteriori Estimation Works in Binary Classification

Step 1: Define the Likelihood Function (P (D|p))

Step 2: Define the Prior Distribution (P(p))

Step 3: Define the Posterior Distribution (P(p|D))

Step 4: Compute the MAP estimate (p_MAP)

The Walkthrough Examples - Churn Prediction

The Project Setup

Scenario 1: Confident Prior

Scenario 2: Weak Prior

Scenario 3: More Data

MAP Estimation in Machine Learning Context Wrapping Up

Introduction

Maximum A Posteriori (MAP) estimation is a statistical method that combines observed data with expert knowledge to determine the most probable model parameters.

This article will explore MAP’s fundamentals, demonstrate its application in churn prediction, and discuss its practical implications across various scenarios, including machine learning.

What is Maximum A Posteriori Estimation

Maximum A Posteriori (MAP) estimation is a modeling approach that finds the most probable model parameters by combining observed data with prior knowledge of their probability distribution.

While Maximum Likelihood Estimation (MLE) focuses solely on finding the most probable parameter (θ_MLE) maximizing the likelihood function:

\hat \theta_{MLE} = \underset{\theta}{\arg\max} P(D| \theta)

The MAP estimation leverages Bayes’ Theorem and find the parameter (θ_MAP) maximizing the posterior probability (P(θ∣D)):

\begin{align*} \hat{\theta}_{\text{MAP}} &= \underset{\theta}{\arg\max} P(\theta|D) \\ \\ &= \underset{\theta}{\arg\max} \frac{P(D|\theta)P(\theta)}{P(D)} \\ \\ \\ &\propto\underset{\theta}{\arg\max} P(D|\theta)P(\theta) \end {align*}

Its key components include:

Posterior Probability (P(θ∣D))

Represents the updated probability distribution of the churn rate after observing the data (D).
Our main objective in Bayesian inference.

Likelihood Function (P(D∣θ))

Represents how probable our observed churn data would be if the true churn rate were a specific value.

Prior Distribution (P(θ))

Represents our initial belief about the churn rate before seeing any data.
In our case, this includes the owner’s expert opinion.

Marginal Likelihood (P(D)), also known as Evidence

A normalizing constant to ensure that the posterior probability distribution integrates to one, making it a valid probability distribution.
Can skip the computation because it is constant with respect to the parameter.

How Maximum A Posteriori Estimation Works in Binary Classification

In binary classification tasks, the model's final prediction is a class label, either A or B.

This naturally lends to modeling with:

A Bernoulli distribution (for a single instance) or
A Binomial distribution (for the number of successful predictions in a fixed number of independent trials).

In this article, I assume the underlying data (each data point's true label or each single prediction's outcome) follows a Bernoulli distribution.

And our goal is to estimate the true underlying data, interpreted as the probability of belonging to class A, using MAP estimation:

\begin{align*} \hat p_{MAP} & \propto \underset{p}{\arg\max} P(D|p)P(p) \end {align*}

where:

D: A set of true labels and
P(p∣D): The posterior distribution of the parameter p.

Now, take a look at each step.

◼ Step 1: Define the Likelihood Function (P (D|p))

For a binary classification task, the likelihood function is given by the Binomial probability mass function such that:

P(D | p) = \binom{N}{k} p^k (1-p)^{N-k}

where:

N: total number of customers observed,
k: number of customers who churned,
p: hypothetical churn rate, and
(N k): binomial coefficient, representing the number of ways to choose k churners out of N customers.

▫ Example

For instance, if we observe 2 churners among 20 total customers (N = 20, k = 2), P(D|p) is given by:

P(D | p)=190⋅p^2(1−p)^{18}

The likelihood of observing 2 churners out of 20 depends on our hypothesized churn rate (p);

p = 0.05 (5%) → The likelihood is approximately 0.1886 (or 18.86%)
p = 0.10 (10%) → The likelihood is approximately 0.2852 (or 28.52%)
p = 0.20 (20%) → The likelihood is approximately 0.1369 (or 13.69%)

Since p_MLE = 2 / 20 = 0.10, the likelihood at p = 0.10 is the highest.

◼ Step 2: Define the Prior Distribution (P(p))

Next, I’ll define our prior probability distribution for p.

A Beta distribution is a common choice here because:

It's conjugate to the Bernoulli and
Its shape parameters (α, β) can be interpreted as "pseudo-counts" of churners and non-churners:
- α = the number of "successful" outcomes (e.g., churners)
- β = the number of "failed" outcomes (e.g., non-churners)

Mathematically, its probability density function (PDF) is defined with those shape parameters:

P(p) = P(p | \alpha, \beta) =\frac {p^{\alpha - 1} (1-p)^{\beta -1}} {Beta(\alpha, \beta)}

Where:

P(p∣α,β): The probability density of p conditioned on the parameters α and β
p^{α−1} (1−p)^{β−1}: Kernel of the Beta distribution,
p: prior probability (churn rate, in this case), and
Beta(α, β): Beta function serving as the normalizing constant for the distribution to ensure the total area under the PDF curve (from p=0 to p=1) integrates to 1.

▫ Roles of shape parameters in Beta distribution

The exponents (α−1) and (β−1) control the shape of the distribution.

If α>1, the density increases as p approaches 1.
If β>1, the density increases as p approaches 0.
If α=1 and β=1, it becomes a uniform distribution (Beta(1,1)).
If α<1 or β<1, the density can be U-shaped or J-shaped, with peaks at the boundaries.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. PDF of various Beta distributions (source)

▫ Translating the owner’s knowledge

In the data scares scenario, I’ll translate the owner's knowledge into our estimates on the shape parameters α and β.

In the later section, I’ll explore how to set our initial hypothesis in various scenarios.

◼ Step 3: Define the Posterior Distribution (P(p|D))

After defining the shape parameters, I’ll compute the posterior distribution.

the posterior (P(p | D)) is proportional to the likelihood times the prior.

Substituting each function with a Binomial likelihood and a Beta prior, the posterior distribution is redefined as a Beta distribution with new shape parameters α’ and β’:

\begin{align*} P(p \mid D) &\propto P(D \mid p)P(p) \\ \\ &\propto \left[p^k (1-p)^{N-k}\right] \times \left[p^{\alpha-1}(1-p)^{\beta-1}\right] \\ \\ & \propto p^{k+\alpha-1}(1-p)^{N-k+\beta-1} \\ \\ & \propto p^{\alpha'-1} (1-p)^{\beta' - 1} \\ \\ & \sim Beta(\alpha', \beta') \because \alpha' = k + \alpha \text{ , } \beta' = (N−k)+\beta \end{align*}

◼ Step 4: Compute the MAP estimate (p_MAP)

The MAP estimate is the mode of the Beta posterior’s PDF curve.

Mathematically, p_MAP can be computed by taking a derivative of the log-posterior PDF with respect to p:

\begin{align*} \frac{d}{dp}\left[\log P(p \mid \alpha', \beta')\right] &= \frac {d} {dp}\log\left(\frac{p^{\alpha'-1}(1-p)^{\beta'-1}}{B(\alpha', \beta')}\right) \\ \\ &= \frac{d}{dp}\left[(\alpha'-1)\log(p) + (\beta'-1)\log(1-p) - \log(B(\alpha', \beta'))\right] \\ \\ &= \frac{\alpha'-1}{p} - \frac{\beta'-1}{1-p} \end{align*}

Setting the derivative to zero and solving for p can lead to the MAP estimate p^_MAP:

\frac{\alpha'-1}{p} - \frac{\beta'-1}{1-p} = 0 \implies \hat p_{MAP} = \frac{\alpha' - 1}{\alpha' + \beta' - 2}

p^_MAP is denoted with the shape parameters (α, β) and likelihood parameters (N, k) by substituting α′ and β′ defined earlier:

\begin{align*} \hat{p}_{\text{MAP}} &= \frac{\alpha' - 1}{\alpha' + \beta' - 2} \\ \\ &= \frac{(k + \alpha) - 1}{(k + \alpha) + ((N - k) + \beta) - 2} \\ \\ &= \frac{k + \alpha - 1}{N + \alpha + \beta - 2} \end{align*}

This mathematically concludes MAP estimation for a binary task.

In the next section, I'll explore its application on churn rate prediction using a walkthrough example.

The Walkthrough Examples - Churn Prediction

Using concrete scenarios, I’ll examine MAP estimates and compare them with the MLE.

◼ The Project Setup

We only have limited datasets (N = 50).
The business owner has their own belief on churn rate.
We’ll define hypothesis on the Beta distribution (Beta prior) that aligns with the owner’s belief.
Then, apply MAP estimation to estimate the churn rate.

◼ Scenario 1: Confident Prior

Owner’s statement:

"Based on our experience in similar markets, we generally expect about 5% churn. We’re in the market for decades and confident about our estimate."

Our interpretation for the Beta prior:

“generally expect about 5% churn” indicates the mean of churn rate (given by α / (α + β)) is 0.05.
“confident about our estimate“ indicates a good confidence level (given by (α + β))

→ Beta prior hypothesis: Beta(2, 38) - mean 0.05, conf level = 40

Actual data observed:

We observed 5 churners out of 50 customers in the datasets. (p_MLE = 5/50 = 0.10 (or 10%))

Calculating the MAP estimate:

Likelihood parameters: N = 50, k = 5
Prior parameters: α = 2, β = 38
Posterior parameters:
- α′ = k + α = 7
- β′= ( N−k ) + β = 83
Posterior distribution: Beta(7, 83)
MAP Estimate: p_MAP= ( α′ − 1 ) / ( α′ + β′ − 2 ) ≈ 0.0682 or 6.82%.

Summary:

Owner’s Prior (green): 5%
MLE (red): 10.0%
MAP (blue): 6.82%

In this scenario, the owner's strong prior belief in a 5% churn rate carries significant weight.

When we incorporate the observed data (5 churners out of 50 customers), resulting in an MLE of 10%, the MAP estimate is pulled away from the owner’s prior of 5% towards the observed data.

However, due to the prior's strength, it doesn't shift as drastically as it would with a weaker prior.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Scenario 1. Confident Prior (Created by Kuriko IWAI)

◼ Scenario 2: Weak Prior

Owner’s statement:

“I'd guess the churn is somewhere from zero to 10% max, but I'm definitely not sure. I haven't had a chance to dig into the numbers or research, and I'm new to the business."

Our interpretation for the Beta prior:

“somewhere from zero to 10%” indicates the mean of churn rate (given by α / (α + β)) is likely 0.05.
“I'm definitely not sure“ indicates very weak confidence level (given by (α + β)) - so the spread (variance) must be much higher than the Scenario 1.

→ Beta prior hypothesis: Beta(0.5, 9.5) - mean 0.05, conf level = 10 (much lower than Scenario 1)

Actual data observed:

We observed 5 churners out of 50 customers in the datasets. (p_MLE = 5/50 = 0.10 (or 10%))

Calculating the MAP estimate:

Likelihood parameters: N=50, k = 5
Prior parameters: α = 0.5, β = 9.5
Posterior parameters:
- α′ = k + α = 5.5
- β′= ( N−k ) + β = 54.5
Posterior distribution: Beta(5.5, 54.5)
MAP Estimate: p_MAP= ( α′ − 1 ) / ( α′ + β′ − 2 ) ≈ 0.0776 or 7.76%.

Summary:

Owner’s Prior (green): 5%
MLE (red): 10.0%
MAP (blue): 7.76% > 6.82% (Scenario 1)

When an owner provides a weak prior estimate, the observed data plays a much larger role in shaping our updated belief.

In this scenario, the owner's initial guess gives us a prior mean of 5% (Beta(0.5, 9.5)). However, the observed data — 5 churners out of 50 customers, leading to an MLE of 10% — is quite influential due to that weak prior.

As a result, the MAP estimate is calculated to be 7.76%. This value is pulled significantly away from the owner's initial 5% guess and towards the observed data, demonstrating how a less confident prior allows the data to dominate in forming a more accurate estimate.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Scenario 2. Weak Prior (Created by Kuriko IWAI)

◼ Scenario 3: More Data

In this "More Data" scenario, we maintain the same moderately confident prior as Scenario 1 (Beta(2, 38), with a mean of 5%), but critically, we observe ten times more data (50 churners out of 500 customers).

Actual data observed:

We observed 50 churners out of 500 customers in the datasets. (p_MLE = 50/500 = 0.10 (or 10%))

Calculating the MAP estimate:

Likelihood parameters: N=500, k = 50
Prior parameters: α = 2, β = 38
Posterior parameters:
- α′ = k + α = 52.5
- β′= ( N−k ) + β = 497.5
Posterior distribution: Beta(52.5, 497.5)
MAP Estimate: p_MAP= ( α′ − 1 ) / ( α′ + β′ − 2 ) ≈ 0.0940 or 9.40%

Summary:

Owner’s Prior (green): 5%
MLE (red): 10%
MAP (blue): 9.40% » Scenario 1 or 2

In this scenario, the large data size significantly pulled the MAP estimate (9.40%) to the MLE (10%) regardless of the confidence level of the owner’s prior (5%).

This demonstrates a key principle of Bayesian inference: as the amount of observed data increases, the influence of the prior distribution diminishes, and the posterior distribution (and thus the MAP estimate) converges more closely to the likelihood function and the MLE.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Scenario 3. Confident Prior with Large Dataset (Created by Kuriko IWAI)

MAP Estimation in Machine Learning Context

In machine learning context, MAP estimates are used in cases like:

Regularization: Equivalent to L1 (Laplace prior) and L2 (Gaussian prior) regularization, preventing overfitting by penalizing large or non-sparse weights.
- ML: Ridge/Lasso Regression, Regularized Logistic Regression, SVMs.
- Deep Learning: Weight Decay (L2 regularization) in neural networks.
Bayesian Probabilistic Models: Core method for parameter estimation when full Bayesian inference is too complex.
- ML: Bayesian Linear/Logistic Regression, Naive Bayes, GMMs, HMMs, Graphical Models.
Approximate Bayesian Inference: Provides a computationally cheaper point estimate of the posterior in complex models.

Wrapping Up

Maximum A Posteriori (MAP) estimation is a powerful framework for statistical inference, highlighting its ability to integrate prior beliefs with observed data, especially in data-scarce environments.

In our simulation of churn prediction, we observed how the strength (i.e., confidence) and volume of data influence the MAP estimation, showing its convergence towards the MLE as data increases.

Applying MAP in data-scarce scenarios in the machine learning context is crucial for regularization; it acts as a form of implicit regularization by preventing parameter estimates from being overly influenced by a small, potentially unrepresentative dataset.