Beyond K-Means: A Deep Dive into Gaussian Mixture Models and the EM Algorithm

A deep dive into the core concepts of unsupervised clustering with practical application on customer data segmentation

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Gaussian Mixture ModelHow Gaussian Mixture Models Work
Initializing Gaussian Mixture Models
The Objective of GMMs
Leveraging Expectation-Maximization Algorithm
E-Step (Expectation Step)
M-Step (Maximization Step)
3. Declare the Convergence
The Walkthrough Case - How the EM Algorithm Works
1. Initializing the GMM
2. E-Step
3. M-Step
4. Declare the Convergence
Comparing with K-meansSimulation
Data Preprocessing
Model Tuning
Results
Evaluation
Conclusion

Introduction

The Expectation-Maximization (EM) algorithm, particularly its application to Gaussian Mixture Models (GMM), is a foundational unsupervised learning technique.

In this article, I’ll explore core concepts of EM and GMMs, demonstrating their application to tabular data segmentation in comparison with K-Means.

What is Gaussian Mixture Model

A Gaussian Mixture Model (GMM) is a probabilistic model that represents the data as a combination of multiple Gaussian distributions.

It’s a clustering tool for unsupervised learning, offering more flexibility than other clustering methods like k-means by computing probabilities of data points to each cluster.

Its key features include:

  • Probabilistic Model: Generates the likelihood of different outcomes using probability distribution, rather than predicting a single, definite result.

  • Unsupervised Learning: No labeled data required for training.

  • Clustering and Density Estimation: Clusters data points into different groups and estimates the probability density function (PDF) of the data.

  • Flexibility in Modeling: Can approximate any distribution represented as a weighted sum of normal distributions.

GMMs are commonly used in various fields like financial investments, natural language analysis, predictive maintenance, and medical imaging (MRI, CT scans).

How Gaussian Mixture Models Work

The core concept of GMMs is to assume that the data points are generated from a mixture of multiple Gaussian distributions, each of which has its own model parameters:

  • mean (μ_k​),

  • variance (or covariance) (Σ_k​), and

  • mixing coefficient (ϕ_k​).

The first row of the below figure illustrates various GMMs with two distinct Gaussian distributions. Each Gaussian has its own unique model parameters, shaping different plots:

Fig: Examples of Gaussian Mixture Models and other data distribution (Created by Kuriko IWAI)Fig: Examples of Gaussian Mixture Models and other data distribution (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Gaussian Mixture examples (top) and other data distribution types (bottom) (Created by Kuriko IWAI)

Based on this assumption, GMMs are first initialized with random Gaussian distributions, then iteratively refine their parameters to find the best-fitting Gaussians for the data.

Let us explore the process step-by-step.

Initializing Gaussian Mixture Models

First, the GMM is initialized with random model parameters that consist of:

1. Mixture weights: ϕ_k

  • Represents the probability that a data point belong to the k-th Gaussian component.

  • Must follow the constraints:

k=1Kϕk=1 , ϕk0\sum_{k=1}^{K} \phi_k = 1 \text{ , } \phi_k \geq 0

2. Mean vectors: μ_k

  • Represents the average of a k-th Gaussian component.

3. Covariance matrices: Σ_k

  • Represents the variance and correlation of each feature.

  • Determines the shape and orientation of the Gaussian component.

The Objective of GMMs

Now, the objective the GMM is to find these model parameters that maximize the log likelihood of the observed data:

(ϕ,μ,Σ)=i=1mlog(j=1Kϕjp(x(i)μj,Σj))\ell(\phi, \mu, \Sigma) = \sum_{i=1}^{m} \log \left( \sum_{j=1}^{K} \phi_j \cdot p(x^{(i)} | \mu_j, \Sigma_j) \right)

where:

  • l(ϕ,μ,Σ): the log likelihood of observing samples X such that ϕ,μ,Σ.

  • X = {x(1), x(2), … , x(m)} : The observed dataset of m data points.

  • ϕ = {ϕ1​,…,ϕK​}: The mixture weights with K entries.

  • μ = {μ1​,…,μK​}: The mean vectors with K entries.

  • Σ = {Σ1​,…,ΣK​}: The covariance matrices with K entries.

  • p(x(i)∣μj​,Σj​): The probability density function (PDF) of data point x(i) under the j-th Gaussian distribution, defined by its mean μj​ and covariance Σj​.

Now, to find optimal values for the model parameters, we consider the underlying probabilistic model which involves a latent variable z(i) for a data point x(i).

The latent variable indicates which Gaussian component in the GMM generated the data point. For example, if z = 1, the data point belongs to the Gaussian 1.

So, the probability density of the particular data point x(i) is defined as a joint probability of observing that data point and its corresponding latent variable altogether:

Now, to understand how to compute these optimal parameters, we consider the underlying probabilistic model which involves an unobserved (latent) variable z(i) for each data point x(i).

Because z(i) indicates which Gaussian distribution component in the GMM generated the data point x(i), the probability density of a single data point x(i) can be defined by the joint probability of observing that data point and the latent variable altogether:

p(x(i);ϕ,μ,Σ)=z(i)=1Kp(x(i),z(i);ϕ,μ,Σ) ...(1)p(x_{(i)}; \phi, \mu, \Sigma) = \sum_{z^{(i)} = 1}^K p(x_{(i)}, z^{(i)}; \phi, \mu, \Sigma) \text{ ...(1)}

Using the product rule of probability, p(A,B) = p(A|B) p(B), the joint probability formula (1) is denoted:

p(x(i);ϕ,μ,Σ)=z(i)=1Kp(x(i),z(i);ϕ,μ,Σ) (1)=z(i)=1Kp(x(i)z(i);μ,Σ)p(z(i);ϕ) (2)\begin {align} p(x_{(i)}; \phi, \mu, \Sigma) &= \sum_{z^{(i)} = 1}^K p(x_{(i)}, z^{(i)}; \phi, \mu, \Sigma) \cdots \text { (1)} \\ \\ &= \sum_{z^{(i)} = 1}^K p(x^{(i)}∣z^{(i)};\mu, \Sigma)\cdot p(z^{(i)}; \phi)\cdots \text { (2)} \end {align}

Substituting the log-likelihood function with the formula (2):

(ϕ,μ,Σ)=i=1mlog(z(i)=1Kp(x(i)z(i)μj,Σj)p(z(i);ϕ)) (3)\begin{align} \ell(\phi, \mu, \Sigma) &= \sum_{i=1}^{m} \log \left( \sum_{z^{(i)}=1}^{K} p(x^{(i)}| z^{(i)} | \mu_j, \Sigma_j) \cdot p(z^{(i)}; \phi) \right) \cdots \text{ (3)} \end{align}

This formula (3) explicitly shows the marginalization of the log-likelihood function over the latent variable z.

Here, our core challenge in maximizing this log-likelihood is that the latent variables z is unknown.

Imagine we want to segment heights: X = { 150, 175, 190 cm } to two Gaussian components.

Taking a data point x_2 = 175 for an example, we would ask:

  • Did x_2 come from Gaussian 1?

  • What about Gaussian 2? Did x_2 come from Gaussian 2?

The problem here is we simply don’t know the true answer if z is 1 or 2. Hence, we cannot compute the log likelihood to find the optimal parameters.

Side notes: This unknown z is the characteristics of unsupervised learning. In contrast, supervised learning has labeled data where y represents the true value.

Leveraging Expectation-Maximization Algorithm

To tackle these challenges, our strategy is to use the Expectation-Maximization (EM) algorithm to construct the tightest lower bound to the log likelihood function (E-Step) and keep updating the model parameters (M-Step) until the model converges.

The below figure illustrates how the EM algorithm find the convergence on the log-likelihood function, taking E-Steps and M-Steps:

Fig: EM Algorithm finding local optimum (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. EM Algorithm finding local optimum (Created by Kuriko IWAI)

In the next section, I’ll demonstrate the process step by step using math.

E-Step (Expectation Step)

In the Expectation step (E-step), GMMs first ”guess” the z values by calculating the probability of each data point belonging to each Gaussian component (j) based on the current model parameters:

wj(i):=p(z(i)=jx(i);ϕ,μ,Σ)w^{(i)}_j := p(z^{(i)} = j | x^{(i)}; \phi, \mu, \Sigma)

Here, w is called responsibility, and p is the posterior probability of z = j given a data point x, indicating how likely this data point belongs to the Gaussian component j.

Leveraging Bayes’ Theorem, this posterior probability is denoted:

p(z(i)=jx(i);ϕ,μ,Σ)=p(x(i)z(i)=j;μ,Σ)p(z(i)=j;ϕ)l=1kp(x(i)z(i)=l;μ,Σ)p(z(i)=l;ϕ)p(z^{(i)} = j|x^{(i)}; \phi, \mu, \Sigma) = \frac{p(x^{(i)}|z^{(i)} = j; \mu, \Sigma)p(z^{(i)} = j; \phi)}{\sum_{l=1}^{k} p(x^{(i)}|z^{(i)} = l; \mu, \Sigma)p(z^{(i)} = l; \phi)}

where:

  • p(x|z): the likelihood of observing the data point x(i) in the Gaussian component j and

  • p(z): The prior probability of any data point belonging to the component j.

  • The denominator represents the marginal probability of observing data point x(i) under the entire GMM.

Here, we observe that the responsibility w is a probability rather than a hard assignment.

This allows GMMs to assign multiple clusters to a single data point simultaneously by computing the probability of the data point belonging to each Gaussian component. The responsibility w is also known as a ‘soft’ assignment.

M-Step (Maximization Step)

In the Maximization step (M-Step), the EM updates its model parameters ϕ, μ, Σ, using the responsibilities computed in the E-Step.

Each of the parameter update looks like the following:

1. Update for Mixture Weights (ϕ​)

The new mixture weight for the Gaussian component j is simply the average responsibility of that component across all m data points:

ϕj:=1mi=1mwj(i)\phi_j := \frac{1}{m} \sum_{i=1}^{m} w_j^{(i)}

This is because the responsibilities are continuous values of probabilities — not hard-assigned labels like zero or one.

2. Update for Mean Vectors (μ)

The new mean for the Gaussian component j is the weighted average of all data points.

Each data point x(i) is weighted by its responsibility wj(i)​ for the component j:

μj:=i=1mwj(i)x(i)i=1mwj(i)\mu_j := \frac{\sum_{i=1}^{m} w_j^{(i)} x^{(i)}}{\sum_{i=1}^{m} w_j^{(i)}}

where:

  • The numerator is the sum of the responsibilities of all the m data points.

  • The denominator is the sum of these weights (the “effective number” of points in component j), serving as the normalization factor for the weighted average.

3. Update for Covariance Matrices (Σ)

The new covariance matrix for the Gaussian component j is the weighted average of the outer products of the differences between each data point and the new mean μ_j​:

Σj:=i=1mwj(i)(x(i)μj)(x(i)μj)TDeviation of x(i) from the mean μj​i=1mwj(i)\Sigma_j := \frac{\sum_{i=1}^{m} w_j^{(i)} \overbrace{(x^{(i)} - \mu_j)(x^{(i)} - \mu_j)^T}^{\text{Deviation of x(i) from the mean μj​}} }{\sum_{i=1}^{m} w_j^{(i)}}

This is the standard formula for the sample covariance matrix adapted with weights.

Each of the squared deviations from the mean is weighted by wj(i)​, reflecting how much a data point x(i) contributes to the spread of component j.

In essence, the M-step formulas are weighted maximum likelihood estimators.

3. Declare the Convergence

The EM repeats these E/M steps, causing the likelihood to converge monotonically.

One reasonable convergence test is to check if the increase in the model parameters between successive iterations is smaller than some tolerance parameter.

Using θ as a simpler notation for a set of the model parameters, the likelihood function (3) is redefined:

(θ)=i=1mlog(z(i)=1Kp(x(i),z(i);θ))\ell(\theta) = \sum_{i=1}^{m} \log \left( \sum_{z^{(i)}=1}^{K} p(x^{(i)}, z^{(i)}; \theta) \right)

According to Jensen’s Inequality, log E[X] ≥ E[ logX ] since the log is a concave function.

So, taking a random data point x for an example, the lower bound of the log-likelihood function at the data point is defined:

logp(x(i);θ)j=1Kwj(i)logp(x(i),z(i)=j;θ)wj(i)\log p(x^{(i)};\theta) \ge \sum_{j=1}^{K} w_j^{(i)} \log \frac{p(x^{(i)},z^{(i)}=j;\theta)}{w_j^{(i)}}

With m total data points, a lower bound for the total log-likelihood is defined by the sum:

(θ)i=1mj=1Kwj(i)logp(x(i),z(i)=j;θ)wj(i)\ell(\theta) \geq \sum_{i=1}^{m} \sum_{j=1}^{K} w_j^{(i)} \log \frac{p(x^{(i)}, z^{(i)}=j; \theta)}{w_j^{(i)}}

In the E-step, the EM algorithm calculates responsibilities w that make the lower bound as tight as possible to the true log-likelihood function, making it tangent to the true log-likelihood function.

So, at t-th epoch for example, the true log-likelihood can be expressed as:

(θ(t))=i=1mj=1Kwj(i)logp(x(i),z(i)=j;θ(t))wj(i)\ell(\theta^{(t)}) = \sum_{i=1}^{m} \sum_{j=1}^{K} w_j^{(i)} \log \frac{p(x^{(i)}, z^{(i)}=j; \theta^{(t)})}{w_j^{(i)}}

where θ = θ_t, representing the set of model parameters at which this lower bound precisely touches the true log-likelihood function.

I’ve updated the figure to include the process to see how θ_t​ is found and, consequently, how ℓ(θt​) is computed:

Fig: EM Algorithm finding local optimum (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. EM Algorithm finding local optimum (Created by Kuriko IWAI)

Then, in the M-step, we chose the next set of parameters θ_(t+1) to maximize this lower bound with respect to each parameter:

(θ(t+1))i=1mj=1Kwj(i)logp(x(i),z(i)=j;θ(t+1))wj(i) lower bound at (t+1)-th epochi=1mj=1Kwj(i)logp(x(i),z(i)=j;θ(t))wj(i) lower bound at t-th epoch=(θ(t))\begin{align} \ell(\theta^{(t+1)}) &\geq \sum_{i=1}^{m} \sum_{j=1}^{K} w_j^{(i)} \log \frac{p(x^{(i)}, z^{(i)}=j;\theta^{(t+1)})}{w_j^{(i)}} \cdots \text{ lower bound at (t+1)-th epoch} \\ \\ &\geq \sum_{i=1}^{m} \sum_{j=1}^{K} w_j^{(i)} \log \frac{p(x^{(i)}, z^{(i)}=j; \theta^{(t)})}{w_j^{(i)}} \cdots \text{ lower bound at t-th epoch} \\ \\ &= \ell(\theta^{(t)}) \end{align}

This formula illustrates the monotonic increase of the log-likelihood in the EM algorithm.

The first and second lines in the formula represent the lower bounds at the current (t+1) and previous (t) epochs, respectively (orange and red points in the figure).

As the EM keeps choosing the maximum point of the lower bound in each epoch, the value in line 1 (current epoch) must be greater than the one in line 2 (previous epoch).

This guarantees that ℓ(θ_t+1) ≥ ℓ(θ_t), meaning the log-likelihood never decreases and thus converges to a local optimum.

And lastly, the EM declares the convergence when the change in the parameters or log-likelihood falls below a predefined tolerance (e.g., 1e-10) or reaches zero.

Now, let us examine how the EM works using a walkthrough case on a height segmentation.

The Walkthrough Case - How the EM Algorithm Works

Now, imagine we collected height measurement data in cm: X = {150, 175, 190}.

We want to model this data using a GMM with two Gaussian distributions (K=2) that best explain these three heights.

In this case, learnable model parameters are:

  • ϕ1​: The proportion of data belonging to Gaussian 1.

  • ϕ2​: The proportion of data belonging to Gaussian 2. (Remember, ϕ1​+ϕ2​=1).

  • μ1​: The mean height of Gaussian 1.

  • μ2​: The mean height of Gaussian 2.

  • Σ1​: The variance of heights for Gaussian 1.

  • Σ2​: The variance of heights for Gaussian 2.

1. Initializing the GMM

We would start with assigning random values to these model parameters. Let’s say (wild guess):

ϕ1=0.5ϕ2=0.5μ1=160μ2=185Σ1=102Σ2=102\begin {align} \phi_1 &= 0.5 \\ \phi_2 &= 0.5 \\ \\ \mu_1 &= 160 \\ \mu_2 &= 185 \\ \\ \Sigma_1 &= 10^2 \\ \Sigma_2 &= 10^2 \end {align}

In our case, since K = 2, each parameters are defined: ϕ = {ϕ1​, ϕ2​} , μ = {μ1​,μ2​}, Σ = {Σ1​, Σ2​} with X of three data points: X = { x_1 = 150, x_2 = 175, x_3 = 190 }.

ϕ={ϕ1,ϕ2}={0.5,0.5}μ={μ1,μ2}={160,185}Σ={Σ1,Σ2}={102,102}X={x1,x2,x3}={150,175,190}\begin {align} \phi &= \lbrace \phi_1, \phi_2 \rbrace = \lbrace 0.5, 0.5\rbrace \\ \\ \mu &= \lbrace \mu_1, \mu_2 \rbrace = \lbrace 160, 185\rbrace \\ \\ \Sigma &= \lbrace \Sigma_1, \Sigma_2 \rbrace = \lbrace 10^2, 10^2\rbrace \\ \\ X &= \lbrace x_1, x_2, x_3\rbrace = \lbrace 150, 175, 190\rbrace \end {align}

These initial parameters define the GMM shape with two Gaussians (blue and red dotted lines in the figure):

Fig. The initial GMM (t=0) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. The initial GMM (t=0) (Created by Kuriko IWAI)

2. E-Step

The EM starts to optimize the parameters.

Take a look at the computation steps for a data point x_2 = 175 cm in our walkthrough example:

1. Calculate the Likelihood (p(x|z)

We’ll use the PDF of a univariate Gaussian distribution:

p(xμ,σ2)=12πσ2exp((xμ)22σ2)p(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
  • For Gaussian 1:

    • Component: j = 1

    • Mean: μ_1​ = 160

    • Variance: Σ_1​ = 100

    • PDF: p(x_2 = 175 | μ_1​ = 160, Σ_1 ​= 100) = (1 / \sqrt (2π(100))) ​exp(− (175−160)²/ 200​) ≈ 0.01295

  • For Gaussian 2:

    • Component: j = 2

    • Mean: μ_2​ = 185

    • Variance: Σ_2 = 100

    • PDF: p(x_2 = 175 | μ_2 = 185, Σ_2 ​= 100) = (1 / \sqrt (2π(100))) ​exp(− (175−185)²/ 200​) ≈ 0.0242

2. Calculate the Weighted Likelihoods

  • For Gaussian 1: p(x_2 | μ_1​, Σ_1​) ⋅ ϕ1​ = 0.01295 ⋅ 0.5 = 0.006475

  • For Gaussian 2: p(x_2 | μ_2, Σ_2​) ⋅ ϕ2 ​= 0.0242 ⋅ 0.5 = 0.0121

3. Calculate Responsibilities (w)

  • Marginal Probability (Denominator): 0.006475 + 0.0121 = 0.018575

  • Gaussian 1’s Responsibility for x_2: w1_2​ = 0.006475 / 0.018575 ​≈ 0.3486

  • Gaussian 2’s Responsibility for x_2: w2_2 ​= 0.0121​ / 0.018575 ≈ 0.6514

For the data point x_2, Gaussian 2 has a higher responsibility (0.6514) than Gaussian 1 (0.3486), indicating x_2 is more likely to come from the Gaussian 2.

Taking the same steps, the EM computes responsibilities of Gaussian 1 and 2 for the other data points:

  • data x_1 = 150 cm:

    • Gaussian 1’s responsibility: w1_1 ≈ 0.9964 > w2_1

    • Gaussian 2’s responsibility: w2_1 ​≈ 0.0036

  • data x_3 = 185 cm:

    • Gaussian 1’s responsibility: w1_3 ​​≈ 0.0124

    • Gaussian 2’s responsibility: w2_3 ​≈ 0.9876 > w1_3

The results indicate that x_1 likely came from Gaussian 1, and x_3 from Gaussian 2.

3. M-Step

The EM updates its model parameters based on the computed responsibilities:

1.Mixture Weights ϕ​

  • Gaussian 1: ϕ_1​ := ​(w1_1​ + w1_2​ + w1_3) / 3 ≈ 0.4525

  • Gaussian 2: ϕ_2​ := ​(w2_1​ + w2_2​ + w2_3) / 3 ≈ 0.5475

Now, Gaussian 2 has a higher mixture weights than Gaussian 1, reflecting that more of the total responsibility shifted towards Gaussian 2 during the E-step.

2. Mean Vector μ

  • Gaussian 1: μ_1​ := ((0.9964⋅150)+(0.3486⋅175)+(0.0124⋅190)​) / (0.9964+0.3486+0.0124) ​≈ 156.77 cm

  • Gaussian 2: μ_2​ := ((0.0036⋅150)+(0.6514⋅175)+(0.9876⋅190)) / (0.0036+0.6514+0.9876​) ≈ 183.97 cm

Gaussian 1 has a very high responsibility for x_1, which pulls its mean closer to the x_1 = 150, causing a shift of μ_1 from 160 to 156.77.

μ_2​ also shifted slightly from 185 to 183.97cm. Both x_2 = 175 and x_3 = 190 pulled it towards the lower end of its expected range.

3. Covariance Matrices Σ

  • Gaussian 1: Σ_1 := (0.9964⋅(150−156.77)² + 0.3486⋅(175−156.77)² + 0.0124⋅(190−156.77)²) / (0.9964+0.3486+0.0124) ≈ 129.13 cm² > 100 cm²

  • Gaussian 2: Σ_2 := (0.0036⋅(150−183.97)² + 0.6514⋅(175−183.97)² + 0.9876⋅(190−183.97)²​) / (0.0036+0.6514+0.9876​) ≈ 56.31 cm² < 100 cm²

Gaussian 1 became broaden from covariance 100 to 129.13 cm², trying to cover x_1 = 150 and also x_2 = 175 to a lesser extent.

Gaussian 2 decreased its covariance from 100 to 56.31 cm² because it is now primary responsible fr x_2 = 175 and x_3 = 190, which are relatively closer to its updated mean, leading to a tighter distribution.

The figure illustrates how the EM algorithm shifted the Gaussians:

Fig. GMM (t=1): The EM algorithm shifted Gaussian 1 toward x_1 and Gaussian 2 toward x_3. (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. GMM (t=1): The EM algorithm shifted Gaussian 1 toward x_1 and Gaussian 2 toward x_3. (Created by Kuriko IWAI)

The model will compute the responsibilities of Gaussian 1 and 2 for the rest of the data points:

This indicates x_1 likely comes from Gaussian 1, and x_3 from Gaussian 2.

4. Declare the Convergence

Now, let us check if the log-likelihood has increased.

The initial true log-likelihood (t = 0) is computed by summing up the log-likelihood of each data point:

For x_1 = 150,

  • Likelihood under Gaussian 1: p(x_1​ | θ¹) ≈ 0.0242

  • Likelihood under Gaussian 2: p(x_1 | θ²) ≈ 0.0000874

  • log(p(x_1; θ)) = log ((0.5 ⋅ 0.0242)+(0.5 ⋅ 0.0000874)) ≈ −4.4109

For x_2 = 175,

  • Likelihood under Gaussian 1: p(x_2 | θ¹ ) ≈ 0.01295

  • Likelihood under Gaussian 2: p(x_2 | θ²) ≈ 0.0242

  • log(p(x_2; θ)) = log((0.5 ⋅ 0.01295)+(0.5 ⋅ 0.0242)) ≈ −3.9859

For x_3 = 190,

  • Likelihood under Gaussian 1: p(x_3 | θ¹) ≈ 0.000443

  • Likelihood under Gaussian 2: p(x_3 | θ²) ≈ 0.0352

  • log(p(x_3; θ)) = log((0.5 ⋅ 0.000443)+(0.5 ⋅ 0.0352)) ≈ −4.0272

\=> ℓ(θ_0) = −4.4109 + (−3.9859) + (−4.0272) ≈ −12.424

Taking the same steps, the true log-likelihood at t = 1 is:

ℓ(θ_1) = −4.3879 + (−3.3627) + (−3.2794) ≈ −11.030

Now, we see ℓ(θ_1) ≥ ℓ(θ_0) as guaranteed by the EM algorithm’s monotonic convergence property.

Comparing with K-means

The K-means algorithm can be understood as a simplified case of the EM algorithm applied to Gaussian Mixture Models.

K-means implicitly assumes that the data is generated by a mixture of Gaussians under very restrictive conditions:

  1. Equal Mixture Weights (ϕj​): All components have equal prior probability.

  2. Spherical Covariance (Σj​): Clusters are spherical (diagonal covariance matrix).

  3. Equal Variance (σ2): All spherical clusters have the same variance.

With these strong assumptions, the E-step and M-step derivations simplify:

  • E-Step: Instead of calculating “soft” responsibilities (wj(i)​) as posterior probabilities, K-means performs a “hard” assignment, assigning a responsibility of 1 to the component whose centroid is closest (Euclidean distance) and 0 to all other clusters.

  • M-Step: Updates the unweighted mean of all data points assigned to each component.

To summarize the differences of K-means and GMMs:

Shape of Clusters

  • K-means assumes spherical clusters, while

  • GMM can model more complex shapes like elliptical clusters.

Soft vs. Hard Clustering

  • K-means assigns each point to a single cluster (hard), while

  • GMM computes probabilities of each point to belong to each cluster (soft).

Model Complexity

  • Generally, GMMs are more complex to train than K-means (but can capture more intricate data structures).

When to Use:

  • Use GMM when the data exhibits overlapping clusters or clusters with non-spherical (e.g., elliptical or elongated) shapes, or varying densities/sizes,

  • Use K-means to prioritize simplicity and speed than the clustering flexibility.

Simulation

I’ll demonstrate how the GMM and K-means segment telecom customers from the UC Irvine Machine Learning Repository.

Data Preprocessing

Data transformation is a crucial prerequisite for effective clustering.

After loading the dataset, I applied column-wise transformations, including scaling for numerical data and one-hot encoding for binary categorical features to generate 3,150 data points with 6 features.

1from sklearn.preprocessing import OneHotEncoder, StandardScaler
2from sklearn.compose import ColumnTransformer
3from sklearn.pipeline import Pipeline
4
5df = pd.read_csv(csv_file_path)
6cols = ['subscription_length', 'customer_value', 'age', 'frequency_of_use', 'churn']
7X = df[cols]
8
9num_features = ['subscription_length', 'customer_value', 'age', 'frequency_of_use']
10num_transformer = Pipeline(steps=[('scaler', StandardScaler())])
11
12cat_features = ['churn']
13cat_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])
14
15preprocessor = ColumnTransformer(
16    transformers=[
17        ('num', num_transformer, num_features),
18        ('cat', cat_transformer, cat_features)
19    ],
20    remainder='passthrough'
21)
22
23X_processed = preprocessor.fit_transform(X)
24

Shape (3150, 6)

[[ 0.63672633 0.02681625 -0.11307444 -0.52875851 1. 0. ] [ 0.75338358 -1.12292553 -0.67934551 -0.82203616 1. 0. ] [ 0.52006909 -0.16480738 -0.11307444 2.0612852 1. 0. ] ... [-1.6964185 -0.32159035 -0.11307444 -0.36881527 1. 0. ] [-2.51301919 -0.408692 -0.11307444 1.17358857 1. 0. ] [-2.51301919 -0.77451893 -0.11307444 -0.71632621 0. 1. ]]

Model Tuning

For a comparative analysis, I tuned GMM and K-Means models.

The GMM (from scikit-learn) was configured with 10 components and n_init=1000 to ensure robust initialization and capture diverse Gaussian distributions. K-Means, also from scikit-learn, was set to identify 5 clusters.

1from sklearn.mixture import GaussianMixture
2from sklearn.cluster import KMeans
3
4# gmm
5gmm =  GaussianMixture(
6    n_components=10,            # number of Gaussian components in the data
7    covariance_type='full',     # 'full' for complex underlying data distribution (not spherical)
8    tol=1e-10,                  # stop the iteration when increase in log-likelihood hits the tol
9    reg_covar=1e-10,            # add to the covariance to ensure positive definite matrices
10    max_iter=1000,              # number of epochs
11    n_init=1000,                # number of model runs with diff initializations of the parameters
12    init_params='kmeans',       # default
13    weights_init=None,          # init model params to set None (no clue)
14    means_init=None,
15    precisions_init=None,
16    random_state=42,
17)
18gmm_labels = gmm.fit_predict(X_processed)
19
20
21# k-means (for comparison)
22kmeans = KMeans(
23    n_clusters=5, 
24    init="k-means++",
25    n_init=500,
26    max_iter=500,
27    tol=1e-10,
28    random_state=42,
29    algorithm='lloyd'
30)
31kmeans_labels = kmeans.fit_predict(X_processed)
32

Results

K-Means clustering (left) implicitly models clusters as spherical Gaussian distributions, hence the circular boundaries.

On the other hand, GMM (right) allow for elliptical cluster shapes and fit more complex data distributions by computing unique covariance for each Gaussian component.

Fig: Comparison of clustering telecom customers (m=3,150) via K-means (left) and GMM (right) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Comparison of clustering telecom customers (m=3,150) via K-means (left) and GMM (right) (Created by Kuriko IWAI)

Evaluation

I used three key metrics to assess the performance:

  • Silhouette Score: assesses how similar an object is to its own cluster compared to others,

  • Davies-Bouldin Index: measures the average similarity ratio of each cluster with its most similar cluster (lower — more distinct clustering)

  • Calinski-Harabasz Index: evaluates the ratio of between-cluster dispersion to within-cluster dispersion (higher — more distinct clustering)

The scores highlight distinct characteristics of GMM and K-means.

GMM Performance:

  • Silhouette Score: 0.1267

  • Davies-Bouldin Index: 1.5877

  • Calinski-Harabasz Index: 315.9951

GMM’s scores reflects its soft assignment approaches where it allows data points to belong to multiple clusters.

We saw this flexibility enables it to capture more complex cluster shapes (e.g., elliptical) and overlapping distributions.

However, the internal validation metrics, which often implicitly favor distinct boundaries and hard cluster assignments, may penalize GMM for this overlap, even when it provides a more statistically robust fit to the data’s true underlying distribution.

K-Means Performance:

  • Silhouette Score: 0.3897

  • Davies-Bouldin Index: 0.8486

  • Calinski-Harabasz Index: 1546.5695

K-Means consistently yielded strong scores across these metrics.

This is expected given its hard assignment nature, where each data point is exclusively assigned to a single cluster, and its objective to minimize the sum of squared distances within clusters.

These metrics inherently reward the creation of distinct, compact, and well-separated clusters, aligning directly with K-Means’ strengths.

Conclusion

The Gaussian Mixture Model (GMM) is a powerful unsupervised learning algorithm that models data as a combination of multiple Gaussian distributions.

In the experiment, we observed that GMMs can model clusters of various shapes (e.g., elliptical) and sizes due to their ability to learn individual means, covariances, and mixing coefficients for each component.

The EM algorithm iteratively refines these parameters by maximizing the log-likelihood of the observed data, leading it to the convergence to a local optimum.

While K-means is a simpler, faster alternative assuming spherical and equally sized clusters, GMMs offer greater flexibility for complex, overlapping, or non-spherically distributed data, making them valuable in applications like image segmentation and customer profiling where intricate data structures need to be captured.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Beyond K-Means: A Deep Dive into Gaussian Mixture Models and the EM Algorithm" in Kernel Labs

https://kuriko-iwai.com/em-algorithm-and-gmm

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.