Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds

How model complexity and data size impact generalization performance in machine learning

Machine LearningDeep LearningData SciencePython

By Kuriko IWAI

Introduction

The bias-variance trade-off is a core challenge in supervised machine learning.

The Vapnik-Chervonenkis (VC) dimension provides a theoretical rule to estimate how well a model can perform on unseen data, which is crucial for tackling the challenge efficiently.

In this article, I’ll empirically evaluate the relationship between the VC dimension, VC bound, and generalization capabilities of various models, measuring their generalization error on synthetic datasets.

The Mechanics of Shattering: Defining VC Dimension

The Vapnik-Chervonenkis (VC) dimension is a fundamental concept in machine learning that measures the capacity of a statistical classification model by theoretically quantifying how many data points the model can perfectly separate regardless of how they are labelled.

Below diagram illustrates its concept using a linear classifier shattering three data points on a 2D plane:

Figure A. The VC dimension of a linear classier in a 2D plane

“The model shatters a set of three points”, meaning it can perfectly classify the data with zero error, regardless of the label assignment (e.g., the red or blue labels shown in Figure A).

And we can conclude that the VC dimension of this classifier is three.

Model Complexity and the Bias-Variance Trade-Off

The VC dimension provides a formal measure of a model's capacity to learn complex patterns by defining a generalization bound to predict how well a model will perform on unseen data:

Low VC Dimension: Simple models (e.g., linear classifiers) have a low generalization bound. Their ability to capture intricate data structures is limited.
High VC Dimension: Complex models (e.g., deep neural networks) have a high VC dimension. They can learn highly intricate patterns but carry a higher risk of memorizing noise in the data.

◼ Measuring Model Performance: The Error Breakdown

In supervised learning, it is key to assess if a model has truly learned the underlying logic or merely memorized the training dataset.

Below figure illustrates the process of assessing the model and three sample results:

Figure B. The breakdown of errors during training, validation, and test phase (Created by Kuriko IWAI)

First, the raw data is split into three sets:

Training to learn patterns,
Validation to tune hyperparameters and mimic real-world performance, and
Test as the final unseen benchmark.

Then, human analyzes the gaps between different error rates:

Avoidable bias (or bias): The gap between the training error and the Bayes error (the theoretical minimum error rate. i.e., human error rate). High bias suggests the model is too simple to grasp the data.
Variance: The gap between the training error and the validation error. This gap estimates the generalization error. High variance suggests the model is not transferring its learning effectively to new data.

◼ Underfitting vs. Overfitting

The relationship between bias and variance defines the two most common pitfalls in machine learning.

▫ 1. Underfitting (High Bias, Low Variance)

Underfitting occurs when the model is too simple to represent the underlying structure of the data.

The Situation: The model fails to capture the "signal" in the data.
Potential Reasons: Insufficient model complexity. A lack of high-quality features.
Result: High bias, low variance. High training/validation errors.

In Figure B, a simple model yields a very high bias of 14 points and low variance of 1 point because this model is too simple for the training set, failing to learn any from the set.

▫ 2. Overfitting (Low Bias, High Variance)

Overfitting occurs when a complex model learns the training data too well, including its noise and random fluctuations.

The Situation: The model memorizes specific training examples rather than learning generalizable rules.
Potential Reasons: Excessive model complexity relative to the amount of data. Training for too many iterations.
Result: Low bias, high variance.

In Figure B, a complex model achieves very low bias of 0 point, but yields high variance of 15 points because this model is too complex for the training set and memorizes the data completely.

But its memory of the training set is not transferrable to the validation set because the training set contains its unique noise and unexpected samples that do not repeat in other sets.

So, the model ends up with a very high validation error, failing to generalize its learning (memory).

◼ The Ideal Model and the Bias-Variance Trade-Off

An ideal model achieves the sweet spot of low bias and low variance, effectively capturing the hidden patterns of the training data in a way that remains transferable to unseen data.

In practice, the bias-variance trade-off happens when humans increase model complexity trying to reduce bias, and then the model becomes more flexible and starts to pick up on noise, which increases variance.

As the model architecture is getting more complex, overfitting is frequently a more significant challenge than underfitting.

How the VC Dimension Defines the Generalization Bound

The VC dimension is used to assess the VC bound, a probabilistic upper bound on a model's true error.

It answers the question:

In the worst-case scenario, what is the maximum error this model will likely encounter in the real-world?

Mathematically, the VC bound represents the gap between the in-sample error and out-of-sample error such that:

E_{o u t} \leq E_{i n} + \underset{VC Bound}{\underset{⏟}{\sqrt{\frac{8}{N} l n (\frac{4 (2 N)^{d_{V C}}}{δ})}}} \dots (1.1)

where:

E_{out}: Out-of-sample error (also called total error or generalization error),
E_{in}: In-sample error (the average training error),
N: The total number of samples in the training set,
d_{VC}: The VC dimension of the model, and
δ: Confidence parameter (e.g., δ = 0.05 indicates 95% confidence in what the bound holds).

In asymptotic notation simplifying the constants for Big-O analysis, Eq (1.1) is written as:

E_{o u t} \leq E_{i n} + O (\sqrt{\frac{d_{V C}}{N} l o g \frac{N}{d_{V C}} + \frac{1}{N} l o g \frac{1}{δ}}) \dots (1.2)

◼ In-Sample Error

The in-sample error measures the average loss (error) on the data the model was actually trained on.

E_{i n} (h) = \frac{1}{N} \sum_{i = 1}^{N} L (y_{i}, h (x_{i})) \dots (2)

where:

N: The total number of samples in the training set, and
L(y, h(x)): The loss function (e.g., Mean Squared Error or 0/1 Loss) to quantify the error.

As Figure B shows, in-sample error can be high when underfitting, and low when overfitting or ideal.

◼ Out-of-Sample Error

The out-of-sample error (also known as total error or generalization error) indicates the expected total error of a model on unseen data.

Theoretically, this error is decomposed into three fundamental components:

E_{o u t} = Bayes Error + {Bias}^{2} + Variance \dots (3.1)

*Note: The decomposition in Eq (3.1) is specifically applied to Squared Error loss.

Bayes error is the noise inherent in the problem itself such that y = f(x) + ϵ where y is the observed label (the noisy truth) and f(x) is the true, ideal function that the model attempts to learn.

Bias² indicates the difference between the Bayes error and the training error, and the variance indicates the difference between the training error and the true error on unseen data.

When a set of samples (x, y) is drawn from the true, underlying data distribution P, E_{out} is denoted:

E_{o u t} (h) = E_{(x, y) \sim P} [L (y, h (x))] \dots (3.2)

where:

h(x): The hypothesis (prediction) generated by the model,
E[L]: Expected loss (error) over all possible samples (x, y) drawn from P, and
L(y, h(x)): The loss function (e.g., Mean Squared Error or 0/1 Loss) to quantify the error.

But in practice, computing true E_{out} is impossible because the underlying distribution P in the real world is extremely complex and noisy.

So, the bound assumes that the training error serves as a reliable proxy for the true error because the data is i.i.d. (independently and identically distributed).

So, it proxies the error using the test set which is not used during the training, assuming that the gap between the true error (E_{out}) and estimated error (^E_{out}) is negligible.

This process is denoted:

E_{o u t} (h) \approx \hat{E} o u t (h) = \frac{1}{N t e s t} \sum_{i = 1}^{N_{t e s t}} L (y_{i}, h (x_{i})) \dots (3.3)

where N_{test} represents the total number of samples in the test set.

VC Dimensions of Machine Learning Models

The VC dimension measures the largest set of samples that the model can shatter. In simpler terms:

Step 1: Find the largest number n such that there exists at least one arrangement of n points that the model can shatter.
Step 2: Prove that no set of n + 1 points can be shattered by that model.

The formula of the VC dimension vary based on the model.

◼ Linear Classifiers (Perceptrons)

For a linear separator in an n-dimensional space where n is the number of features:

d_{V C} = n + 1 \dots (4.1)

As Figure A shows, a line in 2D (n = 2) can shatter three points. A plane in 3D can shatter four points, assuming that the model has total n + 1 parameters (n weights plus one bias term (intercept)).

◼ Polynomial Classifiers

A polynomial surface of degree d separates data in an n-dimensional space:

d_{V C} = (\binom{n + d}{d}) \dots (4.2)

This represents the number of terms a general polynomial of degree d in n variables.

For example, a 2nd-degree polynomial in 2D space has a d_{VC} = \binom{2+2}{2} = 6.

◼ Nearest Neighbor (1-NN)

A 1-NN model can shatter any number of points by simply memorizing their locations. So, the VC dimension is infinity:

d_{V C} = \infty \dots (4.3)

Eq (4.3) correctly predicts that 1-NN is highly prone to overfitting without further constraints.

◼ Support Vector Machines (SVM)

For SVMs, the VC dimension is not strictly defined by the number of dimensions, but by the margin γ and the radius R of the sphere containing the data:

d_{V C} < \frac{R^{2}}{γ^{2}} \dots (4.4)

Eq (4.4) explains why SVMs generalize well in high-dimensional spaces; their complexity depends on the margin they find, not just the number of features.

Why the VC Bound Matters

Eq (1.1) shows that the VC bound calculates the maximum possible gap between the in-sample and out-of-sample errors, leveraging the VC dimension d_{VC}.

The formula mathematically proves the fundamental tradeoff in Machine Learning:

Complex models tend to overfit because increasing d_{VC} increases the VC bound, increasing the gap between the training error and true error (hypothetically, represented by the validation error).
More data makes the training error a more reliable estimate of total error because as N increases, the VC bound shrinks, making E_{in} closer to E_{out}.

To achieve the lowest E_{out}, it is critical to balance a model complex enough to minimize E_{in} but simple enough to keep the VC bound small.

Developer Notes

The VC dimension is a traditional theory.

In practice, when selecting or tuning the model, I rarely use it not only because it does not fit most modern models, but also because computing the exact VC dimension is a mathematical burden especially when the model is complex.

In addition, the VC bound represents a “loose” upper bound in the worst case scenario that I will not likely encounter in reality.

Instead, I benchmark multiple models and apply regularization and early-stopping almost all the time to prevent overfitting.

The VC dimension is a theoretical tool to quantify the model’s complexity.

Empirical Test: Structural Risk Minimization (SRM) in Python

I’ll demonstrate how to find an optimal polynomial degree using Structural Risk Minimization (SRM).

◼ The Dataset

First, I create a synthetic data where the true relationship is y = \sin(2\pi x) and add significant noise to simulate real-world imperfections:

1import numpy as np
2
3np.random.seed(42)
4
5N = 100
6X = np.sort(np.random.rand(N, 1), axis=0)
7def true_func(x): return np.sin(2 * np.pi * x)
8y = true_func(X).ravel() + np.random.normal(0, 0.3, N) # adding noise
9

Then, I create a cleaner test data to simulate true E_{out} later:

1import numpy as np
2
3X_test = np.linspace(0, 1, 500).reshape(-1, 1)
4y_test = true_func(X_test).ravel() + np.random.normal(0, 0.3, 500)
5

◼ The Pipeline (Scaling + Complexity)

Inside the loop, I use sklearn.pipeline.Pipeline to perform three tasks in order for every degree d:

PolynomialFeatures(degree=d): This increases the model's capacity (VC Dimension). It transforms a single feature x into [x^1, x^2, …, x^d] as per Eq (4.2).
StandardScaler(): Scaling all features to have a mean of 0 and a variance of 1.
LinearRegression(): Fits the coefficients to the scaled features.

◼ Calculating the VC Bound & SRM Logic

As per Equations, I compute e_out, e_in, d_vc, and srm.

The Structural Risk Minimization (SRM) is used to compare two different ways to pick the optimal polynomial degree:

Empirical selection picks d with the lowest e_out as the winner by focusing on the actual performance on the test set.
Theoretical selection based on SRM picks d with the lowest sum of e_in and vc_bound, without ever looking at test data.

Empirical selection is common in practice. Theoretical selection is for demonstration.

1import numpy as np
2from sklearn.pipeline import Pipeline
3from sklearn.preprocessing import PolynomialFeatures, StandardScaler
4from sklearn.linear_model import LinearRegression
5from sklearn.metrics import mean_squared_error
6
7results = []
8for d in degrees:
9    # pipeline
10    model = Pipeline([
11        ("poly", PolynomialFeatures(degree=d)),
12        ("scaler", StandardScaler()), 
13        ("linear", LinearRegression())
14    ])
15    model.fit(X, y)
16    
17    # compute E_{in}, E_{out}, vc bound, srm
18    e_in = mean_squared_error(y, model.predict(X))
19    e_out = mean_squared_error(y_test, model.predict(X_test))
20    vc_bound = compute_vc_bound(d_vc=d + 1, N=N)
21    srm_score = e_in + vc_bound 
22
23    # store the results
24    results.append({
25        'd': d,
26        'e_in': e_in,
27        'e_out': e_out,
28        'srm': srm_score
29    })
30
31# find optimal d
32opt_d_empirical = degrees[np.argmin([r['e_out'] for r in results])]
33opt_d_theoretical = degrees[np.argmin([r['srm'] for r in results])]
34

◼ Results

The simulation demonstrates that as model complexity (polynomial degree) increases from one to 14, the VC Bound grows, representing a higher penalty for model flexibility.

This causes the gap between the in-sample error E_{in} (red solid line) and the out-of-sample error E_{out} (black solid line) to widen.

In this specific case, the theoretical optimal (d = 7, black dotted line)—chosen by minimizing SRM—is slightly more complex than the empirical optimal (d = 5, red dotted line).

This suggests that the actual data is best represented by a simpler degree-5 model, even though the theory indicates a more complex model with d = 7.

Beyond d = 5, the model begins to fit the noise, leading to wasted computational resources and degraded performance.

Figure C. Comparison of out-of-sample and in-sample errors with theoretical and actual optimal model complexities (Created by Kuriko IWAI)

◼ Predictive Analysis

▫ As Sample Size (N) Increases:

The VC bound shrinks: With more data, the penalty for complexity decreases because the model is constrained by more points, enabling it to learn the true underlying data distribution.
The gap narrows: As the VC bound shrinks, E_{in} and E_{out} starts to converge.
Optimal complexity rises: The optimal_d shift to a higher number (e.g., from 5 to 10), as the model can now safely learn finer details of data.
Theory matches reality: The theoretical optimal d and empirical optimal d starts to move closer.

▫ As Model Complexity (d) Increases:

E_{in} decreases: The model becomes good at memorizing the training points, eventually reaching near-zero error.
Variance explodes: As a result, the model becomes highly sensitive to small fluctuations in the training data (noise).
The generalization gap widens as E_{out} will eventually start to climb, entailing the overfitting risk.

Wrapping Up

The VC dimension serves as a critical theoretical pillar in machine learning, providing a way to quantify model capacity.

The VC bound helps us establish a safe bet; a loose upper bound on how much a model’s performance might degrade when handling unseen, real-world data.

The experiment with polynomial model empirically validates several core principles of learning theory:

The bias-variance trade-off: Increasing the polynomial degree minimized the in-sample error but increased the generalization gap.
Complexity vs. data: Models with high VC dimensions (high polynomial degree) quickly started to treat random noise as a learnable signal (especially when the total number of samples N is small).
Theoretical vs. practical approaches to model selection: The looseness of the VC bound leads to selecting a model slightly more complex than what empirical testing suggests is optimal.

Achieving the sweet spot of low bias and low variance remains the primary goal for any predictive model, ensuring that the patterns learned today remain valid on the data in the future.

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Share What You Learned

Kuriko IWAI, "Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds" in Kernel Labs

https://kuriko-iwai.com/vc-dimension-bias-variance-generalization-bound

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Regression Loss Functions & Regularization
Deep dive into MSE, L1/L2 regularization, and generalization bounds. Use the interactive tool to visualize how loss functions handle outliers and how MLOps strategies shift across model families.
The Definitive Guide to Machine Learning Loss Functions: From Theory to Implementation
Master LLM Fine-tuning behind ML optimization. Explore regression, classification, ranking, and generative loss functions (MSE, Cross-Entropy, Triplet, GAN) with formulas and use cases.
Dimensionality Reduction Unveiled: LLM Fine-tuning and Mechanics of SVD and PCA
A deep dive into SVD and PCA for dimensionality reduction. Learn LLM Fine-tuning behind matrix factorization, SVD-based PCA, and compare Incremental, Randomized, and Kernel PCA.
A Deep Dive into KNN Optimization and Distance Metrics
Deep dive into K-Nearest Neighbors (KNN). Learn to optimize K-values, compare distance metrics (Euclidean, Manhattan, Cosine), and implement weighted KNN pipelines with Scikit-Learn.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds

How model complexity and data size impact generalization performance in machine learning

Table of Contents

Introduction

The Mechanics of Shattering: Defining VC Dimension

Model Complexity and the Bias-Variance Trade-Off

◼ Measuring Model Performance: The Error Breakdown

◼ Underfitting vs. Overfitting

▫ 1. Underfitting (High Bias, Low Variance)

▫ 2. Overfitting (Low Bias, High Variance)

◼ The Ideal Model and the Bias-Variance Trade-Off

How the VC Dimension Defines the Generalization Bound

◼ In-Sample Error

◼ Out-of-Sample Error

VC Dimensions of Machine Learning Models

◼ Linear Classifiers (Perceptrons)

◼ Polynomial Classifiers

◼ Nearest Neighbor (1-NN)

◼ Support Vector Machines (SVM)

Why the VC Bound Matters

Empirical Test: Structural Risk Minimization (SRM) in Python

◼ The Dataset

◼ The Pipeline (Scaling + Complexity)

◼ Calculating the VC Bound & SRM Logic

◼ Results

◼ Predictive Analysis

▫ As Sample Size (N) Increases:

▫ As Model Complexity (d) Increases:

Wrapping Up

Shipping AI Systems?

Share What You Learned

Continue Your Learning

Regression Loss Functions & Regularization

The Definitive Guide to Machine Learning Loss Functions: From Theory to Implementation

Dimensionality Reduction Unveiled: LLM Fine-tuning and Mechanics of SVD and PCA

A Deep Dive into KNN Optimization and Distance Metrics

Related Books for Further Understanding