Regression Loss Functions & Regularization

Navigating model complexity and practical frameworks for model selection in regression problems

Machine LearningData Science

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Regression
Measuring Approximation Accuracy
Optimizing Parameters
The Practical Goal: Minimizing Empirical Risks
Major Regression Algorithms
Linear Regression Family
Polynomial Regression
Kernel Ridge Regression (KRR)
Support Vector Regression (SVR)
K-Nearest Neighbors
Trees (Decision Tree, Random Forest, Gradient Boost Family)
Neural Networks
Understanding Model Performance:
Why Unbounded Problems are Challenging
Challenges of Outliers
The Big Picture of Generalization Bounds
The Role of Confidence Term
The Role of Complexity Term
Measuring Complexity
Transforming from Unbounded to Bounded
Pseudo-Dimension with Bounded Envelope
Assumptions on Finite Second Moments
Practical Use of Generalization Bounds
Scenario 1: Limited Data with (Potentially) Complex Underlying Pattern
Scenario 2: Abundant Data
Scenario 3: Diagnosing Overfitting
Conclusion

Introduction

Regression is a common task in machine learning with variety of applications.

In this article, I’ll explore the core of its learning problems and the practical impact of theoretical generalization bounds when choosing regression algorithms.

What is Regression

The learning problem of regression is to identify the most suitable approximate function (hypothesis or prediction function) that can accurately map input values X to output values Y:

h:XYh: X \mapsto Y

The model learns this approximation using a given training sample set S with a set of inputs x and their corresponding output values y:

S=((x1,y1),(x2,y2),,(xm,ym))xiX , yiY\begin{align} S &= ((x_1, y_1), (x_2, y_2), \cdots, (x_m, y_m)) \\ \\ x_i &\in X \text{ , } y_i \in Y \end{align}

where:

  • S: A training sample set,

  • X, Y: An input, output sample space,

  • x, y: An input, output sample drawn from the sample spaces X and y, and

  • m: A sample size.

For example, a training sample set can be:

S = ((1, 0.4), (2, 1.2), (3, 1.6))

when m = 3 and X is in an one-dimensional feature space.

Here, each training sample set (x, y) is drawn from the respective sample spaces X and Y.

For regression algorithms (or simply called models) to perform well in the real world, we ideally assume these training samples are independently and identically distributed (i.i.d.) from the true underlying data distribution (f), thereby serving as true representatives of real-world data.

This concept is mathematically denoted:

yi=f(xi)+ϵi where f:XYy_i = f(x_i) + \epsilon_i \text{ where }f: X \mapsto Y

where:

  • f: A true function of the true underlying data distribution, mapping the input space X to the output Y,

  • x_i, y_i: i_th input / output training sample set (true values),

  • X, Y: An input / output space that x and y are drawn from, and

  • ε: A small difference between the sample and true distribution.

Hence, regression algorithms attempt to approximate y using hypothesis h.

Measuring Approximation Accuracy

Now, when a model approximates a true function, we need to measure its accuracy to keep improving its approximation performance.

A loss function is used to measure this accuracy, quantifying the discrepancy (loss) between the model’s hypothesis h and the true value y for each sample set.

In regression, the loss is computed based on the magnitude of the difference between h and y.

For instance, MSE, the most common loss function for regression, computes a squared distance from h to y; when h = 1 and y =3, MSE = (3 - 1)² = 4.

It is also possible for us to define a custom loss function based on our objectives and unique performance metrics to be used.

Learn More: A Comprehensive Guide on Loss Functions in Machine Learning

Optimizing Parameters

After we define a suitable loss function, the model aims to minimize the losses computed by the loss function across training samples.

This is called an optimization problem where a model attempts to find an optimal set of parameters that can minimize the losses.

The parameters (or called model parameters) include weight (w) and bias (b).

For example, mathematically, the optimal weights are denoted:

w=argminw(1mi=1mL(h(xi,w),yi))w^* = \arg \min_w \left( \frac{1}{m} \sum_{i=1}^m L(h(x_i, w), y_i) \right)

where:

  • w∗: Optimal weithts,

  • m: A sample size,

  • L: A loss function of our choice,

  • h(x, w): A hypothesis for the sample point x, and

  • y: A true value for the sample point x.

And once the model finds optimal parameters across samples, it computes hypothesis h^(x) using the parameters:

h^(x)=w0+w1x1++wnxn=wX+b\hat h(x) = w_0 + w_1 x_1 + \cdots + w_nx_n = w^*X + b

where:

  • h^(x): An estimated hypothesis,

  • w_i: i-th optimal weight corresponding to the i-th sample (x_i),

  • b: A bias term, and

  • n: The number of features in the sample space.

Now, ideally, this h^(x) approximates the true function f with minimum errors.

But what if, we have no ideas on the true function?

It is impossible to estimate the loss when we don’t have any clues on the true function.

The Practical Goal: Minimizing Empirical Risks

Now, let us define losses as “risks” in algorithms where faulty hypotheses deviate significantly from true values.

The ultimate goal of regression algorithms is to minimize the true risk, estimated as expected losses across the samples:

R(h)=ExD[L(h(x),f(x))]R(h) = E_{x \sim D}[L(h(x), f(x))]

where:

  • R(h): A true risk (loss),

  • L(h(x)): Losses between hypotheses h(x) and true values,

  • f(x): A true function, and

  • x: A sample drawn from the true sample space D.

Yet, we see the challenges in computing the true risk when a true function f(x) is complicated and unknown.

In practice, we approximate the true risk with the empirical risk (also called the empirical error or training error), computing an average loss between hypothesis and an output sample value across the samples:

R^(h)=1mi=1mL(h(xi),yi)\hat R(h) = \frac {1} {m} \sum_{i=1}^m L(h(x_i), y_i)

where:

  • R^(h): An empirical risk,

  • L(h(x), y): Losses between hypotheses h(x) and true values y,

  • m: A sample size, and

  • x_i, y_i: i-th input/output sample drawn from the true sample space.

instead of using a true function f.

***

With the theoretical objective set, the next step is to explore the diverse algorithms designed to achieve this minimization in practice.

I’ll explore some of the major regression algorithms and their characteristics in the next section.

Major Regression Algorithms

Major regression algorithms include the linear regression family and polynomial regression.

These are parametric models primarily designed for regression problems.

Other model families often used for regression include:

  • Support Vector Regression (SVR),

  • Trees (Decision trees, GBM, Random Forest), and

  • K-nearest Neighbors (KNN)

These are non-parametric models and versatile enough to handle both classification and regression tasks.

In deep learning, neural networks also serve as powerful regression algorithms capable of addressing complex regression problems.

Here’s a quick overview of each family:

Linear Regression Family

One of the most fundamental regression techniques is the linear regression. Its family includes models with L1, L2, or combined regularization terms over the loss function:

  • Linear Regression

  • Lasso (Applying L1 terms to the loss function)

  • Ridge (Applying L2 terms to the loss function)

  • Elastic Net (Applying the combination of L1 and L2 with coefficient)

Key characteristics include:

  • Parametric

  • Applied to regression only (Though technically capable, dedicated classification algorithms for classification tasks are strongly preferred).

  • Objective function: Minimize the loss function such as MSE.

  • Decision Function: A linear combination of input features:

y^=w0+w1x1++wnxn\hat y = w_0 + w_1 x_1 + \cdots + w_n x_n

The parametric characteristics of linear regression makes it rely on specific assumptions:

  1. Linearity: Linear relationship between y and the independent variable X.

  2. Independence: No autocorrelation among y and X .

  3. Homoscedasticity: Equal variance of the errors across all levels of X (e.g., if we plot residuals against X, the spread of residuals should be constant).

  4. Normality: The errors follow a normal distribution.

  5. No multicollinearity: x’s are not correlated with each other.

  6. No endogeneity: No correlation between the errors and the independent variables x’s.

The violations of these assumptions leads to biased parameter estimates.

Polynomial Regression

While linear regression assumes a strictly linear relationship, polynomial regression extends this concept by modeling non-linear relationships using polynomial features. It remains parametric, similar to its linear counterpart.

Similar to linear regression, polynomial regression models the relationship between the independent variable(s) and the dependent variable as an k-th degree polynomial.

The prediction is a polynomial combination of the input features such that:

y^=β0+β1x+β2x2++βkxk\hat y = \beta_0 + \beta_1 x +\beta_2x^2 + \cdots + \beta_k x^k

where k is a polynomial degree.

Kernel Ridge Regression (KRR)

  • Non-parametric

  • Non-linear (in data, linear in transformed space)

  • Versatile (commonly used for regression)

  • Objective Function: Minimize the loss function of human’s choice and an L2 penalty.

  • Decision Function: A weighted sum of kernel functions of the training points:

y^=αiK(xi,x)\hat y = \sum \alpha_i K(x_i, x)

Support Vector Regression (SVR)

  • Non-parametric

  • Non-linear (via kernels)

  • Versatile

  • Objective Function: Minimize the loss function plus a regularization term (L2 penalty on weights).

  • Decision Function: A weighted sum of kernel of the support vectors:

y^=(αiαi)K(xi,x)+b\hat y = (\alpha_i - \alpha_i^*) K(x_i, x) + b

K-Nearest Neighbors

  • Parametric

  • Non-linear

  • Versatile

  • Objective Function: Distance metric of human’s choice

  • Decision Function: The (weighted) average of distance between the query points and the K nearest neighbors.

Trees (Decision Tree, Random Forest, Gradient Boost Family)

  • Non-parametric

  • Non-linear

  • Versatile

  • Objective Function: Minimizes the impurity (decision trees) or the loss of ensemble models (GB, Random Forest).

  • Decision Function: Average or weighted sum of the weak learner’s prediction:

y^=m=1Mfm(x)\hat y = \sum_{m=1}^M f_m(x)

Neural Networks

  • Parametric (in terms of a fixed number of model parameters)

  • Non-linear

  • Versatile

  • Objective Function: Minimizes the loss.

  • Decision Function: Depending on the output’s layer’s activation function.

While these algorithms provide powerful tools for training regression models, a critical question remains:

How well will these models perform on new, unseen data?

Understanding their reliability and performance on unseen data requires theoretical assurances.

The concept of learning guarantees and generalization bounds provides these theoretical assurances about a model's performance beyond the training set.

I'll explore this crucial concept in the next section.

Understanding Model Performance:Learning Guarantees and Generalization Bound

When the loss L(h(x),f(x)) is bounded by some constant M>0 for all hypotheses h in the hypothesis space H and all input features x in the feature space X, the problem is referred to as a bounded regression problem.

In a bounded regression problem, generalization bounds provide a theoretical upper bound on the true risk based on the empirical risk.

This upper bound offers assurance on how well the model will perform on unseen data after training on a finite dataset.

Hence, bounded regression problems can often establish strong learning guarantees.

Why Unbounded Problems are Challenging

When a model is operating in an unbounded loss space (where the loss can be theoretically infinite) or tackling unbounded regression problems, the generalization bound can indeed be infinite or undefined.

This scenario becomes significantly challenging primarily because an unbounded loss implies infinite variance or heavy-tailed error distributions.

In such scenarios, a function can take arbitrarily large values with arbitrarily small probabilities (often referred to as exceptional “outliers”).

These outliers make it difficult to establish meaningful learning guarantees because the empirical risk can be heavily skewed toward the outliers, and might not be a reliable indicator of the true risk anymore.

The below two graphs demonstrate how the learned models behave on bounded-like losses (left) and unbounded-like losses (right). The unbounded-like losses include many “outliers“ in the training samples which influence the learned model to the point where its loss (MSE) got significantly higher than one from the bounded loss.

Fig: Comparison of the learned models and MSEs with bounded-like loss (left) and unbounded-like loss (right) (Created by Kuriko IWAI)Figure A. Comparison of the learned models and MSEs with bounded-like loss (left) and unbounded-like loss (right) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Comparison of the learned models and MSEs with bounded-like loss (left) and unbounded-like loss (right) (Created by Kuriko IWAI)

Challenges of Outliers

In practice, removing these outliers from the training samples can be problematic for several reasons.

1. Masking Effect

First of all, detecting outliers is challenging in a real-world case because multiple outliers can collectively shift the detection measure — the mean and standard deviation.

This shift makes them appear less anomalous, making it challenging for the detection method to flag all or some of them correctly.

2. Extreme Observations

Even when we can detect the outliers, these outliers might represent valid but extreme observations that should be a part of the true data distribution (e.g., spikes/drops in stock prices).

Removing these outliers makes the model fail to learn such key observations — which often results in poor generalization.

3. Reduced Sample Size

Removing too many outliers can significantly reduce the sample size, potentially affecting the statistical power and reliability of the model.

Given these challenges, defining appropriate strategies for unbounded problems is crucial.

I’ll detail them in the next section.

The Big Picture of Generalization Bounds

A common form of a generalization bound defines an upper bound for the true risk using the following structure:

R(h)R^(h)+ Complexity Term +Confidence TermR(h) \leq \hat R(h) + \text { Complexity Term } + \text {Confidence Term}

The primary goal of establishing generation bounds is to constrain both the Complexity Term and the Confidence Term.

This ensures that if the model show a low empirical risk (good performance on training samples), then its true risk on unseen data is also likely to be low, providing a measure of the model’s generalization capability.

The Role of Confidence Term

The confidence term represents the probability that the bound does not hold, using δ (delta) as a parameter (In other words, the bound holds with probability 1−δ). For example, if δ=0.05, the bound holds with 95% confidence.

This confidence term often appears in the form of (1/m) log (1/δ) ​in bounds.

This means that if we want to be more sure about the true risk being within a certain range, we need to make the bound looser by setting smaller δ, hence assigning a large confident term, so that the range gets wider.

The Role of Complexity Term

The Complexity Term quantifies the model complexity, which indicates how well the model can approximate a complex, non-linear true function.

A complex model indicates that the model has greater capacity to learn complex patterns within the data, while it is prune to overfitting where the model memorizes the noise in the training samples rather than learning the true underlying relationships.

On the other hand, low model complexity indicates that the model might be too simple to capture the underlying patterns, potentially leading to underfitting.

Measuring Complexity

The complexity terms can be measured by the number of the model parameters or more sophisticated measures such as VC dimensions especially for more complex models.

The following table illustrates major computation measures and their main purposes, with a simple example of linear regression with two model parameters (w0​, w1​):

Purposee.g. Linear Regression h = w1×1 + w0
Number of Model ParametersSimplest way to est. model complexity2
VC DimensionCompute model complexity2
Rademacher ComplexityEst. capabilities to fit random noiseB \sqrt 2 (​
Pseudo-Dimension BoundsCompute model complexity for real-valued functions2 (1 + d, d =1 )

Fig. Table: Major complexity measures (Created by Kuriko IWAI)

I’ll explore some of these key measures.

VC Dimensions (Vapnik-Chervonenkis Dimension)

  • Measures the model’s complexity.

  • A higher VC-dimension generally means a more complex model.

  • Primarily for classification, but concepts extend.

Rademacher Complexity Bounds

  • Quantifies the ability of a model to fit random noise:
R^m(G)=Eσ[supgG1mi=1mσig(xi,yi)]\hat R^m(G)=E_\sigma[\sup_{g\in G}\frac{1}{m}\sum_{i=1}^m\sigma_i \cdot g(x_i,y_i)]
  • Lower Rademacher complexity generally leads to tighter generalization bounds:
R(h)R^(h)+2R^m(LH)+O(log(1/δ)m)R(h) \leq \hat R(h) + 2\hat R_m(L∘H) + O\left(\sqrt{\frac{\log(1/\delta)} {m}}\right)

where

  • Rm​(G): The true Rademacher complexity,

  • R^m​(L∘H): The empirical Rademacher complexity of the loss function on hypothesis H,

  • m: The sample size,

  • σ: a vector of independent Rademacher random variables, and

  • g(xi​, yi​): The loss value.

Pseudo-Dimension Bounds

  • Extends the concept of VC-dimension to real-valued functions.

  • Quantifies the complexity of real-valued functions - by defining the largest points that a model can “shatter“ (classified into two classes).

  • Generalization bounds is set by using Pdim(H) as pseudo-dimension bounds of hypothesis space (H):

R(h)R^(h)+O(Pdim(H)log(m/Pdim(H))m)R(h) \leq \hat R(h) + O\left(\sqrt{\frac{Pdim(\mathcal{H})\log(m/Pdim(\mathcal{H}))}{m}}\right)

The generalization bounds provide powerful theoretical assurances, but they often rely on the assumption of a bounded loss function — meaning the loss cannot take arbitrarily large values.

But this assumption can be violated in real-world regression problems where many outliers exists.

I’ll explore major approaches to the problem in the next section.

Transforming from Unbounded to Bounded

A practical approach to tackle unbounded regression problems involves defining bounds on the loss or its components.

Major theoretical approaches are:

Pseudo-Dimension with Bounded Envelope

This approach extends concepts like VC-dimension, traditionally for bounded binary classification, to pseudo-dimension for real-valued function classes.

The process includes defining an envelope function that effectively bounds the absolute value of the unbounded loss by providing a maximum possible finite value, just like considering the worst-case behavior within a reasonable limit.

In practice, the approach defines a robust loss function or applies regularization on the loss function.

  • Loss Function: Using functions like Huber loss directly addresses unboundedness. They are designed to be bounded or have bounded influence for large residuals, creating a suitable envelope function for the loss function, allowing pseudo-dimension application.

  • Regularization: Limiting parameter norms (e.g., ||w||² ​≤ C) can lead to a more easily definable envelope function for the loss.

Assumptions on Finite Second Moments

Losses like squared error can be unbounded if y or a true function f can take arbitrarily large values like outliers. But in practice, we often assume the loss has a finite second moment, hence has a finite variance.

For instance, a model predicts $15,000 for a car whose actual price is $16,000. This is bounded loss space with finite loss of MSE = 1000².

Imagine someone accidentally inputs $10,000,000,000 as an actual price. Then, the loss space is technically unbounded with infinite loss of MSE ≈10²⁰.

But even with this occasional $10,000,000,000 outlier, we assume the error is rare and by computing the average of all the squared errors across the very large datasets, the average might still be a finite, manageable number.

In practice, loss functions like Huber loss are designed to have finite second moments, directly addressing unboundedness and making these bounds applicable even with outliers.

Practical Use of Generalization Bounds

In practice, we don't typically compute generalization bounds directly.

Instead, their underlying principles offer insights into choosing the right modeling strategies for various scenarios.

I'll demonstrate how these principles influence model choices in three common scenarios, using a few well-known regression algorithms as examples:

Model Considerations (from low complexity to high)

  • Linear Regression: A low-capacity model, ideal for simple, linear relationships.

  • Lasso, Ridge, Elastic Net : Linear models enhanced with regularization to control complexity.

  • Kernel Ridge: A low-capacity model in terms of its effective complexity.

  • Support Vector Regression (SVR): A flexible, high-capacity model that excels in non-linear relationships. Implicitly incorporates regularization.

  • Tree-based Models: Very high capacity, prone to overfitting.

Task: Predicting monthly transaction amounts of the credit card users

Scenario 1: Limited Data with (Potentially) Complex Underlying Pattern

Situation:

  • Only a few hundred of historical transaction records in the database.

  • Assuming the true underlying data distribution has complex patterns.

Insights from Generalization Bounds:

  • With very limited data, even the true distribution might be complex, a simpler mode might generalize well simply because it's less prone to overfitting.

  • Complex model might not perform because there simply aren't enough data points to reliably distinguish the true signal from random noise, resulting in unbounded loss space.

Strategies

  • Consider simpler models first.

  • Set tighter regularization parameters to complex models to define the upper bounds.

Experiment Setup

To simulate the scenario, I generated a synthetic dataset of 20 noisy samples with true underlying distribution of 10-degree polynomial, and computed MSE of the predictions and true values.

1import numpy as np
2from sklearn.preprocessing import PolynomialFeatures
3from sklearn.pipeline import make_pipeline
4from sklearn.metrics import mean_squared_error
5
6np.random.seed(42)
7
8def true_function(X): return 0.1*X + 0.5*X**2 + 0.02*X**3
9
10X = np.random.rand(20, 1) * 10
11y = true_function(X) + np.random.randn(20, 1) * 15
12
13y_pred = make_pipeline(PolynomialFeatures(degree=10), model).fit(X, y).predict(X_val)
14mse = mean_squared_error(y_pred, true_func(X_val))
15

Model Options

Based on our strategies, I selected eight models with hyperparameter tuning:

  • Baseline: Linear Regression

  • Normal regularization: Lasso, Ridge, Elastic Net

  • Robust regularization (for noise and outliers): Huber, Theil-Sen, RANSAC

  • High Complexity Models: RBF Kernel SVR, GBM with tight or relaxed regularization

1from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, HuberRegressor,TheilSenRegressor, RANSACRegressor
2from sklearn.svm import SVR
3
4models = [
5    # baseline
6    LinearRegression(),
7
8    # simple models with normal regularization
9    Lasso(alpha=0.1, max_iter=2000),
10    Ridge(alpha=100.0),
11    ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=2000),
12
13    # robust regressors
14    HuberRegressor(epsilon=1.35, max_iter=1000),
15    TheilSenRegressor(max_subpopulation=100),
16    RANSACRegressor(estimator=LinearRegression(), random_state=42),
17
18    # high complexity model with tight regulaization
19    SVR(kernel='rbf', C=0.1, epsilon=0.1, gamma=0.01, max_iter=5000),
20    GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, max_depth=4, subsample=0.7, n_iter_no_change=10, random_state=42),
21
22    # high complexity model with relaxed regulaization (for demonstration)
23    SVR(kernel='rbf', C=100000, epsilon=0.1, max_iter=5000), 
24    GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, max_depth=100, subsample=0.7, n_iter_no_change=100000, random_state=42),
25]
26

Results: MSE

  1. Elastic Net: 63.0199

  2. Lasso: 63.2807

  3. Ridge: 73.3110

  4. GBM (max_depth: 4): 94.4799 ← Tight regularization worked.

  5. Linear Regression: 113.0903

  6. Huber: 130.1813

  7. GBM (max_depth: 100): 140.8863

  8. RANSAC: 141.1297

  9. SVR (Reg: C=0.1): 580.9976 ← Regularization slightly controlled the bound compared to the SVR (C=10000).

  10. SVR (Reg: C=10000): 847.8213

  11. Thei-Sen: 63028.5222

Elastic Net (left in the figure) performed the best (MSE 63.02), followed by other regularized, simple linear models like Lasso and Ridge, outperforming robust models (Huber, RANSAC).

GBM with regularization (right) also performed better (MSE 94.48) than the robust models or the baseline model, Linear Regression. This is because the model was applied early stopping on top of the tree depth constraints, which worked as tighter regularization than any other models.

Fig. Model capacity on small, potentially complex dataset (Elastic Net, Regularized GBM) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Model capacity on small, potentially complex dataset (Elastic Net, Regularized GBM) (Created by Kuriko IWAI)

On the other hand, another complex model, SVR (left) severely struggled, showing very high MSEs (580.99–847.82). As we can see in the approximation in the graph, Kernel is prune to struggle finding stable, generalizable solutions over sparse data.

Theil-Sen (right) performed exceptionally poorly, suggesting their effectiveness is highly data-dependent in these constrained conditions. The right graph demonstrates the model was operating in the unbounded loss space.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Model capacity on small, potentially complex dataset (Regularized SVR, Thei-Sen) (Created by Kuriko IWAI)

Scenario 2: Abundant Data

Situation:

  • Large number of historical transaction records available.

  • Suspecting the true underlying patterns are very complex.

Insights from Generalization Bounds:

  • With abundant data, models with high capacity can effectively learn and generalize from complex underlying patterns without necessarily leading to overfitting.

  • Simpler models might suffer to learn the complex pattern even with abundant dataset.

Strategies

  • Consider prioritizing complex models.

Experiment Setup

I generated a synthetic dataset of 2,000 noisy samples with true underlying distribution of 30-degree polynomial, and computed MSE of the predictions and true values.

1import numpy as np
2from sklearn.preprocessing import PolynomialFeatures
3from sklearn.pipeline import make_pipeline
4from sklearn.metrics import mean_squared_error
5
6np.random.seed(42)
7
8def true_function(X): # more complex true function that the Scenario 1
9    return (0.1*X + 0.5*X**2 + 0.02*X**3 - 0.005*X**4 + 8 * np.sin(X * 3.5) + 4 * np.cos(X * 1.5) + 2 * np.sin(X * 7)) 
10
11X = np.random.rand(2000, 1) * 10
12y = true_function(X) + np.random.randn(2000, 1) * 15
13
14y_pred = make_pipeline(PolynomialFeatures(degree=30), model).fit(X, y).predict(X_val)
15mse = mean_squared_error(y_pred, true_func(X_val))
16

Model Options

Based on our strategies, I selected eight models with hyperparameter tuning:

  • Baseline models: Linear Regression, Elastic Net, Huber

  • Tree-based Ensembles: Decision Tree, GBM, Random Forest

  • High Complexity Models: k-NN, RBF Kernel SVR, RBF Kernel Ridge, MLP Regressor

1from sklearn.linear_model import LinearRegression, ElasticNet, HuberRegressor
2from sklearn.svm import SVR
3from sklearn.kernel_ridge import KernelRidge
4from sklearn.neighbors import KNeighborsRegressor
5from sklearn.tree import DecisionTreeRegressor
6from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
7from sklearn.neural_network import MLPRegressor
8
9models = [
10    # baseline models
11    LinearRegression(),
12    ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=5000),
13    HuberRegressor(epsilon=1.35, max_iter=5000),
14
15    # tree family
16    DecisionTreeRegressor(max_depth=5), # as a baseline for trees
17    GradientBoostingRegressor(n_estimators=500, learning_rate=0.01, max_depth=8, subsample=0.8, n_iter_no_change=20, random_state=42),
18    RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_leaf=1, random_state=42),
19
20    # complex models
21    KernelRidge(alpha=0.001, kernel='rbf', gamma=0.1),
22    KNeighborsRegressor(n_neighbors=7, weights='distance'),
23    SVR(kernel='rbf', C=100000, epsilon=0.1, gamma='scale', max_iter=10000),
24    MLPRegressor(hidden_layer_sizes=(100, 50, 30, 20), activation='relu', solver='adam', max_iter=5000, random_state=42),
25]
26

Results: MSE

  1. k-NN: 1.0725

  2. Random Forest: 1.1554

  3. GBM: 1.8373

  4. Decision Tree: 4.2163

  5. Linear Regression: 17.8602

  6. Elastic Net: 40.0838

  7. RBF Kernel SVR: 115.8335

  8. RBF Kernel Ridge: 129.1579

  9. Huber: 318.4782

  10. DFN: 5251.2137

Best Performers

k-NN (left) performed the best (MSE: 1.07) because it is a non-parametric model that can capture highly local, complex, and non-linear patterns by directly leveraging the large dataset. With sufficient data, its adaptive nature allows it to closely approximate the intricate 30-degree polynomial function.

Tree ensembles (Random Forest and GBM) also performed better than the baseline models like Linear Regression. This is because, unlike linear models, they can effectively model complex, non-linear relationships. Their ensemble techniques of bagging and boosting further enhances their robustness and reduces variance, allowing them to generalize well to complex patterns across the large dataset.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Model capacity on large, complex dataset (k-NN, Random Forest) (Created by Kuriko IWAI)

Worst Performers

Huber (left) didn’t perform (MSE: 318.47) because it is fundamentally a linear model. Despite the abundant data, its linear nature limits its capacity to learn the complex true function regardless of their robustness to the outliers.

MLPRegressor (right) performed worst due to sensitivity of feature scaling. The Adam optimizer struggles significantly with numerical instability, leading to slow or failed convergence.

In the graphs, we can see the both models struggled in the unbounded loss spaces.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Model capacity on large, complex dataset (Huber, MLP Regressor) (Created by Kuriko IWAI)

Scenario 3: Diagnosing Overfitting

Situation

  • Model shows high performance on training data but poor performance on unseen test data.

  • This discrepancy suggests the model has learned the training data’s noise rather than the general underlying pattern.

Insights from Generalization Bounds

  • When the model’s capacity (complexity) is too high relative to the amount of effective signal in the training data, it overfits.

  • Overfitting leads to a large “generalization gap” — the difference between training error and test error.

  • Even with abundant data, if the model is excessively complex for the actual problem, it can still overfit to irrelevant noise.

Strategies

  • Reduce Model Complexity:

    • Regularization: Apply or increase regularization (e.g., L1, L2, Dropout) to penalize complex models.

    • Simpler Architecture: Choose a less complex model architecture or reduce the number of layers/parameters.

    • Feature Selection/Engineering: Reduce the number of input features, or create more robust features.

  • Increase Data (if feasible): Provide more diverse and representative training data.

  • Early Stopping: Monitor performance on a validation set and stop training when validation error starts to increase.

Experiment Setup

I trained a highly complex polynomial regression model on a limited synthetic dataset and then evaluated its performance on both the training set and a separate test set.

1import numpy as np
2from sklearn.preprocessing import PolynomialFeatures, StandardScaler
3from sklearn.pipeline import make_pipeline
4from sklearn.metrics import mean_squared_error
5from sklearn.model_selection import train_test_split
6
7def true_function(X): # same as the Scenario 2
8    return (0.1*X + 0.5*X**2 + 0.02*X**3 - 0.005*X**4 + 8 * np.sin(X * 3.5) + 4 * np.cos(X * 1.5) + 2 * np.sin(X * 7)) 
9
10np.random.seed(42)
11X = np.random.rand(500, 1) * 10
12y = true_function(X) + np.random.randn(500, 1) * 15
13
14X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)
15pipeline = make_pipeline(
16    StandardScaler(), 
17    PolynomialFeatures(degree=50),
18    model
19).fit(X_train, y_train.ravel())
20
21# train
22y_pred_train = pipeline.predict(X_train)
23mse_train = mean_squared_error(y_train, y_pred_train)
24
25# test
26y_pred_test = pipeline.predict(X_test)
27mse_test = mean_squared_error(y_test, y_pred_test)
28

Model Options

I selected three models (Decision Tree, SVR, GBM) with different regularization tuning with baseline models of Linear Regression and Elastic Net.

1from sklearn.linear_model import LinearRegression, ElasticNet
2from sklearn.svm import SVR
3from sklearn.tree import DecisionTreeRegressor
4from sklearn.ensemble import GradientBoostingRegressor
5
6models = [
7  LinearRegression(),
8  ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=5000),
9  DecisionTreeRegressor(max_depth=None, min_samples_leaf=1, random_state=42),
10  DecisionTreeRegressor(max_depth=5, min_samples_leaf=10, random_state=42),
11  SVR(kernel='rbf', C=100000, epsilon=0.1, gamma='scale', max_iter=5000),
12  SVR(kernel='rbf', C=.1, epsilon=0.1, gamma='scale', max_iter=5000),
13  GradientBoostingRegressor(
14    n_estimators=500,
15    learning_rate=0.1, 
16    max_depth=10,
17    subsample=1.0,
18    random_state=42
19  ),
20  GradientBoostingRegressor(
21    n_estimators=500, 
22    learning_rate=0.01, 
23    max_depth=5,
24    subsample=0.8,
25    n_iter_no_change=20, 
26    random_state=42
27  ),
28]
29

Results: Generalization Loss (MSE)

  1. RBF Kernel SVR (C=0.1): Train 352.9335, Test 329.8245, Diff 23.1089

  2. Elastic Net: Train 226.9546, Test 283.2070, Diff -56.2525

  3. Linear Regression: Train 203.2903, Test 278.6601, Diff -75.3697

  4. GBM (max_depth=5): Train 170.7339, Test 255.5918, Diff -84.8579

  5. Decision Tree (max_depth=5): Train 189.2130, Test 301.5972, Diff -112.3842

  6. Decision Tree (No max depth): Train 0.0000, Test 435.5088, Diff -435.5088

  7. RBF Kernel SVR (C=100000): Train 335.6198, Test 554.4035, Diff -218.7837

  8. GBM (max_depth=10): Train 0.0001, Test 433.6032, Diff -433.6030

This analysis of regression models highlights key differences in their generalization abilities.

RBF Kernel SVR (C=0.1) performed the best generalization with the smallest difference between its training and test MSE: 23.1089. This indicates that its tight regularization (low C value) effectively constrained its complexity, preventing it from overfitting the training data and allowing it to generalize very well to unseen examples.

Fig: Model capacity on generalization (RBF Kernel SVR with C=0.1, 10,000) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Model capacity on generalization (RBF Kernel SVR with C=0.1, 10,000) (Created by Kuriko IWAI)

GBM (specifically the max_depth=10 version) struggled the most because it showed classic signs of severe overfitting. Its training MSE (0.0001) was almost perfect, indicating it completely memorized the training data and noise. However, its test MSE (433.6032) was drastically higher, revealing its poor generalization. This occurred because its high model capacity (max_depth=10) was not sufficiently controlled by other regularization techniques (like strict early stopping or higher subsampling), allowing it to fit the noise rather than the underlying pattern.

Fig: Model capacity on generalization (GBMs) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Model capacity on generalization (GBMs) (Created by Kuriko IWAI)

Conclusion

Regression in machine learning focuses on finding optimal functions to map inputs to real-valued outputs, aiming to minimize true risk, often approximated by empirical risk.

This article explored diverse regression algorithms and, crucially, generalization bounds — theoretical assurances for a model’s performance on unseen data.

I highlighted how generalization bounds inform model selection, particularly through three scenarios:

1. Limited Data, Potentially Complex Patterns

With scarce data, regularized simpler models like Elastic Net or Lasso performed the best. It effectively prevents overfitting and keeps the model’s complexity term within reasonable bounds, leading to a tighter generalization bound despite the underlying true complexity.

2. Abundant Data, Complex Patterns

Abundant data allows high-capacity models to shine. k-NN and tree ensembles like Random Forest and GBM performed exceptionally, leveraging the rich dataset to learn intricate relationships. Here, sufficient data allows for a higher complexity term without necessarily leading to a loose generalization bound, as the empirical risk can reliably approximate the true risk.

3. Diagnosing Overfitting

A large discrepancy between training and test performance signals overfitting. Regularized RBF Kernel SVR demonstrated superior generalization by effectively controlling its complexity and achieving a tight train-test MSE difference.

The three scenarios underscore how optimizing this balance, either by choosing simpler models, applying regularization, or leveraging abundant data for complex models, directly impacts a model’s ability to generalize well, influencing the complexity and confidence terms within the generalization bound.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Share What You Learned

Kuriko IWAI, "Regression Loss Functions & Regularization" in Kernel Labs

https://kuriko-iwai.com/regression-in-machine-learning

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.