Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks

Explore core concepts and practical implementations for enhanced performance.

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Gradient Boosting
Major Types of Gradient Boosting
What is Ensemble Technique
Core Frameworks of Ensemble Techniques
How Gradient Boosting WorksSimulation
Defining the Custom Classifier
Preparing Datasets
Model Tuning
Evaluation
Results
Wrapping Up
Bottlenecks of Gradient Boosting

Introduction

Ensemble techniques are common techniques in machine learning to enhance the accuracy of the model predictions.

In this article, I'll explore Gradient Boosting, a widely used ensemble tactic, both in theory and through coding simulations.

What is Gradient Boosting

Gradient Boosting (GB, or Gradient Boosting Machines (GBM)) is an ensemble method categorized in boosting, that captures complex non-linear dependencies by building weak models sequentially.

In the process, each of these weak models (typically shallow decision trees with a few leaf nodes and terminal nodes) tries to minimize the loss (residuals) of the previous model using gradient decent algorithms and improve the prediction overall.

Learn More: Prototyping Gradient Descent in Machine Learning

Major Types of Gradient Boosting

Here are three major models categorized in the GBM family:

1) XGBoost (Extreme Gradient Boosting)

  • A dominant algorithm in the GB family.

  • Optimized and highly efficient implementation of gradient boosting - speed, performance, and features like parallel processing, tree pruning, and regularization (L1 and L2).

2) LightGBM (Light Gradient-Boosting Machine):

  • Excels with large datasets due to its techniques like Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) (significantly reduce computational overhead and speed up training).

  • Developed by Microsoft.

3) CatBoost (Categorical Boosting):

  • Handles categorical features by employing a permutation-driven approach and ordered boosting.

  • Often yields good results with minimal hyperparameter tuning.

  • Developed by Yandex.

What is Ensemble Technique

Before detailing the GBM, let us quickly cover the big picture of the ensemble techniques in machine learning.

Its basic concept is to combine multiple models (or neurons in deep learning) to improve overall prediction accuracy and robustness. Boosting is one of the techniques, but there are many options we can utilize.

Core Frameworks of Ensemble Techniques

In Machine Learning Context:

a) Bagging (Bootstrap Aggregating)

  • Trains multiple models independently on bootstrap samples, then averaging/voting their predictions.

  • Goal: Reduces the variance.

  • Major Examples: Random Forest (A classic bagging method of an ensemble of decision trees)

b) Boosting:

  • Trains models sequentially, each correcting errors of the last model.

  • Goal: Reduces the bias.

  • Major Examples: Gradient Boosting, AdaBoost (Adaptive Boosting, assigning weights to misclassified training samples)

c) Stacking (Stacked Generalization):

  • Combines predictions from multiple diverse base models by training a meta-model (or “meta-learner”) on the base models’ outputs.

d) Voting:

  • A method to decide a “winner” across the models’ predictions.

  • Hard Voting (Majority Vote): The class with the most votes from individual models wins.

  • Soft Voting (Weighted Averaging): For classification tasks, a class with the highest average probability wins. For regression, it typically averages the predictions.

Learn More on Stacking & Voting: Ensemble Naive Bayes for Mixed Data Types

In Deep Learning Context:

e) Snapshot Ensemble:

  • Trains a single neural network, saves its weights at different points during training, and averages each prediction.

  • Avoids training multiple large networks.

f) Weight Averaging (e.g., Stochastic Weight Averaging — SWA):

  • Averages the weights of multiple models (or different snapshots of a single model) after training

  • The resulting averaged model often performs better than any individual model.

g) Multi-Model Training with Diverse Architectures

  • Trains several deep learning models with completely different architectures and combines their predictions via voting or stacking.

How Gradient Boosting Works

The basic idea of Gradient Boosting is to keep building a refined ensemble model upon simple base models (I’ll call them “weak learner/s”) - to minimize the overall loss.

Mathematically, this process is defined by the iteration where the algorithm keeps updating the prediction from the ensemble model (F) by adding a scaled output from a new weak learner added to the ensemble model in the current iteration loop (m):

Fm(x)=Fm1(x)previous ensemble model+ρmscaling factor = step size h(x;am)a new weak learnerF_m(x) = \underbrace{F_{m-1}(x)}{\text{previous ensemble model}} + \underbrace{\rho_m}{\text{scaling factor = step size }}\cdot \underbrace{h(x; a_m)}_{\text{a new weak learner}}
  • m: The total number of iterations.

  • Fm​(x): The updated ensemble model after the m-th iteration.

  • Fm−1​(x): The ensemble model from the previous (m-1 th) iteration.

  • ρ_m: The scaling factor defining the step size toward a new weak learner.

  • ​h(x; a_m​): A weak learner added to the ensemble model in the m-th iteration.

Training Weak Learners

Weak learners (h) are trained to predict the pseudo-residuals (y~​i, also called negative gradients) that represent the direction and magnitude of the steepest descent of the loss function at the current ensemble model's prediction for each training example.

Mathematically, these values are computed by taking the negative partial derivative (gradient) of the loss function (L) with respect to the ensemble model's prediction (F(xi​)):

y~i=[L(yi,F(xi))F(xi)]F(x)=Fm1(x) for i=1,,N\tilde{y}i = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]{F(x)=F_{m-1}(x)} \text{ for }i=1, \dots, N

The algorithm then guides the new weak learner (h(x; a_m)) to correct these errors to the direction where the total loss is minimized - by adjusting its model parameters (a_m):

am=argmina,βi=1N[y~ipseudo-residualsβscaling factorh(xi;a)]2a_m = \underset{a,\beta}{\operatorname{argmin}} \sum_{i=1}^{N} [ \underbrace{\tilde{y}i }{\text{pseudo-residuals}} - \underbrace{\beta}_{\text{scaling factor}} \cdot h(x_i;a)]^2

In the formula, the scaling factor (β) indicates a learning rate or a weight assigned to the weak learner. In some implementations, β might be optimized separately or fixed.

Optimizing Step Sizes

After training the weak learner, the algorithm decides the optimal step size (ρ) that defines how much the new weak learner should contribute to the ensemble model.

Mathematically, the optimal step size is found in a line search where the ensemble model minimizes the loss (L) between the updated prediction and its corresponding true label:

ρm=argminρi=1NL(yi,Fm1(xi)+ρh(xi;am)updated prediction with a new weak learner)loss\rho_m = \arg \min_{\rho} \sum_{i=1}^N \overbrace{ L(y_i, \underbrace{ F_{m-1}(x_i) + \rho \cdot h(x_i; a_m) }_{\text{updated prediction with a new weak learner}} ) }^{\text{loss}}

(y: corresponding true label, a_m: a weak leaner’s parameters)

This is an important step for the ensemble model to prevents itself from taking too large steps toward a certain weak learner and in consequence, overfitting to the learner.


The entire process looks like this:

Figure A. Iteration Process of Gradient Boosting (Using a simple decision tree as a weak learner, Image by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Iteration Process of Gradient Boosting (Using a simple decision tree as a weak learner, Image by Kuriko IWAI)

The figure shows that the algorithm:

  • makes a prediction with the current ensemble model,

  • computes residuals,

  • adds new weak learner (h, colored in red, yellow, and green) to the ensemble model while applying the optimal step size, and

  • continuously iterate these steps.

Simulation

I’ll build the following four models using the Sckit Learn / Keras libraries and compare the performance.

  1. Custom GB Classifier (CustomGB class),

  2. XGBoost Classifier

  3. LightGBM Classifier

  4. CatBoost Classifier

  5. Logistic Regression (as the primary baseline model)

  6. Deep Feedforward Network (as the secondary baseline model).

Learn More: Building Deep Feedfoward Network

Defining the Custom Classifier

I’ll begin with defining the custom classifier with fit(), predict_proba(), and predict() functions.

In the iteration loop, I defined the binary cross-entropy loss as the loss function, and simplified the computation of the residuals (rho).

1import numpy as np
2from sklearn.tree import DecisionTreeRegressor
3
4class CustomGB:
5    def __init__(self, learning_rate, n_estimators, max_depth=1):
6        self.learning_rate = learning_rate
7        self.n_estimators = n_estimators
8        self.max_depth = max_depth
9        self.random_state = 42
10        self.learners = []
11        self.F_0 = None
12        self.epsilon = 1e-10
13
14    def fit(self, X, y):
15        self.F_0 = np.log(y.mean() / (1 - y.mean()))
16        F_m = np.full(len(y), self.F_0)
17
18        # starts the iteration
19        for _ in range(self.n_estimators):
20            # computes the residuals
21            p = np.exp(F_m) / (1 + np.exp(F_m))
22            rho = y - p 
23
24            # adds a decision tree as a weak learner.
25            learner = DecisionTreeRegressor(max_depth=self.max_depth, random_state=self.random_state).fit(X, rho)
26            terminal_node_ids = learner.apply(X)
27
28            # looping through the terminal nodes to calculate gamma and update F_m
29            for j in np.unique(terminal_node_ids):
30                current_id = terminal_node_ids == j
31
32                # computes gamma: (Σ residuals/Σ p(1-p))
33                gamma = rho[current_id].sum() / ((p[current_id] * (1 - p[current_id])).sum() + self.epsilon)
34                F_m[current_id] += self.learning_rate * gamma
35
36                # replacing the prediction value in the tree's leaves
37                learner.tree_.value[j, 0, 0] = gamma
38
39            self.learners.append(learner)
40        return self
41
42    def predict_proba(self, X):
43        F_m_pred = np.full(len(X), self.F_0)
44        for learner in self.learners:
45            F_m_pred += self.learning_rate * learner.predict(X)
46        return np.exp(F_m_pred) / (1 + np.exp(F_m_pred)) # convert final log-odds (F_m_pred) back to probabilities
47
48    def predict(self, X, threshold=0.5):
49        probabilities = self.predict_proba(X)
50        return (probabilities >= threshold).astype(int)
51

Preparing Datasets

I used the same dataset I used for voting and stacking method to compare the performance and generated the train, validation, and test datasets after applying the column transformation and SMOTE scaling:

(2826, 61) (2826,) (500, 61) (500,) (500, 61) (500,)

To recap, the base dataset is a telecom churn data from the UC Irvine Machine Learning Repository (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license) with 3,500 data samples and 14 features:

Figure B. Sample dataset (Image source)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Sample dataset (Image source)

Model Tuning

I set up the similar values to key arguments across the models to compare the performance apple to apple. The below coding block shows the base definition of the model, trained on the preprocessed training samples.

1from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3from catboost import CatBoostClassifier 
4
5# sets up the same values
6learning_rate = 0.01
7n_estimators = 5000
8max_depth = 1
9
10# 1. Custom GBM
11custom_gbm = CustomGB(
12    learning_rate=learning_rate,
13    n_estimators=n_estimators,
14    max_depth=max_depth
15).fit(X_train_processed, y_train)
16
17
18# 2. XGBoost Classifier
19sklearn_xgb = GradientBoostingClassifier(
20    loss='log_loss',                    # explicitly mentioned
21    learning_rate=learning_rate,
22    n_estimators=n_estimators,          # dafault: 100
23    subsample=1.0,
24    criterion="friedman_mse",
25    min_samples_split=2,
26    min_samples_leaf=1,
27    min_weight_fraction_leaf=0.0,
28    max_depth=max_depth,
29    min_impurity_decrease=0.0,
30    validation_fraction=0.1,
31    n_iter_no_change=None,
32    tol=1e-4,
33).fit(X_train_processed, y_train)
34
35
36# 3. Light GB Classifier (Using HistGradientBoostingClassifier from the Sckit Learn library)
37sklearn_lgb = HistGradientBoostingClassifier(
38    loss='log_loss',                    # explicitly mentioned
39    learning_rate=learning_rate,
40    max_depth=max_depth,
41    max_iter=n_estimators,              # default is 100. Intentionally changed.
42    max_leaf_nodes=31,
43    min_samples_leaf=20,
44    l2_regularization=0.01,
45    early_stopping=True,                # early stopping if 10% fraction from train to val for 10 consequence epochs
46    validation_fraction=0.1,
47    n_iter_no_change=10
48).fit(X_train_processed, y_train)
49
50# 4. CatBoost Classifier
51cat = CatBoostClassifier(
52    iterations=n_estimators,
53    learning_rate=learning_rate,
54    depth=max_depth,
55    loss_function='Logloss',
56    eval_metric='Accuracy',             # monitor accuracy on the validation set for early stopping
57    random_seed=42,
58    verbose=0,
59    early_stopping_rounds=10
60).fit(X_train_processed, y_train)
61
62
63# 5. Logistic Regression (as a baseline ML model)
64sklearn_lr = LogisticRegression(
65    penalty='l2',
66    tol=1e-4,
67    max_iter=n_estimators,
68).fit(X_train_processed, y_train)
69
70
71# 6. DFN (Using the Keras library, as a baseline DL model)
72import tensorflow as tf
73from tensorflow import keras
74from keras.models import Sequential
75from keras.layers import Dense, Dropout, Input
76
77keras_model = Sequential([
78    Input(shape=(X_train_processed.shape[1],)),
79    Dense(32, activation='relu'),
80    Dropout(0.1),
81    Dense(16, activation='relu'),
82    Dropout(0.1),
83    Dense(1, activation='sigmoid')
84])
85keras_model.compile(
86    optimizer='adam',
87    loss='binary_crossentropy',
88    metrics=['accuracy']
89)
90history = keras_model.fit(
91    X_train_processed, y_train,
92    epochs=n_estimators,
93    batch_size=32,
94    validation_split=0.2,
95    verbose=0
96)
97

One noticeable point here is that the classifiers build on the external libraries have the regularization frameworks such as L1, L2, and early stopping. I set up L2 terms and early stopping while securing relatively large number of the epochs (n_estimators). In reality, keeping high epochs is important to improve the accuracy of the weak learners.

Evaluation

I made a prediction on the training and test datasets and computed the accuracy scores.

1from sklearn.metrics import accuracy_score
2
3y_pred_train =  model.predict(X=X_train_processed)
4 y_pred_val = model.predict(X_val_processed)
5y_pred_test = model.predict(X=X_test_processed)
6print(f'\n{model_names[i]}\nTrain: {accuracy_score(y_train, y_pred_train):.4f} Test: {accuracy_score(y_test, y_pred_test):.4f}')
7
8loss_train, accuracy_train = keras_model.evaluate(X_train_processed, y_train)
9oss_val, accuracy_val = keras_model.evaluate(X_val_processed, y_val)
10loss_test, accuracy_test = keras_model.evaluate(X_test_processed, y_test)
11print(f"\nDFN - Train Accuracy: {accuracy_train:.4f}, Test Accuracy: {accuracy_test:.4f}")
12

Results

  1. Custom GB Classifier: Train: 0.8960 Val: 0.9100, Test: 0.8940

  2. XGBoost Classifier: Train: 0.8960 Val: 0.9100, Test: 0.8940

  3. LightGBM Classifier: Train: 0.8981 Val: 0.8980, Test: 0.9020

  4. CatBoost Classifier: Train: 0.8949 Val: 0.8960, Test: 0.8980

  5. Logistic Regression (Base line model): Train: 0.8638 Val: 0.8800, Test: 0.8520

  6. DFN (Base line model (DL)): Train: 0.9172, Val: 0.9060, Test: 0.8920

LightGBM showed the highest test accuracy (0.9020) among all models, outperforming other GBM variants and the DFN.

Other Gradient Boosting Models (Custom GBM, XGBoost GB, CatBoost) consistently achieved strong test accuracies between 0.8940 and 0.8980, demonstrating robust performance.

The Deep Neural Network (DFN) model (0.8920 test accuracy) performed competitively, though slightly lower than LightGBM and CatBoost in this specific comparison.

All Gradient Boosting models significantly outperformed the Logistic Regression Baseline (0.8520 test accuracy), highlighting their superior predictive power for this task.

Wrapping Up

Gradient Boosting Machines (GBMs) offer high flexibility for designing custom models due to their framework-like nature.

In the experiment, we saw the GB models outperformed the baseline model with moderate efforts in tuning.

I’ll list up some considerations for GBMs and conclude this article.

Bottlenecks of Gradient Boosting

Time Complexity

Building and evaluating weak learners (especially decision trees) is time consuming.

For traditional decision trees, splitting a single node generally involves sorting samples for each feature, resulting in a time complexity of O(n⋅m log m) (m: the number of samples at that node, n: the number of features).

LightGBM mitigates this by using histograms, reducing complexity to O(n⋅m).

Slow Learning and Evaluation Process

The sequential nature of GBMs makes them inherently difficult to parallelize during the learning phase unlike other ensemble methods like Random Forests.

On top of that, tens of thousands of iterations - common in accuracy-intensive applications - require evaluating all base learners for predictions, making real-time inference slow.

This makes a trade-off between model complexity and prediction speed in the applications.

Lack of Smooth Continuous Base-Learners

GBMs currently lack fast and efficient implementations of smooth, continuous base-learners that can effectively capture interactions between variables.

Despite these computational challenges, GBMs remain highly applicable, offering strong predictive power and relatively easy interpretability that can provide variable insights to problems.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Share What You Learned

Kuriko IWAI, "Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks" in Kernel Labs

https://kuriko-iwai.com/gradient-boosting-machines-explained

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.