Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks
Explore core concepts and practical implementations for enhanced performance.
By Kuriko IWAI

Table of Contents
IntroductionWhat is Gradient BoostingIntroduction
Ensemble techniques are common techniques in machine learning to enhance the accuracy of the model predictions.
In this article, I'll explore Gradient Boosting, a widely used ensemble tactic, both in theory and through coding simulations.
What is Gradient Boosting
Gradient Boosting (GB, or Gradient Boosting Machines (GBM)) is an ensemble method categorized in boosting, that captures complex non-linear dependencies by building weak models sequentially.
In the process, each of these weak models (typically shallow decision trees with a few leaf nodes and terminal nodes) tries to minimize the loss (residuals) of the previous model using gradient decent algorithms and improve the prediction overall.
Learn More: Prototyping Gradient Descent in Machine Learning
◼ Major Types of Gradient Boosting
Here are three major models categorized in the GBM family:
1) XGBoost (Extreme Gradient Boosting)
A dominant algorithm in the GB family.
Optimized and highly efficient implementation of gradient boosting - speed, performance, and features like parallel processing, tree pruning, and regularization (L1 and L2).
2) LightGBM (Light Gradient-Boosting Machine):
Excels with large datasets due to its techniques like Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) (significantly reduce computational overhead and speed up training).
Developed by Microsoft.
3) CatBoost (Categorical Boosting):
Handles categorical features by employing a permutation-driven approach and ordered boosting.
Often yields good results with minimal hyperparameter tuning.
Developed by Yandex.
What is Ensemble Technique
Before detailing the GBM, let us quickly cover the big picture of the ensemble techniques in machine learning.
Its basic concept is to combine multiple models (or neurons in deep learning) to improve overall prediction accuracy and robustness. Boosting is one of the techniques, but there are many options we can utilize.
◼ Core Frameworks of Ensemble Techniques
▫ In Machine Learning Context:
▫ a) Bagging (Bootstrap Aggregating)
Trains multiple models independently on bootstrap samples, then averaging/voting their predictions.
Goal: Reduces the variance.
Major Examples: Random Forest (A classic bagging method of an ensemble of decision trees)
▫ b) Boosting:
Trains models sequentially, each correcting errors of the last model.
Goal: Reduces the bias.
Major Examples: Gradient Boosting, AdaBoost (Adaptive Boosting, assigning weights to misclassified training samples)
▫ c) Stacking (Stacked Generalization):
- Combines predictions from multiple diverse base models by training a meta-model (or “meta-learner”) on the base models’ outputs.
▫ d) Voting:
A method to decide a “winner” across the models’ predictions.
Hard Voting (Majority Vote): The class with the most votes from individual models wins.
Soft Voting (Weighted Averaging): For classification tasks, a class with the highest average probability wins. For regression, it typically averages the predictions.
Learn More on Stacking & Voting: Ensemble Naive Bayes for Mixed Data Types
▫ In Deep Learning Context:
▫ e) Snapshot Ensemble:
Trains a single neural network, saves its weights at different points during training, and averages each prediction.
Avoids training multiple large networks.
▫ f) Weight Averaging (e.g., Stochastic Weight Averaging — SWA):
Averages the weights of multiple models (or different snapshots of a single model) after training
The resulting averaged model often performs better than any individual model.
▫ g) Multi-Model Training with Diverse Architectures
- Trains several deep learning models with completely different architectures and combines their predictions via voting or stacking.
How Gradient Boosting Works
The basic idea of Gradient Boosting is to keep building a refined ensemble model upon simple base models (I’ll call them “weak learner/s”) - to minimize the overall loss.
Mathematically, this process is defined by the iteration where the algorithm keeps updating the prediction from the ensemble model (F) by adding a scaled output from a new weak learner added to the ensemble model in the current iteration loop (m):
m: The total number of iterations.
Fm(x): The updated ensemble model after the m-th iteration.
Fm−1(x): The ensemble model from the previous (m-1 th) iteration.
ρ_m: The scaling factor defining the step size toward a new weak learner.
h(x; a_m): A weak learner added to the ensemble model in the m-th iteration.
▫ Training Weak Learners
Weak learners (h) are trained to predict the pseudo-residuals (y~i, also called negative gradients) that represent the direction and magnitude of the steepest descent of the loss function at the current ensemble model's prediction for each training example.
Mathematically, these values are computed by taking the negative partial derivative (gradient) of the loss function (L) with respect to the ensemble model's prediction (F(xi)):
The algorithm then guides the new weak learner (h(x; a_m)) to correct these errors to the direction where the total loss is minimized - by adjusting its model parameters (a_m):
In the formula, the scaling factor (β) indicates a learning rate or a weight assigned to the weak learner. In some implementations, β might be optimized separately or fixed.
▫ Optimizing Step Sizes
After training the weak learner, the algorithm decides the optimal step size (ρ) that defines how much the new weak learner should contribute to the ensemble model.
Mathematically, the optimal step size is found in a line search where the ensemble model minimizes the loss (L) between the updated prediction and its corresponding true label:
(y: corresponding true label, a_m: a weak leaner’s parameters)
This is an important step for the ensemble model to prevents itself from taking too large steps toward a certain weak learner and in consequence, overfitting to the learner.
The entire process looks like this:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Iteration Process of Gradient Boosting (Using a simple decision tree as a weak learner, Image by Kuriko IWAI)
The figure shows that the algorithm:
makes a prediction with the current ensemble model,
computes residuals,
adds new weak learner (h, colored in red, yellow, and green) to the ensemble model while applying the optimal step size, and
continuously iterate these steps.
Simulation
I’ll build the following four models using the Sckit Learn / Keras libraries and compare the performance.
Custom GB Classifier (CustomGB class),
XGBoost Classifier
LightGBM Classifier
CatBoost Classifier
Logistic Regression (as the primary baseline model)
Deep Feedforward Network (as the secondary baseline model).
Learn More: Building Deep Feedfoward Network
◼ Defining the Custom Classifier
I’ll begin with defining the custom classifier with fit(), predict_proba(), and predict() functions.
In the iteration loop, I defined the binary cross-entropy loss as the loss function, and simplified the computation of the residuals (rho).
1import numpy as np
2from sklearn.tree import DecisionTreeRegressor
3
4class CustomGB:
5 def __init__(self, learning_rate, n_estimators, max_depth=1):
6 self.learning_rate = learning_rate
7 self.n_estimators = n_estimators
8 self.max_depth = max_depth
9 self.random_state = 42
10 self.learners = []
11 self.F_0 = None
12 self.epsilon = 1e-10
13
14 def fit(self, X, y):
15 self.F_0 = np.log(y.mean() / (1 - y.mean()))
16 F_m = np.full(len(y), self.F_0)
17
18 # starts the iteration
19 for _ in range(self.n_estimators):
20 # computes the residuals
21 p = np.exp(F_m) / (1 + np.exp(F_m))
22 rho = y - p
23
24 # adds a decision tree as a weak learner.
25 learner = DecisionTreeRegressor(max_depth=self.max_depth, random_state=self.random_state).fit(X, rho)
26 terminal_node_ids = learner.apply(X)
27
28 # looping through the terminal nodes to calculate gamma and update F_m
29 for j in np.unique(terminal_node_ids):
30 current_id = terminal_node_ids == j
31
32 # computes gamma: (Σ residuals/Σ p(1-p))
33 gamma = rho[current_id].sum() / ((p[current_id] * (1 - p[current_id])).sum() + self.epsilon)
34 F_m[current_id] += self.learning_rate * gamma
35
36 # replacing the prediction value in the tree's leaves
37 learner.tree_.value[j, 0, 0] = gamma
38
39 self.learners.append(learner)
40 return self
41
42 def predict_proba(self, X):
43 F_m_pred = np.full(len(X), self.F_0)
44 for learner in self.learners:
45 F_m_pred += self.learning_rate * learner.predict(X)
46 return np.exp(F_m_pred) / (1 + np.exp(F_m_pred)) # convert final log-odds (F_m_pred) back to probabilities
47
48 def predict(self, X, threshold=0.5):
49 probabilities = self.predict_proba(X)
50 return (probabilities >= threshold).astype(int)
51
◼ Preparing Datasets
I used the same dataset I used for voting and stacking method to compare the performance and generated the train, validation, and test datasets after applying the column transformation and SMOTE scaling:
(2826, 61) (2826,) (500, 61) (500,) (500, 61) (500,)
To recap, the base dataset is a telecom churn data from the UC Irvine Machine Learning Repository (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license) with 3,500 data samples and 14 features:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. Sample dataset (Image source)
◼ Model Tuning
I set up the similar values to key arguments across the models to compare the performance apple to apple. The below coding block shows the base definition of the model, trained on the preprocessed training samples.
1from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3from catboost import CatBoostClassifier
4
5# sets up the same values
6learning_rate = 0.01
7n_estimators = 5000
8max_depth = 1
9
10# 1. Custom GBM
11custom_gbm = CustomGB(
12 learning_rate=learning_rate,
13 n_estimators=n_estimators,
14 max_depth=max_depth
15).fit(X_train_processed, y_train)
16
17
18# 2. XGBoost Classifier
19sklearn_xgb = GradientBoostingClassifier(
20 loss='log_loss', # explicitly mentioned
21 learning_rate=learning_rate,
22 n_estimators=n_estimators, # dafault: 100
23 subsample=1.0,
24 criterion="friedman_mse",
25 min_samples_split=2,
26 min_samples_leaf=1,
27 min_weight_fraction_leaf=0.0,
28 max_depth=max_depth,
29 min_impurity_decrease=0.0,
30 validation_fraction=0.1,
31 n_iter_no_change=None,
32 tol=1e-4,
33).fit(X_train_processed, y_train)
34
35
36# 3. Light GB Classifier (Using HistGradientBoostingClassifier from the Sckit Learn library)
37sklearn_lgb = HistGradientBoostingClassifier(
38 loss='log_loss', # explicitly mentioned
39 learning_rate=learning_rate,
40 max_depth=max_depth,
41 max_iter=n_estimators, # default is 100. Intentionally changed.
42 max_leaf_nodes=31,
43 min_samples_leaf=20,
44 l2_regularization=0.01,
45 early_stopping=True, # early stopping if 10% fraction from train to val for 10 consequence epochs
46 validation_fraction=0.1,
47 n_iter_no_change=10
48).fit(X_train_processed, y_train)
49
50# 4. CatBoost Classifier
51cat = CatBoostClassifier(
52 iterations=n_estimators,
53 learning_rate=learning_rate,
54 depth=max_depth,
55 loss_function='Logloss',
56 eval_metric='Accuracy', # monitor accuracy on the validation set for early stopping
57 random_seed=42,
58 verbose=0,
59 early_stopping_rounds=10
60).fit(X_train_processed, y_train)
61
62
63# 5. Logistic Regression (as a baseline ML model)
64sklearn_lr = LogisticRegression(
65 penalty='l2',
66 tol=1e-4,
67 max_iter=n_estimators,
68).fit(X_train_processed, y_train)
69
70
71# 6. DFN (Using the Keras library, as a baseline DL model)
72import tensorflow as tf
73from tensorflow import keras
74from keras.models import Sequential
75from keras.layers import Dense, Dropout, Input
76
77keras_model = Sequential([
78 Input(shape=(X_train_processed.shape[1],)),
79 Dense(32, activation='relu'),
80 Dropout(0.1),
81 Dense(16, activation='relu'),
82 Dropout(0.1),
83 Dense(1, activation='sigmoid')
84])
85keras_model.compile(
86 optimizer='adam',
87 loss='binary_crossentropy',
88 metrics=['accuracy']
89)
90history = keras_model.fit(
91 X_train_processed, y_train,
92 epochs=n_estimators,
93 batch_size=32,
94 validation_split=0.2,
95 verbose=0
96)
97
One noticeable point here is that the classifiers build on the external libraries have the regularization frameworks such as L1, L2, and early stopping. I set up L2 terms and early stopping while securing relatively large number of the epochs (n_estimators). In reality, keeping high epochs is important to improve the accuracy of the weak learners.
◼ Evaluation
I made a prediction on the training and test datasets and computed the accuracy scores.
1from sklearn.metrics import accuracy_score
2
3y_pred_train = model.predict(X=X_train_processed)
4 y_pred_val = model.predict(X_val_processed)
5y_pred_test = model.predict(X=X_test_processed)
6print(f'\n{model_names[i]}\nTrain: {accuracy_score(y_train, y_pred_train):.4f} Test: {accuracy_score(y_test, y_pred_test):.4f}')
7
8loss_train, accuracy_train = keras_model.evaluate(X_train_processed, y_train)
9oss_val, accuracy_val = keras_model.evaluate(X_val_processed, y_val)
10loss_test, accuracy_test = keras_model.evaluate(X_test_processed, y_test)
11print(f"\nDFN - Train Accuracy: {accuracy_train:.4f}, Test Accuracy: {accuracy_test:.4f}")
12
◼ Results
Custom GB Classifier: Train: 0.8960 Val: 0.9100, Test: 0.8940
XGBoost Classifier: Train: 0.8960 Val: 0.9100, Test: 0.8940
LightGBM Classifier: Train: 0.8981 Val: 0.8980, Test: 0.9020
CatBoost Classifier: Train: 0.8949 Val: 0.8960, Test: 0.8980
Logistic Regression (Base line model): Train: 0.8638 Val: 0.8800, Test: 0.8520
DFN (Base line model (DL)): Train: 0.9172, Val: 0.9060, Test: 0.8920
LightGBM showed the highest test accuracy (0.9020) among all models, outperforming other GBM variants and the DFN.
Other Gradient Boosting Models (Custom GBM, XGBoost GB, CatBoost) consistently achieved strong test accuracies between 0.8940 and 0.8980, demonstrating robust performance.
The Deep Neural Network (DFN) model (0.8920 test accuracy) performed competitively, though slightly lower than LightGBM and CatBoost in this specific comparison.
All Gradient Boosting models significantly outperformed the Logistic Regression Baseline (0.8520 test accuracy), highlighting their superior predictive power for this task.
Wrapping Up
Gradient Boosting Machines (GBMs) offer high flexibility for designing custom models due to their framework-like nature.
In the experiment, we saw the GB models outperformed the baseline model with moderate efforts in tuning.
I’ll list up some considerations for GBMs and conclude this article.
◼ Bottlenecks of Gradient Boosting
▫ Time Complexity
Building and evaluating weak learners (especially decision trees) is time consuming.
For traditional decision trees, splitting a single node generally involves sorting samples for each feature, resulting in a time complexity of O(n⋅m log m) (m: the number of samples at that node, n: the number of features).
LightGBM mitigates this by using histograms, reducing complexity to O(n⋅m).
Slow Learning and Evaluation Process
The sequential nature of GBMs makes them inherently difficult to parallelize during the learning phase unlike other ensemble methods like Random Forests.
On top of that, tens of thousands of iterations - common in accuracy-intensive applications - require evaluating all base learners for predictions, making real-time inference slow.
This makes a trade-off between model complexity and prediction speed in the applications.
Lack of Smooth Continuous Base-Learners
GBMs currently lack fast and efficient implementations of smooth, continuous base-learners that can effectively capture interactions between variables.
Despite these computational challenges, GBMs remain highly applicable, offering strong predictive power and relatively easy interpretability that can provide variable insights to problems.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds
Regression Loss Functions & Regularization
A Deep Dive into KNN Optimization and Distance Metrics
Mastering Decision Trees: From Impurity Measures to Greedy Optimization
Random Forest Decoded: Architecture, Bagging, and Performance Benchmarks
Building Powerful Naive Bayes Ensembles for Mixed Datasets
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Share What You Learned
Kuriko IWAI, "Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks" in Kernel Labs
https://kuriko-iwai.com/gradient-boosting-machines-explained
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.





