Gradient Boosting (GB, or Gradient Boosting Machines (GBM)) is an ensemble method categorized in boosting, that captures complex non-linear dependencies by building weak models sequentially.

In the process, each of these weak models (typically shallow decision trees with a few leaf nodes and terminal nodes) tries to minimize the loss (residuals) of the previous model using gradient decent algorithms and improve the prediction overall.

Learn More: Prototyping Gradient Descent in Machine Learning

◼ Major Types of Gradient Boosting

Here are three major models categorized in the GBM family:

1) XGBoost (Extreme Gradient Boosting)

A dominant algorithm in the GB family.
Optimized and highly efficient implementation of gradient boosting - speed, performance, and features like parallel processing, tree pruning, and regularization (L1 and L2).

2) LightGBM (Light Gradient-Boosting Machine):

Excels with large datasets due to its techniques like Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) (significantly reduce computational overhead and speed up training).
Developed by Microsoft.

3) CatBoost (Categorical Boosting):

Handles categorical features by employing a permutation-driven approach and ordered boosting.
Often yields good results with minimal hyperparameter tuning.
Developed by Yandex.

What is Ensemble Technique

Before detailing the GBM, let us quickly cover the big picture of the ensemble techniques in machine learning.

Its basic concept is to combine multiple models (or neurons in deep learning) to improve overall prediction accuracy and robustness. Boosting is one of the techniques, but there are many options we can utilize.

◼ Core Frameworks of Ensemble Techniques

▫ In Machine Learning Context:

▫ a) Bagging (Bootstrap Aggregating)

Trains multiple models independently on bootstrap samples, then averaging/voting their predictions.
Goal: Reduces the variance.
Major Examples: Random Forest (A classic bagging method of an ensemble of decision trees)

▫ b) Boosting:

Trains models sequentially, each correcting errors of the last model.
Goal: Reduces the bias.
Major Examples: Gradient Boosting, AdaBoost (Adaptive Boosting, assigning weights to misclassified training samples)

▫ c) Stacking (Stacked Generalization):

Combines predictions from multiple diverse base models by training a meta-model (or “meta-learner”) on the base models’ outputs.

▫ d) Voting:

A method to decide a “winner” across the models’ predictions.
Hard Voting (Majority Vote): The class with the most votes from individual models wins.
Soft Voting (Weighted Averaging): For classification tasks, a class with the highest average probability wins. For regression, it typically averages the predictions.

Learn More on Stacking & Voting: Ensemble Naive Bayes for Mixed Data Types

▫ In Deep Learning Context:

▫ e) Snapshot Ensemble:

Trains a single neural network, saves its weights at different points during training, and averages each prediction.
Avoids training multiple large networks.

▫ f) Weight Averaging (e.g., Stochastic Weight Averaging — SWA):

Averages the weights of multiple models (or different snapshots of a single model) after training
The resulting averaged model often performs better than any individual model.

▫ g) Multi-Model Training with Diverse Architectures

Trains several deep learning models with completely different architectures and combines their predictions via voting or stacking.

How Gradient Boosting Works

The basic idea of Gradient Boosting is to keep building a refined ensemble model upon simple base models (I’ll call them “weak learner/s”) - to minimize the overall loss.

Mathematically, this process is defined by the iteration where the algorithm keeps updating the prediction from the ensemble model (F) by adding a scaled output from a new weak learner added to the ensemble model in the current iteration loop (m):

F_m(x) = \underbrace{F_{m-1}(x)}{\text{previous ensemble model}} + \underbrace{\rho_m}{\text{scaling factor = step size }}\cdot \underbrace{h(x; a_m)}_{\text{a new weak learner}}

m: The total number of iterations.
Fm(x): The updated ensemble model after the m-th iteration.
Fm−1(x): The ensemble model from the previous (m-1 th) iteration.
ρ_m: The scaling factor defining the step size toward a new weak learner.
h(x; a_m): A weak learner added to the ensemble model in the m-th iteration.

▫ Training Weak Learners

Weak learners (h) are trained to predict the pseudo-residuals (y~i, also called negative gradients) that represent the direction and magnitude of the steepest descent of the loss function at the current ensemble model's prediction for each training example.

Mathematically, these values are computed by taking the negative partial derivative (gradient) of the loss function (L) with respect to the ensemble model's prediction (F(xi)):

\tilde{y}i = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]{F(x)=F_{m-1}(x)} \text{ for }i=1, \dots, N

The algorithm then guides the new weak learner (h(x; a_m)) to correct these errors to the direction where the total loss is minimized - by adjusting its model parameters (a_m):

a_m = \underset{a,\beta}{\operatorname{argmin}} \sum_{i=1}^{N} [ \underbrace{\tilde{y}i }{\text{pseudo-residuals}} - \underbrace{\beta}_{\text{scaling factor}} \cdot h(x_i;a)]^2

In the formula, the scaling factor (β) indicates a learning rate or a weight assigned to the weak learner. In some implementations, β might be optimized separately or fixed.

▫ Optimizing Step Sizes

After training the weak learner, the algorithm decides the optimal step size (ρ) that defines how much the new weak learner should contribute to the ensemble model.

Mathematically, the optimal step size is found in a line search where the ensemble model minimizes the loss (L) between the updated prediction and its corresponding true label:

\rho_m = \arg \min_{\rho} \sum_{i=1}^N \overbrace{ L(y_i, \underbrace{ F_{m-1}(x_i) + \rho \cdot h(x_i; a_m) }_{\text{updated prediction with a new weak learner}} ) }^{\text{loss}}

(y: corresponding true label, a_m: a weak leaner’s parameters)

This is an important step for the ensemble model to prevents itself from taking too large steps toward a certain weak learner and in consequence, overfitting to the learner.

The entire process looks like this:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Iteration Process of Gradient Boosting (Using a simple decision tree as a weak learner, Image by Kuriko IWAI)

The figure shows that the algorithm:

makes a prediction with the current ensemble model,
computes residuals,
adds new weak learner (h, colored in red, yellow, and green) to the ensemble model while applying the optimal step size, and
continuously iterate these steps.

Simulation

I’ll build the following four models using the Sckit Learn / Keras libraries and compare the performance.

Custom GB Classifier (CustomGB class),
XGBoost Classifier
LightGBM Classifier
CatBoost Classifier
Logistic Regression (as the primary baseline model)
Deep Feedforward Network (as the secondary baseline model).

Learn More: Building Deep Feedfoward Network

◼ Defining the Custom Classifier

I’ll begin with defining the custom classifier with fit(), predict_proba(), and predict() functions.

In the iteration loop, I defined the binary cross-entropy loss as the loss function, and simplified the computation of the residuals (rho).

1import numpy as np
2from sklearn.tree import DecisionTreeRegressor
3
4class CustomGB:
5    def __init__(self, learning_rate, n_estimators, max_depth=1):
6        self.learning_rate = learning_rate
7        self.n_estimators = n_estimators
8        self.max_depth = max_depth
9        self.random_state = 42
10        self.learners = []
11        self.F_0 = None
12        self.epsilon = 1e-10
13
14    def fit(self, X, y):
15        self.F_0 = np.log(y.mean() / (1 - y.mean()))
16        F_m = np.full(len(y), self.F_0)
17
18        # starts the iteration
19        for _ in range(self.n_estimators):
20            # computes the residuals
21            p = np.exp(F_m) / (1 + np.exp(F_m))
22            rho = y - p 
23
24            # adds a decision tree as a weak learner.
25            learner = DecisionTreeRegressor(max_depth=self.max_depth, random_state=self.random_state).fit(X, rho)
26            terminal_node_ids = learner.apply(X)
27
28            # looping through the terminal nodes to calculate gamma and update F_m
29            for j in np.unique(terminal_node_ids):
30                current_id = terminal_node_ids == j
31
32                # computes gamma: (Σ residuals/Σ p(1-p))
33                gamma = rho[current_id].sum() / ((p[current_id] * (1 - p[current_id])).sum() + self.epsilon)
34                F_m[current_id] += self.learning_rate * gamma
35
36                # replacing the prediction value in the tree's leaves
37                learner.tree_.value[j, 0, 0] = gamma
38
39            self.learners.append(learner)
40        return self
41
42    def predict_proba(self, X):
43        F_m_pred = np.full(len(X), self.F_0)
44        for learner in self.learners:
45            F_m_pred += self.learning_rate * learner.predict(X)
46        return np.exp(F_m_pred) / (1 + np.exp(F_m_pred)) # convert final log-odds (F_m_pred) back to probabilities
47
48    def predict(self, X, threshold=0.5):
49        probabilities = self.predict_proba(X)
50        return (probabilities >= threshold).astype(int)
51

◼ Preparing Datasets

I used the same dataset I used for voting and stacking method to compare the performance and generated the train, validation, and test datasets after applying the column transformation and SMOTE scaling:

(2826, 61) (2826,) (500, 61) (500,) (500, 61) (500,)

To recap, the base dataset is a telecom churn data from the UC Irvine Machine Learning Repository (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license) with 3,500 data sa mples and 14 features:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Sample dataset (Image source)

◼ Model Tuning

I set up the similar values to key arguments across the models to compare the performance apple to apple. The below coding block shows the base definition of the model, trained on the preprocessed training samples.

1from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3from catboost import CatBoostClassifier 
4
5# sets up the same values
6learning_rate = 0.01
7n_estimators = 5000
8max_depth = 1
9
10# 1. Custom GBM
11custom_gbm = CustomGB(
12    learning_rate=learning_rate,
13    n_estimators=n_estimators,
14    max_depth=max_depth
15).fit(X_train_processed, y_train)
16
17
18# 2. XGBoost Classifier
19sklearn_xgb = GradientBoostingClassifier(
20    loss='log_loss',                    # explicitly mentioned
21    learning_rate=learning_rate,
22    n_estimators=n_estimators,          # dafault: 100
23    subsample=1.0,
24    criterion="friedman_mse",
25    min_samples_split=2,
26    min_samples_leaf=1,
27    min_weight_fraction_leaf=0.0,
28    max_depth=max_depth,
29    min_impurity_decrease=0.0,
30    validation_fraction=0.1,
31    n_iter_no_change=None,
32    tol=1e-4,
33).fit(X_train_processed, y_train)
34
35
36# 3. Light GB Classifier (Using HistGradientBoostingClassifier from the Sckit Learn library)
37sklearn_lgb = HistGradientBoostingClassifier(
38    loss='log_loss',                    # explicitly mentioned
39    learning_rate=learning_rate,
40    max_depth=max_depth,
41    max_iter=n_estimators,              # default is 100. Intentionally changed.
42    max_leaf_nodes=31,
43    min_samples_leaf=20,
44    l2_regularization=0.01,
45    early_stopping=True,                # early stopping if 10% fraction from train to val for 10 consequence epochs
46    validation_fraction=0.1,
47    n_iter_no_change=10
48).fit(X_train_processed, y_train)
49
50# 4. CatBoost Classifier
51cat = CatBoostClassifier(
52    iterations=n_estimators,
53    learning_rate=learning_rate,
54    depth=max_depth,
55    loss_function='Logloss',
56    eval_metric='Accuracy',             # monitor accuracy on the validation set for early stopping
57    random_seed=42,
58    verbose=0,
59    early_stopping_rounds=10
60).fit(X_train_processed, y_train)
61
62
63# 5. Logistic Regression (as a baseline ML model)
64sklearn_lr = LogisticRegression(
65    penalty='l2',
66    tol=1e-4,
67    max_iter=n_estimators,
68).fit(X_train_processed, y_train)
69
70
71# 6. DFN (Using the Keras library, as a baseline DL model)
72import tensorflow as tf
73from tensorflow import keras
74from keras.models import Sequential
75from keras.layers import Dense, Dropout, Input
76
77keras_model = Sequential([
78    Input(shape=(X_train_processed.shape[1],)),
79    Dense(32, activation='relu'),
80    Dropout(0.1),
81    Dense(16, activation='relu'),
82    Dropout(0.1),
83    Dense(1, activation='sigmoid')
84])
85keras_model.compile(
86    optimizer='adam',
87    loss='binary_crossentropy',
88    metrics=['accuracy']
89)
90history = keras_model.fit(
91    X_train_processed, y_train,
92    epochs=n_estimators,
93    batch_size=32,
94    validation_split=0.2,
95    verbose=0
96)
97

One noticeable point here is that the classifiers build on the external libraries have the regularization frameworks such as L1, L2, and early stopping. I set up L2 terms and early stopping while securing relatively large number of the epochs (n_estimators). In reality, keeping high epochs is important to improve the accuracy of the weak learners.

◼ Evaluation

I made a prediction on the training and test datasets and computed the accuracy scores.

1from sklearn.metrics import accuracy_score
2
3y_pred_train =  model.predict(X=X_train_processed)
4 y_pred_val = model.predict(X_val_processed)
5y_pred_test = model.predict(X=X_test_processed)
6print(f'\n{model_names[i]}\nTrain: {accuracy_score(y_train, y_pred_train):.4f} Test: {accuracy_score(y_test, y_pred_test):.4f}')
7
8loss_train, accuracy_train = keras_model.evaluate(X_train_processed, y_train)
9oss_val, accuracy_val = keras_model.evaluate(X_val_processed, y_val)
10loss_test, accuracy_test = keras_model.evaluate(X_test_processed, y_test)
11print(f"\nDFN - Train Accuracy: {accuracy_train:.4f}, Test Accuracy: {accuracy_test:.4f}")
12

◼ Results

Custom GB Classifier: Train: 0.8960 Val: 0.9100, Test: 0.8940
XGBoost Classifier: Train: 0.8960 Val: 0.9100, Test: 0.8940
LightGBM Classifier: Train: 0.8981 Val: 0.8980, Test: 0.9020
CatBoost Classifier: Train: 0.8949 Val: 0.8960, Test: 0.8980
Logistic Regression (Base line model): Train: 0.8638 Val: 0.8800, Test: 0.8520
DFN (Base line model (DL)): Train: 0.9172, Val: 0.9060, Test: 0.8920

LightGBM showed the highest test accuracy (0.9020) among all models, outperforming other GBM variants and the DFN.

Other Gradient Boosting Models (Custom GBM, XGBoost GB, CatBoost) consistently achieved strong test accuracies between 0.8940 and 0.8980, demonstrating robust performance.

The Deep Neural Network (DFN) model (0.8920 test accuracy) performed competitively, though slightly lower than LightGBM and CatBoost in this specific comparison.

All Gradient Boosting models significantly outperformed the Logistic Regression Baseline (0.8520 test accuracy), highlighting their superior predictive power for this task.

Wrapping Up

Gradient Boosting Machines (GBMs) offer high flexibility for designing custom models due to their framework-like nature.

In the experiment, we saw the GB models outperformed the baseline model with moderate efforts in tuning.

I’ll list up some considerations for GBMs and conclude this article.

◼ Bottlenecks of Gradient Boosting

▫ Time Complexity

Building and evaluating weak learners (especially decision trees) is time consuming.

For traditional decision trees, splitting a single node generally involves sorting samples for each feature, resulting in a time complexity of O(n⋅m log m) (m: the number of samples at that node, n: the number of features).

LightGBM mitigates this by using histograms, reducing complexity to O(n⋅m).

Slow Learning and Evaluation Process

The sequential nature of GBMs makes them inherently difficult to parallelize during the learning phase unlike other ensemble methods like Random Forests.

On top of that, tens of thousands of iterations - common in accuracy-intensive applications - require evaluating all base learners for predictions, making real-time inference slow.

This makes a trade-off between model complexity and prediction speed in the applications.

Lack of Smooth Continuous Base-Learners

GBMs currently lack fast and efficient implementations of smooth, continuous base-learners that can effectively capture interactions between variables.

Despite these computational challenges, GBMs remain highly applicable, offering strong predictive power and relatively easy interpretability that can provide variable insights to problems.