By averaging or taking a majority vote for the predictions from these many trees, Random Forest significantly reduces variance and improves the robustness and accuracy of its predictions.

Its key characteristics include:

Supervised Learning Model: Requires labeled data for training.
Ensemble Method (Bagging): Combines predictions from multiple individual models (decision trees) to achieve better performance than any single model (Learn More: Ensemble Technique).
Non-Parametric: Makes no assumptions about the underlying data distribution and doesn't require a fixed number of features.
Non-linearity: Capable of modeling complex, non-linear relationships within the data.
Versatile: Can handle both regression and classification problems.
Training in Parallel: Individual trees are trained simultaneously, speeding up the training process on appropriate hardware.

Random Forests are a popular choice in many fields because of their predictive abilities, versatility with different data types, and transparent feature importance scores. These scores are especially useful for feature selection and uncovering hidden data patterns.

How Random Forest Algorithm Works

A Random Forest works like a diverse team mitigating bias in the decision making process to make the most out of the team, instead of relying on a single expert.

The figure below illustrates the nature of individual trees (team members) within a Random Forest and how they relate to data points in a feature space.

I’ll take a randomly selected sample (the red star in the figure, called s) for an example to demonstrate how the ensemble makes a prediction.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Nature of individual trees in Random Forest architecture (Created by Kuriko IWAI)

◼ Bootstrapping Samples

The ensembles first makes bootstrap samples, a completely-randomized subset of the training samples, for each tree in the forest, and then structures and trains the tree on the bootstrap samples.

These samples contains replacements, meaning any random data point might appear multiple times in the same bootstrap samples, while others might not appear at all.

For example, if the training samples have data points of all the alphabets (A, B, … Z), bootstrap samples with size five can look like:

Bootstrap 1: D, A, C, F, Z
Bootstrap 2: C, C, D, G, N
Bootstrap 3. B, D, F, Z, A
…

To ensure even more robustness, the ensemble also employs feature bootstrapping where the input features in the training samples also split into \sqrt (p) size for classification tasks, and p/3 size for regression tasks (p: feature size).

For instance, if the training samples have five features (p=5), each bootstrap sample has:

For classification task: rounddown(√5) =2 → Any two features among five
For regression task: rounddown(5/3) = 1 → Any one feature among five

The feature bootstrapping is crucial for a Random Forest.

When the training dataset contains a few strong features to influence the prediction, no matter how random the ensemble splits the data points, all the trees might be reliant on these features, generating similar predictions.

Feature bootstrapping ensures randomness of samples and hence, robust learning of each tree.

***

In the figure, all three trees (T1, T2, T3) have 2,000 data points with 3 input features as their own bootstrap samples. These data points and features are completely randomized and different from each other.

Assuming the data points are represented by alphabets, and the features are ‘weight‘, ‘height‘, ‘name’, ‘age‘, ‘score‘, and so on, each tree is assigned a unique combination of data points and features:

T1 - bootstrap sample 1: data points: D, A, C, F, Z,…, features: ‘weight‘, ‘score‘, ‘name‘
T2 - bootstrap sample 2: data points: C, C, D, G, N,…, features: ‘name‘, ‘height‘, ‘name‘
T3 - bootstrap sample 3: data points: B, D, F, Z, A,…, features: ‘score‘, ‘age‘, ‘height‘

This way, the ensemble secures robust learning of each tree.

◼ Out-of-Bag Error Estimation

Random Forest provides an internal mechanism for estimating the model error without the need for a separate validation set by securing the out-of-bag (OOB) samples.

OOB samples are unused samples during training of each tree.

In the figure, each tree (T1, T2, T3) has 1,000 OOB samples because 2,000 bootstrap samples out of 3,000 training samples were used on its training.

And the specific star sample (s) is in the T2 and T3’s OOB samples because both trees didn’t use this sample for their own training.

The ensemble keeps track of these OOB samples, ensuring they are completely "masked" from the trees during the training to prevent any data leakage.

◼ Parallel Training

Once each tree has its unique bootstrap samples, the trees are trained in parallel.

During the training, each tree grows independently and asymmetrically until it reaches a given maximum depth (max_depth) or a minimum number of samples required to split a node (min_samples_leaf).

When these parameters aren't specified, trees can theoretically grow until every leaf node contains only one data point, which might lead to overfitting for individual trees.

◼ Ensemble

After all the individual trees are structured and trained, the ensemble start to make a prediction for a new data point by aggregating the prediction from the individual trees.

Crucially, the ensemble only allows trees that did not use the data point for their training to participate in this process.

So, in case of our star sample (s), the ensemble first;

Excluded T1 from the process because T1 used this sample for training.
Invited only T2 and T3 to participate in the soft voting process (For classification tasks, RF often takes soft voting to aggregate probabilities. In the figure, T2 submitted 0.4, T3 submitted 0.3 as its prediction of the probability for the sample s to be classified into class A).
Averaged T2 and T3’s individual predictions ((0.3+0.4)/2=0.35).
Concluded class B because 0.35 is below the threshold of 0.50.

The ensemble repeats the same process across all the samples.

Mathematically, the final prediction for classification tasks is denoted:

\begin{align*} \hat{f}(x') &= \arg \max_{c \in \text{Classes}} \left( \frac{1}{B} \sum_{b=1}^{B} P_b(c|x') \right) \end{align*}

(B: tree size in the forest, b: a tree, c: class, P_b(c|x): the probability of class c given input x from the tree b)

In case of regression tasks where the output is a continuous value:

\begin{align*} \hat{f}(x') &= \frac{1}{B} \sum_{b=1}^{B} f_b(x') \end{align*}

(B: tree size in the forest, b: a tree, f_b(x): prediction of the b-th tree for a given input x′)

This approach makes the ensemble very stable and robust because it leverages an internal mechanism to validate trees’ predictions using the other trees in the system.

A part of the reasons that the ensemble randomize the bootstrap samples first is to ensure this robustness from all the trees, to achieve generalization.

◼ Key Hyperparameters

Random Forest’s capabilities are heavily influenced by 1. Forest Complexity, 2. Tree Complexity, and 3. Training Control.

Referring to RandomForestClassifier from the Scikit Learn library, I’ll list up key hyperparameters:

▫ 1. Forest Complexity

n_estimators: Number of trees to build before averaging the prediction.
oob_score: Whether to secure the OOB samples.
bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
max_samples: Number of bootstrap samples to draw from the entire training samples.

▫ 2. Tree Complexity

max_depth: Tree’s depth
min_samples_split: Ratio or absolute number of minimum samples to split the node.
min_samples_leaf: Minimum number of samples required to create a leaf node.
max_features: Maximum number of features to split to a node (log or square).
max_leaf_nodes: Maximum number of leaf nodes in each tree (None = infinite).
min_impurity_decrease: threshold of impurity to split the node
ccp_alpha: Complexity parameter used for Minimal Cost-Complexity Pruning.

▫ 3. Training Control

criterion: Objective function on measuring the gain.
class_weight: Handles class imbalance.
random_state: Handles random sampling.
n_jobs: Number of processors allowed to use for training.

These hyperparameters are often subject to grid search for optimal results. In the simulation section, I'll demonstrate and evaluate the performance.

Comparing with Gradient Boosting Machines

GBMs are often considered the counterpart of RF.

The primary distinctions between Random Forest and GBM are their ensemble process and optimization focus.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Comparison of Gradient Boosting Machines and Random Forest (Created by Kuriko IWAI)

◼ Ensemble Process

Random Forest builds multiple decision trees independently and in parallel.

In contrast, GBM sequentially adds "weak learners" (typically shallow decision trees) to its ensemble, with each new tree correcting the errors of the previous ones.

This parallel nature gives Random Forest an advantage in time complexity, especially when dealing with large and complex datasets.

For GBM, techniques like bootstrapping (as seen in LightGBM) can help mitigate its sequential processing time.

◼ Optimization Focus

Random Forest primarily aims to reduce variance by only taking a prediction on unseen dataset from each tree as its final prediction.

GBM, on the other hand, focuses on reducing training errors by iteratively optimizing its weak learners. This optimization often leads to a higher risk of overfitting, making regularization a necessary step to prevent it.

Simulation

In this section, I'll demonstrate the performance of Random Forest Classifiers across four model complexities, comparing them against GBM and Logistic Regression as baselines for binary churn prediction.

Random Forest Classifiers (1. Small, 2. Middle, 3. Large, 4. Optimal)
GBM Family: XGBoost Classifier, LightGBM Classifier
Baseline Models: Logistic Regression

◼ Preparing Datasets

I used a telecom churn data from the UC Irvine Machine Learning Repository (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license) with 3,500 data samples and 14 features, generating 2,826 datasets with 61 input features after applying column transformations and data scaling with SMOTE:

(2826, 61) (2826,) (500, 61) (500,) (500, 61) (500,)

Input features (before column transformation):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Sample dataset (Image source)

◼ Model Tuning: Grid Search

I used Grid Search with cross-validation to find the best hyperparameters for the Random Forest Classifier.

This process systematically tested different combinations of parameters, such as the number of trees (n_estimators) and maximum depth (max_depth), to optimize the model's performance on the training data.

1from sklearn.model_selection import GridSearchCV
2from sklearn.ensemble import RandomForestClassifier
3
4# sets baseline model
5base_rf = RandomForestClassifier(
6    # 1. forest complexity
7    bootstrap=True,
8    oob_score=True,
9
10    # 3. training control
11    random_state=42,
12    n_jobs=-1,
13    class_weight='balanced',
14
15    verbose=0,
16    warm_start=False,
17)
18pipeline = Pipeline([('scaler', preprocessor),('model', base_rf)])
19
20# defines options to test
21param_grid =  {
22    'model__n_estimators': [100, 200, 300],
23    'model__max_depth': [5, 10, 20, None],
24    'model__min_samples_split': [2, 5, 10],
25    'model__min_samples_leaf': [1, 2, 4],
26    'model__max_features': ['sqrt', 0.7],
27    'model__criterion': ['gini', 'entropy']
28}
29
30# grid search
31grid_search_rm_bt = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=0)
32grid_search_rm_bt.fit(X_train, y_train) # type: ignore
33best_params_search_model = grid_search_rm_bt.best_params_
34rf_opt = grid_search_rm_bt.best_estimator_
35print(f"Best params for bootstrapped Random Forest: {best_params_search_model}")
36print(f"Best score: {grid_search_rm_bt.best_score_:.4f}")
37

◼ Random Forests with Different Model Complexity

Then, I added four classifiers, each with a different level of forest / tree complexity:

rf_s: 50 trees in the forest. No sample or feature bootstrap (The entire training samples are used to train trees). Each tree has maximum 3 depths with at least 20 samples in the leaf node.
rf_s_bootstrap: Counterpart of rf_s with bootstrap samples. (No feature bootstrap)
rf_m: 200 trees in the forest. Bootstrap. Each tree has maximum 10 depths with at least 5 samples in the leaf node.
rf_l : 500 trees in the forest. Bootstrap. Each tree can grow infinite until remaining a single sample in the leaf node.

1from sklearn.ensemble import RandomForestClassifier
2
3# common training conditions across the models
4training_controls = dict(
5    random_state=42,
6    n_jobs=-1,
7    class_weight='balanced',
8    verbose=0,
9    warm_start=False, 
10)
11
12# complexity = low
13rf_s = RandomForestClassifier(
14    # 1. forest complexity
15    bootstrap=False,
16    oob_score=False,
17    n_estimators=50,
18
19    # 2. tree complexity
20    max_depth=3,
21    min_samples_leaf=20,
22    max_features=1,        # no feature bootstrap
23
24    # 3. training control
25    **training_controls,
26).fit(X_train_processed, y_train)
27
28# complexity = low (counterpart)
29rf_s_bootstrap = RandomForestClassifier(
30    **{ k: v for k, v in rf_s.get_params().items() if k != 'bootstrap' and k != 'oob_score'},
31    bootstrap=True,
32    oob_score=True,
33).fit(X_train_processed, y_train)
34
35
36# complexity = middle
37rf_m = RandomForestClassifier(
38    # 1. forest complexity
39    bootstrap=True,
40    oob_score=True,
41    n_estimators=200,
42
43    # 2. tree complexity
44    max_depth=10,
45    min_samples_leaf=5,
46    max_features='sqrt',      # turns on the feature bootstrap
47
48    # 3. training control
49    **training_controls,
50).fit(X_train_processed, y_train)
51
52
53# complexity = high
54rf_l = RandomForestClassifier(
55    # 1. forest complexity
56    bootstrap=True,
57    oob_score=True,
58    n_estimators=500,
59
60    # 2. tree complexity
61    max_depth=None,          # depth can be infinite.
62    min_samples_leaf=1,
63    max_features='log2',
64
65    # 3. training control
66    **training_controls,
67).fit(X_train_processed, y_train)
68

◼ GBM Family and Baseline Model

Then, I tuned XGBoost Classifier, LightGBM Classifier, CatBoost Classifier, and Logistic Regression as our primary baseline mode.

GBM models allocate 10% of training samples as a validation set and employ early stopping after 10 consecutive epochs without improvement in impurity values.

This strategy is crucial for GBMs to prevent overfitting because they prioritize minimizing training errors.

To control tree complexity, all GBMs were configured with the same number of trees as the complex random forest model, and allowed for a theoretically infinite number of trees.

1from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3from catboost import CatBoostClassifier
4
5# complexity hyperparams
6learning_rate = 0.01
7n_estimators = 500        # same as rf_l
8max_depth = None
9
10# regularization hyperparams
11validation_fraction = 0.1
12n_iter_no_change = 10
13
14xgbm = GradientBoostingClassifier(
15    loss='log_loss',
16    learning_rate=learning_rate,
17    n_estimators=n_estimators,
18    max_depth=max_depth,
19    validation_fraction=validation_fraction, 
20    n_iter_no_change=n_iter_no_change,
21    tol=1e-4,
22).fit(X_train_processed, y_train)
23
24lgbm = HistGradientBoostingClassifier(
25    loss='log_loss',
26    learning_rate=learning_rate,
27    max_depth=max_depth, 
28    max_iter=n_estimators,
29    validation_fraction=validation_fraction,
30    l2_regularization=0.01,
31    early_stopping=True,
32    n_iter_no_change=n_iter_no_change,
33    max_leaf_nodes=31,
34    min_samples_leaf=20,
35).fit(X_train_processed, y_train)
36
37cat = CatBoostClassifier(
38    loss_function='Logloss',
39    learning_rate=learning_rate,
40    iterations=n_estimators,
41    depth=max_depth,
42    early_stopping_rounds=n_iter_no_change,
43    eval_metric='Accuracy',
44    random_seed=42,
45    verbose=0,
46).fit(X_train_processed, y_train)
47
48
49lr = LogisticRegression(
50    max_iter=n_estimators,
51    penalty='l2',
52    tol=1e-4,
53).fit(X_train_processed, y_train)
54

◼ Evaluation

Computed accuracy scores from the prediction on training, validation, and test datasets.

1from sklearn.metrics import accuracy_score
2
3y_pred_train =  model.predict(X=X_train_processed)
4y_pred_val = model.predict(X_val_processed)
5y_pred_test = model.predict(X=X_test_processed)
6print(f'\n{model_names[i]}\nTrain: {accuracy_score(y_train, y_pred_train):.4f} Test: {accuracy_score(y_test, y_pred_test):.4f}')
7

◼ Results

The table summarizes the performance comparison based on accuracy scores for Random Forest (RF), Gradient Boosting Machine (GBM), and Logistic Regression (LR).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Performance comparison (RF, GBM, LR) (Created by Kuriko IWAI)

For training performance, both the optimal and complex Random Forest models achieved a high accuracy of 99.36%.

This highlights the importance of having more trees in the forest for improved voting accuracy.

For generalization, CatBoost performed best, exhibiting a discrepancy of only 0.021 between training and test accuracy, outperforming the optimal Random Forest's 0.042 discrepancy.

This suggests that with appropriate regularization, GBM models can generalize better than Random Forests.

Conversely, smaller Random Forests performed below the Logistic Regression baseline, despite their generalization discrepancy (0.013) being better than CatBoost (0.021).

Notably, bootstrap sampling had little impact on these results, further emphasizing the critical role of the number of trees in a Random Forest's performance.

Wrapping Up

Random Forests demonstrate significant predictive power by leveraging an ensemble of decision trees.

In the experiment, we observed the power of many trees in the forest.

However, this strength can induce computational challenges.

Training many trees on bootstrapped samples with feature subsets can be computationally expensive and time-consuming, especially large datasets or a high n_estimators.

Also, storing numerous individual trees requires substantial memory, posing a concern for resource-limited systems.

Despite these hurdles, Random Forests remain a versatile and powerful tool for a wide range of projects.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds
Explore the Vapnik-Chervonenkis (VC) dimension and its impact on the bias-variance trade-off.
Regression Loss Functions & Regularization
Deep dive into MSE, L1/L2 regularization, and generalization bounds. Use the interactive tool to visualize how loss functions handle outliers and how MLOps strategies shift across model families.
A Deep Dive into KNN Optimization and Distance Metrics
Deep dive into K-Nearest Neighbors (KNN). Learn to optimize K-values, compare distance metrics (Euclidean, Manhattan, Cosine), and implement weighted KNN pipelines with Scikit-Learn.
Mastering Decision Trees: From Impurity Measures to Greedy Optimization
Master decision tree mechanics. A deep dive into Entropy vs. Gini impurity, Information Gain, and a comparison of Exact, Approximate, and Histogram-based greedy algorithms with Python examples
Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks
Explore Gradient Boosting (GBM) through deep-dive theory and hands-on Python simulations. Compare XGBoost, LightGBM, and CatBoost performance against DL baselines.
Building Powerful Naive Bayes Ensembles for Mixed Datasets
Learn how to handle mixed data types in Naive Bayes by building a stacked ensemble. Compare Bernoulli, Gaussian, and Multinomial NB pipelines with Python code

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Share What You Learned

Kuriko IWAI, "Random Forest Decoded: Architecture, Bagging, and Performance Benchmarks" in Kernel Labs

https://kuriko-iwai.com/master-random-forest

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Random Forest Decoded: Architecture, Bagging, and Performance Benchmarks

Explore architecture, optimization strategies, and practical implications.

Table of Contents

Introduction

What is Random Forest?

How Random Forest Algorithm Works

◼ Bootstrapping Samples

◼ Out-of-Bag Error Estimation

◼ Parallel Training

◼ Ensemble

◼ Key Hyperparameters

▫ 1. Forest Complexity

▫ 2. Tree Complexity

▫ 3. Training Control

Comparing with Gradient Boosting Machines

◼ Ensemble Process

◼ Optimization Focus

Simulation

◼ Preparing Datasets

◼ Model Tuning: Grid Search

◼ Random Forests with Different Model Complexity

◼ GBM Family and Baseline Model

◼ Evaluation

◼ Results

Wrapping Up

Continue Your Learning

Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds

Regression Loss Functions & Regularization

A Deep Dive into KNN Optimization and Distance Metrics

Mastering Decision Trees: From Impurity Measures to Greedy Optimization

Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks

Building Powerful Naive Bayes Ensembles for Mixed Datasets

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?