Mastering Decision Trees: From Impurity Measures to Greedy Optimization

Explore the core of decision tree mechanics with practical walkthrough examples, math, and coding implementation

Machine LearningData SciencePython

By Kuriko IWAI

Introduction

Decision trees are intuitive, flowchart-like models widely used in machine learning.

In a machine learning context, it often serves as fundamental building blocks for more complex ensemble models such as Random Forests and Gradient Boosting Machines.

I’ll explore their mechanism with walkthrough examples, demonstrating their mathematical and coding implementations.

What is Decision Tree?

A decision tree constructs a tree structure used for both regression and classification problems in various sectors like finance, medicine, and education.

Decision trees’ key features include:

A hierarchical tree-structured (Cf. Boosting: sequential structure).
Non-parametric: No fixed number of parameters, no assumptions on data distribution.
Versatile: Effectively used for both regression and classification.
Impurity-Based Optimization: Optimize splitting points of a given dataset to maximize purity (or minimize impurity) of child nodes.
To compute the purity, takes impurity measures as a decision function.
Recursive Data Splitting: Uses greedy algorithms to optimize the data split until it reaches a point where the data in each node achieves maximum purity.
Interpretable: The resulting tree provides clear, interpretable decision rules that can be readily understood and applied for predictions or classifications.

How Decision Tree Algorithm Works

The decision tree algorithm aims to identify optimal split points that categorize data into discrete groups.

Specifically, the process begins at the root node, which represents the entire dataset.

From the root, the data is successively split by decision nodes based on specific features, forming branches.

This splitting continues until the structure reaches leaf nodes (or called terminal nodes).

Each leaf node represents an outcome or a decision.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Decision Tree Structure (Created by Kuriko IWAI)

Objective of Tree Algorithm: Finding the Optimal Split

Tree algorithms' primary goal is to find a split point in the dataset that can achieve maximum purity (or minimum impurity) in the child nodes.

◼ How to Define the “Purity”

Purity means creating subsets of data as homogeneous as possible with respect to the target variable.

The figure below illustrates a child node with high impurity (left) and and maximum purity (right) which is chosen as the optimal split point:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Finding the optimal split with maximum purity (Created by Kuriko IWAI)

Mathematically, the purity is quantified by impurity measures.

Here, I’ll walk you through Entropy and Gini Impurity. Both are impurity measures, but optimize different metrics.

◼ 1. Entropy: Maximizing Information Gain

Entropy quantifies the amount of uncertainty or randomness in a dataset; how mixed the classes are within the node.

The formula for entropy for a given dataset is denoted:

E(D) = -\sum_{i=1}^{C}p_i\log_2(p_i)

where:

E(D): The entropy of the dataset D,
D: The dataset (or a subset of data at a particular node),
C: The number of unique classes in the dataset, and
p_i: The proportion of samples belonging to class i.

Entropy ranges from zero (perfect purity) to log2(C) (perfect impurity where the samples in the node are evenly distributed among the classes).

High Entropy = High Uncertainty = Lots of "surprise" potential. For instance, flipping a fair coin has 50:50 chance of heads or tails. You don't know if it's heads or tails.
Low Entropy = Low Uncertainty = Little to no "surprise" potential. For instance, picking a yellow ball from a bag with full of yellow balls. You know what you'll get. 100%.

If an algorithm uses Entropy, it aims to maximize Information Gain, which means maximizing the reduction in entropy from the parent node to the child nodes.

▫ The Information Gain

Information gain quantifies a feature's effectiveness within a tree structure, indicating its usefulness in separating data into distinct classes.

This usefulness is defined by how much a data split reduces impurity during classification.

Mathematically, Information Gain quantifies the reduction in entropy from a parent node to its child nodes with respect to a given feature (A):

IG(D,A)=E(D) −\sum_{v \in A} \underbrace{\frac{​∣D_v∣}{∣D​∣}​}_{\text{weighted ave.}} E(D_v​)

where:

IG(D, A): Information Gain of dataset S when split by feature A,
E(D): Entropy of the dataset S before the split,
E(D_v): Entropy of data D_v (child nodes),
A: All possible values of feature A,
∣Dv∣: The number of instances in subset D_v (where feature A has value v), and
∣D∣: Total number of instances in dataset D.

◼ 2. Gini Impurity: Maximizing Gini Gain

Gini Impurity quantifies how often a randomly chosen element from the set would be incorrectly labeled.

The formula for Gini Impity for a given dataset (D) is denoted:

G(D) = 1 - \sum_{i=1}^C p_i^2

where:

G(D): Gini Impurity,
D: The dataset (or a subset of data at a particular node),
C: The number of unique classes in the dataset, and
p_i: The proportion of samples belonging to class i.

Key characteristics of Gini Impurity include:

Ranges from zero (perfect purity) to 1−1/C (perfect impurity),
A lower Gini impurity indicates a more homogeneous node.
Computationally faster than entropy-based methods.

If an algorithm uses Gini Impurity, it aims to maximize Gini Gain (or minimize Gini Impurity), meaning maximizing the reduction in Gini impurity from a parent node to its child nodes.

▫ The Gini Gain

Gini gain is calculated by summing the squared probabilities of each class, then subtracting this sum from the parent’s Gini gain.

\Delta G = G_{\text{parent}} - \sum_{j=1}^{k} \frac{N_j}{N} G_{\text{child}_j}

where:

G: Gini impurity of the parent or child node,
k: The number of child nodes after the split,
Nj: The number of samples in child node j, and
N: The total number of samples in the parent node.

Gini gain is the default criterion used by the CART (Classification and Regression Trees) algorithm, which is widely implemented in ML libraries like Scikit-learn.

***

Defining how to compute the gain is crucial for how decision tree algorithms learn and decide the optimal splits at each node to create effective predictive models.

Greedy Algorithm to Maximize Gain

After defining an entropy measure to quantify the gain, the model leverages greedy algorithms to find optimal split points where the algorithm generates the highest gain in a tree-based model.

In this section, I'll provide an overview of three major algorithms, then walkthrough examples in a later section.

◼ 1. Exact Greedy Algorithm

This method considers every unique value of a continuous feature (features with continuous values) as a potential split point.

Pros: Can precisely calculate the gain for each, leading to the most accurate split finding.
Cons: Computationally intensive and slow for large datasets with many unique values.

◼ 2. Approximate Greedy Algorithm

This method addresses scalability by proposing candidate split points based on quantiles of feature distributions.

It buckets continuous features and searches these approximate ranges for optimal splits, balancing accuracy with speed.

Pros: Faster than the exact algorithm. Good at dealing with massive dataset not fit in memory to create full histograms.
Cons: Less precise and potentially less optimal than the Histogram method.

◼ 3. Histogram-based Algorithm

This method pre-bins continuous features into fixed discrete bins (histograms).

Splits are then rapidly found by iterating through these pre-built histograms, offering significant speedups at the cost of some precision.

Pros: Most efficient methods for large datasets.
Cons: Might miss the absolute perfect split by a tiny margin (yet, practically often negligible).

The Walkthrough Example

Taking the following scenario for an example, I’ll go through a step by step approach on how a decision tree works using the three algorithms.

◼ Scenario: Predicting Customer Subscription

▫ The Goal

A subscription service manager wants to predict which new sign-ups are most likely to convert into long-term subscribers (target variable: is_long_term: Yes/No).

They have access to data on internet_usage_hrs_day (how much time a user spends online daily) and device_preference (their primary device: Mobile, Desktop, Tablet).

▫ The Data

ID	internet_usage_hrs_day	device_preference	is_long_term
1	1.2	Mobile	No
2	2.8	Desktop	No
3	3.1	Mobile	Yes
4	4.5	Desktop	Yes
5	5.9	Mobile	Yes
6	6.3	Desktop	Yes
7	7.7	Tablet	Yes
8	8.4	Mobile	No
9	9.1	Desktop	No
10	10.5	Tablet	Yes

* internet_usage_hrs_day contains continuous, device_preference contains discrete values.

◼ Step 1. Compute Initial Impunity

The root node begin with processing the entire dataset:

Total Samples: 10
Class Distribution: 6 'Yes', 4 'No'

The initial proportion of samples belonging to each class (Yes, No) is:

p_Yes = 6/10 = 0.6
p_No = 4/10 = 0.4

Substituting these values to the impurity measure:

Initial Entropy: E(D) = −(0.6 log_2(0.6)+0.4 log_2 (0.4)) ≈ 0.971
Initial Gini Impurity: G(D) = 1 − (0.6² + 0.4²) = 0.48

*For demonstration purpose, I’ll compute both measures. In reality, we’ll choose one and stick to it until the end of the process.

◼ Step 2. Choosing the Optimization Algorithm

The manager selected the optimization algorithm based on their varying priorities.

▫ Scenario 1.

Manager: “For this initial pilot, our dataset is quite small, and I don’t think computational expense would be a problem. I prioritize achieving the most accurate split and generate insightful decision rules first.”

→ Choose Exact Greedy Algorithm

▫ Scenario 2.

Manager: “If this pilot scales to millions of users, an exact search would be too slow. Let me start the process with random split points instead of all the splits to see if we still get a good enough split without freezing our system.”

→ Choose Approximate Greedy Algorithm

▫ Scenario 3.

Manager: “I'm considering expanding our datasets with many more features to gain deeper insights. The expansion would require intensive model training. So, I need to prioritize the training speed. I'm willing to accept a slight loss of precision in split points for the speed increase.”

→ Choose Histogram-based Algorithm

◼ Step 3. Computing Gains

I’ll demonstrate how to compute gains and decide which feature to prioritize to make a prediction on long-term subscribers.

The common steps to take across the scenarios is:

Select split points by feature,
Compute entropy or gini - then compute gains, and
Choose the split point with the maximum gain.

◼ Scenario 1. Apply Exact Greedy Algorithm

▫ Feature 1: internet_usage_hrs_day

With float values, the number of potential split points technically increases. In practice, algorithms usually consider midpoints between sorted, distinct values where the target class changes:

Split 1: internet_usage_hrs_day <= 2.95 (midpoint of 2.8 and 3.1)
Split 2: internet_usage_hrs_day <= 8.05 (midpoint of 7.7 and 8.4)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Applying Exact Greedy Algorithm to the feature (Created by Kuriko IWAI)

I’ll evaluate these splits by computing entropy and gini, and then compute the gains.

Split 1.

Left Child
- 2 samples: 2 'No', 0 'Yes'
- p_No = 2/2=1
- p_Yes = 0/2=0
- Entropy: E(D) = 0 (Perfectly pure)
- Gini: G(D) = 0 (Perfectly pure)
Right Child
- 8 samples: 2 'No', 6 'Yes'
- p_No = 2/8 = 0.25
- p_Yes = 6/8 = 0.75
- Entropy: E(D) = −((0.25) log2(0.25)+(0.75)log2(0.75))≈ −(−0.5−0.311) ≈0.811
- Gini: G(D) = 1 − (0.252+0.752) = 1 − 0.625 = 0.375

→ Gains

Information Gain (Entropy): IG = 0.971 − (2/10 × 0+ 8/10 × 0.811) = 0.971 − 0.649 =0.322
Gini Gain (Gini Impurity): ΔG = 0.48 − (2/10×0+8/10×0.375) = 0.48 - 0.30 = 0.180

The following tree illustrates the computation of information gain (left) and Gini gain (right) from the root node to the first level:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Tree map of the subscription scenario based on exact greedy algorithm on Entropy (left) and Gini (right) impurity measures (Created by Kuriko IWAI)

Split 2.

Left Child
- 7 samples: 2 'No', 5 'Yes'
- p_No = 2/7
- p_Yes = 5/7
- Entropy: E(D)=− ((2/7) log2(2/7) + (5/7) log2(5/7) ) ≈ −(−0.517 − 0.346) ≈ 0.863
- Gini: G(D)=1− ((2/7)²+(5/7)²) = 1 − 29/49 ≈0.408
Right Child
- 3 samples: 2 'No', 1 'Yes'
- p_No = 2/3
- p_Yes = 1/3
- Entropy: E(D) = −((2/3) log2(2/3) + (1/3) log2(1/3) ) ≈ −(−0.390 − 0.528) ≈ 0.918
- Gini: G(D) = 1−((2/3)² + (1/3)²) = 1 − 5/9 ≈ 0.444

→ Gains

Information Gain: IG = 0.971 − ( 7/10 × 0.863+ 3/10 × 0.918 ) = 0.971 − 0.879 = 0.092
Gini Gain: ΔG=0.48 − ( 7/10 × 0.408 + 3/10 × 0.444 ) = 0.48 − 0.419 = 0.061

▫ Feature 2: device_preference

For categorical features, each category can be assigned to a separate node for potential splitting:

Split 1: Node 'Mobile'
- 3 Samples: 2 'No', 1 'Yes'
- p_Yes = 1/3
- p_No = 2/3
- Entropy ≈ 0.918
- Gini ≈ 0.444
Split 2: Node ‘Desktop’
- 4 Samples: 3 'No', 1 'Yes'
- p_Yes = 1/4
- p_No = 3/4
- Entropy ≈ 0.811
- Gini ≈ 0.375
Split 3: Node ‘Tablet’
- 2 Samples: 0 'No', 2 'Yes'
- p_Yes = 2/2 = 1
- p_No = 0/2 = 0
- Entropy = 0 (purity)
- Gini = 0 (purity)

→ Gains:

Information Gain: IG = 0.971 − (3/10 × 0.918 + 4/10 × 0.811 + 2/10 × 0) = 0.971 − 0.599 = 0.372
Gini Gain: ΔG = 0.48 − (3/10 × 0.444 + 4/10 x 0.375 + 2/10 × 0) = 0.48 - 0.283 = 0.197

◼ Decision Making

In either case of Entropy or Gini impurity, it turned out splitting by category of device_preference is the optimal split as it outperformed the gains of internet_usage_hrs_day.

This result indicates that in this dataset, device preference is a stronger predictor than daily internet usage hours for whether a customer becomes a long-term subscriber.

Impurity Method	Feature	Breaking Point	Gain
Entropy	Internet_Usage_Hrs_Day	≤2.95	0.322
Entropy	Internet_Usage_Hrs_Day	≤8.05	0.092
Entropy	Device_Preference	Mobile, Desktop, Tablet	0.372 ← WINNER
Gini	Internet_Usage_Hrs_Day	≤2.95	0.180
Gini	Internet_Usage_Hrs_Day	≤8.05	0.061
Gini	Device_Preference	Mobile, Desktop, Tablet	0.197 ← WINNER

*Gini and Entropy shouldn’t be compared directly as their mathematical formulations differ as we saw in the previous section.

◼ Scenario 2. Apply Approximate Greedy Algorithm

▫ Feature 1: internet_usage_hrs_day

For numerical features, approximate greedy algorithms evaluate only a random subset of potential split points to save computation.

So, the manager decides to split at random points of 5.0 and 9.0.

Split 1: internet_usage_hrs_day <= 5.0
Split 2: internet_usage_hrs_day <= 9.0

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Applying Approximate Greedy Algorithm to the feature (Created by Kuriko IWAI)

Split 1.

Left Child
- 5 samples: 2 'No', 3 'Yes'
- p_YES = 3/5 = 0.6
- p_No = 2/5 = 0.4 (happens to be the same distribution as the root node)
- Entropy ≈ 0.971
- Gini = 0.48
Right Child
- 5 samples: 2 'No', 3 'Yes'
- p_YES = 3/5 = 0.6
- p_No = 2/5 = 0.4
- Entropy ≈ 0.971
- Gini = 0.48

→ Gains:

Information Gain: IG=0.971−(5 / 10 × 0.971 + 5 / 10 ×0.971) = 0.971 - 0.971 = 0.0
Gini Gain: ΔG=0.48−(5 / 10 × 0.48 + 5 / 10 × 0.48) = 0.48 -0.48 = 0.0

Split 2

Left Child
- 9 samples: 4 'No', 5 'Yes'
- p_YES = 5/9
- p_No = 4/9
- Entropy ≈0.991
- Gini ≈0.494
Right Child
- 1 sample: 0 'No', 1 'Yes'
- p_YES = 1
- p_No = 0
- Entropy = 0
- Gini = 0

→ Gains:

Information Gain: IG= 0.971 − (9 / 10 × 0.991 + 1 / 10 × 0) = 0.971 − 0.892 = 0.079
Gini Gain: ΔG = 0.48−(9 / 10 × 0.494 + 1 / 10 × 0) = 0.48 - 0.445 = 0.035

▫ Feature 2: device_preference

Same as Exact Greedy, for categorical features, it typically still evaluates all categories. So, I’ll utilize the results from Scenario 1.

Information Gain: IG = 0.372
Gini Gain: ΔG = 0.197

◼ Decision Making

Similar to Scenario 1, in either impurity measure, splitting by the category of device_preference performed the best.

Impurity Method	Feature	Breaking Point	Gain
Entropy	internet_usage_hrs_day	≤5.0	0.0
Entropy	internet_usage_hrs_day	≤9.0	0.079
Entropy	device_preference	Mobile, Desktop, Tablet	0.372 ← WINNER
Gini	internet_usage_hrs_day	≤5.0	0.0
Gini	internet_usage_hrs_day	≤9.0	0.035
Gini	device_preference	Mobile, Desktop, Tablet	0.197 ← WINNER

◼ Scenario 3. Apply Histogram-based Algorithm

▫ Feature 1: internet_usage_hrs_day

The histogram-based greedy algorithm first bins numerical features into a fixed number of bins and then only considers splits at the bin boundaries.

The manager decides to make 3 bins: Hrs <= 3.0, 3.0 < Hrs <= 7.0, Hrs > 7.0 so that they only consider splits at Hrs <= 3.0 and Hrs <= 7.0.

Split 1. Hrs <= 3.0

Left Child
- 2 samples: 2 'No', 0 'Yes'
- p_Yes = 0
- p_No = 1
- Entropy = 0
- Gini = 0
Right Child
- 8 samples: 2 'No', 6 'Yes'
- p_Yes = 6/8
- p_No = 2/8
- Entropy ≈ 0.811
- Gini ≈ 0.375

→ Gains:

Information Gain: IG = 0.971 − ( 2/10 × 0 + 8/10 × 0.811 ) = 0.971 - 0.649 = 0.322
Gini Gain: ΔG=0.48 − (2/10 × 0 + 8/10 × 0.375 ) = 0.48− 0.3 = 0.180

Split 2. Hrs <= 7.0

Left Child
- 7 samples: 2 'No', 5 'Yes'
- p_Yes = 5/7
- p_No = 2/7
- Entropy ≈ 0.863
- Gini ≈ 0.408
Right Child
- 3 samples: 2 'No', 1 'Yes'
- p_Yes = 1/3
- p_No = 2/3
- Entropy ≈ 0.918
- Gini ≈ 0.444

→ Gains:

Information Gain: IG = 0.971 − (7/10 × 0.863 + 3/10 × 0.918) = 0.971 - 0.879 = 0.092
Gini Gain: ΔG=0.48 − (7/10 × 0.408 + 3/10 × 0.444) = 0.48− 0.419 = 0.061

▫ Feature 2: device_preference

The histogram-based algorithm takes the same approach as Exact / Approximate Greedy.

Information Gain: IG = 0.372
Gini Gain: ΔG = 0.197

◼ Decision Making

Similar to the other scenarios, splitting by the category of device_preference performed the best.

However, for numerical feature internet_usage_hrs_day, the split at hrs <= 3.0 resulted in an Entropy Gain of 0.322 and Gini Gain of 0.180.

Interestingly, these gains are identical to the highest gains found by the Exact Greedy algorithm for internet_usage_hrs_day (at ≤2.95).

This demonstrates that if bin boundaries align closely with optimal split points, the Histogram-based method can achieve similar results with much higher efficiency.

Impurity Method	Feature	Breaking Point	Gain
Entropy	internet_usage_hrs_day	≤3.0	0.322
Entropy	internet_usage_hrs_day	≤7.0	0.092
Entropy	device_preference	Mobile, Desktop, Tablet	0.372 ← WINNER
Gini	internet_usage_hrs_day	≤3.0	0.180
Gini	internet_usage_hrs_day	≤7.0	0.061
Gini	device_preference	Mobile, Desktop, Tablet	0.197 ← WINNER

◼ Summary of Findings

Across all three greedy algorithms, device_preference consistently yields the highest gain and thus represents the optimal split point for the initial node.

▫ Dominance of Categorical Feature

In Exact Greedy, device_preference achieved an Information Gain of 0.372 and a Gini Gain of 0.197 - highest gains across all evaluated splits.
In Approximate Greedy, device_preference maintained its Information Gain of 0.372 and Gini Gain of 0.197, again surpassing all tested splits.
In the Histogram-based Greedy approach, device_preference continued to exhibit gains of 0.372 (Entropy) and 0.197 (Gini), outperforming its numerical counterparts.

▫ Variations in Numerical Feature Handling

Exact Greedy performed an exhaustive search, finding the highest gains for Internet_Usage_Hrs_Day at the ≤2.95 split (Entropy: 0.322, Gini: 0.180).

This represents the maximum gain achievable for this feature under its specified split derivation method.
Approximate Greedy, by evaluating only a random subset of split points (e.g., ≤5.0 and ≤9.0), yielded significantly lower gains. This clearly demonstrates the trade-off, where a simplified search missed the higher gain possible for the numerical feature.
Histogram-based Greedy, by evaluating at predefined bin boundaries (e.g., ≤3.0 and ≤7.0), remarkably found a split at ≤3.0 that produced an Entropy Gain of 0.322 and Gini Gain of 0.180. These gains are identical to the maximum achieved by the Exact Greedy algorithm for this feature, indicating that the chosen bin boundary aligned perfectly with a highly effective split point in this particular dataset.

Simulation

Using more complex dataset, I’ll demonstrate how it works with the DecisionTreeClassifier from the Scikit-Learn library.

◼ Preparing Dataset

I created synthetic datasets by adding features to the walkthrough example data.

Then, I applied SMOTE to the training set to address the class imbalance, generating 120 samples with 20 features.

Training dataset: X_train: (100, 20), y_train:(100,)
Validation dataset: X_val: (20, 20), y_val:(100,)
Test dataset: X_test: (20, 20), y_test:(100,)

◼ Model Training

I’ll define Extract and Approximate Greedy Algorithm using the DecisionTreeClassifier from the Scikit Learn library.

▫ Extract Greedy

1from sklearn.tree import DecisionTreeClassifier
2
3dt_exact_entropy = DecisionTreeClassifier(
4    criterion='entropy',       # entropy
5    max_features=None,         # explicitly mentioned (consider all features at each split)
6    class_weight='balanced',
7    random_state=12
8).fit(X_train, y_train)
9
10dt_exact_gini = DecisionTreeClassifier(
11    criterion='gini',         # gini
12    max_features=None,
13    class_weight='balanced',
14    random_state=12
15).fit(X_train, y_train)
16

▫ Tree Structure

The figure illustrates the tree structure of the classification. Entropy (left) and Gini (right) generated similar results.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Tree Structure (Exact Greedy) (Created by Kuriko IWAI)

▫ Approximate Greedy

1from sklearn.tree import DecisionTreeClassifier
2
3dt_approx_entropy = DecisionTreeClassifier(
4    criterion='entropy',
5    max_features='sqrt', # approximate greedy equivalent: considers sqrt(n_features) at each split
6    class_weight='balanced',
7    random_state=12,
8).fit(X_train, y_train)
9
10dt_approx_gini = DecisionTreeClassifier(
11    criterion='gini',
12    max_features='sqrt',
13    class_weight='balanced', 
14    random_state=12,
15).fit(X_train, y_train)
16

▫ Tree Structure

The figure illustrates the tree structure of the Approximate Greedy algorithm. Entropy (left) and Gini (right) generated different results - where both have more complex structures than Exact Greedy.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Tree Structure (Approximate Greedy) (Created by Kuriko IWAI)

▫ Histogram-based Algorithm

I used the HistGradientBoostingClassifier from the Scikit-Learn library to approximate entropy histogram-based model.

Note that gradient-based optimizers like SGD cannot be used on Gini because Gini isn't differentiable.

1from sklearn.ensemble import HistGradientBoostingClassifier
2from sklearn.inspection import permutation_importance
3
4dt_hist_entropy = HistGradientBoostingClassifier(
5    loss='log_loss',             # entropy equivarent
6    l2_regularization=0.1,       # L2 regularization term
7    learning_rate=0.05,          # step size of each iteration
8    max_depth=5,                 # max depth of individual trees
9    min_samples_leaf=20,         # minimum samples required to be at a leaf node
10    n_iter_no_change=10,         # early stopping if validation score doesn't improve for 10 iterations
11    validation_fraction=0.1,     # 10% of training data used for validation
12    random_state=12,   
13).fit(X_train, y_train)
14

▫ Permutation Importances by Feature

Since HistGradientBoostingClassifier isn't applicable to tree_prot, I'll examine the permutation importance by feature. It turns out that support_contacts is the most important.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure I. Permutation importance by feature (Created by Kuriko IWAI)

◼ Performance Evaluation

1y_pred_train = model.predict(X_train)
2y_pred_val = model.predict(X_val)
3y_pred_test = model.predict(X_test)
4

▫ Accuracy Scores

Exact - Entropy: Train: 1.0000 → 0.7500, Generalization: 0.5500
Exact - Gini: Train: 1.0000 → 0.5500, Generalization: 0.7000
Approximate - Entropy: Train: 1.0000 → 0.6500, Generalization: 0.5500
Approximate - Gini: Train: 1.0000 → 0.5000, Generalization: 0.4000
Histogram-based: Train: 0.9200 → 0.6500, Generalization: 0.7000

The HistGradientBoostingClassifier with the L2 and other regularization parameters demonstrates much better generalization capabilities, as intended.

The single DecisionTreeClassifier instances, without strong regularization, tend to overfit the small dataset heavily.

The HistGradientBoostingClassifier's performance highlights the benefits of ensemble methods and proper regularization in achieving a better balance between bias and variance.

Conclusion

Decision trees are versatile and interpretable models that make predictions by navigating importance of input features.

In the experiment, we observed that the choice of algorithm and impurity measure significantly influences a decision tree's structure and predictive performance.

Conversely, we also observed that decision trees are prone to overfitting.

When trees are used in machine learning, models employ regularization techniques to improve generalization capabilities and prevent overfitting.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds
Explore the Vapnik-Chervonenkis (VC) dimension and its impact on the bias-variance trade-off.
Regression Loss Functions & Regularization
Deep dive into MSE, L1/L2 regularization, and generalization bounds. Use the interactive tool to visualize how loss functions handle outliers and how MLOps strategies shift across model families.
A Deep Dive into KNN Optimization and Distance Metrics
Deep dive into K-Nearest Neighbors (KNN). Learn to optimize K-values, compare distance metrics (Euclidean, Manhattan, Cosine), and implement weighted KNN pipelines with Scikit-Learn.
Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks
Explore Gradient Boosting (GBM) through deep-dive theory and hands-on Python simulations. Compare XGBoost, LightGBM, and CatBoost performance against DL baselines.
Random Forest Decoded: Architecture, Bagging, and Performance Benchmarks
Deep dive into Random Forest architecture, bagging, and OOB error estimation. Compare RF vs. GBM performance using Python Scikit-Learn simulations.
Building Powerful Naive Bayes Ensembles for Mixed Datasets
Learn how to handle mixed data types in Naive Bayes by building a stacked ensemble. Compare Bernoulli, Gaussian, and Multinomial NB pipelines with Python code

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Share What You Learned

Kuriko IWAI, "Mastering Decision Trees: From Impurity Measures to Greedy Optimization" in Kernel Labs

https://kuriko-iwai.com/decision-tree-algorithm-core

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Mastering Decision Trees: From Impurity Measures to Greedy Optimization

Explore the core of decision tree mechanics with practical walkthrough examples, math, and coding implementation

Table of Contents

Introduction

What is Decision Tree?

How Decision Tree Algorithm Works

Objective of Tree Algorithm: Finding the Optimal Split

◼ How to Define the “Purity”

◼ 1. Entropy: Maximizing Information Gain

Entropy ranges from zero (perfect purity) to log2​(C) (perfect impurity where the samples in the node are evenly distributed among the classes).

▫ The Information Gain

◼ 2. Gini Impurity: Maximizing Gini Gain

▫ The Gini Gain

Greedy Algorithm to Maximize Gain

◼ 1. Exact Greedy Algorithm

◼ 2. Approximate Greedy Algorithm

◼ 3. Histogram-based Algorithm

The Walkthrough Example

◼ Scenario: Predicting Customer Subscription

▫ The Goal

▫ The Data

◼ Step 1. Compute Initial Impunity

◼ Step 2. Choosing the Optimization Algorithm

▫ Scenario 1.

▫ Scenario 2.

▫ Scenario 3.

◼ Step 3. Computing Gains

◼ Scenario 1. Apply Exact Greedy Algorithm

▫ Feature 1: internet_usage_hrs_day

▫ Feature 2: device_preference

◼ Decision Making

◼ Scenario 2. Apply Approximate Greedy Algorithm

▫ Feature 1: internet_usage_hrs_day

▫ Feature 2: device_preference

◼ Decision Making

◼ Scenario 3. Apply Histogram-based Algorithm

▫ Feature 1: internet_usage_hrs_day

▫ Feature 2: device_preference

◼ Decision Making

◼ Summary of Findings

▫ Dominance of Categorical Feature

▫ Variations in Numerical Feature Handling

Simulation

◼ Preparing Dataset

◼ Model Training

▫ Extract Greedy

▫ Tree Structure

▫ Approximate Greedy

▫ Tree Structure

▫ Histogram-based Algorithm

▫ Permutation Importances by Feature

◼ Performance Evaluation

▫ Accuracy Scores

Conclusion

Continue Your Learning

Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds

Regression Loss Functions & Regularization

A Deep Dive into KNN Optimization and Distance Metrics

Gradient Boosting Decoded: From Mathematical Foundations to Competitive Benchmarks

Random Forest Decoded: Architecture, Bagging, and Performance Benchmarks

Building Powerful Naive Bayes Ensembles for Mixed Datasets

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?

Entropy ranges from zero (perfect purity) to log2(C) (perfect impurity where the samples in the node are evenly distributed among the classes).