Mastering Decision Trees: From Impurity Measures to Greedy Optimization

Explore the core of decision tree mechanics with practical walkthrough examples, math, and coding implementation

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Decision Tree?How Decision Tree Algorithm WorksObjective of Tree Algorithm: Finding the Optimal Split
How to Define the “Purity”
1. Entropy: Maximizing Information Gain
2. Gini Impurity: Maximizing Gini Gain
Greedy Algorithm to Maximize Gain
1. Exact Greedy Algorithm
2. Approximate Greedy Algorithm
3. Histogram-based Algorithm
The Walkthrough Example
Scenario: Predicting Customer Subscription
Step 1. Compute Initial Impunity
Step 2. Choosing the Optimization Algorithm
Step 3. Computing Gains
Scenario 1. Apply Exact Greedy Algorithm
Decision Making
Scenario 2. Apply Approximate Greedy Algorithm
Decision Making
Scenario 3. Apply Histogram-based Algorithm
Decision Making
Summary of Findings
Simulation
Preparing Dataset
Model Training
Performance Evaluation
Conclusion

Introduction

Decision trees are intuitive, flowchart-like models widely used in machine learning.

In a machine learning context, it often serves as fundamental building blocks for more complex ensemble models such as Random Forests and Gradient Boosting Machines.

I’ll explore their mechanism with walkthrough examples, demonstrating their mathematical and coding implementations.

What is Decision Tree?

A decision tree constructs a tree structure used for both regression and classification problems in various sectors like finance, medicine, and education.

Decision trees’ key features include:

  • A hierarchical tree-structured (Cf. Boosting: sequential structure).

  • Non-parametric: No fixed number of parameters, no assumptions on data distribution.

  • Versatile: Effectively used for both regression and classification.

  • Impurity-Based Optimization: Optimize splitting points of a given dataset to maximize purity (or minimize impurity) of child nodes.

  • To compute the purity, takes impurity measures as a decision function.

  • Recursive Data Splitting: Uses greedy algorithms to optimize the data split until it reaches a point where the data in each node achieves maximum purity.

  • Interpretable: The resulting tree provides clear, interpretable decision rules that can be readily understood and applied for predictions or classifications.

How Decision Tree Algorithm Works

The decision tree algorithm aims to identify optimal split points that categorize data into discrete groups.

Specifically, the process begins at the root node, which represents the entire dataset.

From the root, the data is successively split by decision nodes based on specific features, forming branches.

This splitting continues until the structure reaches leaf nodes (or called terminal nodes).

Each leaf node represents an outcome or a decision.

Figure A. Decision Tree Structure (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Decision Tree Structure (Created by Kuriko IWAI)

Objective of Tree Algorithm: Finding the Optimal Split

Tree algorithms' primary goal is to find a split point in the dataset that can achieve maximum purity (or minimum impurity) in the child nodes.

How to Define the “Purity”

Purity means creating subsets of data as homogeneous as possible with respect to the target variable.

The figure below illustrates a child node with high impurity (left) and and maximum purity (right) which is chosen as the optimal split point:

Figure B. Finding the optimal split with maximum purity (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Finding the optimal split with maximum purity (Created by Kuriko IWAI)

Mathematically, the purity is quantified by impurity measures.

Here, I’ll walk you through Entropy and Gini Impurity. Both are impurity measures, but optimize different metrics.

1. Entropy: Maximizing Information Gain

Entropy quantifies the amount of uncertainty or randomness in a dataset; how mixed the classes are within the node.

The formula for entropy for a given dataset is denoted:

E(D)=i=1Cpilog2(pi)E(D) = -\sum_{i=1}^{C}p_i\log_2(p_i)

where:

  • E(D): The entropy of the dataset D,

  • D: The dataset (or a subset of data at a particular node),

  • C: The number of unique classes in the dataset, and

  • p_i​: The proportion of samples belonging to class i.

Entropy ranges from zero (perfect purity) to log2​(C) (perfect impurity where the samples in the node are evenly distributed among the classes).
  • High Entropy = High Uncertainty = Lots of "surprise" potential. For instance, flipping a fair coin has 50:50 chance of heads or tails. You don't know if it's heads or tails.

  • Low Entropy = Low Uncertainty = Little to no "surprise" potential. For instance, picking a yellow ball from a bag with full of yellow balls. You know what you'll get. 100%.

If an algorithm uses Entropy, it aims to maximize Information Gain, which means maximizing the reduction in entropy from the parent node to the child nodes.

The Information Gain

Information gain quantifies a feature's effectiveness within a tree structure, indicating its usefulness in separating data into distinct classes.

This usefulness is defined by how much a data split reduces impurity during classification.

Mathematically, Information Gain quantifies the reduction in entropy from a parent node to its child nodes with respect to a given feature (A):

IG(D,A)=E(D)vADvDweighted ave.E(Dv)IG(D,A)=E(D) −\sum_{v \in A} \underbrace{\frac{​∣D_v∣}{∣D​∣}​}_{\text{weighted ave.}} E(D_v​)

where:

  • IG(D, A): Information Gain of dataset S when split by feature A,

  • E(D): Entropy of the dataset S before the split,

  • E(D_v​): Entropy of data D_v​ (child nodes),

  • A: All possible values of feature A,

  • ∣Dv​∣: The number of instances in subset D_v​ (where feature A has value v), and

  • ∣D∣: Total number of instances in dataset D.

2. Gini Impurity: Maximizing Gini Gain

Gini Impurity quantifies how often a randomly chosen element from the set would be incorrectly labeled.

The formula for Gini Impity for a given dataset (D) is denoted:

G(D)=1i=1Cpi2G(D) = 1 - \sum_{i=1}^C p_i^2

where:

  • G(D): Gini Impurity,

  • D: The dataset (or a subset of data at a particular node),

  • C: The number of unique classes in the dataset, and

  • p_i​: The proportion of samples belonging to class i.

Key characteristics of Gini Impurity include:

  • Ranges from zero (perfect purity) to 1−1/C (perfect impurity)​,

  • A lower Gini impurity indicates a more homogeneous node.

  • Computationally faster than entropy-based methods.

If an algorithm uses Gini Impurity, it aims to maximize Gini Gain (or minimize Gini Impurity), meaning maximizing the reduction in Gini impurity from a parent node to its child nodes.

The Gini Gain

Gini gain is calculated by summing the squared probabilities of each class, then subtracting this sum from the parent’s Gini gain.

ΔG=Gparentj=1kNjNGchildj\Delta G = G_{\text{parent}} - \sum_{j=1}^{k} \frac{N_j}{N} G_{\text{child}_j}

where:

  • G: Gini impurity of the parent or child node,

  • k: The number of child nodes after the split,

  • Nj​: The number of samples in child node j, and

  • N: The total number of samples in the parent node.

Gini gain is the default criterion used by the CART (Classification and Regression Trees) algorithm, which is widely implemented in ML libraries like Scikit-learn.

***

Defining how to compute the gain is crucial for how decision tree algorithms learn and decide the optimal splits at each node to create effective predictive models.

Greedy Algorithm to Maximize Gain

After defining an entropy measure to quantify the gain, the model leverages greedy algorithms to find optimal split points where the algorithm generates the highest gain in a tree-based model.

In this section, I'll provide an overview of three major algorithms, then walkthrough examples in a later section.

1. Exact Greedy Algorithm

This method considers every unique value of a continuous feature (features with continuous values) as a potential split point.

  • Pros: Can precisely calculate the gain for each, leading to the most accurate split finding.

  • Cons: Computationally intensive and slow for large datasets with many unique values.

2. Approximate Greedy Algorithm

This method addresses scalability by proposing candidate split points based on quantiles of feature distributions.

It buckets continuous features and searches these approximate ranges for optimal splits, balancing accuracy with speed.

  • Pros: Faster than the exact algorithm. Good at dealing with massive dataset not fit in memory to create full histograms.

  • Cons: Less precise and potentially less optimal than the Histogram method.

3. Histogram-based Algorithm

This method pre-bins continuous features into fixed discrete bins (histograms).

Splits are then rapidly found by iterating through these pre-built histograms, offering significant speedups at the cost of some precision.

  • Pros: Most efficient methods for large datasets.

  • Cons: Might miss the absolute perfect split by a tiny margin (yet, practically often negligible).

The Walkthrough Example

Taking the following scenario for an example, I’ll go through a step by step approach on how a decision tree works using the three algorithms.

Scenario: Predicting Customer Subscription

The Goal

A subscription service manager wants to predict which new sign-ups are most likely to convert into long-term subscribers (target variable: is_long_term: Yes/No).

They have access to data on internet_usage_hrs_day (how much time a user spends online daily) and device_preference (their primary device: Mobile, Desktop, Tablet).

The Data

IDinternet_usage_hrs_daydevice_preferenceis_long_term
11.2MobileNo
22.8DesktopNo
33.1MobileYes
44.5DesktopYes
55.9MobileYes
66.3DesktopYes
77.7TabletYes
88.4MobileNo
99.1DesktopNo
1010.5TabletYes

* internet_usage_hrs_day contains continuous, device_preference contains discrete values.

Step 1. Compute Initial Impunity

The root node begin with processing the entire dataset:

  • Total Samples: 10

  • Class Distribution: 6 'Yes', 4 'No'

The initial proportion of samples belonging to each class (Yes, No) is:

  • p_Yes = 6/10 = 0.6

  • p_No = 4/10 = 0.4

Substituting these values to the impurity measure:

  • Initial Entropy: E(D) = −(0.6 log​_2(0.6)+0.4 log_2 ​(0.4)) ≈ 0.971

  • Initial Gini Impurity: G(D) = 1 − (0.6² + 0.4²) = 0.48

*For demonstration purpose, I’ll compute both measures. In reality, we’ll choose one and stick to it until the end of the process.

Step 2. Choosing the Optimization Algorithm

The manager selected the optimization algorithm based on their varying priorities.

Scenario 1.

Manager: “For this initial pilot, our dataset is quite small, and I don’t think computational expense would be a problem. I prioritize achieving the most accurate split and generate insightful decision rules first.”

→ Choose Exact Greedy Algorithm

Scenario 2.

Manager: “If this pilot scales to millions of users, an exact search would be too slow. Let me start the process with random split points instead of all the splits to see if we still get a good enough split without freezing our system.”

→ Choose Approximate Greedy Algorithm

Scenario 3.

Manager: “I'm considering expanding our datasets with many more features to gain deeper insights. The expansion would require intensive model training. So, I need to prioritize the training speed. I'm willing to accept a slight loss of precision in split points for the speed increase.”

→ Choose Histogram-based Algorithm

Step 3. Computing Gains

I’ll demonstrate how to compute gains and decide which feature to prioritize to make a prediction on long-term subscribers.

The common steps to take across the scenarios is:

  1. Select split points by feature,

  2. Compute entropy or gini - then compute gains, and

  3. Choose the split point with the maximum gain.

Scenario 1. Apply Exact Greedy Algorithm

Feature 1: internet_usage_hrs_day

With float values, the number of potential split points technically increases. In practice, algorithms usually consider midpoints between sorted, distinct values where the target class changes:

  • Split 1: internet_usage_hrs_day <= 2.95 (midpoint of 2.8 and 3.1)

  • Split 2: internet_usage_hrs_day <= 8.05 (midpoint of 7.7 and 8.4)

Figure C. Applying Exact Greedy Algorithm to the feature (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Applying Exact Greedy Algorithm to the feature (Created by Kuriko IWAI)

I’ll evaluate these splits by computing entropy and gini, and then compute the gains.

Split 1.

  • Left Child

    • 2 samples: 2 'No', 0 'Yes'

    • p_No ​= 2/2=1

    • p_Yes ​= 0/2=0

    • Entropy: E(D) = 0 (Perfectly pure)

    • Gini: G(D) = 0 (Perfectly pure)

  • Right Child

    • 8 samples: 2 'No', 6 'Yes'

    • p_No​ = 2/8 = 0.25

    • p_Yes​ = 6/8 = 0.75

    • Entropy: E(D) = −((0.25) log2​(0.25)+(0.75)log2​(0.75))≈ −(−0.5−0.311) ≈0.811

    • Gini: G(D) = 1 − (0.252+0.752) = 1 − 0.625 = 0.375

→ Gains

  • Information Gain (Entropy): IG = 0.971 − (2/10 × 0+ 8/10 ​× 0.811) = 0.971 − 0.649 =0.322

  • Gini Gain (Gini Impurity): ΔG = 0.48 − (2/10​×0+8/10​×0.375) = 0.48 - 0.30 = 0.180

The following tree illustrates the computation of information gain (left) and Gini gain (right) from the root node to the first level:

Figure D. Tree map of the subscription scenario based on exact greedy algorithm on Entropy (left) and Gini (right) impurity measures (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Tree map of the subscription scenario based on exact greedy algorithm on Entropy (left) and Gini (right) impurity measures (Created by Kuriko IWAI)

Split 2.

  • Left Child

    • 7 samples: 2 'No', 5 'Yes'

    • p_No​ = 2/7

    • p_Yes ​= 5/7

    • Entropy: E(D)=− ((2/7) log2​(2/7) + (5/7) log2​(5/7) ) ≈ −(−0.517 − 0.346) ≈ 0.863

    • Gini: G(D)=1− ((2/7)²+(5/7)²) = 1 − 29/49 ≈0.408

  • Right Child

    • 3 samples: 2 'No', 1 'Yes'

    • p_No ​= 2/3

    • p_Yes ​= 1/3

    • Entropy: E(D) = −((2/3) log2​(2/3) + (1/3) log2​(1/3) ) ≈ −(−0.390 − 0.528) ≈ 0.918

    • Gini: G(D) = 1−((2/3)² + (1/3)²) = 1 − 5/9 ≈ 0.444

→ Gains

  • Information Gain: IG = 0.971 − ( 7/10​ × 0.863+ 3/10 ​× 0.918 ) = 0.971 − 0.879 = 0.092

  • Gini Gain: ΔG=0.48 − ( 7/10 ​× 0.408 + 3​/10 × 0.444 ) = 0.48 − 0.419 = 0.061

Feature 2: device_preference

For categorical features, each category can be assigned to a separate node for potential splitting:

  • Split 1: Node 'Mobile'

    • 3 Samples: 2 'No', 1 'Yes'

    • p_Yes = 1/3

    • p_No = 2/3

    • Entropy ≈ 0.918

    • Gini ≈ 0.444

  • Split 2: Node ‘Desktop’

    • 4 Samples: 3 'No', 1 'Yes'

    • p_Yes = 1/4

    • p_No = 3/4

    • Entropy ≈ 0.811

    • Gini ≈ 0.375

  • Split 3: Node ‘Tablet’

    • 2 Samples: 0 'No', 2 'Yes'

    • p_Yes = 2/2 = 1

    • p_No = 0/2 = 0

    • Entropy = 0 (purity)

    • Gini = 0 (purity)

→ Gains:

  • Information Gain: IG = 0.971 − (3/10 ​× 0.918 + 4/10 ​× 0.811 + 2/10 ​× 0) = 0.971 − 0.599 = 0.372

  • Gini Gain: ΔG = 0.48 − (3/10 ​× 0.444 + 4/10​ x 0.375 + 2/10 × 0) = 0.48 - 0.283 = 0.197

Decision Making

In either case of Entropy or Gini impurity, it turned out splitting by category of device_preference is the optimal split as it outperformed the gains of internet_usage_hrs_day.

This result indicates that in this dataset, device preference is a stronger predictor than daily internet usage hours for whether a customer becomes a long-term subscriber.

Impurity MethodFeatureBreaking PointGain
EntropyInternet_Usage_Hrs_Day≤2.950.322
EntropyInternet_Usage_Hrs_Day≤8.050.092
EntropyDevice_PreferenceMobile, Desktop, Tablet0.372 ← WINNER
GiniInternet_Usage_Hrs_Day≤2.950.180
GiniInternet_Usage_Hrs_Day≤8.050.061
GiniDevice_PreferenceMobile, Desktop, Tablet0.197 ← WINNER

*Gini and Entropy shouldn’t be compared directly as their mathematical formulations differ as we saw in the previous section.


Scenario 2. Apply Approximate Greedy Algorithm

Feature 1: internet_usage_hrs_day

For numerical features, approximate greedy algorithms evaluate only a random subset of potential split points to save computation.

So, the manager decides to split at random points of 5.0 and 9.0.

  • Split 1: internet_usage_hrs_day <= 5.0

  • Split 2: internet_usage_hrs_day <= 9.0

Figure F. Applying Approximate Greedy Algorithm to the feature (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Applying Approximate Greedy Algorithm to the feature (Created by Kuriko IWAI)

Split 1.

  • Left Child

    • 5 samples: 2 'No', 3 'Yes'

    • p_YES = 3/5 = 0.6

    • p_No = 2/5 = 0.4 (happens to be the same distribution as the root node)

    • Entropy ≈ 0.971

    • Gini = 0.48

  • Right Child

    • 5 samples: 2 'No', 3 'Yes'

    • p_YES = 3/5 = 0.6

    • p_No = 2/5 = 0.4

    • Entropy ≈ 0.971

    • Gini = 0.48

→ Gains:

  • Information Gain: IG=0.971−(5 / 10 ​× 0.971 + 5 / 10 ​×0.971) = 0.971 - 0.971 = 0.0

  • Gini Gain: ΔG=0.48−(5 / 10 ​× 0.48 + 5 / 10 ​× 0.48) = 0.48 -0.48 = 0.0

Split 2

  • Left Child

    • 9 samples: 4 'No', 5 'Yes'

    • p_YES = 5/9

    • p_No = 4/9

    • Entropy ≈0.991

    • Gini ≈0.494

  • Right Child

    • 1 sample: 0 'No', 1 'Yes'

    • p_YES = 1

    • p_No = 0

    • Entropy = 0

    • Gini = 0

→ Gains:

  • Information Gain: IG= 0.971 − (9 / 10 ​× 0.991 + 1 / 10 ​× 0) = 0.971 − 0.892 = 0.079

  • Gini Gain: ΔG = 0.48−(9 / 10 ​× 0.494 + 1 / 10 ​× 0) = 0.48 - 0.445 = 0.035

Feature 2: device_preference

Same as Exact Greedy, for categorical features, it typically still evaluates all categories. So, I’ll utilize the results from Scenario 1.

  • Information Gain: IG = 0.372

  • Gini Gain: ΔG = 0.197

Decision Making

Similar to Scenario 1, in either impurity measure, splitting by the category of device_preference performed the best.

Impurity MethodFeatureBreaking PointGain
Entropyinternet_usage_hrs_day≤5.00.0
Entropyinternet_usage_hrs_day≤9.00.079
Entropydevice_preferenceMobile, Desktop, Tablet0.372 ← WINNER
Giniinternet_usage_hrs_day≤5.00.0
Giniinternet_usage_hrs_day≤9.00.035
Ginidevice_preferenceMobile, Desktop, Tablet0.197 ← WINNER

Scenario 3. Apply Histogram-based Algorithm

Feature 1: internet_usage_hrs_day

The histogram-based greedy algorithm first bins numerical features into a fixed number of bins and then only considers splits at the bin boundaries.

The manager decides to make 3 bins: Hrs <= 3.0, 3.0 < Hrs <= 7.0, Hrs > 7.0 so that they only consider splits at Hrs <= 3.0 and Hrs <= 7.0.

Split 1. Hrs <= 3.0

  • Left Child

    • 2 samples: 2 'No', 0 'Yes'

    • p_Yes = 0

    • p_No = 1

    • Entropy = 0

    • Gini = 0

  • Right Child

    • 8 samples: 2 'No', 6 'Yes'

    • p_Yes = 6/8

    • p_No = 2/8

    • Entropy ≈ 0.811

    • Gini ≈ 0.375

→ Gains:

  • Information Gain: IG = 0.971 − ( 2/10 ​× 0 + 8/10 ​× 0.811 ) = 0.971 - 0.649 = 0.322

  • Gini Gain: ΔG=0.48 − (2/10 ​× 0 + 8/10 ​× 0.375 ) = 0.48− 0.3 = 0.180

Split 2. Hrs <= 7.0

  • Left Child

    • 7 samples: 2 'No', 5 'Yes'

    • p_Yes = 5/7

    • p_No = 2/7

    • Entropy ≈ 0.863

    • Gini ≈ 0.408

  • Right Child

    • 3 samples: 2 'No', 1 'Yes'

    • p_Yes = 1/3

    • p_No = 2/3

    • Entropy ≈ 0.918

    • Gini ≈ 0.444

→ Gains:

  • Information Gain: IG = 0.971 − (7/10 ​× 0.863 + 3/10 ​× 0.918) = 0.971 - 0.879 = 0.092

  • Gini Gain: ΔG=0.48 − (7/10 ​× 0.408 + 3/10 ​× 0.444) = 0.48− 0.419 = 0.061

Feature 2: device_preference

The histogram-based algorithm takes the same approach as Exact / Approximate Greedy.

  • Information Gain: IG = 0.372

  • Gini Gain: ΔG = 0.197

Decision Making

Similar to the other scenarios, splitting by the category of device_preference performed the best.

However, for numerical feature internet_usage_hrs_day, the split at hrs <= 3.0 resulted in an Entropy Gain of 0.322 and Gini Gain of 0.180.

Interestingly, these gains are identical to the highest gains found by the Exact Greedy algorithm for internet_usage_hrs_day (at ≤2.95).

This demonstrates that if bin boundaries align closely with optimal split points, the Histogram-based method can achieve similar results with much higher efficiency.

Impurity MethodFeatureBreaking PointGain
Entropyinternet_usage_hrs_day≤3.00.322
Entropyinternet_usage_hrs_day≤7.00.092
Entropydevice_preferenceMobile, Desktop, Tablet0.372 ← WINNER
Giniinternet_usage_hrs_day≤3.00.180
Giniinternet_usage_hrs_day≤7.00.061
Ginidevice_preferenceMobile, Desktop, Tablet0.197 ← WINNER

Summary of Findings

Across all three greedy algorithms, device_preference consistently yields the highest gain and thus represents the optimal split point for the initial node.

Dominance of Categorical Feature

  • In Exact Greedy, device_preference achieved an Information Gain of 0.372 and a Gini Gain of 0.197 - highest gains across all evaluated splits.

  • In Approximate Greedy, device_preference maintained its Information Gain of 0.372 and Gini Gain of 0.197, again surpassing all tested splits.

  • In the Histogram-based Greedy approach, device_preference continued to exhibit gains of 0.372 (Entropy) and 0.197 (Gini), outperforming its numerical counterparts.

Variations in Numerical Feature Handling

  • Exact Greedy performed an exhaustive search, finding the highest gains for Internet_Usage_Hrs_Day at the ≤2.95 split (Entropy: 0.322, Gini: 0.180).

    This represents the maximum gain achievable for this feature under its specified split derivation method.

  • Approximate Greedy, by evaluating only a random subset of split points (e.g., ≤5.0 and ≤9.0), yielded significantly lower gains. This clearly demonstrates the trade-off, where a simplified search missed the higher gain possible for the numerical feature.

  • Histogram-based Greedy, by evaluating at predefined bin boundaries (e.g., ≤3.0 and ≤7.0), remarkably found a split at ≤3.0 that produced an Entropy Gain of 0.322 and Gini Gain of 0.180. These gains are identical to the maximum achieved by the Exact Greedy algorithm for this feature, indicating that the chosen bin boundary aligned perfectly with a highly effective split point in this particular dataset.

Simulation

Using more complex dataset, I’ll demonstrate how it works with the DecisionTreeClassifier from the Scikit-Learn library.

Preparing Dataset

I created synthetic datasets by adding features to the walkthrough example data.

Then, I applied SMOTE to the training set to address the class imbalance, generating 120 samples with 20 features.

  • Training dataset: X_train: (100, 20), y_train:(100,)

  • Validation dataset: X_val: (20, 20), y_val:(100,)

  • Test dataset: X_test: (20, 20), y_test:(100,)

Model Training

I’ll define Extract and Approximate Greedy Algorithm using the DecisionTreeClassifier from the Scikit Learn library.

Extract Greedy

1from sklearn.tree import DecisionTreeClassifier
2
3dt_exact_entropy = DecisionTreeClassifier(
4    criterion='entropy',       # entropy
5    max_features=None,         # explicitly mentioned (consider all features at each split)
6    class_weight='balanced',
7    random_state=12
8).fit(X_train, y_train)
9
10dt_exact_gini = DecisionTreeClassifier(
11    criterion='gini',         # gini
12    max_features=None,
13    class_weight='balanced',
14    random_state=12
15).fit(X_train, y_train)
16

Tree Structure

The figure illustrates the tree structure of the classification. Entropy (left) and Gini (right) generated similar results.

Figure G. Tree Structure (Exact Greedy) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Tree Structure (Exact Greedy) (Created by Kuriko IWAI)

Approximate Greedy

1from sklearn.tree import DecisionTreeClassifier
2
3dt_approx_entropy = DecisionTreeClassifier(
4    criterion='entropy',
5    max_features='sqrt', # approximate greedy equivalent: considers sqrt(n_features) at each split
6    class_weight='balanced',
7    random_state=12,
8).fit(X_train, y_train)
9
10dt_approx_gini = DecisionTreeClassifier(
11    criterion='gini',
12    max_features='sqrt',
13    class_weight='balanced', 
14    random_state=12,
15).fit(X_train, y_train)
16

Tree Structure

The figure illustrates the tree structure of the Approximate Greedy algorithm. Entropy (left) and Gini (right) generated different results - where both have more complex structures than Exact Greedy.

Figure H. Tree Structure (Approximate Greedy) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Tree Structure (Approximate Greedy) (Created by Kuriko IWAI)

Histogram-based Algorithm

I used the HistGradientBoostingClassifier from the Scikit-Learn library to approximate entropy histogram-based model.

Note that gradient-based optimizers like SGD cannot be used on Gini because Gini isn't differentiable.

1from sklearn.ensemble import HistGradientBoostingClassifier
2from sklearn.inspection import permutation_importance
3
4dt_hist_entropy = HistGradientBoostingClassifier(
5    loss='log_loss',             # entropy equivarent
6    l2_regularization=0.1,       # L2 regularization term
7    learning_rate=0.05,          # step size of each iteration
8    max_depth=5,                 # max depth of individual trees
9    min_samples_leaf=20,         # minimum samples required to be at a leaf node
10    n_iter_no_change=10,         # early stopping if validation score doesn't improve for 10 iterations
11    validation_fraction=0.1,     # 10% of training data used for validation
12    random_state=12,   
13).fit(X_train, y_train)
14

Permutation Importances by Feature

Since HistGradientBoostingClassifier isn't applicable to tree_prot, I'll examine the permutation importance by feature. It turns out that support_contacts is the most important.

Figure I. Permutation importance by feature (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure I. Permutation importance by feature (Created by Kuriko IWAI)

Performance Evaluation

1y_pred_train = model.predict(X_train)
2y_pred_val = model.predict(X_val)
3y_pred_test = model.predict(X_test)
4

Accuracy Scores

  • Exact - Entropy: Train: 1.0000 → 0.7500, Generalization: 0.5500

  • Exact - Gini: Train: 1.0000 → 0.5500, Generalization: 0.7000

  • Approximate - Entropy: Train: 1.0000 → 0.6500, Generalization: 0.5500

  • Approximate - Gini: Train: 1.0000 → 0.5000, Generalization: 0.4000

  • Histogram-based: Train: 0.9200 → 0.6500, Generalization: 0.7000

The HistGradientBoostingClassifier with the L2 and other regularization parameters demonstrates much better generalization capabilities, as intended.

The single DecisionTreeClassifier instances, without strong regularization, tend to overfit the small dataset heavily.

The HistGradientBoostingClassifier's performance highlights the benefits of ensemble methods and proper regularization in achieving a better balance between bias and variance.

Conclusion

Decision trees are versatile and interpretable models that make predictions by navigating importance of input features.

In the experiment, we observed that the choice of algorithm and impurity measure significantly influences a decision tree's structure and predictive performance.

Conversely, we also observed that decision trees are prone to overfitting.

When trees are used in machine learning, models employ regularization techniques to improve generalization capabilities and prevent overfitting.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Share What You Learned

Kuriko IWAI, "Mastering Decision Trees: From Impurity Measures to Greedy Optimization" in Kernel Labs

https://kuriko-iwai.com/decision-tree-algorithm-core

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.