Building Powerful Naive Bayes Ensembles for Mixed Datasets

Explore an ensemble approach to make Naive Bayes more robust to complex datasets

Machine LearningData SciencePython

By Kuriko IWAI

Introduction

Naive Bayes is a generative learning algorithm in machine learning, used in classification tasks.

It’s powerful in its simplicity and efficiency, often performing surprisingly well on large datasets with a small amount of training data.

However, it requires a careful understanding of the data types because it assumes that features are independent of each other. This is a strong assumption, and it’s rarely true in real-world data, which could limit its performance.

In this article, I’ll leverage one of the stacking techniques, ensemble, to make a more powerful model from multiple Naive Bayes classifiers.

What is Naive Bayes

Naive Bayes is a generative learning algorithm that leverages Naive Bayes (NB) Assumptions where the model assume input features are conditionally independent when it comes to predicting the outcome.

This conditional independence assumption streamlines computation, eliminating the need to calculate covariance, unlike Gaussian Discriminative Analysis (GDA), and consequently enables it to perform well even with limited data.

For instance, in the animal classification task: { elephant, dog, cat }, the input features weight and height would not influence each other to predict the outcome — meaning when the model predicts 90% of conditional probability for a certain animal to be classified as a cat when its weight is 10 pounds (Mathematically put, P(weight = 10 pounds | class = ‘cat’) = 0.9), this conditional probability remains the same even though the other features, height is 2 ft.

In reality, this conditional assumption has little sense, but it streamlines computation process and allows NB to perform well even with limited data, unlike algorithms such as Quadratic Discriminant Analysis (QDA), which requires covariance calculations.

◼ The Goal of Naive Bayes

Just like other generative learning algorithms, Naive Bayes aims to find a class label (y*) with a maximum posterior probability (P(y|x)) such that:

y^* = \arg\max_y P(y|x)

To compute P(y|x), NB leverages Bayes’ Theorem where the joint likelihood (P(x,y)) is computed as the product of the prior probability (P(y)) and the conditional probability (P(x∣y)):

\begin{align*} P(y|x) &= \frac {P(x, y)} {P(x)} = \frac {P(x|y)\cdot P(y)} {P(x)} \end{align*}

where

P(y|x): A posterior probability,
P(y): A prior probability of the class label y, and
P(x): A marginal probability of the evidence (or feature).

The marginal probability of the input feature x is the sum of the joint probabilities of x and each class.

In binary classification, the marginal probability P(x) is denoted:

P(x) = P(x|y=1)P(y=1) + P(x|y=0)P(y=0)

This expression is also known as the evidence in Bayes' theorem.

P(x) is a complex calculation and is therefore disregarded in the final classification decision, as its value is a constant normalizing factor for a given input x.

◼ How Naive Bayes Assumptions Reduce Computational Complexity

In this section, I’ll demonstrate how NB assumptions simplify the computational process of the conditional probability P(x|y).

First, P(x|y) is expanded using the product rule:

\begin{align*} P(x|y) &= P(x_1 ,x_2, \dots, x_n | y) \\ \\ &= P(x_1 | y)\cdot P(x_2 | x_1, y) \cdot P(x_3 | x_1, x_2, y) \dots P(x_n | x_1, \cdots, x_{n-1}, y) \end{align*}

where n represents the total number of input features.

The challenge here is that the number of dependencies P(x_i | other features, y) we need to compute in the formula would exponentially increase.

Taking a binary classification for an example:

If n=10 => 2 * (2¹⁰ — 1) ≈ 2,046
If n=20 => 2* (2²⁰ — 1) ≈ 2 millions
If n=30 => 2 * (2³⁰ — 1) ≈ 2 billions
…

NB assumptions help eliminating the need to compute any conditional probabilities that includes other features such as P(x_2| x1, y):

\begin{align*} P(x|y) &= P(x_1 ,x_2, \dots, x_n | y) \\ \\ &= P(x_1 | y)\cdot P(x_2 | y)\cdots P(x_n |y)\cdot P(x_2 | x_1, y) \cdot P(x_3 | x_1, x_2, y) \dots P(x_n | x_1, \cdots, x_{n-1}, y) \\ \\ & = P(x_1 | y) \cdot P(x_2 | y) \cdots P(x_n | y) \\ \\ &= \prod_{i=1}^n P(x_i |y) \end{align*}

In the above formula, the P(x|y) extremely simplifies with just n number of entries. So, the objective function is finalized:

\begin {align*} y^* &= \arg\max_y P(y|x) \\ \\ &\propto \arg\max_y P(x, y) \\ \\ &\propto \arg\max_y (P(y) \cdot \prod_{i=1}^n P(x_i | y)) \quad \cdots (A) \end {align*}

◼ How Naive Bayes Optimizes Its Model Parameters

Naive Bayes often uses Maximum Likelihood Estimation (MLE) to find its optimal model parameters that can maximize the objective function.

For computational efficiency, MLE takes the logarithm of the objective function.

So, the optimal set of model parameters (Θ) is denoted:

\hat{\Theta}{MLE} = \arg\max{\Theta} \log L(\Theta)

where:

Θ: The entire set of model parameters,
Θ^_{MLE}: MLE’s estimates on the optimal set of the model parameters,
L(Θ): The log likelihood of the objective function.

By substituting Formula A:

\begin{align*} \hat{\Theta}{MLE} &= \arg\max{\Theta} \log [\prod_{i=1}^{m} (P(y^{(i)}|\Theta) \prod_{j=1}^{n} P(x_j^{(i)}|y^{(i)},\Theta))] \\ \\ &= \arg\max_{\Theta} \sum_{i=1}^{m}(logP(y^{(i)}|\Theta)+\sum_{j=1}^{n}logP(x_j^{(i)}|y^{(i)},\Theta)) \end{align*}

Here, NB assumptions help improving its computational efficiency by simplifying the formula from the product to the sum.

◼ Core Decision Rules

After learning the optimal parameters, the model makes a final prediction by finding the target class label such that:

\begin{align*} \hat{y}(\mathbf{x}; \hat{\Theta}) &= \arg\max_{y \in \{c_1, \ldots, c_K\}} P(y|\mathbf{x}, \hat{\Theta}) \\ \\ &= \arg\max_{y \in \{c_1, \ldots, c_K\}} P(\mathbf{x}|y, \hat{\Theta}) P(y|\hat{\Theta}) \\ \\ &= \arg\max_{y \in \{c_1, \ldots, c_K\}} P(y|\hat{\Theta}) \prod_{j=1}^{n} P(x_j|y, \hat{\Theta}) \end{align*}

where:

y^(x;Θ^): The final class label predicted by taking input x and parameterized the optimal model parameters Θ^,
P(y∣x,Θ^): The posterior probability of class y given the input x and the optimal model parameters,
P(y∣Θ^): The estimated prior probability of class y, given the optimal parameters, and
P(x_j∣y,Θ^): The estimated conditional probability of feature xj given class y.

Handling Mixed Data Types with Naive Bayes

In this section, I’ll demonstrate how to handle mixed data types with NB classifiers on a binary classification task of churn prediction, using a telecom churn dataset.

◼ The Approach

I’ll first build separate classifiers and then combine their predictions via a voting method and a stacking method with logistic regression as the meta-learner.

The steps include:

Classify input features by data type.
Train multiple NB classifiers using corresponding datasets.
Stacking each outcome.
Evaluate the final result.

◼ The Project

I'll predict customer churn rate by training the NB classifiers on a telecom churn dataset from the UC Irvine Machine Learning repository.

The Dataset

Iranian Churn from the UC Irvine Machine Learning Repository
3,500 samples
14 features:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Sample dataset (source)

◼ Categorizing Features

Naive Bayes assumes that the likelihood P(x∣y) follows a specific, pre-defined probability distribution for each feature, conditional on the class y such as:

Bernoulli - x|y ∈ { 0, 1 }

P(x|y) = p_y^{x} (1-p_y)^{1-x}

Categorical - x|y ∈ { 0, 1, 2, … K }

P(x|y) = \prod_{i=1}^K (p_{y,i})^{1(k=i)}

Multinomial - x|y ~ multinomial(n, p_y)

P(x|y) = \frac{n!} {x_1!x_2!\dots x_K!} (p_{y,1})^{x_1}(p_{y,2})^{x_2} \dots (p_{y,K})^{x_K}

Gaussian - x|y ~ N(μy, σy2)

P(x_i|y) =\frac{1} {\sqrt {2\pi \sigma_{y, i} ^2}} e^{-\frac{(x_i - \mu_{y,i})^2} {2 \sigma_{y,i}^2}}

If the assumption does not reflect real, underlying distribution well, the accuracy of the prediction is compromised.

Here, I’ll first generate a quantile-quantile (QQ) plot of the conditional distributions of x|y = 0 and x|y = 1 against a normal distribution for analysis, and then classify input features into four groups: binary, categorical, multinomial, and Gaussian.

In reality, data contains multiple patterns and noise, and it not so simple. This is why I use an ensemble of multiple NB classifiers to improve robustness.

The category results with QQ plots are below.

Each figure shows a QQ plot (blue dots) of each feature given the class y = 0 (left) and y=1 (right). A normal distribution is plotted in red for comparison.

▫ Binary:

complains , tariff_plan, status

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

▫ Categorical:

age_group

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

▫ Multinomial:

age , charge_amount

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

▫ Gaussian:

call_failure, subscription_length, seconds_of_use, frequency_of_use, frequency_of_sms, distinct_called_numbers, customer_value

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. QQ plots against normal distributions (Created by Kuriko IWAI)

◼ Preprocessing the Dataset

After loading data, I split data into training and test sets and applied SMOTE to handle class imbalance:

1import os
2import pandas as pd
3from sklearn.model_selection import train_test_split
4from sklearn.compose import ColumnTransformer
5from collections import Counter
6from imblearn.over_sampling import SMOTE
7
8# load the dataset
9current_dir =  os.getcwd()
10parent_dir = os.path.dirname(current_dir)
11csv_file_path = f'{parent_dir}/_datasets/{file_name}'
12df = pd.read_csv(csv_file_path)
13
14# split the dataframe to X & y
15target_col = 'churn'
16X = df.drop(target_col, axis=1)
17y = df[target_col]
18X_train, X_test, y_train, y_test = train_test_split(
19  X, y, test_size=0.2, random_state=42
20)
21
22# handle class imbalance by applying SMOTE
23minority_class = y_train.value_counts().idxmin()
24minority_df = y_train[y_train == minority_class]
25majority_class = y_train.value_counts().idxmax()
26majority_df = y_train[y_train == majority_class]
27
28smote_target = min(len(minority_df) * 3, len(majority_df))
29sampling_strategies = { 
30   majority_class: max(len(majority_df), smote_target), 
31   minority_class: smote_target 
32}
33smote_train = SMOTE(sampling_strategy=sampling_strategies, random_state=42) 
34X_train, y_train = smote_train.fit_resample(X_train, y_train)
35
36print(Counter(y_train))
37

Pre- SMOTE: Counter({0: 2135, 1: 385})
After SMOTE- Counter({0: 2135, 1: 1155})

◼ Building Individual Pipelines by Data Type

I’ll build a specialized mini-pipelines for each data type using its best-suited NB model and preprocessing methods:

1from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB, CategoricalNB
2from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, OrdinalEncoder
3from sklearn.compose import ColumnTransformer
4from sklearn.pipeline import Pipeline
5
6# binary 
7binary_features = ['complains', 'tariff_plan', 'status']
8bnb = Pipeline([
9    ('preprocessor', ColumnTransformer([
10        ('passthrough', 'passthrough', binary_features)], remainder='drop')
11    ),
12    ('classifier', BernoulliNB())
13])
14
15# categorical
16categorical_features = ['age_group',]
17catnb = Pipeline([
18     ('preprocessor', ColumnTransformer([
19        ('ordinal_encoder', 
20        OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), 
21        categorical_features)], 
22        remainder='drop')
23    ),
24    ('classifier', CategoricalNB())
25])
26
27# multinomial
28multinomial_features = ['age', 'charge_amount']
29mnb = Pipeline([
30    ('preprocessor', ColumnTransformer([
31        ('onehot', OneHotEncoder(handle_unknown='ignore'), 
32        multinomial_features)],
33        remainder='drop')
34    ),
35    ('classifier', MultinomialNB())
36])
37
38# gausian
39continous_features = [
40    'call_failure',
41    'subscription_length',
42    'seconds_of_use',
43    'frequency_of_use',
44    'frequency_of_sms',
45    'distinct_called_numbers',
46    'customer_value'
47]
48gnb = Pipeline([
49    ('preprocessor', ColumnTransformer([
50        ('scaler', MinMaxScaler(), continuous_features)], remainder='drop')
51    ),
52    ('classifier', GaussianNB())
53])
54

◼ Ensemble - Multiple Pipelines

I’ll integrate the four pipelines using voting and stacking with a Logistic Regression model as a meta learner.

▫ a. Voting Method

1from sklearn.ensemble import VotingClassifier
2
3voting_classifier = VotingClassifier(
4    estimators=[
5        ('gnb', gnb),
6        ('bnb', bnb),
7        ('mnb', mnb),
8        ('cnb', catnb)
9    ],
10    voting='soft', # use soft voting (average probabilities)
11    weights=[1, 1, 1, 1] # equal weight
12).fit(X_train, y_train)
13

▫ b. Stacking Method

1import numpy as np
2from sklearn.linear_model import LogisticRegression 
3
4X_meta_train = np.hstack(
5  (prob_train_bnb, prob_train_catnb, prob_train_mnb, prob_train_gnb)
6)
7
8meta_learner = LogisticRegression( # as a meta-leanrer
9    random_state=42,
10    solver='liblinear',
11    multi_class='auto'
12).fit(X_meta_train, y_train) # train the model with combined training examples
13

◼ Make Predictions

Finally, the model performed inference on the test dataset using the voting_classifier and meta_learner:

1y_pred_voting = voting_classifier.predict(X_test)
2
3X_meta_test = np.hstack((prob_test_bnb, prob_test_catnb, prob_test_mnb, prob_test_gnb))
4y_pred_stacking = meta_learner.predict(X_meta_test)
5

◼ Results

The stacking ensemble achieved an 81.6% test accuracy and a 53.6% test F1-score, showing good performance on the majority class but struggling with precision for the minority class (class 1).

Accuracy (train): 0.8103, (test): 0.8159
Stacked Ensemble F1 (train): 0.7239, (test): 0.5360

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. ROC Curve / Precision-Recall Curve (Comparing the combined result — stacking (sky blue) and the voting method (dark blue)) (Created by Kuriko IWAI)

For the individual Naive Bayes models, BernoulliNB emerged as the strongest performer with an accuracy of 0.875.

Both CategoricalNB and MultinomialNB achieved an accuracy of 0.825, while GaussianNB struggled significantly, yielding the lowest accuracy.

BernoulliNB (Binary only) Accuracy: 0.875
CategorialNB Accuracy: 0.825
MultinomialNB Accuracy: 0.825
GaussianNB (Continuous only) Accuracy: 0.614

This suggests that for this particular dataset, the BernoulliNB was the most effective.

The poor performance of GaussianNB can likely be attributed to the noise and various data types within the dataset that do not conform to a Gaussian distribution.

Wrapping Up

This exploration demonstrated an effective strategy for handling mixed data types with Naive Bayes, leveraging specialized pipelines and a stacking ensemble.

In our simulation, while the ensemble method performed well overall, the BernoulliNB model was the single strongest classifier, indicating that the dataset's features were likely binary or could be effectively handled as such.

This poses a challenge where the model might be overly reliant on the BernoulliNB's strengths, potentially overlooking valuable information from other data types.

Given the dataset’s complexity and the limitations, a more robust algorithm like LightGBM would likely be a better fit.

Each method offers distinct advantages, and the optimal choice depends on the specific dataset characteristics, computational resources, and performance objectives.