The Definitive Guide to Imputation and Data Preprocessing in Machine Learning

A comprehensive guide on missing data imputation, feature scaling and encoding with practical examples

Machine LearningPython

By Kuriko IWAI

Introduction

Machine learning models are powerful, but their effectiveness hinges on the quality of their training data.

Without proper data preparation, even the most sophisticated algorithms will struggle to generate meaningful results.

Data preprocessing is a crucial step in the pipeline, transforming raw data into a clean and suitable format for model training.

This process typically involves:

handling missing data,
scaling numerical variables, and
encoding categorical variables.

Though these preprocessing methods do not directly choose model algorithms, they can prepare the data in a way that makes it compatible with specific algorithms.

In this article, I’ll dive deep into these three data preprocessing methods and explore their impact on major machine learning algorithms.

Handling Missing Data

◼ Missing data is a common issue in real-world datasets and it can significantly impact model performance.

From statistic standpoints, there are three types of missingness:

Missing Completely At Random (MCAR): The missingness is purely random.
Missing at Random (MAR): The missingness depends on other observed variables in the dataset, but not on the missing value itself.
Missing Not At Random (MNAR): The missingness depends on the values of the missing value itself.

Taking a survey asking participants for their household income for an example:

If random 10 forms were accidentally skipped by a data-entry person, the missingness is MCAR because these missing values are completely random.
If respondents owning a cat were more likely to fill in the form compared to those without a cat, it is MAR because the missing values depends on the “cat ownership” in the data, but not on the specific amount of the income (meaning some cat owners who skipped the question had high incomes, while others had low incomes. No correlation).
If respondents with high income were more likely to skip the question, this is MNAR because the missingness of “household income” depends on the specific amount of the income.

In the next section, I’ll explore major strategies for handling missing values, considering their types and other crucial factors.

Identification

The first important step is to identify the extent and patterns of missing data. This can be done by checking for NaN (Not a Number) values or other designated missing value indicators within the dataset.

I created a synthetic dataset for demonstration purpose:

1import pandas as pd
2import numpy as np
3
4data_missing = {
5    'age': [25, 30, np.nan, 40, 55, 32, np.nan, 28, 45, 50],
6    'income': [50000, 60000, 75000, np.nan, 90000, 65000, 80000, 52000, np.nan, 95000],
7    'gender': ['female', 'female', 'male', 'Female', np.nan, 'male', 'female', 'male', 'female', 'male'],
8    'yr_experience': [3, 8, 12, 15, 20, 5, 10, 2, 18, 25],
9    'city': ['New York', 'London', 'Paris', 'New York', 'Tokyo', 'London', 'Berlin', 'New York', 'Paris', 'Tokyo']
10}
11df = pd.DataFrame(data_missing)
12print(df_missing.isnull().sum())
13print(df_missing.isnull().sum() / len(df_missing) * 100)
14

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Missing values per column:

age 2
income 2
gender 1
yr_experience 0
city 0

Imputation

Imputation is a technique that fills in missing data points with estimated values.

Common estimation methods leveraging statistics include mean/medium imputation for continuous values and mode imputation for discrete values.

I’ll explore the following major imputation methods in this section:

Statistical methods: Mean, Median, Mode Imputation
Model-based methods: KNN Imputation, Regression Imputation
Deep learning based methods: GAIN (Generative Adversarial Imputation Networks)
Time series specific: Forward Fill/Backward Fill

◼ Mean Imputation

This technique replaces missing values with the mean of the non-missing values in that column.

Key Assumptions

Missing values are similar to the average of the observed values.
The dataset does not contain significant outliers.
Missingness is MCAR or MAR.

Applicable Data Types

Numerical (integers, floats).

Applicable Models

Generally safe for most models (Linear Models, Tree-based Models, K-Nearest Neighbors, Support Vector Machines)

Avoid When:

The data distribution is skewed because the mean is heavily influenced by those outliers.
Missingness is not random. The mean imputation will introduce bias in this case.

Demonstration:

1from sklearn.impute import SimpleImputer
2
3# takes numerical columns
4cols = ['age', 'income']
5
6# imputes
7df_mean_imputed = df.copy()
8imputer_mean = SimpleImputer(strategy='mean')
9df_mean_imputed[['age', 'income']] = imputer_mean.fit_transform(df_mean_imputed[['age', 'income']])
10

NaN values in age, income columns are replaced with 38.125 and 70875.0 respectively:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Median Imputation

This technique replaces missing numerical values with the median of the non-missing values in that column.

Key Assumptions

Missing values are similar to the central tendency of the observed values.
The dataset has many outliers.
Missingness is MCAR or MAR.

Applicable Data Types

Numerical (integers, floats).

Applicable Models

Generally safe for most models (Linear Models, Tree-based Models, K-Nearest Neighbors, Support Vector Machines)

Best When:

The continuous distribution is skewed or contains significant outliers as median values are more robust than mean values.

Avoid When:

Similar to mean imputation, when the missingness is not random, it can introduce significant bias.

Demonstration:

1from sklearn.impute import SimpleImputer
2
3numerical_cols_missing = ['age', 'income']
4
5imputer_median = SimpleImputer(strategy='median')
6df_median_imputed = df_missing.copy()
7df_median_imputed[numerical_cols_missing] = imputer_median.fit_transform(df_median_imputed[numerical_cols_missing])
8

NaN values in age, income columns are replaced with the median values of 36.0 and 70000.0 respectively.

Cf. Mean imputation replaced with the mean values - 38.125 and 70875.0.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Mode Imputation

This method replaces missing categorical values with the most frequent category (mode).

Key Assumptions

Missing values are likely to be the most common category.
Missingness is MCAR or MAR.

Applicable Data Types

Categorical (strings, objects, nominal, ordinal).

Applicable Models

Generally safe for most models (Linear Models, Tree-based Models, K-Nearest Neighbors, Support Vector Machines)

Best When:

The data is categorical and the missingness is random.

Avoid When:

The feature has a very high cardinality (e.g., “email“ data is almost unique to one million “user_id” data in the dataset).
The mode represents a disproportionately small fraction of the data (e.g., one email address appears only five times among 1,000,000+ entries of data.)

Demonstration:

1from sklearn.impute import SimpleImputer
2
3df_mode_imputed_cat = df.copy()
4imputer_mode = SimpleImputer(strategy='most_frequent')
5df_mode_imputed_cat[['gender']] = imputer_mode.fit_transform(df_mode_imputed_cat[['gender']])
6

The NaN value in the gender column is replaced with the mode female:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ K-Nearest Neighbors (KNN) Imputation

Missing values can be imputed using the values from the k-nearest neighbors in the dataset.

It often generates more sophisticated estimates than simple mean/median/mode imputation, potentially preserving more data variance and relationships by capturing local structures in the data.

Key Assumptions

Data points with similar features are located close to each other in the feature space.
Missingness is MCAR or MAR.

Applicable Data Types

Ideally, numerical (must be scaled prior to the imputation)
Can take categorical if encoded appropriately.

Applicable Models

Generally safe for most models (Linear Models, Tree-based Models, Support Vector Machines).

Best When:

Expects complex relationships among features.
Especially when the missingness is MAR and accurate imputation is desired.

Avoid When:

Large dataset and/or high dimensional features (k-NN is extremely weak to the curse of dimensionality).

Demonstration:

Because our dataset has both numerical and categorical missingness, I’ll scale numerical features and encode categorical features first, then apply KNN imputation (I’ll cover more details on scaling and encoding in the next section).

I’ll take two strategies:

KNN Focus: Apply k-NN imputation to both numerical NaNs and categorical NaNs.
Hybrd: Apply k-NN imputation to numerical NaNs and mode imputation to the categorical NaNs.

▫ 1. KNN Focused

To ensure KNN imputation is applied to the categorical missing values, these values are preserved during encoding.

1import numpy as np
2from sklearn.impute import KNNImputer
3from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder
4
5df_knn = df.copy()
6
7num_cols = ['age', 'income', 'yr_experience']
8cat_cols = ['gender', 'city']
9
10# scales numerical features
11scaler = MinMaxScaler()
12df_knn[num_cols] = scaler.fit_transform(df_knn[num_cols])
13
14# encodes categorical features (while preserving cat NaN's)
15ordinal_encoders = {}
16for col in cat_cols:
17    df_knn[col] = df_knn[col].astype(object)
18    oe = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan) # keeps cat NaNs
19    oe.fit(df_knn[[col]].dropna())    
20    df_knn[col] = oe.transform(df_knn[[col]])
21    ordinal_encoders[col] = oe
22
23# apply knn imputation
24imputer_knn = KNNImputer(n_neighbors=2)
25df_knn_imputed = imputer_knn.fit_transform(df_knn)
26

In the scaling and encoding process, all the missing values are preserved to apply KNN imputation:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

▫ 2. Hybrid Approach

The hybrid approach on the other hand, is simpler and more straight-forward. The point is applying the mode imputation to the categorical features before encoding them, as the encoder can replace missing values with zeros or other values. Hence, the KNN imputation is only applied to the numerical missing values.

1import pandas as pd
2from sklearn.impute import KNNImputer
3from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
4
5df_knn = df.copy()
6
7num_cols = ['age', 'income', 'yr_experience']
8cat_cols = ['gender', 'city']
9
10# scales numerical features
11scaler = MinMaxScaler()
12df_knn[num_cols] = scaler.fit_transform(df_knn[num_cols])
13
14# applies mode imputation to the categorical NaNs
15for col in cat_cols:
16    if df_knn[col].isnull().any():
17        mode_val = df_knn[col].mode()[0]
18        df_knn[col] = df_knn[col].fillna(mode_val)
19
20# encodes categorical features 
21encoder_onehot = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
22encoded_data = encoder_onehot.fit_transform(df_knn[cat_cols])
23encoded_df = pd.DataFrame(encoded_data, columns=encoder_onehot.get_feature_names_out(cat_cols), index=df_knn.index)
24
25# merges encoded/scaled datasets
26df_knn = pd.concat([df_knn.drop(columns=cat_cols), encoded_df], axis=1)
27
28# applies knn imputation
29imputer_knn = KNNImputer(n_neighbors=2)
30df_knn_imputed = imputer_knn.fit_transform(df_knn)
31

The mode imputation is applied to the categorical missing value and then the KNN imputation is applied to the numerical missing values.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig: KNN and mode imputation process on the mixed dataset (Created by Kuriko IWAI)

◼ Regression Imputation

This method generates missing values by prediction of a regression model trained on other features in the dataset.

Leveraging learning from the model, it can preserve complex relationships and offer a more statistically sound imputation than simpler approaches.

Dealing with many missing values or large dataset can be computationally intensive and carries a significant risk of overfitting if not implemented carefully, for instance, by preventing data leakage in cross-validation.

Key Assumptions

There is a linear or non-linear relationship between the feature with missing values and other features in the dataset.
Missingness is MAR or MNAR.

Applicable Data Types

Imputed features must be numerical.
Other features can be numerical or appropriately encoded categorical.

Applicable Models:

Generally safe for most models (Linear Models, Tree-based Models, Support Vector Machines).
Suitable handling a model with high complexity like Neural Networks.

Best When:

Assuming a strong linear or non-linear relationship between the feature with missing values and other features in the dataset.

Avoid When:

The relationships between features are weak or non-existent.
The computational resource is limited.

Demonstration:

1from sklearn.experimental import enable_iterative_imputer
2from sklearn.impute import IterativeImputer
3from sklearn.linear_model import BayesianRidge
4
5# copying numerical cols from the base Data Frame for demonstration
6df_reg = df[['age', 'income', 'yr_experience']].copy()
7
8# applies reg imputation
9imputer_regression = IterativeImputer(BayesianRidge(), max_iter=10, random_state=0)
10df_reg_imputed = imputer_regression.fit_transform(df_reg)
11

The NaN values in age and income columns are replaced with the model’s prediction values:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

We can see the filled values are robust, reflecting the model’s learning.

◼ Forward Fill and Backward Fill

This method is for time-series data. Missing values are filled with the previous (forward fill) or next (backward fill) valid observation.

Key Assumptions

The data has a sequential or temporal order, and values are correlated over time.
Missingness is a temporary gap.

Applicable Data Types

Numerical, typically in time-series or sequential data.

Applicable Models

Crucial for Recurrent Neural Networks (RNNs), LSTMs, ARIMA, Prophet and other time-series specific models that require continuous sequences.

Best When:

Dealing with time-series data where the value at a given point is likely to be similar to its immediate neighbors.

Avoid When:

Missing data gaps are very long, losing critical information on trends (e.g., The temperature data recorded every second is missing for consecutive 16 hours).
Even though the time span of the gap is small, the underlying data pattern changes significantly within the missing gap (e.g., when the temperature in the morning (6 am to 11 am) and the temperature in the noon (11 am to 2 pm) have distinct data distributions, a missing data point around 11 am is hard to fill because 11 am is the timing when the distribution pattern would shift).

Demonstration

1import pandas as pd
2
3# create another synthetic data
4data_ts = {'Value': [10, 12, np.nan, 15, 18, np.nan, np.nan, 25, 28, 30]}
5df_ts = pd.DataFrame(data_ts)
6
7# forward fill
8df_ffill = df_ts.fillna(method='ffill')
9
10# backward fill
11df_bfill = df_ts.fillna(method='bfill')
12

The NaN values in the original data are filled:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Generative Adversarial Imputation Networks (GAIN)

GAIN is a deep learning-based imputation technique inspired by Generative Adversarial Networks (GANs).

It aims to fill in missing values by learning the underlying data distribution through an adversarial process.

GAINs consist of two main components:

Generator: Imputes highly plausible values for missing values through training and
Discriminator: Distinguishes between the actual observed data and the imputed data.

Applicable Data Types

Primarily designed for tabular data containing continuous, categorical, or mixed data types.

Applicable Models

Generally safe for most models (Logistic Regression, Decision Trees, Gradient Boosting, other Neural Networks) handling both regression and classification tasks.

Best When:

Dealing with complex, high-dimensional tabular datasets.
The missing data patterns are complex, potentially Missing Not At Random (MNAR).
High fidelity required for model performance.
Sufficient amount of data is available to train GAINs.

Avoid When:

Very small dataset as training GAINs require sufficient datasets.
The missing data patterns are simple (MCAR or MAR with simple relationships) as statistics imputations are good enough in such case.
Requires interpretability for imputation as deep learning models are black box.
Limited computational resources as training GAINs is computationally intensive.

Demonstration:

Using PyTorch library to define GAIN and its Generator, Discriminator classses.

1import torch
2import torch.nn as nn
3import torch.optim as optim
4import numpy as np
5
6# generator
7class Generator(nn.Module):
8    def __init__(self, input_dim, hidden_dim, output_dim):
9        super(Generator, self).__init__()
10        self.net = nn.Sequential(
11            nn.Linear(input_dim * 3, hidden_dim),
12            nn.ReLU(),
13            nn.Linear(hidden_dim, hidden_dim),
14            nn.ReLU(),
15            nn.Linear(hidden_dim, output_dim),
16            nn.Sigmoid()
17        )
18
19    def forward(self, x, m, h):
20        combined_input = torch.cat([x, m, h], dim=1)
21        return self.net(combined_input)
22
23# discriminator
24class Discriminator(nn.Module):
25    def __init__(self, input_dim, hidden_dim, output_dim=None):
26        super(Discriminator, self).__init__()
27        if output_dim is None:
28            output_dim = input_dim
29        self.net = nn.Sequential(
30            nn.Linear(input_dim * 2, hidden_dim),
31            nn.ReLU(),
32            nn.Linear(hidden_dim, hidden_dim),
33            nn.ReLU(),
34            nn.Linear(hidden_dim, output_dim),
35            nn.Sigmoid()
36        )
37
38    def forward(self, x, h):
39        combined_input = torch.cat([x, h], dim=1)
40        return self.net(combined_input)
41
42
43# GAIN
44class GAIN:
45    def __init__(self, data_dim):
46        # basic config
47        self.data_dim = data_dim
48        self.batch_size = 128
49        self.hint_rate = 0.9
50        self.alpha = 100
51        self.iterations = 10000
52        self.dim = 128
53        self.learning_rate = 1e-3
54        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
55
56        # ganerator and its optimizer
57        self.generator = Generator(self.data_dim, self.dim, self.data_dim).to(self.device)
58        self.optimizer_G = optim.Adam(self.generator.parameters(), lr=self.learning_rate)
59
60        # discriminator and its optimizer
61        self.discriminator = Discriminator(self.data_dim, self.dim, self.data_dim).to(self.device)
62        self.optimizer_D = optim.Adam(self.discriminator.parameters(), lr=self.learning_rate)
63
64        # loss, min, max values
65        self.bce_loss = nn.BCELoss()
66        self.min_val = None
67        self.max_val = None
68
69    def train(self, data_original_normalized, data_with_missing_normalized, mask):
70        num_samples = data_original_normalized.shape[0]
71
72        # training epoch
73        for i in range(self.iterations):
74            # create random samples
75            idx = np.random.choice(num_samples, self.batch_size, replace=False)
76            batch_data_original_normalized = data_original_normalized[idx]
77            batch_data_with_missing_normalized = data_with_missing_normalized[idx]
78            batch_mask = mask[idx]
79            random_mat = torch.rand(self.batch_size, self.data_dim).to(self.device)
80            batch_hint = (batch_mask * self.hint_rate + random_mat * (1 - self.hint_rate)).round()
81
82            # traning discriminator (initilize optimizer, forward pass, backward pass)
83            self.optimizer_D.zero_grad()
84            G_sample = self.generator(batch_data_with_missing_normalized, batch_mask, batch_hint)
85            D_input = batch_mask * batch_data_original_normalized + (1 - batch_mask) * G_sample
86            D_prob = self.discriminator(D_input, batch_hint)
87            D_labels = batch_mask
88            D_loss = self.bce_loss(D_prob, D_labels)
89            D_loss.backward()
90            self.optimizer_D.step()
91
92            # traning generator
93            self.optimizer_G.zero_grad()
94            G_sample = self.generator(batch_data_with_missing_normalized, batch_mask, batch_hint)
95            D_input_G = batch_mask * batch_data_original_normalized + (1 - batch_mask) * G_sample
96            D_prob_G = self.discriminator(D_input_G, batch_hint)
97            G_adversarial_loss = self.bce_loss(D_prob_G, batch_mask)
98            G_reconstruction_loss = torch.mean((1 - batch_mask) * torch.abs(batch_data_original_normalized - G_sample))
99            G_loss = G_adversarial_loss + self.alpha * G_reconstruction_loss
100            G_loss.backward()
101            self.optimizer_G.step()
102
103    def impute(self, data_with_missing_normalized, mask):
104        data_with_missing_normalized = data_with_missing_normalized.to(self.device)
105        mask = mask.to(self.device)
106        self.generator.eval()
107        self.generator.train()
108        return final_imputed_data_normalized.cpu().numpy()
109
110# initiates the model
111gain_model = GAIN(data_dim=data_original.shape[1])
112
113# denormalization
114gain_model.min_val = min_val
115gain_model.max_val = max_val
116
117# model training
118gain_model.train(normalized_data_original, normalized_data_with_missing, mask)
119
120# imputation and denormalization 
121imputed_data_normalized_np = gain_model.impute(normalized_data_with_missing, mask)
122imputed_data_np = gain_model.denormalize_data(torch.tensor(imputed_data_normalized_np, dtype=torch.float32)).numpy()
123

The missing values (zeros) in the original data are filled with GAIN’s predictions:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Deletion

Rows or columns with missing values can be deleted as a last resort when the missing data contains very high portion of the row/column or imputation is not feasible.

This should be done cautiously as it can lead to loss of valuable information.

Key Assumptions

Listwise Deletion (Row-wise): Missingness is MCAR (Non-MCAR introduces bias into the remaining data).
Casewise Deletion (Column-wise): The feature with missing data is not critical, or has too many missing values to be useful.

Applicable Data Types

All data types (numerical, categorical).

Applicable Models

Can be used with any model if the data loss is minimal and does not introduce bias.
Most often considered for simple models.

Best When:

The amount of missing data is very small and randomly distributed, making the loss negligible.
A feature has a very high percentage of missing values (e.g., >70–80%), making it mostly useless.

Avoid When:

The amount of missing data is substantial as it reduces statistical power.
The missingness is not MCAR because the loss can introduce bias.

Demonstration:

1df_dropped_rows = df.dropna(axis=0)
2df_dropped_cols = df.dropna(axis=1)
3

Five data points are left after the list-wise deletion:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

In this section, we see how to handle missing data while ensuring the accuracy of the dataset.

In the next section, I’ll explore how to handle (non-missing) numerical values over scaling techniques.

Scaling Numerical Features

Scaling numerical features ensures that all features contribute equally to the model because without scaling, features with larger ranges dominate the feature space over those with smaller ranges.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Applying standardization and normalization to the original data. (Created by Kuriko IWAI)

This process is particularly important for algorithms sensitive to feature scales, such as K-Means, Support Vector Machines (SVMs), and neural networks.

Common scaling methods include normalization, standardization, and robust scaling.

◼ 1. Normalization (Min-Max Scaling)

This technique scales features to a fixed range of zero to one.

The formula for normalization is:

X_{scaled} = \frac{X - X_{min}} {X_{max} - X_{min}}

where X is the original value, X_min / X_max represents the minimum / maximum value of the feature respectively.

Key Assumptions

No significant outliers, as they can disproportionately compress the range of other data points.

Applicable Models (must apply)

Neural Networks (especially with sigmoid/tanh activations),
K-Nearest Neighbors
Support Vector Machines (especially RBF kernel)
Clustering Algorithms (e.g., K-Means, Hierarchical Clustering) (for distance computation)
Principal Component Analysis (PCA).

Best When:

Algorithms require features to be within a specific range.
The data distribution is not Gaussian.
There are no significant outliers.

Avoid When:

The data contains many outliers (Normalization is highly susceptible to outliers because it uses minimum and maximum values that these outliers influence the most. Significant outliers can severely compress the range of the majority data points. Use Robust Scaling instead).

Demonstration

Applying normalization on synthetic dataset:

1import pandas as pd
2from sklearn.preprocessing import MinMaxScaler
3
4# creates synthetic dataset
5data = {
6    'feature_1': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
7    'feature_2': [0.1, 0.5, 0.2, 0.9, 0.3, 0.7, 0.4, 0.6, 0.8, 1.0],
8    'feature_3': [1000, 2500, 1500, 4000, 3000, 500, 3500, 2000, 4500, 5000]
9}
10df = pd.DataFrame(data)
11
12# applies normalization
13scaler_minmax = MinMaxScaler()
14df_normed = scaler_minmax.fit_transform(df)
15

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Standardization (Z-score Normalization)

This method transforms features to have a mean of 0 and a standard deviation of 1.

The formula for standardization is:

X_{scaled} = \frac {X - \mu} {\sigma}

where X is the original value, μ is the mean of the feature, and σ is the standard deviation of the feature.

Standardization is less affected by outliers than normalization.

While both μ and σ are affected by outliers, the transformation scales data based on its distribution, which is generally more robust to extreme values than simply clamping to the min/max values.

Key Assumptions

Features should contribute equally based on their variance.

Applicable Models

Highly beneficial for models and techniques sensitive to the feature variance.

Linear Regression,
Logistic Regression,
Support Vector Machines (especially with linear kernels),
K-Means,
Neural Networks,
Principal Component Analysis (PCA),
Linear Discriminant Analysis (LDA),
L1/L2 Regularization (Ridge, Lasso).

Best When:

Most effective when the data is approximately normally distributed.
The data contains outliers that should not disproportionately influence the scaling.

Avoid When:

Need to preserve the interpretability of the absolute original scale of the data.
The dataset is very sparse (standardizing could destroy the sparsity structure).

Demonstration:

Applying standardization to the same synthetic dataset:

1from sklearn.preprocessing import StandardScaler
2
3scaler_standard = StandardScaler()
4df_stan = scaler_standard.fit_transform(df_scaling)
5

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Robust Scaling

This method scales features using the interquartile range (IQR), making it more robust to outliers than normalization or standardization.

Mathematically, it is computed by centering the data around zero and scaling it with IQR:

X_s = \frac{X_i − Median}{IQR}

Applicable Models

Support Vector Machines, K-Nearest Neighbors, and Neural Networks

Best When:

The dataset contains many outliers. Ensures that they do not heavily influence the scaling process.

Avoid When:

The dataset does not contain significant outliers (general standardization and normalization are simpler and equally effective).
Need to keep the relationship to the original mean and standard deviation for model interpretability or specific analytical needs.

Demonstration:

1from sklearn.preprocessing import RobustScaler
2
3# creates synthetic dataset
4data_rob = {
5    'val_1': [1, 2, 3, 4, 5, 100, 6, 7, 8, 9],
6    'val_2': [10, 11, 12, 13, 14, 15, 200, 16, 17, 18]
7}
8df_rob_orig = pd.DataFrame(data_rob)
9
10# applies scaling
11scaler_robust = RobustScaler()
12df_robust_scaled = scaler_robust.fit_transform(df_rob_orig)
13

The outliers in the dataset (100, 200) are transformed to 21, 42, respectively.

Take a look at the val_1 column:

Median (Q2) = (5 + 6) / 2 = 5.5
Q1 = 3.25 (0.25 percentile)
Q3 = 7.75 (0.75 percentile)
IQR = Q3 − Q1= 7.75−3.25 = 4.5
X_s = (100–5.5) / 4.5 = 21.0

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Encoding Categorical Variables

Machine learning models require numerical input, so nominal categorical variables (categories without an inherent order) must be converted into numerical representations.

In this section, I’ll introduce four major techniques to encode categorical features.

◼ One-Hot Encoding

This is a widely used technique for nominal categorical variables.

It creates new binary columns for each category, with a ‘1’ indicating the presence of that category and ‘0’ otherwise, while preventing the model from assuming an arbitrary numerical relationship between categories.

Key Assumptions

Nominal data: no inherent order among categories.
Each category is distinct and independent.

Applicable Models

Essential for models that interpret numerical values as continuous or ordinal, such as

Linear Regression,
Logistic Regression,
Support Vector Machines,
K-Nearest Neighbors, and
Neural Networks.

Best When:

Nominal categorical values with a relatively low number of unique categories.

Avoid When:

Has very high cardinality (Leading to the curse of dimensionality due to many new columns created. Consider Binary Encoding, Target Encoding, or Feature Hashing instead).

Demonstration

1from sklearn.preprocessing import OneHotEncoder
2
3# creates synthetic categorical data
4data_cat = {
5    'City': ['New York', 'London', 'Paris', 'New York', 'Tokyo', 'London', 'Berlin', 'New York'],
6    'VehicleType': ['Car', 'Bike', 'Car', 'Bus', 'Car', 'Bike', 'Bus', 'Car']
7}
8df_cat = pd.DataFrame(data_cat)
9
10# applies encoding
11encoder = OneHotEncoder(
12    handle_unknown='ignore',    # replace NaN with zero
13    sparse_output=False         # returns dense array
14)
15encoded_features = encoder.fit_transform(df_cat[['City', 'VehicleType']])
16

The encoder created eight features by splitting unique values stored in the categorical column VehicleType:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

◼ Label Encoding

This method assigns a unique integer to each category, instead of making new columns.

Key Assumptions

For Ordinal Data: A meaningful, inherent order exists among categories that can be represented by integers.
For Nominal Data: An arbitrary ordinal relationship (Caution: This can mislead models sensitive to numerical distances (e.g., linear models, SVMs, KNN).

Applicable Data Types

Mainly for ordinal categorical.

Applicable Models

Primarily for Tree-based Models (less sensitive to the numerical distance between encoded categories).

Best When:

The categorical feature is ordinal and numerical order can reflect integers (e.g., “low,” “medium,” “high” to 0, 1, 2).
Used as a pre-processing step for tree-based algorithms.

Avoid When:

Uses models sensitive to numerical distances (e.g., linear regression, SVMs, K-Means).

Demonstration

Encoded the column VehicleType with unique labels:

1from sklearn.preprocessing import LabelEncoder
2
3df_cat_le = df_cat.copy()
4le = LabelEncoder()
5df_cat_le['VehicleType_Encoded'] = le.fit_transform(df_cat_le['Vehicle_Type'])
6

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

(adding encoded values as a new column for demonstration purpose.)

◼ Target Encoding (Mean Encoding)

This technique replaces each category with the mean of the target variable for that category.

It requires cross-validation or regularization to prevent overfitting.

Key Assumptions

The mean of the target category is a good representation of that category’s influence on the target.
The categorical feature and the target variable have a meaningful statistical relationship.
Crucial: Runs cross-validation to avoid data leakage.

Applicable Data Types

Categorical (nominal), used in conjunction with a numerical target variable (for regression) or binary target variable (for classification).

Applicable Models

Highly effective for Gradient Boosting Machines and Neural Networks because both can provide a powerful, single numerical feature that captures target information.
Can also be used for Linear Models but with higher risk of overfitting without proper validation.

Best When:

Dealing with high-cardinality nominal categorical features.
Strong relationship between the categorical feature and the target variable.

Avoid When:

The dataset is very small, as the target mean for each category might not be robust.
Implement without robust cross-validation strategies.

Demonstration:

Mapping missing values with the mean of the sales by state:

1data_target_encoding = {
2    'State': ['CA', 'NY', 'TX', 'CA', 'FL', 'NY', 'TX', 'IL', 'CA', 'FL'],
3    'Sales': [100, 150, 120, 110, 180, 160, 130, 90, 105, 175]
4}
5df_target = pd.DataFrame(data_target_encoding)
6target_means = df_target.groupby('State')['Sales'].mean()
7df_target['State_Encoded'] = df_target['State'].map(target_means)
8

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

(adding encoded values as a new column for demonstration purpose.)

◼ Binary Encoding

This method converts categories into binary code and then represents each bit as a separate column.

It can reduce the dimensionality compared to one-hot encoding for high-cardinality categorical features.

Key Assumptions: n.a.

Applicable Data Types

Nominal categorical (strings, objects), especially suitable for high-cardinality features.

Applicable Models

Can be used with most models, particularly beneficial for Linear Models, SVMs, and Neural Networks when high-cardinality features.

Best When:

The dataset has high-cardinality nominal categorical features.
Need to reduce the dimensionality while preserving information about the original categories.

Avoid When:

Interpretability of individual categories is necessary (The resulting binary features are not directly human-readable representations of the original categories).
When the cardinality is very low (Apply one-hot encoding instead).

Demonstration

1from category_encoders import BinaryEncoder
2
3data_binary = {
4    'Product_Category': ['Electronics', 'Books', 'Clothing', 'HomeGoods', 'Electronics',
5    'Books', 'Sports', 'Toys', 'HomeGoods', 'Electronics', 'Footwear', 'Jewelry']
6}
7df_binary = pd.DataFrame(data_binary)
8
9# encodes
10encoder_binary = BinaryEncoder(cols=['Product_Category'])
11df_binary_encoded = encoder_binary.fit_transform(df_binary)
12

The binary encoder first transforms unique integer IDs into the binary representations (e.g. 5 → 101) and then stores each digit of the binary representation to a new column:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

One-hot encoding will create seven new columns, while binary encoding creates four, thus saved dimensionality.

Preprocessing is Necessary?

No, sometimes. Though data preprocessing is often essential, there are specific scenarios or model types where some steps might be unnecessary.

◼ Perfectly Clean, Homogeneous, and Appropriately Scaled Data

The dataset is already perfectly clean (no missing values, no erroneous entries), all numerical features are on a comparable scale (or their scales don’t affect the chosen algorithm), and all categorical features are already numerically encoded in a way suitable for your model, then you might skip preprocessing.

This is rare in real-world data and typically implies that significant manual data curation has already occurred.

◼ Specific Domain Expertise or Engineered Features

In some highly specialized domains, features might be meticulously engineered to be directly interpretable by a model without further scaling or transformation.

For instance, if features are already proportions or counts within a constrained range (e.g., 0–1 or 0–100), additional scaling might not be required.

If categorical data comes in as pre-defined integer IDs that the model can interpret directly without assuming order, further encoding is unecessary.

◼ Specific Models and Their Tolerances

▫ Tree-based Models

Missing Data: Not necessary. Most tree-based models have built-in mechanisms to handle missing values. Imputation is not a requirement.
Scaling: Not necessary. These models are generally invariant to feature scaling because their splits are based on feature values relative to each other, not the absolute values.
Encoding: Not necessary, but nice to run One-Hot Encoding. They find splits on integer-encoded values without assuming a linear relationship. One-Hot Encoding is preferred for interpretability or if a specific tree implementation has issues with high cardinalities of label-encoded features.

▫ Deep Learning (with specific layers for text/categorical data)

Missing Data: Necessary for most standard neural network layers that cannot handle NaN directly.
Scaling: Required for numerical features.
Encoding: Not necessary. Primary handled by the embedding layers especially for very high-cardinality categorical features.

▫ Naive Bayes Classifiers

Missing Data: Required imputation or deletion. The model cannot handle missing values.
Scaling: Scaling is required, but standardization is not as critical as for distance-based algorithms.
Encoding: Requires numerical input. One-Hot Encoding for nominal features and Label Encoding for ordinal features are appropriate.

Wrapping Up

Effective data preprocessing is an important preparation before the model training and deployment.

By meticulously addressing missing values, standardizing or normalizing numerical features, and intelligently encoding categorical data, we can significantly enhance model accuracy, stability, and training efficiency.

While certain algorithms, particularly tree-based models, exhibit inherent robustness to some raw data imperfections, a thoughtful approach based on model algorithms and dataset would be a first step to high-performing machine learning solutions.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "The Definitive Guide to Imputation and Data Preprocessing in Machine Learning" in Kernel Labs

https://kuriko-iwai.com/data-preprocessing-in-machine-learning

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

The Definitive Guide to Imputation and Data Preprocessing in Machine Learning

A comprehensive guide on missing data imputation, feature scaling and encoding with practical examples

Table of Contents

Introduction

Handling Missing Data

◼ Missing data is a common issue in real-world datasets and it can significantly impact model performance.

Identification

Imputation

◼ Mean Imputation

◼ Median Imputation

◼ Mode Imputation

◼ K-Nearest Neighbors (KNN) Imputation

▫ 1. KNN Focused

▫ 2. Hybrid Approach

◼ Regression Imputation

◼ Forward Fill and Backward Fill

◼ Generative Adversarial Imputation Networks (GAIN)

◼ Deletion

Scaling Numerical Features

◼ 1. Normalization (Min-Max Scaling)

◼ Standardization (Z-score Normalization)

◼ Robust Scaling

Encoding Categorical Variables

◼ One-Hot Encoding

◼ Label Encoding

◼ Target Encoding (Mean Encoding)

◼ Binary Encoding

Preprocessing is Necessary?

◼ Perfectly Clean, Homogeneous, and Appropriately Scaled Data

◼ Specific Domain Expertise or Engineered Features

◼ Specific Models and Their Tolerances

▫ Tree-based Models

▫ Deep Learning (with specific layers for text/categorical data)

▫ Naive Bayes Classifiers

Wrapping Up

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?