Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE

A comprehensive guide on enhancing machine learning models using Gaussian noise, interpolation methods (Spline, RBF, IDW), and adaptive SMOTE algorithms for real-world datasets.

Machine LearningDeep LearningData SciencePython

By Kuriko IWAI

Introduction What is Data Augmentation Noise Injection Interpolation SMOTE Algorithms Various SMOTE Algorithms Wrapping Up

Introduction

Machine learning models require training on substantial amounts of high-quality, relevant data.

Yet, real-world data presents significant challenges due to its inherent imperfections.

Data augmentation is a key strategy to tackle these challenges and provide robust training for the model.

In this article, I’ll explore major data augmentation techniques for tabular data:

noise injection and
interpolation methods, including SMOTE algorithms,

along with practical implementation examples.

What is Data Augmentation

Data augmentation is data enhancement technique in machine learning that handles specific data transformations and data imbalance by expanding original datasets.

Its major techniques include noise injection where the model is trained on a dataset with intentionally created noise and interpolation methods where the algorithm estimates unknown data based on the original dataset.

Due to this expansion approach leveraging the original dataset, sufficiently large and accurate dataset that reflects the true underlying data distribution is prerequisite to fully leverage data augmentation.

Unless otherwise, noise and outliers in the original dataset that the model shouldn’t learn are also augmented as new data, completely misleading the model.

◼ Why Data Augmentation is Important: The Challenges of Real-World Data

For a model to be effective, it must be trained on data that accurately reflects patterns likely to recur in the future.

Lack of high-quality, relevant data prevents models from learning effectively, leading to poor performance.

However, primary challenges arise when dealing with real-world datasets: data quantity issues and data quality issues.

▫ Data Quantity Issues:

Acquiring sufficient data can be a significant hurdle when relevant events are extremely rare (e.g., predicting rare decreases).

Insufficient data lead to major problems:

Underfitting where the model fundamentally fails to learn patterns from data and generates high bias and
Class imbalance in classification tasks where certain classes in the target variable lack sufficient data compared to the others, making the model bias toward dominant classes.

▫ Data Quality Issues:

Even with sufficient data, exceptional imperfections like missing values, noise, or inconsistencies can severely mislead a model.

This causes a common problem, overfitting where the model learns incorrect patterns from the training data, ultimately preventing it from generalizing its learning to unseen data, generating high variance.

◼ Choosing the Right Data Enhancement Approach

Data enhancement collectively refers to machine learning strategies to expand and improve the quality of datasets for model training to boost its generalization capabilities.

Primary approaches include imputation, synthetic data generation, and data augmentation, each of which handles different types of data limitation challenge:

▫ Imputation

This technique addresses missing values within existing datasets.

Importantly, it doesn’t increase the number of samples; instead, it fills in gaps in the original data points.

Depending on the type of missing data, imputation approaches vary:

Statistical: Mean, Median, Mode Imputation
Model-based: KNN Imputation, Regression Imputation
Deep learning based: GAIN (Generative Adversarial Imputation Networks)
Time series specific: Forward Fill/Backward Fill

▫ Synthetic Data Generation

This approach is ideal when we are facing limitations in data quantity, privacy concerns, or data sharing restrictions.

It involves creating entirely new datasets from scratch, meticulously designed to reflect the statistical properties of real data without using actual sensitive information.

Advanced techniques like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) can generate high-fidelity synthetic data, which is particularly useful when real data is scarce, sensitive, or contains significant imperfections.

▫ Data Augmentation

This method tackles limitations related to data quantity by expanding original datasets (key difference from synthetic data generation).

It involves applying various transformations to the original data (e.g., rotating images, adding noise to audio) without collecting new raw data.

This process helps the model generalize better to unseen examples, lowering variance.

Now, let us explore two major data augmentation techniques: noise injection and interpolation methods.

Noise Injection

Noise injection is a data augmentation technique to deliberately introduce controlled random perturbations into continuous features during model training.

This method is applicable for both regression and classification tasks, but noise has to be injected to continuous values.

For example:

Original Data Point: [age: 35, income: 60000, gender: 1]

Applying noise injection by adding a small, random value to each continuous feature:

Augmented Data Point 1:
[age: 35 + 1.2 = 36.2, income: 60000 - 550 = 59450, gender: 1]
Augmented Data Point 2:
[age: 35 - 0.8 = 34.2, income: 60000 + 720 = 60720, gender: 1]

In this example, noise for age and income are randomly selected from the value ranged from -10 to 10 and -1,000 to 1,000 respectively.

A discrete feature gender is out of the scope, so remains the same.

Although noise injection will not increase the number of the samples in the dataset, it can implicitly expand the feature space by adding values to continuous features.

Major techniques applicable for tabular data include:

Gaussian Noise Injection: Adds random values sampled from a Gaussian distribution to the original dataset, and
Jittering: Applies small, random perturbations (often follows Gaussian) to individual data points in time series/sequential data.

Now, take a look at how a common noise injection method: Gaussian Noise Injection works.

◼ Demonstration: Gaussian Noise Injection

I created a scenarios where a Linear Regression model is trained on extremely noisy data because the deployment environment is expected to have noise (e.g., sensor readings with measurement errors).

This scenario is challenging for the model because in its nature, Linear Regression requires abundant, linearly separable data to accurately learn linear approximations.

1import numpy as np
2from sklearn.model_selection import train_test_split
3from sklearn.preprocessing import StandardScaler
4from sklearn.linear_model import LinearRegression
5from sklearn.metrics import  mean_squared_error, mean_absolute_error, r2_score
6
7# underfit due to limited samples
8n_samples, n_features = 100, 10
9
10# creates true X
11X_true = np.random.rand(n_samples, n_features)
12
13# creates true y (extremely noisy)
14true_coefficients = np.random.randn(n_features)
15true_bias = 100
16y_true_noise = np.random.rand(n_samples) * 10000
17y_true = np.dot(X_true, true_coefficients) + true_bias + y_true_noise
18
19# splits and scales the data
20X_train, X_test, y_train, y_test = train_test_split(X_true, y_true, test_size=30, random_state=42)
21scaler = StandardScaler()
22X_train_s = scaler.fit_transform(X_train)
23X_test_s = scaler.transform(X_test)
24
25# trains the model and makes a prediction
26model = LinearRegression().fit(X_train, y_train)
27y_pred_train = model.predict(X_train_s)
28y_pred_test = model.predict(X_test_s)
29
30# computes evaluation matrics
31mse_train = mean_squared_error(y_train, y_pred_train)
32mae_train = mean_absolute_error(y_train, y_pred_train)
33r2_train = r2_score(y_train, y_pred_train)
34mse_test = mean_squared_error(y_test, y_pred_test)
35mae_test = mean_absolute_error(y_test, y_pred_test)
36r2_test = r2_score(y_test, y_pred_test)
37

▫ Results from Original Data

Without noise injection, the model failed to learn the pattern, ending up with significantly high errors (e.g., generalization MSE: 48,429.01).

MSE: Train 21,232.91 → Generalization on test set: 48,429.01
MAE: Train 3,472.48 → Generalization on test set: 5,943.21
R2 Score: Train: -1.00 → Generalization on test set: -4.5368

▫ Adding Gaussian Noise

Then, I added Gaussian Noise to the training dataset and retrained the model:

1import numpy as np
2from sklearn.preprocessing import StandardScaler
3from sklearn.metrics import  mean_squared_error, mean_absolute_error, r2_score
4
5# adds gaussian noise to training dataset (before scaling)
6gaussian_noise = np.random.normal(loc=0, scale=1, size=X_train.shape)
7X_train_noise = X_train + gaussian_noise
8
9# scale the dataset
10scaler = StandardScaler()
11X_train_noise_s = scaler.fit_transform(X_train_noise)
12X_test_noise_s = scaler.transform(X_test)
13
14# retrain the model and make a prediction
15model = LinearRegression().fit(X_train_noise_s, y_train)
16y_pred_train_noise = model.predict(X_train_s_noise)
17y_pred_test_noise = model.predict(X_test_s_noise)
18
19# computes evaluation matrics
20mse_train = mean_squared_error(y_train, y_pred_train_noise)
21mae_train = mean_absolute_error(y_train, y_pred_train_noise)
22r2_train = r2_score(y_train, y_pred_train_noise)
23
24mse_test = mean_squared_error(y_test, y_pred_test_noise)
25mae_test = mean_absolute_error(y_test, y_pred_test_noise)
26r2_test = r2_score(y_test, y_pred_test_noise)
27

▫ Results from Data with Gaussian Noise

The model improved performance significantly from generalization MSE of 48,429 to 8,962.

MSE: Train 9,240.38 → Generalization on test set: 8,962.52
MAE: Train 2,632.58 → Generalization on test set: 2,610.19
R2 Score: Train: 0.13 → Generalization on test set: -0.0247

These results indicates that the model become more robust to noisy real-world data after trained on the Gaussian noise.

There are occasions we should avoid noise injection:

When interpretability is crucial: The noise added makes the relationship between input features and predictions obscure.
When the model is sensitive to small input perturbations: Especially in safety-critical systems, even small changes to the input could lead to inaccurate outputs.
When training time is extremely limited: The process of injecting noise could increase the computational cost and training time when implemented at scale.

Else, noise injection is an useful method to combat moderate overfitting by forcing the model to learn varied versions of the data.

Interpolation

Interpolation is a data augmentation technique that expands the underlying data distribution of the original dataset by estimating unknown values between data points randomly chosen from the original dataset.

Because of this estimation process, this method requires the original dataset to be accurate and sufficiently robust.

It’s not suitable for:

Very limited dataset because it cannot estimate new data correctly, or
Noisy datasets as the noise is also expanded to new data, misleading the model.

◼ Types of Interpolation Methods

Out of many interpolation methods, Linear interpolation is the most major and intuitive method.

Mathematically, the interpolated value y is given by:

y = y_1 + (x - x_1) \frac{ ​y_2​−y_1​​}{x_2 - x_1}

where x and y lies in between randomly chosen samples: (x_1, y_1) and (x_2, y_2).

The below figure visualizes the linear interpolated curve (blue line) of original data points (red dots).

Taking two random original data points:

(x_1, y_1) = (3.00, 1.66),
(x_2, y_2) = (4.00, 2.43) (highlighted in orange in the figure)

for example, the interpolated value y for a random point x = 3.75 that exists in between the two original points is computed y = 2.24:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Linear interpolation (Created by Kuriko IWAI)

Linear interpolation is best when:

The original dataset is relatively small.
The underlying relationship between two original data points seems linear.
Higher-order smoothness is not critical (e.g., basic graphing, resampling)

I’ll find unique interpolation curves of major interpolation methods at the same arbitrary point of x = 3.75.

▫ Polynomial Interpolation

Polynomial interpolation fits a polynomial function through a set of original data points, instead of a linear function.

For n original data points, an interpolated value is given as a polynomial value up to n-1 degree P(x):

P(x) = \sum_{j=1}^{n} y_j L_j(x)

where L_j(x) is the Lagrange basis polynomial corresponding to j-th data point x_j:

L_j(x) = \prod_{k=1, k \neq j}^{n} \frac{x - x_k}{x_j - x_k}

(In case of using Lagrange Interpolation)

This method is best when:

The original dataset is relatively small.
The underlying relationship between two original data points is a single, smooth polynomial with moderate degrees (as high-degree polynomial suffers from Runge's phenomenon).

The interpolation value for the arbitrary point is 2.46:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Polynomial Interpolation (Created by Kuriko IWAI)

▫ Nearest Neighbor Interpolation

Nearest neighbor interpolation assigns the value of an original data point closest to the given unknown data point.

The interpolated value y is given by finding the original data y_k with the minimum distance values:

y = y_k \quad \text{where } k = \arg\min_i |x - x_i|

(In case of Manhattan distance. Distance metrics can be Euclidian or other metrics of our choice.)

This method is best when:

Original data points are discrete values.
Critical to preserve the values of the original data (e.g., image processing for resizing).
Needs computationally inexpensive methods.

The interpolation value is 2.43 as it chooses the nearest original data point: (x_2, y_2).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Nearest Interpolation (Created by Kuriko IWAI)

▫ Spline Interpolation

Spline interpolation fits a series of piecewise polynomial functions of the original dataset.

In the most common method, cubic splines, the interpolation value is defined as a cubic polynomial S_i(x):

S_i​(x)=a_i​+b_i​(x−x_i​)+c_i​(x−x_i​)^2+d_i​(x−x_i​)^3 \text{ for } x_i​≤x ≤x_{i+1​}

where coefficients a, b, c, d are computed by solving the following equations:

\begin{align} S_i(x_{i+1}) &= S_{i+1}(x_{i+1}) \\ \\ S_i'(x_{i+1}) &= S_{i+1}'(x_{i+1}) \\ \\ S_i''(x_{i+1}) &= S_{i+1}''(x_{i+1}) \end{align}

This method can create a smooth, continuous curve that passes through original data points without oscillations. So, it is best when:

Handles high-degree polynomial interpolation that a simple polynomial interpolation would suffer from oscillations.
Tasks requiring visually pleasing and differentiable curves, such as computer graphics, CAD/CAM, and numerical analysis.

The interpolation value is 2.12:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Spline Interpolation (Created by Kuriko IWAI)

▫ Radial Basis Function (RBF) Interpolation:

RBF interpolation constructs an interpolated curve as a linear combination of radial basis functions (RBF).

The RBF interpolated value s(x) is given by:

s(x) = \sum_{i=1}^n w_i ϕ (||x -x_i||) + P(x)

where:

x_i: i-th data points in the original dataset in D-dimensional feature vector,
w_i: The weights determined by solving a system of linear equations,
ϕ: A RBF function of our choise (e.g., Gaussian, multiquadric, thin-plate spline),
|| x − x_i||: The Euclidean distance between x and x_i, and
P(x): A low-degree polynomial term.

This method can fully leverage advantages of RBF, creating a smooth interpolating surface for complex original data. So, it is best when:

The original dataset is scattered in high-dimensional spaces.
The original dataset has irregular data distribution.

The interpolate value is 2.11:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. RBF Interpolation (Created by Kuriko IWAI)

▫ Inverse Distance Weighting (IDW) Interpolation

IDW interpolation estimates values based on a weighted average of the distance of original data points under the assumptions where:

the relationship between distance and influence is constant (stationarity) and
no directional biases exist in data (isotropy).

So, in this method, closer points have more influence as the weight is inversely proportional to the distance.

The interpolated value y^(x) is given by:

\hat{y}(x) = \frac{\sum_{i=1}^{n} \frac{1}{d(x, x_i)^p} y_i}{\sum_{i=1}^{n} \frac{1}{d(x, x_i)^p}}

where:

d(x, x_i): The distance between the unknown point x and the i-th original data point x_i.
p: A positive power parameter (commonly p=2 - "inverse distance squared")

As p increases, the influence of more distant points diminishes more rapidly.

This method is best when:

The data points are relatively dense and evenly distributed (aligning with IDW’s assumptions).
The future prediction is driven more by local variation (as IDW puts more weights on nearby measured values).
Large dataset and real-time application where quick results are needed.

The interpolated value is 2.40:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. IDW Interpolation (Created by Kuriko IWAI)

Building on the foundation of various interpolation methods, in the next section, I’ll explore SMOTE algorithms, which leverages linear interpolation.

SMOTE Algorithms

SMOTE (Synthetic Minority Over-sampling Technique) algorithms are data augmentation techniques leveraging linear interpolation.

They are applicable for both classification and regression tasks, handling imbalance in training data to improve model performance.

▫ How SMOTE Algorithms Work

SMOTE algorithms take iterative process of interpolating a new sample until it reaches the given target number of samples.

This process involves:

Step 1. Choose a random sample x_i from the minority class.

Step 2. Find k-nearest neighbors of x_i in the feature space using Euclidean distance.

Step 3. Randomly select one neighbor from the neighbors found in Step 2.

Step 4. Interpolate to create a new sample:

A new synthetic sample x_new is generated using linear interpolation:

x_{new} = x_i + \lambda (x_{neighbor} - x_i)

where:

x_new: The newly generated synthetic sample,
x_i: The original minority class sample,
x_neighbor: A randomly chosen sample from the k nearest neighbors of the original sample x_i,
λ: A random number between 0 and 1 (all inclusive).

The algorithm chooses a random value for λ to place the new synthetic sample at a random point along the line segment between x_i and x_neighbor.

Step 5. Repeat Steps 1 to 4 until the desired number of synthetic samples is created.

Let us see a walkthrough example.

◼ The Walkthrough Example

Imagine a 2-dimensional feature space. We have five samples in a minority class M:

\begin{align} M_1 &= \begin{bmatrix} 2.0 & 5.0 \end{bmatrix} \\ \\ M_2 &= \begin{bmatrix} 2.1 & 5.2 \end{bmatrix} \\ \\ M_3 &= \begin{bmatrix} 1.9 & 4.8 \end{bmatrix} \\ \\ M_4 &= \begin{bmatrix} 3.5 & 6.0 \end{bmatrix} \\ \\ M_5 &= \begin{bmatrix} 1.0 & 4.0 \end{bmatrix} \end{align}

and we want to secure at least 10 samples to handle class imbalance.

SMOTE algorithms start with:

Step 1. Choose a random sample x_i:

x_i =\begin{bmatrix} 2.0 & 5.0 \end{bmatrix} \text{ }(= M_1)

Step 2. Find k-nearest neighbors of x_i:

First, the algorithm computes the Euclidean distance between the selected sample x_i and the other samples:

Distance from x_i to M_2: d(x_i, M_2) = sqrt((2.1−2.0)²+(5.2−5.0)²) ≈ 0.2236
Distance from x_i to M_3: d(x_i, M_3) = sqrt((1.9−2.0)²+(4.8−5.0)²) ≈ 0.2236
Distance from x_i to M_4: d(x_i, M_4) = sqrt((3.5−2.0)²+(6.0−5.0)²) ≈ 1.8028
Distance from x_i to M_5: d(x_i, M_5) = sqrt((1.0−2.0)²+(4.0−5.0)²) ≈ 1.4142

Here, let’s say k = 2.

The algorithm picks up M_2 and M_3 as neighbors based on the computed distance.

Step 3. Randomly select one neighbor: The algorithm selects M_3.

Step 4. Interpolate to create a new sample:

First, assign a random value to λ: let’s say λ = 0.4.

Then, compute x_new:

x_{new} = \begin{bmatrix} 2.0 & 5.0 \end{bmatrix} + 0.4 \cdot\begin{bmatrix} 1.9-2.0 & 4.8 - 5.0 \end{bmatrix} = \begin{bmatrix} 1.96 & 4.92 \end{bmatrix}

Add the new sample x_new to the sample space as M_6:

\begin{align} M_1 &= \begin{bmatrix} 2.0 & 5.0 \end{bmatrix} \\ \\ M_2 &= \begin{bmatrix} 2.1 & 5.2 \end{bmatrix} \\ \\ M_3 &= \begin{bmatrix} 1.9 & 4.8 \end{bmatrix} \\ \\ M_4 &= \begin{bmatrix} 3.5 & 6.0 \end{bmatrix} \\ \\ M_5 &= \begin{bmatrix} 1.0 & 4.0 \end{bmatrix} \\ \\ M_6 &= \begin{bmatrix}1.96 & 4.92 \end{bmatrix} \end{align}

Step 5. Repeat Step 1 to 4 four more times, adding M_7, M_8, M_9, and M_10 to secure 10 samples.

Various SMOTE Algorithms

Depending on task types and input data types, SMOTE algorithms are classified into three categories:

Classification tasks with numerical input data,
Classification tasks with categorical or mixed input data, and
Regression tasks.

I’ll explore them one by one in this section.

◼ 1. Classification Task with Numerical Input Data

SMOTE, KMeansSMOTE, variations of Borderline SMOTE are only applicable for classification tasks with continuous input values.

Applicable Task Types: Classification
Applicable Input Data Types: Continuous only

SMOTE:

Generates synthetic samples for the minority class by interpolating between existing minority samples and their k-nearest neighbors.

Best When: Minority class samples are not extremely rare, and in dense feature space.

KMeans SMOTE:

This method combines K-Means clustering with SMOTE.

It first clusters the minority class samples using K-Means and then applies SMOTE within each cluster, focusing on generating synthetic samples in less dense regions of these clusters.

Best When: The minority class has multiple clusters or sub-distributions.

Borderline SMOTE:

A variation of SMOTE that only oversamples minority samples close to the decision boundary.

Best When: Minority class samples are close to the decision boundary, but deeply surrounded by a mix of majority and minority class neighbors.

SVM SMOTE:

SVM SMOTE is a variation of borderline SMOTE, using an SVM (Support Vector Machine) classifier to identify support vectors as borderline minority samples, and then applies SMOTE to only these samples.

Best When: Similar to Borderline SMOTE, but best when the decision boundary is more complex.

ADASYN (Adaptive Synthetic Sampling):

ADASYN adaptively generates more synthetic data for minority class samples, focusing on those harder to learn because it is close to majority class samples.

It first identifies those “difficult“ samples in the minority class by computing a ratio (r_i) of minority / majority class samples in the k-nearest neighbors.

Higher r_i ratio means more “difficult“ because it’s closer to the decision boundary, surrounded by more majority class samples.

Hence, when it observes high r_i, it generates more minority samples (which is called adaptive), taking the same approach as SMOTE.

This method is best when:

The decision boundary is extremely complex.
Certain minority class instances are difficult to classify due to their proximity to majority class clusters.

◼ Simulation

Let us see how they work on synthetic continuous data.

To compare the variations, I use the same dataset and hyperparameters for Logistic Regression.

LightGBM was also included as a best performing benchmark, without data augmentation.

▫ The Data

1import numpy as np
2from sklearn.model_selection import train_test_split
3from sklearn.preprocessing import StandardScaler
4
5
6# create a dataset
7np.random.seed(42)
8n_samples, balance, test_size = 5000, [0.95, 0.05], 1000
9
10# classes
11n_class_0 = int(n_samples * balance[0])
12n_class_1 = int(n_samples * balance[1])
13
14# class 0 (majority)
15X_0 = np.random.randn(n_class_0, 2) * 0.8 + np.array([-1, -1])
16y_0 = np.zeros(n_class_0, dtype=int)
17
18# class 1 (minority) - slightly shifted and with less variance
19X_1 = np.random.randn(n_class_1, 2) * 0.3 + np.array([1, 1])
20y_1 = np.ones(n_class_1, dtype=int)
21
22# merge two classe
23X = np.vstack((X_0, X_1))
24y = np.hstack((y_0, y_1))
25
26# creates and scale train and test datasets 
27X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, stratify=y)
28
29scaler = StandardScaler()
30X_train = scaler.fit_transform(X_train)
31X_test = scaler.transform(X_test)
32

▫ The Augmentation & Model Training

SMOTE algorithms are applied only to the training dataset after preprocessing.

1import lightgbm as lgb
2from collections import Counter
3from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE, KMeansSMOTE, SVMSMOTE
4from sklearn.linear_model import LogisticRegression
5from sklearn.metrics import accuracy_score, f1_score
6
7# define a function to build, train, and evaluate the model
8def evaluate_model(X_train, y_train, X_test, y_test, model=None, sampler_name="No Oversampling"):
9    # build a model (LR or LightGBM)
10    model = LogisticRegression(
11        penalty='l2',
12        solver='lbfgs',
13        random_state=42,
14        class_weight='balanced',
15        tol=0.0001,
16        C=100,
17        max_iter=5000,   
18    ) if not model else lgb.LGBMClassifier(
19        class_weight='balanced',
20        n_estimators=1000,
21        random_state=42,
22        n_jobs=-1,
23        verbosity=-1,
24        learning_rate=0.01,
25        num_leaves=200,
26        max_depth=10,
27        reg_alpha=0.02,
28        reg_lambda=0.1,
29        subsample=0.4
30    )
31
32    # train
33    model.fit(X_train, y_train)
34
35    # make a prediction
36    y_pred_train = model.predict(X_train)
37    y_pred_test = model.predict(X_test)
38
39    # log adjusted class balance, accuracy and f1 scores
40    print(f"Train data distribution: class 0: {Counter(y_train)[0]}, class 1: {Counter(y_train)[1]}")
41    print(f"F1 (minority class): Train {f1_score(y_train, y_pred_train, pos_label=1):.4f} -> Generaliztion: {f1_score(y_test, y_pred_test, pos_label=1):.4f}")
42
43
44# base - logistic regression (lr) without smote
45evaluate_model(X_train, y_train, X_test, y_test, "No SMOTE (LR)")
46
47# lr with smote
48k, m = 3, 10
49smote_sampler = SMOTE(k_neighbors=k, random_state=42)
50X_res_smote, y_res_smote = smote_sampler.fit_resample(X_train, y_train)
51evaluate_model(X_res_smote, y_res_smote, X_test, y_test, "SMOTE")
52
53# lr with kmean smote
54kmeans_smote_sampler = KMeansSMOTE(
55    k_neighbors=k,
56    random_state=42,
57    n_jobs=-1,
58    cluster_balance_threshold=0.05, # lowers the threshold to avoid no cluster found error
59)
60X_res_kmsmote, y_res_kmsmote = kmeans_smote_sampler.fit_resample(X_train, y_train)
61evaluate_model(X_res_kmsmote, y_res_kmsmote, X_test, y_test, "KMeansSMOTE")
62
63# lr with borderline smote
64borderline_smote_sampler = BorderlineSMOTE(k_neighbors=k, m_neighbors=m, random_state=42)
65X_res_bsmote, y_res_bsmote = borderline_smote_sampler.fit_resample(X_train, y_train)
66evaluate_model(X_res_bsmote, y_res_bsmote, X_test, y_test, "BorderlineSMOTE")
67
68# lr with svm smote
69svmsmote_sampler = SVMSMOTE(k_neighbors=k, m_neighbors=m, random_state=42)
70X_res_svmsmote, y_res_svmsmote = svmsmote_sampler.fit_resample(X_train, y_train)
71evaluate_model(X_res_svmsmote, y_res_svmsmote, X_test, y_test, "SVMSMOTE")
72
73# lr with adasyn
74adasyn_sampler = ADASYN(n_neighbors=k, random_state=42)
75X_res_adasyn, y_res_adasyn = adasyn_sampler.fit_resample(X_train, y_train)
76evaluate_model(X_res_adasyn, y_res_adasyn, X_test, y_test, "ADASYN")
77
78# threshold - light gbm without smote
79evaluate_model(X_train, y_train, X_test, y_test, "No Oversampling (Light GBM)", model=light_gbm)
80

▫ Results

Class balance after data augmentation:

The minority class was expanded from 200 to 3,800 samples, matching the size of the majority class.

Original: class 0: 3800, class 1: 200 (Applied for both Logistic Regression and LightGBM)
SMOTE: class 0: 3800, class 1: 3800
KMeanSMOTE: class 0: 3800, class 1: 3800
BorderlineSMOTE: class 0: 3800, class 1: 3800
SVM SMOTE: class 0: 3800, class 1: 3800
ADASYN: class 0: 3800, class 1: 3796

F1 score for the minority class:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Comparison of F1 scores (blue: train, red: generalization) among SMOTE algorithms, Logistic Regression, and LightGBM (Created by Kuriko IWAI)

SMOTE and KMeansSMOTE achieved the best generalization F1 scores for the minority class, surpassing the benchmark of LightGBM.

While overall training scores remain very high, Borderline SVM, SVM SMOTE, and ADASYN did not show the same level of generalization capabilities, underperforming LightGBM.

This indicates that for this dataset, sample imbalance should be addressed across the feature space of the minority class, not only in regions close to the borderline.

Decision boundary:

SMOTE and KMeansSMOTE augmented samples across the minority sample space using k-neighbor interpolation (black border dots), while borderline SMOTE algorithms in the second row added samples near the decision boundary.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Comparison of decision boundaries among SMOTE algorithms and Logistic Regression (Created by Kuriko IWAI)

◼ 2. Classification Tasks with Mixed / Categorical Input Data

When we have categorical features in input data, SMOTE N and SMOTE NC can handle them well.

SMOTE N (SMOTE for Nominal features):

Specifically designed for datasets composed entirely of nominal (categorical) features.

It picks up the most frequent category among the neighbors or by considering the difference between categorical values based on their relationship to the target class.

Applicable Task Types: Classification
Applicable Input Data Types: Discrete/Categorical only
Best When: The original dataset solely consists of categorical features.

SMOTE NC (SMOTE for Nominal and Continuous features):

A variation of SMOTE designed to handle datasets with both continuous and nominal (categorical) features.

It uses linear interpolation for continuous features and mode-based or frequency-based assignment for categorical features.

Applicable Task Types: Classification
Applicable Input Data Types: Mixed
Best When: Need to balance mixed data types when augmenting the data.

◼ Simulation

I compared the performance in the same approach as the previous case.

▫ The Data

I created a synthetic dataset with categorical values / mixed values, encoded categorical features, and scaled numerical features.

1import numpy as np
2import pandas as pd
3from sklearn.preprocessing import OneHotEncoder
4from sklearn.model_selection import train_test_split
5from imblearn.over_sampling import SMOTEN
6
7# create synthetic dataset only with categorical features
8np.random.seed(42)
9sample_size, minority, test_size = 5000, 2000, 2000
10data_cat = {
11    'cat_1': np.random.choice(['X', 'Y', 'Z'], sample_size),
12    'cat_2': np.random.choice(['P', 'Q'], sample_size),
13    'cat_3': np.random.choice(['M', 'N', 'O', 'P','R'], sample_size)
14}
15df_cat = pd.DataFrame(data_cat)
16
17# distribute target variables
18target_values = np.array([1] * minority + [0] * (sample_size - minority))
19np.random.shuffle(target_values)
20df_cat['target'] = target_values
21
22# create train / test datasets 
23X = df_cat[['cat_1', 'cat_2', 'cat_3']]
24y = df_cat['target']
25X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, stratify=y)
26
27# apply encoder
28encoder = OneHotEncoder(handle_unknown='ignore')
29X_train = encoder.fit_transform(X_train)
30X_test = encoder.transform(X_test)
31

▫ The Augmentation & Model Training

Using the same evaluate_model function, build, train, and evaluate each method:

1# logistic regression (no data augmentation)
2evaluate_model(X_train, y_train, X_test, y_test, "Logistic Regression (Original)")
3
4# apply SMOTE N for training data only with cat features
5smoten_sampler = SMOTEN(random_state=42, k_neighbors=3)
6X_train_smoten, y_train_smoten = smoten_sampler.fit_resample(X_train, y_train)
7evaluate_model(X_train_smoten, y_train_smoten, X_test, y_test, "SMOTEN", model=lr)
8
9
10# apply SMOTE NC for mixed features
11smotenc_sampler = SMOTENC(
12    categorical_features=[2], 
13    categorical_encoder=OneHotEncoder(handle_unknown='ignore'),  # encode cat features
14    k_neighbors=3,
15    random_state=42,
16)
17X_train_smote_nc, y_train_smote_nc = smotenc_sampler.fit_resample(X_train, y_train)
18
19# scale numerical features
20num_trans = Pipeline(steps=[('scaler', StandardScaler())])
21preprocessor = ColumnTransformer(transformers=[('num', num_trans, ['num_1', 'num_2']), ])
22X_train_smote_nc = preprocessor.fit_transform(X_train_smote_nc)
23X_test = preprocessor.transform(X_test)
24
25# model training, evaluation
26evaluate_model(X_train_smote_nc, y_train_smote_nc, X_test, y_test, "SMOTE NC")
27
28
29# light gbm (no data augmentation)
30evaluate_model(X_train, y_train, X_test, y_test, "LightGBM", model=light_gbm)
31

▫ Results

Class balance after data augmentation:

Original Data: class 0: 1800, class 1: 1200 (Applied to Logistic Regression and LightGBM)
SMOTE N: class 0: 1800, class 1: 1800
SMOTE NC: class 0: 2400, class 1: 2400 (adding numerical feature to the SMOTE N data)

F1 score for the minority class:

SMOTE N (categorical only) achieved the best generalization score of 0.472, outperforming LightGBM. (Yet, it has tendency of overfitting. We need to tighten the regularization.)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure I. Comparison of F1 scores (blue: train, red: generalization) for a classification task with categorical data (Created by Kuriko IWAI)

SMOTE NC (for mixed data) also outperformed the benchmark of LightGBM.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure J. Comparison of F1 scores (blue: train, red: generalization) for a classification task with mixed data (Created by Kuriko IWAI)

Important Consideration:

When dealing with mixed data or categorical only data with imbalanced classes, we observed lower F1-scores for the minority class, with or without data augmentation.

This is because of the high dimensionality introduced after encoding categorical features, which can exacerbate the "curse of dimensionality" and dilute the effectiveness of the augmentation.

Specifically, in high-dimensional spaces, the synthetic samples generated by SMOTE algorithms might not effectively represent the true underlying distribution of the minority class, or they might introduce noise, leading to suboptimal model performance.

We need to consider applying tactics like:

Feature engineering before encoding,
Encoding strategies like binary encoding to optimize the number of dimensions increased, and
Dimensionality reduction through PCA.

◼ 3. Regression Tasks

Regardless of input data types, SMOTE R from the smogn library can handle value imbalance in regression tasks.

SMOTE R (SMOTE for Regression)

SMOTE R adapts the SMOTE concept for regression problems where the target variable has an imbalanced distribution (e.g., a few extreme values).

It generates synthetic samples by interpolating both features and the target variable, aiming to balance the distribution of the target.

Best When: Dealing with regression problems where there's an imbalance in the distribution of target variable values (e.g., very few instances with extremely high or low values).

Let us see how it works.

◼ Simulation

I compared the performance in the same approach as the previous case.

▫ The Data

I created a synthetic dataset for a regression task and intentionally removed 80% of target values ranged from 50 to 100 to create extremely sparse, irregular patterns:

1
2import numpy as np
3import pandas as pd
4from sklearn.datasets import make_regression
5from sklearn.model_selection import train_test_split
6
7np.random.seed(42)
8sample_size, n_features, test_size = 5000, 5, 2000
9X, y = make_regression(n_samples=sample_size, n_features=n_features, noise=30, random_state=42)
10
11# introduce imbalance by making a range of y values less frequent (i.e., values between 50 and 100 will be sparse)
12y_imbalanced = np.copy(y)
13mask_sparse = (y_imbalanced > 50) & (y_imbalanced < 100)
14
15# randomly remove a large portion of samples in this range
16sparse_indices = np.where(mask_sparse)[0]
17np.random.shuffle(sparse_indices)
18remove_count = int(len(sparse_indices) * 0.8)
19y_imbalanced = np.delete(y_imbalanced, sparse_indices[:remove_count])
20X_imbalanced = np.delete(X, sparse_indices[:remove_count], axis=0)
21
22# make train and test datasets
23X_train, X_test, y_train, y_test = train_test_split(X_imbalanced, y_imbalanced, test_size=test_size, random_state=42)
24
25# preprocess
26num_trans = Pipeline(steps=[('scaler', StandardScaler())])
27cat_trans = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
28preprocessor = ColumnTransformer(transformers=[('num', num_trans,  list(range(X_train.shape[1])))])
29
30X_train_en = preprocessor.fit_transform(X_train)
31X_test_en = preprocessor.transform(X_test)
32

▫ Applying SMOTE R

SMOTE R needs to compute the balance between input features and the target value.

So, first, I created a Data Frame including both X_train and y_train, applied SMOTE R to the Data Frame, and preprocessing the augmented training dataset X_train_smote:

1import smogn
2import pandas as pd
3
4# make a DataFrame of training data (X_train and y_train)
5df_train = pd.DataFrame(X_train, columns=[f'feature_{i}' for i in range(X_train.shape[1])])
6df_train['target'] = y_train
7
8# apply SMOTE R (this returns DataFrame)
9smote_train = smogn.smoter(data=df_train, y='target')
10
11# split the df into X_train and y_train
12y_train_smote = smote_train['target']
13X_train_smote = smote_train.drop('target', axis=1)
14
15# preprocess
16X_train_smote = preprocessor.fit_transform(X_train_smote)
17

▫ Evaluation

Using the same evaluate_model function to compare performance on a simple Linear Regression model:

1evaluate_model(X_train_en, y_train, X_test_en, y_test)
2evaluate_model(X_train_smote, y_train_smote, X_test_en, y_test, 'LR with SMOTE R')
3

▫ Results

Class balance after data augmentation:

SMOTE R has shifted the data distribution of target values (right in the figure) by generating synthetic samples, focusing on an uneven distribution.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure K. Comparison of data distribution (left: normal distribution, middle: imbalanced distribution for experimentation, right: applied SMOTE R) for a regression task (Created by Kuriko IWAI)

MSE Scores:

Trained on the original, imbalanced data: 3,490.29 → Generalization: 3,609.83
Trained on SMOTE R data: 2,127.16 → Generalization: 2,190.76

The model trained on augmented data outperformed both training and generalization MSE scores.

By balancing the target value distribution, SMOTE R enables the linear regression model to learn more robustly and make much more accurate predictions on new, unseen data.

Wrapping Up

Data augmentation is extremely useful for building powerful models especially when we only have limited, but high-quality data.

In our experiments, we observed improved performance in both classification and regression tasks on imbalanced datasets when applying SMOTE algorithms.

Yet, we also learned the importance of feature engineering to extract relevant information dimension control to enable the model to efficiently learn from dense feature spaces, especially when we handle categorical data.

Ultimately, data augmentation is key for building robust machine learning models.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage
Master time series cross-validation. Learn to prevent data leakage and autocorrelation using Growing Window, Blocked K-Fold, and Purged CV with PyTorch and Scikit-Learn.
A Guide to Synthetic Data Generation: Statistical and Probabilistic Approaches
Learn how to generate synthetic data using univariate PDF estimation and multivariate approaches like KDE and Bayesian Networks with real-world use cases.
Maximum A Posteriori (MAP) Estimation: Balancing Data and Expert Knowledge
Learn the fundamentals of MAP estimation, how it differs from MLE, and its application in churn prediction. Discover why MAP is essential for regularization in machine learning.
Beyond Simple Imputation: Understanding MICE for Robust Data Science
Learn how the MICE algorithm handles missing data through iterative chain prediction. Explore PMM vs. Linear Regression imputation with Python code and Rubin’s Rules for pooling.
Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data
Learn how to boost regression model performance using hypothesis-driven feature engineering. Practical guide covering EDA, time-based features, and scaling for Elastic Net, Random Forest, and DFN.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE" in Kernel Labs

https://kuriko-iwai.com/data-augmentation-techniques-in-machine-learning

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE

A comprehensive guide on enhancing machine learning models using Gaussian noise, interpolation methods (Spline, RBF, IDW), and adaptive SMOTE algorithms for real-world datasets.

Table of Contents

Introduction

What is Data Augmentation

◼ Why Data Augmentation is Important: The Challenges of Real-World Data

▫ Data Quantity Issues:

▫ Data Quality Issues:

◼ Choosing the Right Data Enhancement Approach

▫ Imputation

▫ Synthetic Data Generation

▫ Data Augmentation

Noise Injection

◼ Demonstration: Gaussian Noise Injection

▫ Results from Original Data

▫ Adding Gaussian Noise

▫ Results from Data with Gaussian Noise

Interpolation

◼ Types of Interpolation Methods

▫ Polynomial Interpolation

▫ Nearest Neighbor Interpolation

▫ Spline Interpolation

▫ Radial Basis Function (RBF) Interpolation:

▫ Inverse Distance Weighting (IDW) Interpolation

SMOTE Algorithms

▫ How SMOTE Algorithms Work

◼ The Walkthrough Example

Various SMOTE Algorithms

◼ 1. Classification Task with Numerical Input Data

◼ Simulation

▫ The Data

▫ The Augmentation & Model Training

▫ Results

◼ 2. Classification Tasks with Mixed / Categorical Input Data

◼ Simulation

▫ The Data

▫ The Augmentation & Model Training

▫ Results

◼ 3. Regression Tasks

◼ Simulation

▫ The Data

▫ Applying SMOTE R

▫ Evaluation

▫ Results

Wrapping Up

Continue Your Learning

Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage

A Guide to Synthetic Data Generation: Statistical and Probabilistic Approaches

Maximum A Posteriori (MAP) Estimation: Balancing Data and Expert Knowledge

Beyond Simple Imputation: Understanding MICE for Robust Data Science

Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?