Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE
A comprehensive guide on enhancing machine learning models using Gaussian noise, interpolation methods (Spline, RBF, IDW), and adaptive SMOTE algorithms for real-world datasets.
By Kuriko IWAI

Table of Contents
IntroductionWhat is Data AugmentationNoise InjectionInterpolationSMOTE AlgorithmsVarious SMOTE AlgorithmsWrapping UpIntroduction
Machine learning models require training on substantial amounts of high-quality, relevant data.
Yet, real-world data presents significant challenges due to its inherent imperfections.
Data augmentation is a key strategy to tackle these challenges and provide robust training for the model.
In this article, I’ll explore major data augmentation techniques for tabular data:
noise injection and
interpolation methods, including SMOTE algorithms,
along with practical implementation examples.
What is Data Augmentation
Data augmentation is data enhancement technique in machine learning that handles specific data transformations and data imbalance by expanding original datasets.
Its major techniques include noise injection where the model is trained on a dataset with intentionally created noise and interpolation methods where the algorithm estimates unknown data based on the original dataset.
Due to this expansion approach leveraging the original dataset, sufficiently large and accurate dataset that reflects the true underlying data distribution is prerequisite to fully leverage data augmentation.
Unless otherwise, noise and outliers in the original dataset that the model shouldn’t learn are also augmented as new data, completely misleading the model.
◼ Why Data Augmentation is Important: The Challenges of Real-World Data
For a model to be effective, it must be trained on data that accurately reflects patterns likely to recur in the future.
Lack of high-quality, relevant data prevents models from learning effectively, leading to poor performance.
However, primary challenges arise when dealing with real-world datasets: data quantity issues and data quality issues.
▫ Data Quantity Issues:
Acquiring sufficient data can be a significant hurdle when relevant events are extremely rare (e.g., predicting rare decreases).
Insufficient data lead to major problems:
Underfitting where the model fundamentally fails to learn patterns from data and generates high bias and
Class imbalance in classification tasks where certain classes in the target variable lack sufficient data compared to the others, making the model bias toward dominant classes.
▫ Data Quality Issues:
Even with sufficient data, exceptional imperfections like missing values, noise, or inconsistencies can severely mislead a model.
This causes a common problem, overfitting where the model learns incorrect patterns from the training data, ultimately preventing it from generalizing its learning to unseen data, generating high variance.
◼ Choosing the Right Data Enhancement Approach
Data enhancement collectively refers to machine learning strategies to expand and improve the quality of datasets for model training to boost its generalization capabilities.
Primary approaches include imputation, synthetic data generation, and data augmentation, each of which handles different types of data limitation challenge:
▫ Imputation
This technique addresses missing values within existing datasets.
Importantly, it doesn’t increase the number of samples; instead, it fills in gaps in the original data points.
Depending on the type of missing data, imputation approaches vary:
Statistical: Mean, Median, Mode Imputation
Model-based: KNN Imputation, Regression Imputation
Deep learning based: GAIN (Generative Adversarial Imputation Networks)
Time series specific: Forward Fill/Backward Fill
▫ Synthetic Data Generation
This approach is ideal when we are facing limitations in data quantity, privacy concerns, or data sharing restrictions.
It involves creating entirely new datasets from scratch, meticulously designed to reflect the statistical properties of real data without using actual sensitive information.
Advanced techniques like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) can generate high-fidelity synthetic data, which is particularly useful when real data is scarce, sensitive, or contains significant imperfections.
▫ Data Augmentation
This method tackles limitations related to data quantity by expanding original datasets (key difference from synthetic data generation).
It involves applying various transformations to the original data (e.g., rotating images, adding noise to audio) without collecting new raw data.
This process helps the model generalize better to unseen examples, lowering variance.
Now, let us explore two major data augmentation techniques: noise injection and interpolation methods.
Noise Injection
Noise injection is a data augmentation technique to deliberately introduce controlled random perturbations into continuous features during model training.
This method is applicable for both regression and classification tasks, but noise has to be injected to continuous values.
For example:
- Original Data Point: [age: 35, income: 60000, gender: 1]
Applying noise injection by adding a small, random value to each continuous feature:
Augmented Data Point 1:
[age: 35 + 1.2 = 36.2, income: 60000 - 550 = 59450, gender: 1]Augmented Data Point 2:
[age: 35 - 0.8 = 34.2, income: 60000 + 720 = 60720, gender: 1]
In this example, noise for age and income are randomly selected from the value ranged from -10 to 10 and -1,000 to 1,000 respectively.
A discrete feature gender is out of the scope, so remains the same.
Although noise injection will not increase the number of the samples in the dataset, it can implicitly expand the feature space by adding values to continuous features.
Major techniques applicable for tabular data include:
Gaussian Noise Injection: Adds random values sampled from a Gaussian distribution to the original dataset, and
Jittering: Applies small, random perturbations (often follows Gaussian) to individual data points in time series/sequential data.
Now, take a look at how a common noise injection method: Gaussian Noise Injection works.
◼ Demonstration: Gaussian Noise Injection
I created a scenarios where a Linear Regression model is trained on extremely noisy data because the deployment environment is expected to have noise (e.g., sensor readings with measurement errors).
This scenario is challenging for the model because in its nature, Linear Regression requires abundant, linearly separable data to accurately learn linear approximations.
1import numpy as np
2from sklearn.model_selection import train_test_split
3from sklearn.preprocessing import StandardScaler
4from sklearn.linear_model import LinearRegression
5from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
6
7# underfit due to limited samples
8n_samples, n_features = 100, 10
9
10# creates true X
11X_true = np.random.rand(n_samples, n_features)
12
13# creates true y (extremely noisy)
14true_coefficients = np.random.randn(n_features)
15true_bias = 100
16y_true_noise = np.random.rand(n_samples) * 10000
17y_true = np.dot(X_true, true_coefficients) + true_bias + y_true_noise
18
19# splits and scales the data
20X_train, X_test, y_train, y_test = train_test_split(X_true, y_true, test_size=30, random_state=42)
21scaler = StandardScaler()
22X_train_s = scaler.fit_transform(X_train)
23X_test_s = scaler.transform(X_test)
24
25# trains the model and makes a prediction
26model = LinearRegression().fit(X_train, y_train)
27y_pred_train = model.predict(X_train_s)
28y_pred_test = model.predict(X_test_s)
29
30# computes evaluation matrics
31mse_train = mean_squared_error(y_train, y_pred_train)
32mae_train = mean_absolute_error(y_train, y_pred_train)
33r2_train = r2_score(y_train, y_pred_train)
34mse_test = mean_squared_error(y_test, y_pred_test)
35mae_test = mean_absolute_error(y_test, y_pred_test)
36r2_test = r2_score(y_test, y_pred_test)
37
▫ Results from Original Data
Without noise injection, the model failed to learn the pattern, ending up with significantly high errors (e.g., generalization MSE: 48,429.01).
MSE: Train 21,232.91 → Generalization on test set: 48,429.01
MAE: Train 3,472.48 → Generalization on test set: 5,943.21
R2 Score: Train: -1.00 → Generalization on test set: -4.5368
▫ Adding Gaussian Noise
Then, I added Gaussian Noise to the training dataset and retrained the model:
1import numpy as np
2from sklearn.preprocessing import StandardScaler
3from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
4
5# adds gaussian noise to training dataset (before scaling)
6gaussian_noise = np.random.normal(loc=0, scale=1, size=X_train.shape)
7X_train_noise = X_train + gaussian_noise
8
9# scale the dataset
10scaler = StandardScaler()
11X_train_noise_s = scaler.fit_transform(X_train_noise)
12X_test_noise_s = scaler.transform(X_test)
13
14# retrain the model and make a prediction
15model = LinearRegression().fit(X_train_noise_s, y_train)
16y_pred_train_noise = model.predict(X_train_s_noise)
17y_pred_test_noise = model.predict(X_test_s_noise)
18
19# computes evaluation matrics
20mse_train = mean_squared_error(y_train, y_pred_train_noise)
21mae_train = mean_absolute_error(y_train, y_pred_train_noise)
22r2_train = r2_score(y_train, y_pred_train_noise)
23
24mse_test = mean_squared_error(y_test, y_pred_test_noise)
25mae_test = mean_absolute_error(y_test, y_pred_test_noise)
26r2_test = r2_score(y_test, y_pred_test_noise)
27
▫ Results from Data with Gaussian Noise
The model improved performance significantly from generalization MSE of 48,429 to 8,962.
MSE: Train 9,240.38 → Generalization on test set: 8,962.52
MAE: Train 2,632.58 → Generalization on test set: 2,610.19
R2 Score: Train: 0.13 → Generalization on test set: -0.0247
These results indicates that the model become more robust to noisy real-world data after trained on the Gaussian noise.
There are occasions we should avoid noise injection:
When interpretability is crucial: The noise added makes the relationship between input features and predictions obscure.
When the model is sensitive to small input perturbations: Especially in safety-critical systems, even small changes to the input could lead to inaccurate outputs.
When training time is extremely limited: The process of injecting noise could increase the computational cost and training time when implemented at scale.
Else, noise injection is an useful method to combat moderate overfitting by forcing the model to learn varied versions of the data.
Interpolation
Interpolation is a data augmentation technique that expands the underlying data distribution of the original dataset by estimating unknown values between data points randomly chosen from the original dataset.
Because of this estimation process, this method requires the original dataset to be accurate and sufficiently robust.
It’s not suitable for:
Very limited dataset because it cannot estimate new data correctly, or
Noisy datasets as the noise is also expanded to new data, misleading the model.
◼ Types of Interpolation Methods
Out of many interpolation methods, Linear interpolation is the most major and intuitive method.
Mathematically, the interpolated value y is given by:
where x and y lies in between randomly chosen samples: (x_1, y_1) and (x_2, y_2).
The below figure visualizes the linear interpolated curve (blue line) of original data points (red dots).
Taking two random original data points:
(x_1, y_1) = (3.00, 1.66),
(x_2, y_2) = (4.00, 2.43) (highlighted in orange in the figure)
for example, the interpolated value y for a random point x = 3.75 that exists in between the two original points is computed y = 2.24:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Linear interpolation (Created by Kuriko IWAI)
Linear interpolation is best when:
The original dataset is relatively small.
The underlying relationship between two original data points seems linear.
Higher-order smoothness is not critical (e.g., basic graphing, resampling)
I’ll find unique interpolation curves of major interpolation methods at the same arbitrary point of x = 3.75.
▫ Polynomial Interpolation
Polynomial interpolation fits a polynomial function through a set of original data points, instead of a linear function.
For n original data points, an interpolated value is given as a polynomial value up to n-1 degree P(x):
where L_j(x) is the Lagrange basis polynomial corresponding to j-th data point x_j:
(In case of using Lagrange Interpolation)
This method is best when:
The original dataset is relatively small.
The underlying relationship between two original data points is a single, smooth polynomial with moderate degrees (as high-degree polynomial suffers from Runge's phenomenon).
The interpolation value for the arbitrary point is 2.46:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. Polynomial Interpolation (Created by Kuriko IWAI)
▫ Nearest Neighbor Interpolation
Nearest neighbor interpolation assigns the value of an original data point closest to the given unknown data point.
The interpolated value y is given by finding the original data y_k with the minimum distance values:
(In case of Manhattan distance. Distance metrics can be Euclidian or other metrics of our choice.)
This method is best when:
Original data points are discrete values.
Critical to preserve the values of the original data (e.g., image processing for resizing).
Needs computationally inexpensive methods.
The interpolation value is 2.43 as it chooses the nearest original data point: (x_2, y_2).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C. Nearest Interpolation (Created by Kuriko IWAI)
▫ Spline Interpolation
Spline interpolation fits a series of piecewise polynomial functions of the original dataset.
In the most common method, cubic splines, the interpolation value is defined as a cubic polynomial S_i(x):
where coefficients a, b, c, d are computed by solving the following equations:
This method can create a smooth, continuous curve that passes through original data points without oscillations. So, it is best when:
Handles high-degree polynomial interpolation that a simple polynomial interpolation would suffer from oscillations.
Tasks requiring visually pleasing and differentiable curves, such as computer graphics, CAD/CAM, and numerical analysis.
The interpolation value is 2.12:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. Spline Interpolation (Created by Kuriko IWAI)
▫ Radial Basis Function (RBF) Interpolation:
RBF interpolation constructs an interpolated curve as a linear combination of radial basis functions (RBF).
The RBF interpolated value s(x) is given by:
where:
x_i: i-th data points in the original dataset in D-dimensional feature vector,
w_i: The weights determined by solving a system of linear equations,
ϕ: A RBF function of our choise (e.g., Gaussian, multiquadric, thin-plate spline),
|| x − x_i||: The Euclidean distance between x and x_i, and
P(x): A low-degree polynomial term.
This method can fully leverage advantages of RBF, creating a smooth interpolating surface for complex original data. So, it is best when:
The original dataset is scattered in high-dimensional spaces.
The original dataset has irregular data distribution.
The interpolate value is 2.11:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure E. RBF Interpolation (Created by Kuriko IWAI)
▫ Inverse Distance Weighting (IDW) Interpolation
IDW interpolation estimates values based on a weighted average of the distance of original data points under the assumptions where:
the relationship between distance and influence is constant (stationarity) and
no directional biases exist in data (isotropy).
So, in this method, closer points have more influence as the weight is inversely proportional to the distance.
The interpolated value y^(x) is given by:
where:
d(x, x_i): The distance between the unknown point x and the i-th original data point x_i.
p: A positive power parameter (commonly p=2 - "inverse distance squared")
As p increases, the influence of more distant points diminishes more rapidly.
This method is best when:
The data points are relatively dense and evenly distributed (aligning with IDW’s assumptions).
The future prediction is driven more by local variation (as IDW puts more weights on nearby measured values).
Large dataset and real-time application where quick results are needed.
The interpolated value is 2.40:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure F. IDW Interpolation (Created by Kuriko IWAI)
Building on the foundation of various interpolation methods, in the next section, I’ll explore SMOTE algorithms, which leverages linear interpolation.
SMOTE Algorithms
SMOTE (Synthetic Minority Over-sampling Technique) algorithms are data augmentation techniques leveraging linear interpolation.
They are applicable for both classification and regression tasks, handling imbalance in training data to improve model performance.
▫ How SMOTE Algorithms Work
SMOTE algorithms take iterative process of interpolating a new sample until it reaches the given target number of samples.
This process involves:
Step 1. Choose a random sample x_i from the minority class.
Step 2. Find k-nearest neighbors of x_i in the feature space using Euclidean distance.
Step 3. Randomly select one neighbor from the neighbors found in Step 2.
Step 4. Interpolate to create a new sample:
A new synthetic sample x_new is generated using linear interpolation:
where:
x_new: The newly generated synthetic sample,
x_i: The original minority class sample,
x_neighbor: A randomly chosen sample from the k nearest neighbors of the original sample x_i,
λ: A random number between 0 and 1 (all inclusive).
The algorithm chooses a random value for λ to place the new synthetic sample at a random point along the line segment between x_i and x_neighbor.
Step 5. Repeat Steps 1 to 4 until the desired number of synthetic samples is created.
Let us see a walkthrough example.
◼ The Walkthrough Example
Imagine a 2-dimensional feature space. We have five samples in a minority class M:
and we want to secure at least 10 samples to handle class imbalance.
SMOTE algorithms start with:
Step 1. Choose a random sample x_i:
Step 2. Find k-nearest neighbors of x_i:
First, the algorithm computes the Euclidean distance between the selected sample x_i and the other samples:
Distance from x_i to M_2: d(x_i, M_2) = sqrt((2.1−2.0)²+(5.2−5.0)²) ≈ 0.2236
Distance from x_i to M_3: d(x_i, M_3) = sqrt((1.9−2.0)²+(4.8−5.0)²) ≈ 0.2236
Distance from x_i to M_4: d(x_i, M_4) = sqrt((3.5−2.0)²+(6.0−5.0)²) ≈ 1.8028
Distance from x_i to M_5: d(x_i, M_5) = sqrt((1.0−2.0)²+(4.0−5.0)²) ≈ 1.4142
Here, let’s say k = 2.
The algorithm picks up M_2 and M_3 as neighbors based on the computed distance.
Step 3. Randomly select one neighbor: The algorithm selects M_3.
Step 4. Interpolate to create a new sample:
First, assign a random value to λ: let’s say λ = 0.4.
Then, compute x_new:
Add the new sample x_new to the sample space as M_6:
Step 5. Repeat Step 1 to 4 four more times, adding M_7, M_8, M_9, and M_10 to secure 10 samples.
Various SMOTE Algorithms
Depending on task types and input data types, SMOTE algorithms are classified into three categories:
Classification tasks with numerical input data,
Classification tasks with categorical or mixed input data, and
Regression tasks.
I’ll explore them one by one in this section.
◼ 1. Classification Task with Numerical Input Data
SMOTE, KMeansSMOTE, variations of Borderline SMOTE are only applicable for classification tasks with continuous input values.
Applicable Task Types: Classification
Applicable Input Data Types: Continuous only
SMOTE:
Generates synthetic samples for the minority class by interpolating between existing minority samples and their k-nearest neighbors.
- Best When: Minority class samples are not extremely rare, and in dense feature space.
KMeans SMOTE:
This method combines K-Means clustering with SMOTE.
It first clusters the minority class samples using K-Means and then applies SMOTE within each cluster, focusing on generating synthetic samples in less dense regions of these clusters.
- Best When: The minority class has multiple clusters or sub-distributions.
Borderline SMOTE:
A variation of SMOTE that only oversamples minority samples close to the decision boundary.
- Best When: Minority class samples are close to the decision boundary, but deeply surrounded by a mix of majority and minority class neighbors.
SVM SMOTE:
SVM SMOTE is a variation of borderline SMOTE, using an SVM (Support Vector Machine) classifier to identify support vectors as borderline minority samples, and then applies SMOTE to only these samples.
- Best When: Similar to Borderline SMOTE, but best when the decision boundary is more complex.
ADASYN (Adaptive Synthetic Sampling):
ADASYN adaptively generates more synthetic data for minority class samples, focusing on those harder to learn because it is close to majority class samples.
It first identifies those “difficult“ samples in the minority class by computing a ratio (r_i) of minority / majority class samples in the k-nearest neighbors.
Higher r_i ratio means more “difficult“ because it’s closer to the decision boundary, surrounded by more majority class samples.
Hence, when it observes high r_i, it generates more minority samples (which is called adaptive), taking the same approach as SMOTE.
This method is best when:
The decision boundary is extremely complex.
Certain minority class instances are difficult to classify due to their proximity to majority class clusters.
◼ Simulation
Let us see how they work on synthetic continuous data.
To compare the variations, I use the same dataset and hyperparameters for Logistic Regression.
LightGBM was also included as a best performing benchmark, without data augmentation.
▫ The Data
1import numpy as np
2from sklearn.model_selection import train_test_split
3from sklearn.preprocessing import StandardScaler
4
5
6# create a dataset
7np.random.seed(42)
8n_samples, balance, test_size = 5000, [0.95, 0.05], 1000
9
10# classes
11n_class_0 = int(n_samples * balance[0])
12n_class_1 = int(n_samples * balance[1])
13
14# class 0 (majority)
15X_0 = np.random.randn(n_class_0, 2) * 0.8 + np.array([-1, -1])
16y_0 = np.zeros(n_class_0, dtype=int)
17
18# class 1 (minority) - slightly shifted and with less variance
19X_1 = np.random.randn(n_class_1, 2) * 0.3 + np.array([1, 1])
20y_1 = np.ones(n_class_1, dtype=int)
21
22# merge two classe
23X = np.vstack((X_0, X_1))
24y = np.hstack((y_0, y_1))
25
26# creates and scale train and test datasets
27X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, stratify=y)
28
29scaler = StandardScaler()
30X_train = scaler.fit_transform(X_train)
31X_test = scaler.transform(X_test)
32
▫ The Augmentation & Model Training
SMOTE algorithms are applied only to the training dataset after preprocessing.
1import lightgbm as lgb
2from collections import Counter
3from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE, KMeansSMOTE, SVMSMOTE
4from sklearn.linear_model import LogisticRegression
5from sklearn.metrics import accuracy_score, f1_score
6
7# define a function to build, train, and evaluate the model
8def evaluate_model(X_train, y_train, X_test, y_test, model=None, sampler_name="No Oversampling"):
9 # build a model (LR or LightGBM)
10 model = LogisticRegression(
11 penalty='l2',
12 solver='lbfgs',
13 random_state=42,
14 class_weight='balanced',
15 tol=0.0001,
16 C=100,
17 max_iter=5000,
18 ) if not model else lgb.LGBMClassifier(
19 class_weight='balanced',
20 n_estimators=1000,
21 random_state=42,
22 n_jobs=-1,
23 verbosity=-1,
24 learning_rate=0.01,
25 num_leaves=200,
26 max_depth=10,
27 reg_alpha=0.02,
28 reg_lambda=0.1,
29 subsample=0.4
30 )
31
32 # train
33 model.fit(X_train, y_train)
34
35 # make a prediction
36 y_pred_train = model.predict(X_train)
37 y_pred_test = model.predict(X_test)
38
39 # log adjusted class balance, accuracy and f1 scores
40 print(f"Train data distribution: class 0: {Counter(y_train)[0]}, class 1: {Counter(y_train)[1]}")
41 print(f"F1 (minority class): Train {f1_score(y_train, y_pred_train, pos_label=1):.4f} -> Generaliztion: {f1_score(y_test, y_pred_test, pos_label=1):.4f}")
42
43
44# base - logistic regression (lr) without smote
45evaluate_model(X_train, y_train, X_test, y_test, "No SMOTE (LR)")
46
47# lr with smote
48k, m = 3, 10
49smote_sampler = SMOTE(k_neighbors=k, random_state=42)
50X_res_smote, y_res_smote = smote_sampler.fit_resample(X_train, y_train)
51evaluate_model(X_res_smote, y_res_smote, X_test, y_test, "SMOTE")
52
53# lr with kmean smote
54kmeans_smote_sampler = KMeansSMOTE(
55 k_neighbors=k,
56 random_state=42,
57 n_jobs=-1,
58 cluster_balance_threshold=0.05, # lowers the threshold to avoid no cluster found error
59)
60X_res_kmsmote, y_res_kmsmote = kmeans_smote_sampler.fit_resample(X_train, y_train)
61evaluate_model(X_res_kmsmote, y_res_kmsmote, X_test, y_test, "KMeansSMOTE")
62
63# lr with borderline smote
64borderline_smote_sampler = BorderlineSMOTE(k_neighbors=k, m_neighbors=m, random_state=42)
65X_res_bsmote, y_res_bsmote = borderline_smote_sampler.fit_resample(X_train, y_train)
66evaluate_model(X_res_bsmote, y_res_bsmote, X_test, y_test, "BorderlineSMOTE")
67
68# lr with svm smote
69svmsmote_sampler = SVMSMOTE(k_neighbors=k, m_neighbors=m, random_state=42)
70X_res_svmsmote, y_res_svmsmote = svmsmote_sampler.fit_resample(X_train, y_train)
71evaluate_model(X_res_svmsmote, y_res_svmsmote, X_test, y_test, "SVMSMOTE")
72
73# lr with adasyn
74adasyn_sampler = ADASYN(n_neighbors=k, random_state=42)
75X_res_adasyn, y_res_adasyn = adasyn_sampler.fit_resample(X_train, y_train)
76evaluate_model(X_res_adasyn, y_res_adasyn, X_test, y_test, "ADASYN")
77
78# threshold - light gbm without smote
79evaluate_model(X_train, y_train, X_test, y_test, "No Oversampling (Light GBM)", model=light_gbm)
80
▫ Results
Class balance after data augmentation:
The minority class was expanded from 200 to 3,800 samples, matching the size of the majority class.
Original: class 0: 3800, class 1: 200 (Applied for both Logistic Regression and LightGBM)
SMOTE: class 0: 3800, class 1: 3800
KMeanSMOTE: class 0: 3800, class 1: 3800
BorderlineSMOTE: class 0: 3800, class 1: 3800
SVM SMOTE: class 0: 3800, class 1: 3800
ADASYN: class 0: 3800, class 1: 3796
F1 score for the minority class:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure G. Comparison of F1 scores (blue: train, red: generalization) among SMOTE algorithms, Logistic Regression, and LightGBM (Created by Kuriko IWAI)
SMOTE and KMeansSMOTE achieved the best generalization F1 scores for the minority class, surpassing the benchmark of LightGBM.
While overall training scores remain very high, Borderline SVM, SVM SMOTE, and ADASYN did not show the same level of generalization capabilities, underperforming LightGBM.
This indicates that for this dataset, sample imbalance should be addressed across the feature space of the minority class, not only in regions close to the borderline.
Decision boundary:
SMOTE and KMeansSMOTE augmented samples across the minority sample space using k-neighbor interpolation (black border dots), while borderline SMOTE algorithms in the second row added samples near the decision boundary.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure H. Comparison of decision boundaries among SMOTE algorithms and Logistic Regression (Created by Kuriko IWAI)
◼ 2. Classification Tasks with Mixed / Categorical Input Data
When we have categorical features in input data, SMOTE N and SMOTE NC can handle them well.
SMOTE N (SMOTE for Nominal features):
Specifically designed for datasets composed entirely of nominal (categorical) features.
It picks up the most frequent category among the neighbors or by considering the difference between categorical values based on their relationship to the target class.
Applicable Task Types: Classification
Applicable Input Data Types: Discrete/Categorical only
Best When: The original dataset solely consists of categorical features.
SMOTE NC (SMOTE for Nominal and Continuous features):
A variation of SMOTE designed to handle datasets with both continuous and nominal (categorical) features.
It uses linear interpolation for continuous features and mode-based or frequency-based assignment for categorical features.
Applicable Task Types: Classification
Applicable Input Data Types: Mixed
Best When: Need to balance mixed data types when augmenting the data.
◼ Simulation
I compared the performance in the same approach as the previous case.
▫ The Data
I created a synthetic dataset with categorical values / mixed values, encoded categorical features, and scaled numerical features.
1import numpy as np
2import pandas as pd
3from sklearn.preprocessing import OneHotEncoder
4from sklearn.model_selection import train_test_split
5from imblearn.over_sampling import SMOTEN
6
7# create synthetic dataset only with categorical features
8np.random.seed(42)
9sample_size, minority, test_size = 5000, 2000, 2000
10data_cat = {
11 'cat_1': np.random.choice(['X', 'Y', 'Z'], sample_size),
12 'cat_2': np.random.choice(['P', 'Q'], sample_size),
13 'cat_3': np.random.choice(['M', 'N', 'O', 'P','R'], sample_size)
14}
15df_cat = pd.DataFrame(data_cat)
16
17# distribute target variables
18target_values = np.array([1] * minority + [0] * (sample_size - minority))
19np.random.shuffle(target_values)
20df_cat['target'] = target_values
21
22# create train / test datasets
23X = df_cat[['cat_1', 'cat_2', 'cat_3']]
24y = df_cat['target']
25X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, stratify=y)
26
27# apply encoder
28encoder = OneHotEncoder(handle_unknown='ignore')
29X_train = encoder.fit_transform(X_train)
30X_test = encoder.transform(X_test)
31
▫ The Augmentation & Model Training
Using the same evaluate_model function, build, train, and evaluate each method:
1# logistic regression (no data augmentation)
2evaluate_model(X_train, y_train, X_test, y_test, "Logistic Regression (Original)")
3
4# apply SMOTE N for training data only with cat features
5smoten_sampler = SMOTEN(random_state=42, k_neighbors=3)
6X_train_smoten, y_train_smoten = smoten_sampler.fit_resample(X_train, y_train)
7evaluate_model(X_train_smoten, y_train_smoten, X_test, y_test, "SMOTEN", model=lr)
8
9
10# apply SMOTE NC for mixed features
11smotenc_sampler = SMOTENC(
12 categorical_features=[2],
13 categorical_encoder=OneHotEncoder(handle_unknown='ignore'), # encode cat features
14 k_neighbors=3,
15 random_state=42,
16)
17X_train_smote_nc, y_train_smote_nc = smotenc_sampler.fit_resample(X_train, y_train)
18
19# scale numerical features
20num_trans = Pipeline(steps=[('scaler', StandardScaler())])
21preprocessor = ColumnTransformer(transformers=[('num', num_trans, ['num_1', 'num_2']), ])
22X_train_smote_nc = preprocessor.fit_transform(X_train_smote_nc)
23X_test = preprocessor.transform(X_test)
24
25# model training, evaluation
26evaluate_model(X_train_smote_nc, y_train_smote_nc, X_test, y_test, "SMOTE NC")
27
28
29# light gbm (no data augmentation)
30evaluate_model(X_train, y_train, X_test, y_test, "LightGBM", model=light_gbm)
31
▫ Results
Class balance after data augmentation:
Original Data: class 0: 1800, class 1: 1200 (Applied to Logistic Regression and LightGBM)
SMOTE N: class 0: 1800, class 1: 1800
SMOTE NC: class 0: 2400, class 1: 2400 (adding numerical feature to the SMOTE N data)
F1 score for the minority class:
SMOTE N (categorical only) achieved the best generalization score of 0.472, outperforming LightGBM. (Yet, it has tendency of overfitting. We need to tighten the regularization.)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure I. Comparison of F1 scores (blue: train, red: generalization) for a classification task with categorical data (Created by Kuriko IWAI)
SMOTE NC (for mixed data) also outperformed the benchmark of LightGBM.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure J. Comparison of F1 scores (blue: train, red: generalization) for a classification task with mixed data (Created by Kuriko IWAI)
Important Consideration:
When dealing with mixed data or categorical only data with imbalanced classes, we observed lower F1-scores for the minority class, with or without data augmentation.
This is because of the high dimensionality introduced after encoding categorical features, which can exacerbate the "curse of dimensionality" and dilute the effectiveness of the augmentation.
Specifically, in high-dimensional spaces, the synthetic samples generated by SMOTE algorithms might not effectively represent the true underlying distribution of the minority class, or they might introduce noise, leading to suboptimal model performance.
We need to consider applying tactics like:
Feature engineering before encoding,
Encoding strategies like binary encoding to optimize the number of dimensions increased, and
Dimensionality reduction through PCA.
◼ 3. Regression Tasks
Regardless of input data types, SMOTE R from the smogn library can handle value imbalance in regression tasks.
SMOTE R (SMOTE for Regression)
SMOTE R adapts the SMOTE concept for regression problems where the target variable has an imbalanced distribution (e.g., a few extreme values).
It generates synthetic samples by interpolating both features and the target variable, aiming to balance the distribution of the target.
- Best When: Dealing with regression problems where there's an imbalance in the distribution of target variable values (e.g., very few instances with extremely high or low values).
Let us see how it works.
◼ Simulation
I compared the performance in the same approach as the previous case.
▫ The Data
I created a synthetic dataset for a regression task and intentionally removed 80% of target values ranged from 50 to 100 to create extremely sparse, irregular patterns:
1
2import numpy as np
3import pandas as pd
4from sklearn.datasets import make_regression
5from sklearn.model_selection import train_test_split
6
7np.random.seed(42)
8sample_size, n_features, test_size = 5000, 5, 2000
9X, y = make_regression(n_samples=sample_size, n_features=n_features, noise=30, random_state=42)
10
11# introduce imbalance by making a range of y values less frequent (i.e., values between 50 and 100 will be sparse)
12y_imbalanced = np.copy(y)
13mask_sparse = (y_imbalanced > 50) & (y_imbalanced < 100)
14
15# randomly remove a large portion of samples in this range
16sparse_indices = np.where(mask_sparse)[0]
17np.random.shuffle(sparse_indices)
18remove_count = int(len(sparse_indices) * 0.8)
19y_imbalanced = np.delete(y_imbalanced, sparse_indices[:remove_count])
20X_imbalanced = np.delete(X, sparse_indices[:remove_count], axis=0)
21
22# make train and test datasets
23X_train, X_test, y_train, y_test = train_test_split(X_imbalanced, y_imbalanced, test_size=test_size, random_state=42)
24
25# preprocess
26num_trans = Pipeline(steps=[('scaler', StandardScaler())])
27cat_trans = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
28preprocessor = ColumnTransformer(transformers=[('num', num_trans, list(range(X_train.shape[1])))])
29
30X_train_en = preprocessor.fit_transform(X_train)
31X_test_en = preprocessor.transform(X_test)
32
▫ Applying SMOTE R
SMOTE R needs to compute the balance between input features and the target value.
So, first, I created a Data Frame including both X_train and y_train, applied SMOTE R to the Data Frame, and preprocessing the augmented training dataset X_train_smote:
1import smogn
2import pandas as pd
3
4# make a DataFrame of training data (X_train and y_train)
5df_train = pd.DataFrame(X_train, columns=[f'feature_{i}' for i in range(X_train.shape[1])])
6df_train['target'] = y_train
7
8# apply SMOTE R (this returns DataFrame)
9smote_train = smogn.smoter(data=df_train, y='target')
10
11# split the df into X_train and y_train
12y_train_smote = smote_train['target']
13X_train_smote = smote_train.drop('target', axis=1)
14
15# preprocess
16X_train_smote = preprocessor.fit_transform(X_train_smote)
17
▫ Evaluation
Using the same evaluate_model function to compare performance on a simple Linear Regression model:
1evaluate_model(X_train_en, y_train, X_test_en, y_test)
2evaluate_model(X_train_smote, y_train_smote, X_test_en, y_test, 'LR with SMOTE R')
3
▫ Results
Class balance after data augmentation:
SMOTE R has shifted the data distribution of target values (right in the figure) by generating synthetic samples, focusing on an uneven distribution.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure K. Comparison of data distribution (left: normal distribution, middle: imbalanced distribution for experimentation, right: applied SMOTE R) for a regression task (Created by Kuriko IWAI)
MSE Scores:
Trained on the original, imbalanced data: 3,490.29 → Generalization: 3,609.83
Trained on SMOTE R data: 2,127.16 → Generalization: 2,190.76
The model trained on augmented data outperformed both training and generalization MSE scores.
By balancing the target value distribution, SMOTE R enables the linear regression model to learn more robustly and make much more accurate predictions on new, unseen data.
Wrapping Up
Data augmentation is extremely useful for building powerful models especially when we only have limited, but high-quality data.
In our experiments, we observed improved performance in both classification and regression tasks on imbalanced datasets when applying SMOTE algorithms.
Yet, we also learned the importance of feature engineering to extract relevant information dimension control to enable the model to efficiently learn from dense feature spaces, especially when we handle categorical data.
Ultimately, data augmentation is key for building robust machine learning models.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage
A Guide to Synthetic Data Generation: Statistical and Probabilistic Approaches
Maximum A Posteriori (MAP) Estimation: Balancing Data and Expert Knowledge
Beyond Simple Imputation: Understanding MICE for Robust Data Science
Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps
Share What You Learned
Kuriko IWAI, "Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE" in Kernel Labs
https://kuriko-iwai.com/data-augmentation-techniques-in-machine-learning
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.




