Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data

A step-by-step guide to minimize generalization errors on large-scale tabular data

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Feature Engineering?
Benefits of Feature Engineering
Common Techniques
The WorkflowSimulation
Basic Setups
Data Extraction & Preprocessing
Data Cleaning
Identifying Empty Samples
Handling Data Types
Exploratory Data Analysis (EDA)
1) Basic Analysis (Univariate Focus)
2) Project Specific EDAs
Hypothesis-Driven Feature Engineering
Hypothesis 1. “
Hypothesis 2. “
Hypothesis 3.
Model SelectionTraining Models on Preprocessed Data
Elastic Net
Random Forest
DFN
ResultsIteration of Feature Engineering and Model TrainingExperiment SummaryWrapping Up

Introduction

Feature engineering is critical process to run before the model training, as the quality of input data directly determine the quality of the model’s output.

While deep learning models excel at automatically learning features from unstructured data like images or text, explicit feature engineering remains crucial for tabular datasets.

In this article, I’ll demonstrate the impact of feature engineering on regression tasks, specifically focusing on large tabular dataset with mixed numerical, categorical, and time-based features.

What is Feature Engineering?

Feature engineering is the process of selecting, transforming, and creating new features from raw data to improve the performance of machine learning models.

It involves using domain knowledge to extract the most relevant information from data and represent it in a way that is suitable for the chosen machine learning algorithm.

Benefits of Feature Engineering

Well-engineered features can significantly boost the predictive power of the model to the point where even a simple one can grasp complex relationships, because it can:

  • Reduce data sparsity: Most real-world datasets are sparse with many zero’s and missing values. Feature engineering consolidates information and create denser representations, making it easier for models to learn.

  • Handle diverse data types: Raw data comes in various formats like numerical, categorical, textual, and temporal. Feature engineering converts these types into a numerical format that the model can process.

  • Address data noise and outliers: Feature engineering mitigates the impact of noisy data and outliers, leading to more robust models.

When we can format features directly correspond to meaningful concepts in the problem domain, it can make the model's decisions more interpretable, accurate, and robust.

Common Techniques

Some of the common feature engineering techniques include:

Log Transformation

  • Applying a logarithm to numerical features.

  • Can make the distribution more symmetrical and make the model assume a normal distribution.

  • Best when: Handling skewed numerical features.

Polynomial Features Creation

  • Generates new features by raising existing features to a power (e.g., x2, x3).

  • Creates interaction terms (e.g., x∗y).

  • Best when: Capturing non-linear relationships.

Binning (Discretization)

  • Groups continuous numerical values into bins.

  • Can reduce the impact of small fluctuations and make more linear relationship.

  • Best when: Handling non-linear relationships with linear models, the data contains significant outliers or skewed.

Time-Based Features

  • Extracts day of the week, month, year, hour, quarter, or even more complex features like "is_weekend", "is_holiday".

  • Calculates time differences between events.

  • Best when: Seasonality impacts the prediction.

The Workflow

While there's no one-size-all approach, the general workflow for feature engineering often involves the entire lifecycle of the project starting from defining the problem and success metrics.

I'll demonstrate Phase 1 and Phase 2 (the blue boxes in the figure) particularly focusing on the feature engineering.

Figure A. Machine learning project workflow with feature engineering (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Machine learning project workflow with feature engineering (Created by Kuriko IWAI)

For Phase 3, Achieving Accuracy with Machine Learning covers the core concept of generalization, hypothesizing root causes of the loss.

Simulation

For demonstration, I’ll follow the workflow in the previous section step by step.

Basic Setups

The Problem

For an online retail business, understanding future customer spend is critical for marketing, inventory management, and strategic planning.

Let us imagine a scenario where the business struggles in sales growth figuring out the next actionable insights.

The Evaluation Metrics

  • Mean Absolute Error (MAE) as primary metric due to its robustness to the skewed data (Later, during the EDA, I'll discuss the data distribution of the target variable: sales).

Data Extraction & Preprocessing

This is the fundamental part of data preparation of the initial phase.

For demonstration, I’ll use the online retail data from the UC Irvine Machine Learning Repository (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license). In reality, this data can range from a simple Excel sheet to information stored on a cloud server, or we can combine multiple data sources.

Figure B. Sample dataset (Source)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Sample dataset (Source / Creative Commons Attribution 4.0 International (CC BY 4.0) license)

1import os
2import pandas as pd
3
4df = pd.read_csv(csv_file_path)
5df.info()
6

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

The dataset has 541,909 data points with eight features:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Data Cleaning

Before jumping into the EDA, I’ll clean up the data following the general principals:

  • Preserve Numeric Types: If a column should be numeric (like quantity, unitprice, customerid), keep it that way. Numeric operations are much more efficient and meaningful with numeric types.

  • Object Type for Mixed/String Data: The object dtype in pandas is a catch-all for columns that contain mixed types or are primarily strings. If a column truly contains a mix of numbers and strings, object is often the default and sometimes necessary.

  • Handling Missing Data by Identifying NaNs Properly: Pandas uses NaN for missing numerical data and None for missing object data. However, when we read data, missing values might be represented as empty strings, specific text like "N/A", or even just spaces, which pandas might not automatically interpret as NaN. We need to identify these values.

Identifying Empty Samples

For columns with mixed data types, I converted potential missing values like spaces or 'nan' strings into NumPy's NaN. This is a crucial step for accurately flagging missing values during imputation.

Learn More: Data Preprocessing for Effective Machine Learning Models

1import numpy as np
2
3obj_cols = ['invoiceno', 'stockcode', 'country', 'invoicedate']
4null_vals = ['', 'nan', 'N/A', None, 'na', 'None', 'none']
5replaced_null = { item: np.nan for item in null_vals }
6
7for col in obj_cols:
8    df[col].replace(replaced_null, inplace=True)
9
10df.info()
11

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Now, identify missing data in each column:

1for i, col in enumerate(df.columns):
2    unique_num = df[col].nunique()
3    nan_num = df[col].isna().sum()
4    print(f'{i}. {col} - {unique_num:,} data points (missing data: {nan_num:,})')
5
  1. invoiceno - 25,900 data points (missing data: 0)

  2. stockcode - 4,070 data points (missing data: 0)

  3. description - 4,223 data points (missing data: 1,454)

  4. quantity - 722 data points (missing data: 0)

  5. invoicedate - 23,260 data points (missing data: 0)

  6. unitprice - 1,630 data points (missing data: 0)

  7. customerid - 4,372 data points (missing data: 135,080)

  8. country - 38 data points (missing data: 0)

Strategies for missing data in the description and customerid columns:

  • description: I decided to delete the column, assuming it won't impact predictions. (Note: The text data could be valuable for product factoring especially if the missing parts was more limited.)

  • customerid: I've introduced an is_registered binary column (1: registered, 0: non-registered) for each customer ID, assuming customers without IDs are not registered. I also retained the original customerid column to evaluate the impact of unique users.
    Note: This will result in high cardinality. Later, I’ll address this with binary encoding.

1# copy base dataset before engineering
2df_rev = df.copy()
3
4df_rev = df_rev.drop(columns='description')
5df_rev['is_registered'] = np.where(df_rev['customerid'].isna(), 0, 1)
6

Handling Data Types

Lastly, considering potential feature engineering, I converted the data types of invoicedate and customerid:

1import pandas as pd
2
3df_rev['invoicedate'] = pd.to_datetime(df_rev['invoicedate'])
4df_rev['customerid'] = df_rev['customerid'].astype('Int64')
5
6df_rev.info()
7

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Exploratory Data Analysis (EDA)

With the clean dataset, we can now move onto the EDA.

Exploratory Data Analysis (EDA) is a data analysis technique focusing on summarizing and visualizing data to understand its main characteristics.

Technically speaking, we can perform countless EDAs especially on complex dataset. But our primary focus here is to reveals what features to engineer and gain insights on data preprocessing for better model performance.

Any other analyses should be handled by models themselves because the true underlying patterns are either too subtle or too complex for us to manually discover.

Hence, EDA becomes the first step for the model to pinpoint which analysis (e.g., hidden trends, group differences) are worth translating into the prediction.

EDA must involve 1) basic analysis (univariate focus) for data understanding and 2) project-specific analysis (bivariate focus) based on our hypotheses directly relevant to the project goals.

1) Basic Analysis (Univariate Focus)

This initial phase is to know the dataset at a fundamental level by analyzing each variable in isolation.

First, to prepare for EDA, I extracted year, month, day_of_week from the invoicedate column and sorted the data by invoicedate:

1df_rev['invoicedate'] = pd.to_datetime(df_rev['invoicedate'])
2df_rev['year'] = df_rev['invoicedate'].dt.year
3df_rev['year_month'] = df_rev['invoicedate'].dt.to_period('M')
4df_rev['month_name'] = df_rev['invoicedate'].dt.strftime('%b')
5df_rev['day_of_week'] = df_rev['invoicedate'].dt.strftime('%a')
6df_rev = df_rev.sort_values('invoicedate')
7

Also introduced sales column for the sales analysis:

1df_rev['sales'] = df_rev['quantity'] * df_rev['unitprice']
2

The dataset looks like:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Understanding Data Distribution

I plotted PDFs for numerical features and histograms for categorical features to identify characteristics like outliers, skew, and heavy tails.

Even though the true data distribution is too complex to fully grasp, analyzing them is crucial for effective preprocessing and model selection.

PDF of Numerical Columns

Both unitprice and sales are sparse and heavily tailed with significant outliers. I’ll use MAE as evaluation due to its robustness to the skewed data.

Figure C-1. PDFs of unitprice and sales (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C-1. PDFs of unitprice and sales (Created by Kuriko IWAI)

  • unitprice: Max: 38,970.0, Min: -11,062.1, Mean: 4.6, Std: 96.8

  • sales: Max: 168,469.6, Min: -168,469.6, Mean: 18.0, Std: 378.8

Histogram of Categorical Features

invoiceno, year_month, month and day_of_week are evenly distributed:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

stockcode has a peak at the left with long tail to the right:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

quantity and country shows a degenerate distribution where data is concentrated in a few categories:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

is_registered and also year turned out binary:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Building on this, we'll conduct project-specific EDAs to pinpoint additional feature engineering opportunities.

2) Project Specific EDAs

This phase dives deeper, specifically looking for relationships between variables, particularly between potential features and the target variable: sales.

Here are hypotheses I considered:

Hypothesis 1. “Sales trend is driven by day of the week or day of the month.”

Given the dataset's limited 13 months of sales data (with a binary year), focus on the shorter trend cycle.

Potential features to engineer: is_weekend, day_of_month

Potential business solutions in mind: Flush promotion aligning with the trend.

Hypothesis 2. “Product sales are driven by time and price point.”

Potential features to engineer:

  • unit_price_bin: Discretized unitprice into categories of 'low', 'medium', 'high', directly addressing price influence and non-linearity.

  • product_avg_quantity_last_month: Computes the average quantity sold in the previous calendar month for each stockcode and captures recent product popularity.

  • product_sales_growth_last_month: Percentage change in sales for a stockcode from 2 months ago to the previous month to identify trending products.

Potential business solutions in mind:

  • Dynamic pricing (Models to predict the optimal price point by promotion timing).

  • Tailored product recommendation (Models to predict similarity of product factors).

Hypothesis 3. “Engaging customers tend to purchase more and contribute to the sales.”

Potential features to engineer:

  • customer_recency_days: Number of days between the prediction date (end of previous month) and the customer's last purchase date to assess likelihood to buy soon.

  • customer_total_spend_ltm: Total sales revenue generated by the customer in the last 3 months. This is a direct measure of recent monetary value of the customer.

  • customer_freq_ltm: Total number of unique invoices a customer has made in the last 3 months. This is one of the engagement measures directly impacts on the sales.

Potential business solutions:

  • Tiered customer loyalty programs (models to predict retention timing of unique user)

  • Marketing media mix optimization (models to predict customer values of a new customer)

Let us see the EDA results and which features to engineer.

Hypothesis-Driven Feature Engineering

Hypothesis 1. “Sales trend is driven by day of the week or day of the month.“

Sales trends by month and day of week have no significant patterns emerged beyond a peak in November. I opted not to add additional features from this hypothesis.

Figure D-1. Sales trend by month of the year and day of the week (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-1. Sales trend by month of the year and day of the week (Created by Kuriko IWAI)

Hypothesis 2. “Product sales are driven by time and price point.”

For unit_price_bin, the median lines of all three price bins are nearly zero for nearly all months. And the interquartile ranges (IQR) of all the bins are also very short, indicating that 25-75 percentile data falls within a very small range of low quantities.

However, we can see the outliers dominating high quantities, making clear stratification.

So, I added this feature, while keeping the original unit_price to retain the granularity, using the exact values within the bin for predicting the quantity.

Figure D-2. Monthly total units sold with the median and the interquartile range by price range (low, mid, high) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-2. Monthly total units sold with the median and the interquartile range by price range (low, mid, high) (Created by Kuriko IWAI)

Adding unit_price_bin to the final dataset:

1import pandas as pd
2
3df_fin = df_rev.copy()
4
5_df_prod_month_agg = df_fin.copy().groupby(['stockcode', 'year_month']).agg(
6    prod_total_monthly_quantity=('quantity', 'sum'),
7    prod_ave_monthly_price=('unitprice', 'mean')
8).reset_index().sort_values(by=['stockcode', 'year_month'])
9
10_df_prod_month_agg['unit_price_bin'] = pd.qcut(
11    _df_prod_month_agg['prod_ave_monthly_price'],
12    q=3,
13    labels=['low', 'mid', 'high'],
14    duplicates='drop'
15)
16
17_df_prod_bin_per_stockcode = _df_prod_month_agg.groupby('stockcode')['unit_price_bin'].agg(
18    lambda x: x.mode()[0] if not x.mode().empty else None
19).reset_index()
20
21df_fin = pd.merge(
22    df_fin,
23    _df_prod_bin_per_stockcode[['stockcode', 'unit_price_bin']],
24    on='stockcode',
25    how='left'
26)
27
28df_fin.info()
29

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

product_avg_quantity_last_month also shows a very strong positive correlation, working as a momentum feature that tells products sold well last month tend to sell well this month. I’ll add this feature.

Figure D-3. Average products sold this month and last month (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-3. Average products sold this month and last month (Created by Kuriko IWAI)

Adding product_avg_quantity_last_month to the final dataset (also handled imputation):

1import pandas as pd
2
3_df_prod_month_agg['product_avg_quantity_last_month'] = _df_prod_month_agg.groupby('stockcode')['prod_total_monthly_quantity'].shift(1)
4_df_prod_last_month_agg = _df_prod_month_agg.groupby('stockcode')['product_avg_quantity_last_month'].mean().reset_index()
5
6df_fin = pd.merge(
7    df_fin,
8    _df_prod_last_month_agg [['stockcode', 'product_avg_quantity_last_month']],
9    on='stockcode',
10    how='left'
11)
12
13# missing data means no products sold during the term. imputation with zero.
14df_fin['product_avg_quantity_last_month'] = df_fin['product_avg_quantity_last_month'].fillna(value=0)
15
16df_fin.info()
17

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

On the other hand, product_sales_growth_last_month didn’t show a strong linear/monotonic relationship. Considering the limited predictive power of this feature, I opted not to add it.

Figure D-4. Monthly product quantity vs last month sales growth rate (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-4. Monthly product quantity vs last month sales growth rate (Created by Kuriko IWAI)

Hypothesis 3. “Registered customers purchase more and contribute to the sales.”

customer_recency_days shows that customers with lower recency (more recent purchases, e.g., x < 60 days) tend to exhibit higher monthly sales revenue, indicating inverse relationship (red dotted line in the figure). I’ll add this feature for predicting monthly sales revenue.

Figure D-4. Monthly sales vs customer recency days (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-4. Monthly sales vs customer recency days (Created by Kuriko IWAI)

Adding customer_recency_days to the dataset:

1import pandas as pd
2
3# create customer base df
4_df_all_customers_year_month = pd.MultiIndex.from_product(
5    [df_fin['customerid'].unique(), df_fin['year_month'].unique()], # type: ignore
6    names=['customerid', 'year_month']
7).to_frame(index=False).sort_values(by=['customerid', 'year_month']).reset_index(drop=True)
8
9_df_customer_monthly_agg = df_fin.copy().groupby(['customerid', 'year_month']).agg(
10    monthly_sales=('sales', 'sum'),
11    monthly_unique_invoices=('invoiceno', 'nunique'),
12    monthly_last_purchase_date=('invoicedate', 'max')
13).reset_index()
14
15_df_cus = _df_all_customers_year_month.merge(_df_customer_monthly_agg, on=['customerid', 'year_month'], how='left').sort_values(by=['customerid', 'year_month'])
16
17# adding timestamp
18_df_cus['pfin_last_purchase_date'] = _df_cus.groupby('customerid')['monthly_last_purchase_date'].shift(1)
19_df_cus['invoice_timestamp_end'] = _df_cus['year_month'].dt.end_time
20
21# computes recency days
22_df_cus['customer_recency_days'] = (_df_cus['invoice_timestamp_end'] - _df_cus['pfin_last_purchase_date']).dt.days
23
24# merge and imputation
25df_fin['customer_recency_days'] = _df_cus['customer_recency_days']
26
27max_recency = _df_cus['customer_recency_days'].max()
28df_fin['customer_recency_days'] = df_fin['customer_recency_days'].fillna(value=max_recency + 30)
29
30df_fin.info()
31

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

customer_total_spend_ltm shows a clear positive correlation between a customer's total spend in the last three months and their current monthly sales revenue. This indicates higher past spending generally corresponds to higher current revenue, making it an excellent predictive feature. I’ll add this feature.

Figure D-5. Monthly sales vs customer total spend in the last three months (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-5. Monthly sales vs customer total spend in the last three months (Created by Kuriko IWAI)

Adding customer_total_spend_ltm:

1_df_cus['customer_total_spend_ltm'] = _df_cus.groupby('customerid')['monthly_sales'].rolling(window=3, closed='left').sum().reset_index(level=0, drop=True)
2
3df_fin['customer_total_spend_ltm'] = _df_cus['customer_total_spend_ltm']
4df_fin['customer_total_spend_ltm'] = df_fin['customer_total_spend_ltm'].fillna(value=0)
5
6df_fin.info()
7

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

customer_freq_ltm also demonstrates a positive relationship between a customer's purchase frequency in the last three months and their current monthly sales revenue. Customers with more unique invoices in the prior three months tend to generate higher monthly revenue. I’ll add this feature as well.

Figure D-6. Monthly sales vs customer frequency in the last three months (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-6. Monthly sales vs customer frequency in the last three months (Created by Kuriko IWAI)

Adding customer_freq_ltm:

1_df_cus['customer_freq_ltm'] = _df_cus.groupby('customerid')['monthly_unique_invoices'].rolling(window=3, closed='left').sum().reset_index(level=0, drop=True)
2
3df_fin['customer_freq_ltm'] = _df_cus['customer_freq_ltm']
4df_fin['customer_freq_ltm'] = df_fin['customer_freq_ltm'].fillna(value=0)
5
6df_fin.info()
7

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Final Check on Missing Values

I found five missing items in the stockcode, quantity, country, and unit_price_bin columns of the updated dataset:

1df_fin.isna().sum()
2
  • invoiceno 0

  • stockcode 5

  • quantity 5

  • invoicedate 0

  • unitprice 0

  • customerid 135080 (already covered in is_registered)

  • country 5

  • is_registered 0

  • year 0

  • year_month 0

  • month_name 0

  • day_of_week 0

  • sales 0

  • unit_price_bin 5

  • product_avg_quantity_last_month 0

  • customer_recency_days 0

  • ustomer_total_spend_ltm 0

  • customer_freq_ltm 0

I’ll check missing items one by one and perform imputation.

Note: Given that only max. 20 out of 540k+ samples have missing values, the row-wise deletion (simply deleting them from the dataset) could be a viable option.

stockcode / unit_price_bin

1df_null = df_fin[df_fin['stockcode'].isnull()]
2df_null.head().transpose()
3

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Other values in the samples with missing stockcode appear legitimate. I replaced NaNs in stockcode with ‘unknown’ and unit_price_bin with ‘low’:

1df_fin['stockcode'] = df_fin['stockcode'].fillna(value='unknown')
2df_fin['unit_price_bin'] = df_fin['unit_price_bin'].fillna(value='low')
3

Taking the same process, the missing values in the country and quantity columns are filled with its mode value and sales/unitprice resepecitvely:

1import numpy as np
2
3df_fin['country'] = df_fin['country'].fillna(value=df_fin['country'].mode().iloc[0]) 
4df_fin['quantity'] = df_fin['quantity'].fillna(value=np.floor(df_fin['sales'] / df_fin['unitprice']))
5

Converted the data types to finalize the dataset:

1df_fin['year_month'] = df_fin['year_month'].dt.month
2df_fin['invoicedate'] = df_fin['invoicedate'].astype(int) / 10 ** 9
3df_fin = df_fin.drop(columns=['month_name'], axis='columns')
4
5df_fin.info()
6

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

The final version of the dataset has 541,909 data points with 17 features:

1cat_cols = [
2    'invoiceno',
3    'stockcode',
4    'quantity',
5    'customerid',
6    'country',
7    'year',
8    'year_month',
9    'day_of_week',
10    'is_registered',
11    'unit_price_bin',
12    'customer_recency_days',
13]
14num_cols = [
15    'unitprice',
16    'product_avg_quantity_last_month',
17    'customer_total_spend_ltm',
18    'customer_freq_ltm',
19    'invoicedate'
20]
21
22target_col = 'sales'
23

Phase 1 Feature Engineering Recap

Based on the EDA, added 11 features:

  • From the univariate EDA: is_registered, year , year_month, month_name, day_of_week, sales

  • From the bivariate EDA: unit_price_bin, product_avg_quantity_last_month, customer_recency_days, customer_total_spend_ltm, customer_freq_ltm:

and dropped one feature: description due to its large volume of missing values and limited impacts to the prediction.

Model Selection

Given the complex, large datasets, I selected three models:

  • Elastic Net: A regularized linear regression model, suitable as a baseline for linearly separable data.

  • Random Forest: A powerful machine learning model capable of capturing complex, non-linear relationships.

  • Deep Feedforward Network: A deep learning model that serves as a strong baseline for non-linearly separable data. To effectively manage large datasets, I used the PyTorch library.

Each of the model has different needs for preprocessing such that:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

In the next section, I'll train each model on preprocessed dataset.

Training Models on Preprocessed Data

First, I split the dataset into training, validation, and test sets for all models.

I intentionally didn't shuffle the dataset to preserve its temporal order.

1from sklearn.model_selection import train_test_split
2
3target_col = 'sales'
4X = df_fin.copy().drop(columns=target_col)
5y = df_fin.copy()[target_col]
6
7test_size = 50000
8X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
9X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=42)
10

Elastic Net

Elastic Net needs to be trained on scaled and encoded dataset.

For numerical features, I applied RobustScaler to handle significant outliers we found during the EDA.

For categorical features, I applied BinaryEncoder to limit the increase of dimensions while replacing missing values in the customerid column with zero:

1from sklearn.preprocessing import RobustScaler
2from sklearn.compose import ColumnTransformer
3from sklearn.pipeline import Pipeline
4from category_encoders import BinaryEncoder
5
6# num
7num_transformer = Pipeline(steps=[
8    ('scaler', RobustScaler(with_centering=True, with_scaling=True))
9])
10
11# cat
12cat_transformer = Pipeline(steps=[
13    ('encoder', BinaryEncoder(cols=cat_cols, handle_missing='0'))
14])
15
16# defines a preprocessor
17preprocessor_en = ColumnTransformer(
18    transformers=[
19        ('num', num_transformer, num_cols),
20        ('cat', cat_transformer, cat_cols)
21    ],
22    remainder='passthrough',
23)
24
25# transform
26X_train_processed = preprocessor_en.fit_transform(X_train)
27X_val_processed = preprocessor_en.transform(X_val)
28X_test_processed = preprocessor_en.transform(X_test)
29
30
31# initiate and train the model
32from sklearn.linear_model import ElasticNet
33
34elastic_net = ElasticNet(
35    alpha=1,            # total strength of the regularization
36    l1_ratio=0.5,       # l1 to l2 ration = 1:1
37    fit_intercept=True, # fits a y-intercept by computing it
38    precompute=False,   # no use precomputed Gram matrix
39    max_iter=5000,      # 1,000 epochs
40    copy_X=True,        # copies X before fitting
41    tol=1e-5,           # tol to stop iteration
42    random_state=42,    # seed for the random number generato
43    warm_start=False,   # ignores solutions from the previous fit call
44    positive=False,     # can both negative and positive coefficient
45    selection="cyclic"  # updates coefficients cyclically one by one (vs random)
46).fit(X_train_processed, y_train)
47

Random Forest

For random forest, we can skip the scaling:

1from sklearn.compose import ColumnTransformer
2from sklearn.pipeline import Pipeline
3from category_encoders import BinaryEncoder
4
5cat_transformer = Pipeline(steps=[
6    ('encoder', BinaryEncoder(cols=cat_cols, handle_missing='0'))
7])
8
9# define the preprocessor
10preprocessor_rf = ColumnTransformer(
11    transformers=[
12        ('cat', cat_transformer, cat_cols)
13    ],
14    remainder='passthrough',
15)
16
17# transform
18X_train_processed = preprocessor_rf.fit_transform(X_train)
19X_val_processed = preprocessor_rf.transform(X_val)
20X_test_processed = preprocessor_rf.transform(X_test)
21
22
23# initiate and train the model
24from sklearn.ensemble import RandomForestRegressor
25
26random_forest = RandomForestRegressor(
27    n_estimators=1000,
28    criterion="squared_error",
29    max_depth=None,
30    min_samples_split=2,
31    min_samples_leaf=1,
32    min_weight_fraction_leaf=0,
33    max_features='sqrt',
34    max_leaf_nodes=None,
35    min_impurity_decrease=1e-10,
36    bootstrap=True,
37    oob_score=True,
38    n_jobs=-1,
39    random_state=42,
40    verbose=0,
41    warm_start=False,
42    ccp_alpha=0,
43    max_samples=None,
44).fit(X_train_processed, y_train)
45

DFN

DFN also needs scaling and encoding.

For scaling DFN's numerical features, I used StandardScaler considering the DFN’s robustness in handling complex data. The dataset was then transformed into a TensorDataset:

1from sklearn.preprocessing import StandardScaler
2from sklearn.compose import ColumnTransformer
3from sklearn.pipeline import Pipeline
4from category_encoders import BinaryEncoder
5
6num_transformer = Pipeline(steps=[('scaler', StandardScaler())])
7cat_transformer = Pipeline(steps=[('encoder', BinaryEncoder(cols=cat_cols, handle_missing='0'))])
8
9# defines a preprocessor
10preprocessor_dfn = ColumnTransformer(
11    transformers=[
12        ('num', num_transformer, num_cols),
13        ('cat', cat_transformer, cat_cols)
14    ],
15    remainder='passthrough'
16)
17
18# transform
19X_train_processed_dfn = preprocessor_dfn.fit_transform(X_train)
20X_val_processed_dfn = preprocessor_dfn.transform(X_val)
21X_test_processed_dfn = preprocessor_dfn.transform(X_test)
22
23import torch
24from torch.utils.data import DataLoader, TensorDataset
25
26# converts a NumPy array into a PyTorch Tensor
27X_train_tensor = torch.tensor(X_train_processed_dfn, dtype=torch.float32)
28X_val_tensor = torch.tensor(X_val_processed_dfn, dtype=torch.float32)
29X_test_tensor = torch.tensor(X_test_processed_dfn, dtype=torch.float32)
30
31# transforms a 1D tensor
32y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
33y_val_tensor = torch.tensor(y_val.values, dtype=torch.float32).view(-1, 1)
34y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)
35
36# convert to the TensorDataset
37train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
38val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
39test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
40
41# handle batching
42batch_size = 32
43train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False)
44val_loader = DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False)
45test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
46

Then, initiated the model:

1import numpy as np
2import torch
3import torch.nn as nn
4import torch.optim as optim
5
6class DFN(nn.Module):
7    def __init__(self, input_dim):
8        super(DFN, self).__init__()
9        self.fc1 = nn.Linear(input_dim, 32)
10        self.relu1 = nn.ReLU()
11        self.dropout1 = nn.Dropout(0.1)
12        self.fc2 = nn.Linear(32, 16)
13        self.relu2 = nn.ReLU()
14        self.dropout2 = nn.Dropout(0.1)
15        self.fc3 = nn.Linear(16, 1)
16
17    def forward(self, x):
18        x = self.fc1(x)
19        x = self.relu1(x)
20        x = self.dropout1(x)
21        x = self.fc2(x)
22        x = self.relu2(x)
23        x = self.dropout2(x)
24        x = self.fc3(x)
25        return x   
26
27input_dim = X_train_processed_dfn.shape[1]
28device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
29model = DFN(input_dim).to(device)
30criterion = nn.L1Loss()
31optimizer = optim.Adam(model.parameters(), lr=0.001)
32

Trained the model:

1from sklearn.metrics import mean_squared_error, mean_absolute_error
2
3num_epochs = 100
4best_val_loss = float('inf')
5patience = 10
6patience_counter = 0
7min_delta = 1e-4
8history = {'train_loss': [], 'val_loss': [], 'train_mse': [], 'val_mse': [], 'train_mae': [], 'val_mae': [] }
9
10
11for epoch in range(num_epochs):
12    model.train()
13    running_train_loss = 0.0
14    all_train_preds = []
15    all_train_targets = []
16
17    for batch_idx, (data, target) in enumerate(train_loader):
18        data, target = data.to(device), target.to(device)
19
20        optimizer.zero_grad()
21        outputs = model(data)
22        loss = criterion(outputs, target)
23        loss.backward()
24        optimizer.step()
25
26        running_train_loss += loss.item() * data.size(0)
27        all_train_preds.extend(outputs.detach().cpu().numpy())
28        all_train_targets.extend(target.detach().cpu().numpy())
29
30    epoch_train_loss = running_train_loss / len(train_dataset)
31
32    train_mse = mean_squared_error(np.array(all_train_targets), np.array(all_train_preds))
33    train_mae = mean_absolute_error(np.array(all_train_targets), np.array(all_train_preds))
34
35    model.eval()
36    running_val_loss = 0.0
37    all_val_preds = []
38    all_val_targets = []
39
40    with torch.no_grad():
41        for data, target in val_loader:
42            data, target = data.to(device), target.to(device)
43            outputs = model(data)
44            loss = criterion(outputs, target)
45            running_val_loss += loss.item() * data.size(0)
46            all_val_preds.extend(outputs.cpu().numpy())
47            all_val_targets.extend(target.cpu().numpy())
48
49    epoch_val_loss = running_val_loss / len(val_dataset)
50
51    val_mse = mean_squared_error(np.array(all_val_targets), np.array(all_val_preds))
52    val_mae = mean_absolute_error(np.array(all_val_targets), np.array(all_val_preds))
53
54    history['train_loss'].append(epoch_train_loss)
55    history['val_loss'].append(epoch_val_loss)
56    history['train_mse'].append(train_mse)
57    history['val_mse'].append(val_mse)
58    history['train_mae'].append(train_mae)
59    history['val_mae'].append(val_mae)
60

Results

MAE

  • Elastic Net: Train: 19.773 →Validation: 18.508

  • Random Forest: Train: 4.147 →Validation: 10.551

  • Deep Feedforward Network: Train: 10.570 →Validation: 10.987

Elastic Net generalizes well (19.77 Train, 18.51 Val), but has the highest average error. Its predictions are off by approximately $18.50 to $19.77 from the actual sales.

Random Forest overfits significantly (4.15 Train, 10.55 Val). On average, its predictions on new data are off by about $10.55.

Deep Feedforward Network (DFN) shows excellent generalization (10.57 Train, 10.99 Val) and achieves the low average error on unseen data. Its predictions are off by about $10.99.

Random Forest is the best-performing model, yet DFN excells in generalization.

Figure F. Actual vs Predicted Sales (Left: Elastic Net, Middle: Random Forest), Loss History by DFN (Right) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Actual vs Predicted Sales (Left: Elastic Net, Middle: Random Forest), Loss History by DFN (Right) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Actual vs Predicted Sales (Left: Elastic Net, Middle: Random Forest), Loss History by DFN (Right) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Actual vs Predicted Sales (Left: Elastic Net, Middle: Random Forest), Loss History by DFN (Right) (Created by Kuriko IWAI)

Iteration of Feature Engineering and Model Training

The previous phase results suggest there's still room to improve the generalization power across all three models.

I applied a log transformation to the sales values to create a more symmetric distribution for the model's target variable.

To differentiate refunds from positive sales, I created an is_return binary flag (1 for refunds, 0 for sales). This allowed the sales column to focus solely on positive sales values.

(Note: Taking the logarithm on negative values result in NaN. I replaced them with zero first, and then applied Laplace smoothing. This also prevents negative infinite values in the column.)

1import numpy as np
2
3df_fin_rev = df_fin.copy()
4
5# adds is_return flag
6df_fin_rev['is_return'] = (df_fin_rev['sales'] < 0).astype(int)
7
8# zero or positive sales only
9df_fin_rev['sales'] = df_fin_rev['sales'].apply(lambda x: max(x, 0))
10
11# apply laplace smoothing
12alpha = 1
13df_fin_rev['sales'] = np.log(df_fin_rev['sales'] + alpha)
14
15df_fin_rev.info()
16

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

After checking there is no missing values in the dataset except for the customerid column:

1df_fin_rev.isna().sum()
2

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

I retrained the models using the same preprocessing steps and hyperparameters:

1import numpy as np
2from sklearn.model_selection import train_test_split
3from sklearn.preprocessing import StandardScaler
4from sklearn.compose import ColumnTransformer
5from sklearn.pipeline import Pipeline
6from sklearn.metrics import mean_squared_error, mean_absolute_error
7from category_encoders import BinaryEncoder
8
9# creates datasets
10X = df_fin_rev.copy().drop(columns=target_col)
11y = df_fin_rev.copy()[target_col]
12
13test_size = 50000
14X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
15X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=42)
16
17# preprocessing
18num_transformer = Pipeline(steps=[('scaler', StandardScaler())])
19cat_transformer = Pipeline(steps=[('encoder', BinaryEncoder(cols=cat_cols, handle_missing='0'))])
20preprocessor_en = ColumnTransformer(
21    transformers=[
22        ('num', num_transformer, num_cols),
23        ('cat', cat_transformer, cat_cols)
24    ],
25    remainder='passthrough' 
26)
27
28X_train_processed_en = preprocessor_en.fit_transform(X_train)
29X_val_processed_en = preprocessor_en.transform(X_val)
30X_test_processed_en = preprocessor_en.transform(X_test)
31
32# model training
33elastic_net.fit(X_train_processed_en, y_train)
34
35# prediction (log sales)
36y_pred_train = elastic_net.predict(X_train_processed_en) 
37y_pred_val = elastic_net.predict(X_val_processed_en)
38y_pred_test = elastic_net.predict(X_test_processed_en)
39
40# evaluation - log sales - use MSE for evaluation
41mse_train = mean_squared_error(y_train, y_pred_train)
42mse_val = mean_squared_error(y_val, y_pred_val)
43mse_test = mean_squared_error(y_test, y_pred_test)
44
45# evaluation - real sales - use MAE for evaluation
46mae_train_exp = mean_absolute_error(np.exp(y_train), np.exp(y_pred_train))
47mae_val_exp = mean_absolute_error(np.exp(y_val), np.exp(y_pred_val))
48mae_test_exp = mean_abolute_error(np.exp(y_test), np.exp(y_pred_test))
49

Results

I evaluated model performance using MSE on the logged sales data and MAE on the real-valued sales:

Elastic Net:

  • MSE on Log Sales: Train: 1.133 → 1.132, Generalization\: 1.122*

  • MAE on Real-valued Sales: Train: 15.825 → 14.714, Generalization: 16.509

Random Forest:

  • MSE on Log Sales: Train: 0.020 → 0.175, Generalization: 0.176

  • MAE on Real-valued Sales: Train: 4.135 → 7.187, Generalization: 9.041

DFN:

  • MSE on Log Sales: Train: 1.079 → 0.165 Generalization on test dataset: 0.079

  • MAE on Real-valued Sales: Train: 5.644 → 5.016, Generalization: 6.197

(*Generalization on the 50,000 test samples.)

The MAE results for real-valued sales improved across all models compared to Phase 1, indicating the importance of target variable density.

Among all, DFN demonstrates the best performance with low MAE on both training (5.64) and generalization (6.20), suggesting its high capabilities to learn and generalize complex, large dataset. Its predictions on unseen data is off $6.20.

Elastic Net shows excellent generalization, yet its predictions on unseen data are off by $16.51, the largest gap among all, suggesting its struggles with the (potentially) non-linear dataset.

Random Forest exhibits significant overfitting, with a large gap between its low training MAE (4.14) and much higher generalization MAE (9.04). Hyperparameter tuning to tighten the regularization parameters and tree structures can be the next step for this model.

Experiment Summary

The experiment demonstrates that the DFN on PyTorch performs the best on the transformed dataset with features identified during the EDA.

Going back to the initial hypotheses on the business solutions, we can leverage this finding directly to marketing media mix optimization, for instance, enabling the DFN to predict customer lifetime value for new customers and optimize budget allocation towards high-value customer channels.

This experiment covers Phase 1 and the initial part of Phase 2. As a next step, we can either further explore feature engineering in Phase 2 or move on to Phase 3 for tuning hyperparameters to select the model.

Wrapping Up

Feature engineering is more than just data manipulation; it's a strategic method to gain powerful insights from raw data, dramatically improving the model’s ability to solve the problem at hand.

In our experiment, we observed that feature engineering, especially when it is tightly aligned with EDA and the business objectives, significantly enhanced models’ performance. We can expect further improvement by collaborating with domain experts and business stakeholders to refine the hypotheses.

By investing time and effort in crafting effective inputs, we are fundamentally enhancing the model's ability to learn, generalize, and deliver superior predictive performance.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data" in Kernel Labs

https://kuriko-iwai.com/best-practices-in-feature-engineering

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.