Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage

Improve generalization capabilities while keeping data in order

Machine LearningDeep LearningPython

By Kuriko IWAI

Introduction

Cross-validation is an effective technique to prevent overfitting and provides a more reliable estimate of a model’s performance on unseen data.

When applied to sequential data, it has risks like data leakage and autocorrelation, which can lead to over-optimistic performance estimates.

In this article, I’ll explore cross-validation methods for time series data, including specialized cross-validation techniques with practical implementation on PyTorch GRU and Scikit-Learn SVR.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

What is Cross Validation (CV)

Cross-validation (CV) is a statistical technique to evaluate generalization capabilities of a machine learning model.

CV first partitions the original dataset into training sets and validation (or test) sets.

Then, it repeatedly trains a model on a different training set and validates its performance on a separate validation set, making the evaluation more reliable than a single train-test split.

◼ When to Apply Cross-Validation

By repeating these train-validation cycles, CV can prevent overfitting while mitigating biases of the model.

Best practices in applying CV include:

Model selection or tuning: Raises confidence in selection/tuning results by using a small-scale cross-validation,
Sequential data analysis where a single random holdout split might lead to data leakage, and
Highly imbalanced target variables in classification tasks where a random single split might result in extremely imbalanced classes, which biases the model. Aside from many data argumentation techniques, cross-validation can mitigate bias in model performance.

On the other hand, CV is not necessary when:

Limited risks for models to be biased because large, non-sequential training set is available,
The model has already showed stable loss history, achieving competitive generalization capability, and
Need to save computational cost like running a quick initial experimentation.

◼ Major Cross Validation Methods

There are many CV methods, each of which has its own unique way to partition original data.

The diagram below compares common methods of K-fold, Stratified K-fold, and LOOCV:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Comparing data partition in K-fold based CV methods (Created by Kuriko IWAI)

In the diagram:

K-Fold CV (left) evaluates model performance based on the average results of the K equal-sized folds,
Stratified K-Fold CV (middle) is used for a variation of K-fold CV for classification problems, and
Leave-One-Out CV (LOOCV) (right) is another variation of K-fold CV where K is equal to the number of samples in the dataset so that each sample is used as a validation set exactly once.

These K-fold concepts are applicable to sequential data, but their effectiveness depends on the core principals.

Let us explore more in the next section.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Core Principals for Cross-Validating Sequential Data

When running cross-validation on sequential data, the most critical consideration is to avoid data leakage.

◼ Data Leakage

Data leakage is a problem in machine learning where information from outside the training dataset is used to create (train) the model.

This external information leakage allows a model to cheat during training or validation, but the model cannot achieve the same performance on unseen data.

There are two main types of data leakage:

Direct Leakage:
- A simple situation where the target variable to predict is included in training data as a feature.
- i.e., A model to predict home sale prices was trained on data including the actual home sale price as a feature.
Indirect Leakage:

A feature highly correlated with the target variable, yet unknown at the time of prediction is included in training data.
i.e., A model to predict home sale prices was trained on data including the current property tax rate as a feature.

The leakage here is that the future property tax rate is unknown when the model predicts the sales price.

The below diagram showcases how the leakage flows:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Data leakage on sales price prediction (Created by Kuriko IWAI)

First, the model is trained on input features including the property tax.
The model predicts home sales price for $450,000 using the current property tax rate.
The home is sold for $500,000.
The local government updates the assessed value of the home based on the sale price.
The new property tax is calculated based on this new, higher assessed value.
The model is then trained with this new property tax value as a feature.
The model mistakenly learns that a high tax rate means a high sale price.

This learning is inaccurate because the correlation of the tax rate and the sales price exists because the tax rate was calculated after the sale.

So, the tax rate shouldn’t be included in training data, and the model focuses on predicting the pre-tax sales price.

◼ Preventing Data Leakage in Sequential Data

In sequential data, data leakage happens when the information from the future (validation data) directly or indirectly combines with the past information (training data).

To avoid the leakage, we:

Must maintain temporal order: Ensure the sequence of events is preserved.
Use time-series specific validation: Employ validation methods designed for sequential data.
Prevent autocorrelation: Avoid situations where data points in your training and validation sets that are close in time correlate with one another.

Satisfying Condition 1 is mandatory, and depending on the data type, Conditions 2 and 3 also need to be met.

Let us explore more in the next section.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

1. Maintaining Temporal Order

When working with sequential data, it’s crucial to preserve its temporal order.

Even when using simple CV methods that aren’t specific to time series, we must sort the data and avoid shuffling it.

◼ Single Train-Test Split (Holdout)

This is the simplest approach where the dataset is divided into two segments: a training set (the first part) and a validation set (the last part).

The model is trained once on the historical data and evaluated on the validation set:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C-1. Data partition image (Created by Kuriko IWAI)

Best When:

Large dataset is available, and a single evaluation is sufficient enough to mitigate the bias.

Disadvantage:

Underperform if the validation period is not representative of future data.

◼ Monte Carlo Cross-Validation

Monte Carlo CV randomly selects data for each fold, assuming that

The data points are independent, and
Their statistical properties like mean, mod, or median do not change over time.

When these conditions are met, random sampling does not break the underlying data structure, making the method a valid alternative to traditional time-series cross-validation.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C-2. Data partition image (Created by Kuriko IWAI)

Best When:

Stationary data without autocorrelation (i.e., tossing coins multiple times).

Disadvantage:

Completely breaks the temporal order when applied to non-stationary time series.

◼ Blocked K-Fold Cross-Validation

A modified version of the K-fold where the data is not shuffled.

Data is divided into contiguous blocks (folds), and each block is used as the validation set in turn.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C-3. Data partition image (Created by Kuriko IWAI)

Best When:

Non-stationary data without autocorrelation.
Only small size samples are available, and need to use all the samples for training/validation.

Disadvantage:

Data with strong autocorrelation has data leakage risks.

2. Time-Series Specific Validation

The second principal is to use time-series specific validation methods.

These methods are specially designed to maintain temporal order of data by focusing on validating model performance in forward-looking scenarios.

◼ Time Series Cross-Validation (“Growing Window”)

A sequential approach where each fold’s training set expands to include all previous data.

The model is retrained on this ever-growing history and validated on a fixed-size block of subsequent data.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-1. Data partition image (Created by Kuriko IWAI)

Best When:

Model performance benefits from more data.
Simulate a production environment where the model is regularly updated with new information.

Disadvantage:

Computationally expensive for later folds due to the increasing training set size.

◼ Walk-Forward Validation (“Rolling Window” or “Sliding Window”)

Walk-forward method shares the same concept of the growing window method, while training and validation window sizes remain constant.

As the validation window moves forward, old training data is discarded.

This method is common method in cross-validating sequential data.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-2. Data partition image (Created by Kuriko IWAI)

Best When:

Extremely long sequential data where older temporal data is less relevant for future predictions.

Disadvantage:

Computationally expensive because the model is retrained each step of the rolling window.
Discarding older data could lose valuable long-term trends or seasonal patterns.

3. Preventing Autocorrelation

Autocorrelation is typical data leakage for sequential data.

Some CV methods are designed to prevent autocorrelation by intentionally adding gaps between training sets and validation sets.

◼ Time Series Cross-Validation with a Gap (“Gap“)

The method is a validation of time-series specific CV methods with a gap between the training and validation sets.

This gap helps independence between the training and validation sets.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-1. Data partition image (Created by Kuriko IWAI)

Best When:

Ensure a strict separation between training and validation data to avoid data leakage.

Disadvantage:

Gaps leave some training data unused. When the data is small, a model might underfit.

◼ hv-Blocked K-Fold Cross-Validation

hv-Blocked K-fold is an advanced form of blocked validation with a time gap between training and validation blocks.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-2. Data partition image (Created by Kuriko IWAI)

Best When:

Data has strong autocorrelation, and need rigorous form of blocked validation to prevent leakage.

Disadvantage:

The gaps leave some training data unused. When the data is small, a model might underfit.

◼ Purged & Embargo Cross-Validation

Purged & Embargo method is designed to prevent data leakage.

It “purges” (removes) training data points that are too close to the validation period, and then applies an “embargo” by removing training data after the validation period that could be affected by future information.

Best When:

The time series has high autocorrelation and strict prevention of data leakage is paramount (i.e., finance)

Disadvantage:

Some training data is left unused because of gaps, which could lead to underfitting (Compared to the K-fold based methods, the amount of data discarded could be large).

That’s all for cross-validation methods for sequential data.

The choice of method is heavily influenced by the data type.

In the next section, I’ll explore how model sensitivities impact performance after cross-validation.

Simulation

In this section, I’ll explore how each cross-validation method works on:

A GRU (Gated Recurrent Units) network built on PyTorch and
A simpler SVR (Support Vector Regression) on Scikit-Learn.

All code is available in my Github Repo: GRU / SVR

◼ Creating Original Dataset

First, I loaded and engineered CSV data from the UC Irvine Machine Learning repository :

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Screenshot of the variable table (source)

In this project, I’ll predict traffic_volume using the time series data like rain_1h, date_time:

1import pandas as pd
2
3# load data as df
4file_path = 'data/Metro_Interstate_Traffic_Volume.csv'
5df = pd.read_csv(file_path, sep=',')
6
7# add dt related features (critical step for gru)
8df['date_time'] = pd.to_datetime(df['date_time'])
9df['year'] = df['date_time'].dt.year
10df['month'] = df['date_time'].dt.month
11df['hour'] = df['date_time'].dt.hour
12df['day_of_week'] = df['date_time'].dt.dayofweek # cat (0 to 6)
13df['is_weekend'] = df['day_of_week'].isin([5, 6])
14df['is_holiday'] = df['holiday'].notna()
15
16# drop unnecessary columns
17df = df.drop(columns=['holiday', 'weather_description', 'date_time'], axis=1)
18
19# create input and target vars
20target_col = 'traffic_volume'
21y = df[target_col]
22X = df.drop(target_col, axis=1)
23
24
25# split the data into two groups: train and test sets
26from sklearn.model_selection import train_test_split
27
28X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
29

To preserve the temporal order, original data should not be shuffled when creating the training and test sets.

X_test is for assessing generalization capabilities. It must not be used during training / validation phase to avoid data leakage.

Lastly, transform the input features:

1from sklearn.impute import SimpleImputer
2from sklearn.preprocessing import StandardScaler
3from sklearn.compose import ColumnTransformer
4from sklearn.pipeline import Pipeline
5from category_encoders import BinaryEncoder # type: ignore
6
7# classify numerical and categorical features
8cat_cols, num_cols = [], []
9for col in df.columns.to_list():
10    if col == target_col: pass
11    else:
12        if df[col].dtype == 'object' or df[col].dtype == 'bool': cat_cols.append(col)
13        else: num_cols.append(col)
14
15# define column transformer
16num_transformer = Pipeline(steps=[
17    ('imputer', SimpleImputer(strategy='median')), 
18    ('scaler', StandardScaler())]
19)
20cat_transformer = Pipeline(steps=[('encoder', BinaryEncoder(cols=cat_cols))])
21preprocessor = ColumnTransformer(
22    transformers=[
23        ('num', num_transformer, num_cols),
24        ('cat', cat_transformer, cat_cols)
25    ],
26    remainder='passthrough'
27)
28
29# transform the input features
30X_train = preprocessor.fit_transform(X_train)
31X_test = preprocessor.transform(X_test)
32

The final training set has 38,563 samples with 16 input features.

◼ Defining the GRU Model

Next, I defined the GRU class on the PyTorch library:

1import torch.nn as nn
2
3# define a simple gru model (many-to-one architecture)
4class GRU(nn.Module):
5    def __init__(self, input_size=X_train.shape[1], hidden_size=64, output_size=1):
6        super(GRU, self).__init__()
7        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size, batch_first=True)
8        self.fc = nn.Linear(hidden_size, output_size)
9
10    def forward(self, x):
11        h_gru, _ = self.gru(x)
12        o_final = self.fc(h_gru[:, -1, :])
13        return o_final
14

The GRU class is on a simple many-to-one architecture where the output by time step is combined into one final output.

◼ Running Cross-Validation

Next, I defined the train_and_validate function where X_train data is split into five folds based on a selected cross-fold method, and the model is trained and validated using these folds.

Following the the best practice in cross-validation, the function initiates the optimizer and the model in every fold.

For comparison, I used the same params across all methods:

num_epochs: The number of training epochs (set as 300)
lr: The learning rate of the optimizer (set as 0.001)
num_folds: The number of folds created by the cross-validation method (set as 5), and
test_size: The size of test (validation) data from the training data (set as 0.2 (20%)).

1import numpy as np
2import torch
3from sklearn.model_selection import KFold, TimeSeriesSplit, train_test_split
4from tqdm import tqdm
5
6# test cross val methods
7def train_and_validate(
8        validation_method, 
9        num_epochs, 
10        lr, 
11        X_train=X_train, y_train=y_train, # split X_train into folds based on the given cross val method
12        num_folds=5, 
13        test_size=0.2
14    ) -> dict:
15
16    # recording loss history
17    fold_train_losses = []
18    fold_val_losses = []
19    model = None
20
21    # define splits based on the chosen validation method
22    match validation_method:
23        case "Holdout":
24            train_size = int((1 - test_size) * len(X_train))
25            train_indices = np.arange(train_size)
26            val_indices = np.arange(train_size, len(X_train))
27            splits = [(train_indices, val_indices)]
28
29        case "Monte Carlo":
30            splits = []
31            for _ in range(num_folds):
32                train_indices, val_indices = train_test_split(
33                    np.arange(len(X_train)), 
34                    test_size=test_size, 
35                    shuffle=True
36                )
37                splits.append((train_indices, val_indices))
38
39        case "K-Fold":
40            kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)
41            splits = kf.split(X_train)
42
43        case "Blocked K-Fold":
44            kf_blocked = KFold(n_splits=num_folds, shuffle=False)
45            splits = kf_blocked.split(X_train)
46
47        case "Growing Window":
48            tss = TimeSeriesSplit(n_splits=num_folds)
49            splits = tss.split(X_train)
50
51        case "Sliding Window":
52            splits = []
53            window_size = int(len(X_train) / (num_folds + 1))
54            for i in range(num_folds):
55                train_start = i * window_size
56                train_end = train_start + window_size
57                val_start = train_end
58                val_end = val_start + window_size
59                if val_end > len(X_train):
60                    val_end = len(X_train)
61                splits.append(
62                    (np.arange(train_start, train_end), np.arange(val_start, val_end))
63                )
64
65        case "Gap":
66            splits = []
67            tss_gap = TimeSeriesSplit(n_splits=num_folds)
68            for train_idx, val_idx in tss_gap.split(X_train):
69                gap_size = int(0.1 * len(val_idx))
70                train_idx = train_idx[:-gap_size]
71                splits.append((train_idx, val_idx))
72
73        case "hv-Blocked K-Fold":
74            splits = []
75            kf_blocked_gap = KFold(n_splits=num_folds, shuffle=False)
76            for train_idx, val_idx in kf_blocked_gap.split(X_train):
77                gap_size = int(0.1 * len(val_idx))
78                train_idx = train_idx[:-gap_size]
79                splits.append((train_idx, val_idx))
80
81        case _:
82            raise ValueError(f"Unknown validation method: {validation_method}")
83
84
85    # training and validation loop
86    for fold, (train_idx, val_idx) in enumerate(tqdm(splits, desc=f"Training with {validation_method}")):
87        # define model, optimizer, and loss function (initialize a new model and optimizer for each fold )
88        model = GRU(hidden_size=64, output_size=1)
89        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
90        criterion = nn.MSELoss()
91
92        # create training and validation X_train and tensor X_train loader
93        train_idx = range(int(0.8 * len(X_train)))
94        val_idx = range(int(0.8 * len(X_train)), len(X_train))
95
96        # separate features X and target y for training and validation for cv
97        X_train_cv, y_train_cv = X_train[train_idx, :], y_train[train_idx]
98        X_val_cv, y_val_cv = X_train[val_idx, :], y_train[val_idx]
99
100        # convert numpy arrays to pytorch tensors
101        X_train_cv = torch.from_numpy(X_train_cv).float()
102        y_train_cv = torch.from_numpy(y_train_cv.values).float()
103        X_val_cv = torch.from_numpy(X_val_cv).float()
104        y_val_cv = torch.from_numpy(y_val_cv.values).float()
105
106        # create the tensor dataset and data loader
107        train_dataset = torch.utils.data.TensorDataset(X_train_cv, y_train_cv)
108        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=False)
109
110        val_dataset = torch.utils.data.TensorDataset(X_val_cv, y_val_cv)
111        val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=32, shuffle=False)
112
113        # start validation
114        fold_train_loss_history = []
115        fold_val_loss_history = []
116        for _ in range(num_epochs):
117            # training loop
118            model.train()
119            train_loss = 0
120            for X_batch, y_batch in train_loader:
121                X_batch = X_batch.unsqueeze(1)
122                y_batch = y_batch.unsqueeze(1)
123
124                outputs = model(X_batch)
125                loss = criterion(outputs, y_batch)
126                optimizer.zero_grad()
127                loss.backward()
128                optimizer.step()
129                train_loss += loss.item()
130
131            avg_train_loss = train_loss / len(train_loader)
132            fold_train_loss_history.append(avg_train_loss)
133
134            # validation
135            model.eval()
136            val_loss = 0
137            with torch.inference_mode():
138                for X_batch, y_batch in val_loader:
139                    X_batch = X_batch.unsqueeze(1)
140                    y_batch = y_batch.unsqueeze(1)
141                    outputs = model(X_batch)
142                    val_loss += criterion(outputs, y_batch).item()
143
144            avg_val_loss = val_loss / len(val_loader) if len(val_loader) > 0 else 0
145            fold_val_loss_history.append(avg_val_loss)
146
147        fold_train_losses.append(fold_train_loss_history)
148        fold_val_losses.append(fold_val_loss_history)
149
150    # after completing cv loop, retrain the model with the entire X_train / y_train set
151    if model is not None:
152        model.eval()
153
154        # convert numpy arrays to pytorch tensors
155        X_train = torch.from_numpy(X_train).float()
156        y_train = torch.from_numpy(y_train.values).float()
157
158        train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
159        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=False)
160
161        for _ in range(num_epochs):
162            model.train()
163            for X_batch, y_batch in train_loader:
164                optimizer.zero_grad()
165                X_batch = X_batch.unsqueeze(1)
166                y_batch = y_batch.unsqueeze(1)
167                outputs = model(X_batch)
168                loss = criterion(outputs, y_batch)
169                loss.backward()
170                optimizer.step()
171
172
173    return {
174        'model': model,
175        "fold_train_losses": fold_train_losses,
176        "fold_val_losses": fold_val_losses,
177        "average_train_loss": np.mean(fold_train_losses),
178        "average_val_loss": np.mean(fold_val_losses)
179    }
180

◼ Performing Inference

After training, the model performed inference on a new, unseen data (X_test).

Losses are recorded to assess the model’s generalization capability.

1# convert X_test (numpy) to torch data
2X_test_float = torch.from_numpy(X_test).float()
3y_test_float = torch.from_numpy(y_test.values).float()
4
5# create test loader
6test_dataset = torch.utils.data.TensorDataset(X_test_float, y_test_float)
7test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)
8
9# perform inference
10model.eval() 
11test_loss = 0
12criterion = nn.MSELoss()
13with torch.inference_mode():
14    for X_batch, y_batch in test_loader:
15        X_batch = X_batch.unsqueeze(1)
16        y_batch = y_batch.unsqueeze(1)
17        outputs = model(X_batch)
18        test_loss += criterion(outputs, y_batch).item()
19
20# compute average loss (MSE)    
21ave_test_loss = test_loss / len(test_loader)
22

◼ Results

▫ 1. GRU

Blocked K-fold achieved the best generalization loss (MSE) across all.

Each graph below plots CV losses (for all folds), the average loss (blue line), and the generalization loss over X_test (red vertical line) by CV method.

The colored area indicates how well the model generalizes its learning (smaller is better).

Overfitting happens when the average CV loss (blue line) overturns the generalization loss (red line).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Comparison of loss histories by cross-validation methods (GRU) (Created by Kuriko IWAI)

Blocked K-fold achieved the best generalization result, followed by K-fold CV with a Gap, but both start to overfitting at around 150th epoch. Implementing early stopping would further refine the results.

Growing Window achieved well balanced results, generalizing the learning well to avoid overfitting while minimizing the generalization error.

Holdout and Monte Carlo methods show the most severe overfitting to the training setup, resulting in extremely high test errors (large pink area). For this specific dataset and the model, these two are unsuitable CV methods.

▫ 2. SVR

Sliding Window achieved the best performance across all.

Each graph below plots the CV losses across five folds, showing the average CV loss (blue line) and the generalization loss (red line) for each CV method.

Similar to the GRU, discrepancy between the blue line and red lines indicates overfitting.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Comparison of loss histories by cross-validation methods (SVR) (Created by Kuriko IWAI)

Sliding Window achieved the best accuracy with the lowest average MSE (0.6149) and high stability. This implies that with SVR, using a fixed-size training window is the most effective way to tackle local autocorrelation in this dataset.

Growing Window and hv-Blocked K-Fold showed the worst-case errors in some folds, indicating that the model can be prone to severe overfitting to past data when trained on these methods.

Other standard methods like Holdout, Monte Carlo, and K-Fold performed good generalization capabilities but their losses remained high, indicating that they couldn’t learn much compared to the Sliding Window.

Wrapping Up

Cross-validation is a powerful technique for evaluating time series models, as it helps them generalize effectively and avoid overfitting by mirroring the structure of the original data.

In our simulation, we learned that choosing the right CV technique for the model and data type matters when we assess generalization capabilities of the model.

The goal is to develop a model that performs best in production, and cross-validation is key for achieving this by simulating real-world data.

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.