A Comparative Guide to Hyperparameter Optimization Strategies

Explore strategies and practical implementation on tuning an ML model to achieve the optimal performance

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Hyperparameter?Hyperparameter Tuning in Action
Simulation Setting
Factory Owner’s Challenge
Goals
Manual Search
Simulation
Building and Evaluating Models
Preparing the Dataset
Searching Hyperparameters
Results
Grid Search
The Search Space
Simulation
Results
Random Search
The Search Space
Simulation
Results
Bayesian Optimization
How Bayesian Optimization Works
The Search Space
Simulation
Results
Metaheuristic Algorithms
The Search Space
Simulation
Step 1. Initialization: Creating a population
Step 2. Evaluation: Computing fitness scores
Step 3: Iteration
Step 4 & 5: Selecting the best parameters
Results
Final Evaluation on Generalization Capability and Cost EfficiencyCase Study: Simple Machine Learning Models Trained on Tabular Data
The Tabular Data
The Search Space
Evaluating Models
Results
Manual Search
Grid Search
Random Search
Genetic Algorithm
Wrapping Up

Introduction

Hyperparameter tuning is a critical step in both traditional machine learning and deep learning that significantly impacts model performance and its generalization ability.

While numerous techniques exist, choosing the optimal tuning method depends on factors like:

  • Model Complexity: More complex models inherently lead to larger search spaces.

  • Data Complexity: The characteristics of the dataset impact tuning difficulty.

  • Familiarity with the Model: Our understanding of the model's behavior can guide tuning choices and define search spaces.

In this article, I’ll compare five key tuning methods to demonstrate how to choose the right strategy for different scenarios, using Convolutional Neural Networks (CNNs) for high-dimensional image dataset and Kernel Support Vector Machines (SVMs) for simpler tabular datasets.

What is Hyperparameter?

Hyperparameters are the configuration settings of machine learning models that are external to the models themselves.

Unlike model parameters, which are learned during training, hyperparameter values cannot be estimated from data, so must be set before the training process begins.

For example, in case of a CNN model, the input layer shape, convolutional layer settings like the number of filters, filter size, stride, padding, output layer size, and compiler settings like the optimizer, loss function, and evaluation metrics are indeed all hyperparameters that must be set before model training.

Hyperparameter Tuning in Action

Hyperparameter tuning is a technique used to adjust a model's hyperparameters, aiming to optimize its performance—for example, minimizing loss in regression tasks.

Most machine leaning models have many hyperparameters. So, tuning them is a critical step to find the best combination that can bring the peak performance while achieving computational efficiency.

Various methods exist to navigate the tuning process, and each method offers a different trade-off between computational cost, search efficiency, and the likelihood of finding the truly optimal configuration (global optima).

Major tuning methods, ordered by complexity and intelligence, include:

  • Manual Search

  • Grid Search

  • Random Search

  • Bayesian Optimization

  • Metaheuristic Algorithm

In the next section, I'll simulate how each method works on models of different complexity.

Simulation Setting

Before jumping into the details, I’ll briefly introduce the simulation setting.

Factory Owner’s Challenge

Let us consider a situation where a factory owner is automating the inspection of small metal gears for a high-precision assembly.

Manual visual inspection is slow, costly, and error-prone.

So, the owner plans to develop an automated AI-powered inspection system using a Convolutional Neural Network (CNN).

This system will perform a regression task, using the CNN model to estimate the continuous radius value from images captured by a high-resolution camera.

Figure A. Small gears (Source)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Small gears (Source)

Goals

Our ultimate goal of the simulation is to find the optimal hyperparameters for the CNN model that can minimize the generalization error.

I’ll use Mean Absolute Error (MAE) as an evaluation metric to assess the precise accuracy in pixels, and examine the best MAE during the search and generalization errors (MAE) on a separate test dataset.

Now, let us see how each method works levaraging the simulation.

Manual search is the most basic approach where a human manually tries different hyperparameter values, trains the model, and evaluates its performance.

It's often driven by intuition, domain expertise, and prior experience.

Best when:

  • Initial Exploration: Just starting with a new model or dataset and need to get a rough sense of the hyperparameter impact.

  • Small Search Space: For models with very few hyperparameters or a narrow range of possible values.

  • Limited Computational Resources: Very restricted computing power and can only afford a few trials.

Limitations:

  • Extremely time-consuming, highly subjective, and rarely finds truly optimal solutions for complex models.

Simulation

Since this is our first trial, I’ll setup datasets along with the functions to evaluate the model.

Building and Evaluating Models

The build_cnn_model function makes various patterns of CNN models by taking hparams, a dictionary of the hyperparameter set, as an argument. All models take MAE as its evaluation metric by default.

Then, the evaluate_model function builds, trains, and evaluates each CNN model using the validation dataset. This function returns the trained model and the key evaluation metrics:

  • eval_count: The number of distinct hyperparameter sets that were trained and evaluated during the tuning process,

  • total_time: The total time in seconds from the start of the search until the end, and

  • mae: The best MAE score generated during the search.

1import tensorflow as tf
2from tensorflow import keras
3from keras import models, layers, optimizers, metrics, losses, backend, callbacks
4
5def build_cnn_model(hparams: dict, input_shape=X_train.shape[1:]):
6    # input layer
7    inputs = layers.Input(shape=input_shape)
8
9    # conv layer
10    x = layers.Conv2D(hparams['filters_0'], (3, 3), activation='relu')(inputs)
11    if hparams.get('batch_norm_0', False): x = layers.BatchNormalization()(x)
12    x = layers.MaxPooling2D((2, 2))(x)
13
14    # dense layer
15    for i in range(1, hparams.get('num_conv_layers', 1)):
16        if f'filters_{i}' in hparams:
17            x = layers.Conv2D(hparams.get(f'filters_{i}', 16), (3, 3), activation='relu')(x)
18            if hparams.get(f'batch_norm_{i}', False): x = layers.BatchNormalization()(x)
19            x = layers.MaxPooling2D((2, 2))(x)
20    x = layers.Flatten()(x)
21    x = layers.Dense(hparams['dense_units'], activation='relu')(x)
22    if hparams['dropout'] > 0: x = layers.Dropout(hparams['dropout'])(x)
23
24    # output layer
25    outputs = layers.Dense(1)(x)
26
27    # compiles the model
28    model = models.Model(inputs=inputs, outputs=outputs)
29    model.compile(
30        optimizer=optimizers.Adam(learning_rate=hparams['learning_rate']),
31        loss=losses.MeanSquaredError(),
32        metrics=[metrics.MeanAbsoluteError(name='mae')]
33    )
34    return model
35
36
37def evaluate_model(hparams: dict, eval_count: int = 0, total_time: float = 0.0) -> tuple:
38    eval_count += 1
39    start_time = time.time()
40
41    # builds the model (call the build_cnn_model function here)
42    model = build_cnn_model(hparams)
43
44    # trains on training datasets 
45    early_stopping = callbacks.EarlyStopping(
46        monitor='val_mae',
47        patience=5,
48        restore_best_weights=True,
49        verbose=1
50    )
51    model.fit(
52        X_train, y_train,
53        epochs=hparams.get('epochs', 100),
54        validation_data=(X_val, y_val),
55        batch_size=16,
56        callbacks=[early_stopping],
57        verbose=0
58    )
59
60    # evaluates
61    _, mae = model.evaluate(X_val, y_val, verbose=0)
62
63    end_time = time.time()
64    total_time += (end_time - start_time)
65
66    # clear keras backend
67    backend.clear_session()
68
69    return eval_count, total_time, mae, model
70

Note: Due to its complexity, a CNN model can easily overfit to the training dataset. Adding an early stopping callback is crucial.

Preparing the Dataset

I created train, validation, and test datasets from the 1,000 synthetic 28 × 28 pixel images:

1import numpy as np
2from sklearn.model_selection import train_test_split
3import numpy as np
4from sklearn.model_selection import train_test_split
5
6img_height, img_width = 28, 28
7min_radius, max_radius = 1, 10
8n_samples = 1000
9images, targets = [], []
10for i in range(n_samples):
11    img, target = generate_image_and_target(img_height, img_width, min_radius, max_radius, noise_level)
12    images.append(img)
13    targets.append(target)
14
15X = np.array(images)
16X = X.reshape(n_samples, img_height, img_width, 1)
17y = np.array(targets)
18
19X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=200, random_state=42, shuffle=True)
20X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=200, random_state=42, shuffle=True)
21
22print(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape)
23

(600, 28, 28, 1) (600,) (200, 28, 28, 1) (200,) (200, 28, 28, 1) (200,)

Now, let us start the manual search.

Searching Hyperparameters

I randomly defined three sets hyperparameters and picked up one with the best MAE score:

1# randomly setup the hyperparamter options
2hparam_options = [
3    {
4        'num_conv_layers': 1,
5        'filters_0': 16,
6        'batch_norm_0': False,
7        'dense_units': 16,
8        'dropout': 0.0,
9        'learning_rate': 0.001,
10        'epochs': 50
11    },
12    {
13        'num_conv_layers': 1,
14        'filters_0': 16,
15        'batch_norm_0': False,
16        'dense_units': 16,
17        'dropout': 0.0,
18        'learning_rate': 0.001,
19        'epochs': 200
20    },
21    {
22        'num_conv_layers': 2,
23        'filters_0': 32,
24        'batch_norm_0': True,
25        'filters_1': 64,
26        'batch_norm_1': True,
27        'dense_units': 32,
28        'dropout': 0.1,
29        'learning_rate': 0.001,
30        'epochs': 100
31    },
32]
33
34# start searching
35eval_count = 0
36total_time = 0.0
37best_mae = float('inf')
38best_hparams = None
39best_model = None
40
41for item in hparam_options:
42    eval_count_returned, total_time_returned, current_mae, model = evaluate_model(hparams=item)
43    eval_count += eval_count_returned
44    total_time += total_time_returned
45
46    if current_mae < best_mae:
47        best_mae = current_mae
48        best_hparams = item
49        best_model = model
50
51return best_mae, eval_count, total_time, best_hparams, best_model
52

Results

  • Best MAE: 0.3663 pixels (4.07% of the total range)

  • Execution Time: 4.9146 seconds

  • Total Number of Evaluations: 3

  • Optimal Hyperparameter Set:
    num_conv_layers: 1, filters_0: 16, batch_norm_0: False, dense_units: 16, dropout: 0.0, learning_rate: 0.001, epochs: 200

The second option in the hparam_options list turned out the best, resulting MAE of 0.3663 pixels.

Given that the target radius is in the range of one to ten, the prediction is in ~4.07% of the total range (0.3663 / (10 -1) ≈ 0.04070). This will serve as my threshold.

Grid search systematically explores all possible combinations of hyperparameters within a predefined, discrete set of values (grids).

For example, if we define a learning rate to be tested at [0.01, 0.001] and a batch size at [32, 64], Grid Search will test all four combinations [0.01/32, 0.01/64, 0.001/32, 0.001/64].

Best when:

  • Well-defined Hyperparameter Space: Can set a moderate range of hyperparameter options to be tested in reasonable manners based on domain expertise or prior experience.

  • Exhaustive Search Required: Need to be absolutely certain of finding the best combination within the search space.

  • Auditability: Need a thorough and reproducible search process for compliance or understanding.

  • Abundant Computational Resources: Can secure sufficient computing power to explore every combination in the search space.

Limitations:

  • Can become computationally expensive very quickly when the number of hyperparameters and their options increase (e.g., setting 5 options x 10 hyperparameters = 50 models to be trained and evaluated).

  • Tends to suffer from the curse of dimensionality.

The Search Space

Grid Search efficiently explores smaller search spaces by meticulously checking all possible hyperparameter combinations. Due to its exhaustive search efforts, using excessively large search spaces can lead to a time-consuming process.

So, based on the best-performing hyperparameters from the Manual Search, I defined a small search space with 36 discrete options as the hparams_options dictionary:

1hparams_options = {
2    'num_conv_layers': [1],
3    'filters_0': [16, 32],
4    'batch_norm_0': [False],
5    'dense_units': [16, 32, 64],
6    'dropout': [0.0, 0.1, 0.2],
7    'learning_rate': [0.001, 0.0001],
8    'epochs': [200],
9}
10

Simulation

I tested all the combinations:

1from itertools import product
2
3all_combinations_list = []
4
5eval_count = 0
6total_time = 0.0
7best_mae =  float('inf')
8best_hparams = None
9best_model = None
10
11# creates all possible combinations (based on the number of convolutional networks)
12for num_conv in hparams_options['num_conv_layers']:
13    dynamic_filters_bn_combinations = list(
14        product(
15            *(hparams_options[f'filters_{i}'] for i in range(num_conv)),
16            *(hparams_options[f'batch_norm_{i}'] for i in range(num_conv))
17        )
18    )
19    base_combos = product(
20        hparams_options['learning_rate'],
21        hparams_options['epochs'],
22        hparams_options['dense_units'],
23        hparams_options['dropout']
24    )
25
26    for base_combo in base_combos:
27        for dynamic_combo in dynamic_filters_bn_combinations:
28            hparam_set = {
29                'learning_rate': base_combo[0],
30                'epochs': base_combo[1],
31                'num_conv_layers': num_conv,
32                'dense_units': base_combo[2],
33                'dropout': base_combo[3]
34            }
35            filter_values = dynamic_combo[:num_conv]
36            for i in range(num_conv): hparam_set[f'filters_{i}'] = filter_values[i]
37            all_combinations_list.append(hparam_set)
38
39
40# tests all combinations to find the best performing one
41for hparams in all_combinations_list:
42    eval_count_returned, total_time_returned, current_mae, model = evaluate_model(hparams=hparams)
43    eval_count += eval_count_returned
44    total_time += total_time_returned
45
46    if current_mae < best_mae:
47        best_mae = current_mae
48        best_hparams = hparams
49        best_model = model
50
51return best_mae, eval_count, total_time, best_hparams, best_model
52

Results

  • Best MAE: 0.2675 pixels (2.97% error range)

  • Execution Time: 124.7468 seconds

  • Total Number of Evaluations: 36 (Tested all combinations)

  • Optimal Hyperparameter Set:
    num_conv_layers: 1, dense_units: 64, dropout: 0.0, filters_0: 16, learning_rate: 0.0001, epochs: 200

The Grid Search trained and evaluated all 36 patterns of CNNs and found the best performing combination with MAE of 0.2675 pixels (2.97% error range), significantly improving the result of Manual Search (4.07%).

The execution took 125 seconds (roughly 2 minutes) to complete. On top of the narrow search space, early stopping played a key role here to stop unnecessary epochs and save search time once the error decrease became negligible.

Expanding the search space and relaxing regularization may result in finding different local optima.

Instead of exhaustively checking every combination, Random Search samples a fixed number of hyperparameter combinations randomly from the specified search space.

The search space can be continuous or discrete.

Best when:

  • Large Hyperparameter Space: When the number of hyperparameters or their ranges are too vast for Grid Search.

  • Efficiency > Exhaustiveness: Needs a reasonably good solution quickly, especially if only a few hyperparameters truly impact performance.

  • Discovering Important Hyperparameters: Studies have shown that Random Search often finds better results than Grid Search with the same computational budget, as it's more likely to explore widely disparate values.

Limitations:

  • Does not guarantee finding the global optimum as it relies on random sampling.

  • The quality of the solution depends on the number of iterations and the randomness of the sampling.

The Search Space

Random Search performs best when it can select options as randomly as possible from a large search space.

To leverage this capability, I set up an extremely large search space of 2,246,400 discrete combinations with a wide range of values so that it can generate a random continuous value as optimal as well.

1hparams_options = {
2    'num_conv_layers': [1, 2, 3, 4],
3    'filters_0': [16, 32, 48, 64],
4    'filters_1': [16, 32, 48, 64],
5    'filters_2': [16, 32, 48, 64],
6    'filters_3': [16, 32, 48, 64],
7    'batch_norm_0': [True, False],
8    'batch_norm_1': [True, False],
9    'batch_norm_2': [True, False],
10    'batch_norm_3': [True, False],
11    'dense_units': [32, 64, 96, 128],
12    'dropout': [0.0, 0.1, 0.2, 0.3],
13    'learning_rate': [0.1, 0.01, 0.005, 0.001, 0.0001],
14    'epochs': [100, 200, 300, 400, 500, 1000]
15}
16

Simulation

Even within the same search spacem Random Search results vary between trials because it explores random subsets of hyperparameters.

I performed five trials, each with 50 randomly selected combinations from the search space, and recorded the best and average results.

In practice, we need to strike a balance between running many trials to find the absolute global optimum and limiting the time-consuming process to settle for a reasonably good sub-optimum.

1eval_count = 0
2total_time = 0.0
3
4best_mae = float('inf')
5best_hparams = None
6best_model = None
7
8for _ in range(n_iterations):
9    # creates hyperparameter set by randomely choosing values from hparams_options
10    hparams = dict()
11    hparams['learning_rate'] = random.uniform(hparams_options['learning_rate'][0], hparams_options['learning_rate'][-1])
12    hparams['epochs'] = random.choice(hparams_options['epochs'])
13    hparams['num_conv_layers'] = random.choice(hparams_options['num_conv_layers'])
14    hparams['dense_units'] = random.choice(hparams_options['dense_units'])
15    hparams['dropout'] = random.choice(hparams_options['dropout'])
16    for i in range(hparams['num_conv_layers']):
17        hparams[f'filters_{i}'] = random.choice(hparams_options[f'filters_{i}'])
18        hparams[f'batch_norm_{i}'] = random.choice(hparams_options[f'batch_norm_{i}'])
19
20    # evaluate model
21    eval_count_returned, total_time_returned, current_mae, model = evaluate_model(hparams=hparams)
22    eval_count += eval_count_returned
23    total_time += total_time_returned
24
25    if current_mae < best_mae:
26        best_mae = current_mae
27        best_hparams = hparams
28        best_model = model
29
30return best_mae, eval_count, total_time, best_hparams, best_model
31

Results

Figure B. Comparison of five Random Search trials (left: MAE, right: search time in seconds) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Comparison of five Random Search trials (left: MAE, right: search time in seconds) (Created by Kuriko IWAI)

  • Best MAE: 0.1712 pixels (1.90% error range)

  • Average MAE: 0.2233 pixels (2.48% error range) / trial

  • Average Execution Time: 216.08 seconds / trial

  • Total Number of Evaluations: 50 times / trial

  • Optimal Hyperparameter Set:

  • learning_rate: 0.010918659801547131, epochs: 200, num_conv_layers: 2, dense_units: 96, dropout: 0.0, filters_0: 32, batch_norm_0: True, filters_1: 32, batch_norm_1: False

Random Search outperformed Grid Search, resulting in a 2.48% error range on average compared to the Grid Search's 2.97%.

However, due to the broad search space, it took 216 seconds (about 3.5 minutes) on average to complete 50 CNN pattern tests; a speed slightly slower than the Grid Search.

This approach is especially useful for exploring broad search spaces efficiently, offering a strong balance between speed and performance, with only a minor compromise in the performance.

Random Search also can evaluate continuous parameters, such as learning_rate, without requiring pre-defined values. For example, the optimal learning_rate in this simulation is 0.010918.

By increasing n_iterations, we can expect further improvement of the MAE score.

Bayesian Optimization

Bayesian Optimization is a more intelligent and sample-efficient method.

Unlike Grid or Random Search, it learns from previous evaluations to intelligently decide which hyperparameter combination to try next.

It constructs a surrogate model, often a Gaussian Process, of the objective function and uses an acquisition function to determine the next most promising point to evaluate.

This iterative process aims to balance exploration and exploitation.

Best when:

  • Expensive Objective Functions: Ideal for models where a single training run and evaluation is very time-consuming (e.g., deep neural networks).

  • Limited Evaluation Budget: Need to find the optimal hyperparameters with the fewest possible trials.

  • Complex and Non-Convex Search Spaces: When the relationship between hyperparameters and model performance is non-linear and difficult to navigate manually with many local optima where the search algorithm potentially stuck.

Limitations:

  • Can be more complex to implement than Grid or Random Search.

  • The computational overhead per iteration is higher due to maintaining and updating the surrogate model.

  • May struggle in very high-dimensional hyperparameter spaces.

How Bayesian Optimization Works

Before jumping into the simulation, I’ll explore key components in Bayesian Optimization and how it works step by step.

The objective of the Bayesian Optimization is to solve the optimization problem to find the prediction point that can maximize the acquisition function.

To solve the problem, it leverage a probabilistic model as a surrogate model. This is often a Gaussian Process.

The iterative cycle of the Bayesian Optimization looks like:

  1. Initialization: Start with a few initial random evaluations of the objective function.

  2. Build/Update Surrogate Model: Train a surrogate model using samples in the search space. The surrogate model generates the mean μ(x) and variance σ²(x) over the entire search space.

  3. Compute Acquisition Function: Evaluate an acquisition function across the search space using the mean and variance.

  4. Find Next Evaluation Point: Select the point x_next​ that maximizes the acquisition function (x_next​ is the most promising point to evaluate based on the current exploration-exploitation balance).

  5. Evaluate Objective Function: Run the actual (expensive) objective function at x_next​ to obtain the true function value y_next​.

  6. Add to Data and Repeat: Add (x_next​, y_next​) to the set of observed data and go back to Step 2.

This cycle continues for a given number of iterations or until convergence.

The surrogate model is continually refined as more data becomes available, and the acquisition function leverages this updated model to make increasingly informed decisions about where to explore next.

Now, I’ll explore how a Gaussian Process, a common surrogate model, and an acquisition function work in Bayesian Optimization.

Building Surrogate Model: Gaussian Process

Gaussian Process (GP) is a probabilistic model that generates a distribution over all possible functions that could fit the given data.

Imagine we have a dataset with three datapoints:

  • (1, 2)

  • (2, 4)

  • (3, 3)

Prediction point: x = 2.5

Linear regression, for example, attempts to find an approximate function using the three datapoints, comp utes the prediction y^ when x = 2.5, and outputs a single value: y^ = 3.75 based on their approximate function (let’s say): y^ = 0.5x + 2.

On the other hand, GP generates a Gaussian distribution at the prediction point instead of a single value. For instance, a GP would output the prediction at x = 2.5 ∼ N(μ, σ²) where mean μ ≈ 3.28 and variance σ² ≈ 0.15.

Mathematically, this output is generalized as a posterior probability such that:

p(f(x)X,y)N(μ(x),σ2(x))p(f(x) | X, y) \sim N(\mu(x^), \sigma^2(x^))

where μ(x∗) and σ²(x∗) indicate the mean function and the variance function at a prediction point (x*) respectively when the dataset (X, y) is given.

Other key characteristics of GP include:

  • Distribution over Functions: Provides a Gaussian distribution over all possible functions

  • Uncertainty Quantification: Quantifies uncertainty in predictions via a variance for each prediction, representing the confidence in that prediction.

  • Multivariate Normal Distributions for Finite Collections: Assumes any finite collection of random variables has a multivariate normal (Gaussian) distribution with mean and covariance defined by mean function and kernel (covariance) function respectively.

  • Non-Parametric: Not assume a fixed functional form with a predetermined number of parameters.

  • Flexibility: Model a wide variety of functional shapes by choosing different kernel functions.

  • Natural Bayesian Framework: Follows a natural Bayesian approach where it starts with a prior distribution and update the prior to a posterior distribution after observing data.

How Gaussian Computes the Posterior Probability

Taking advantage of its nature of following Bayesian framework, GP can compute the posterior probability from our prior belief on the underlying true function f(x):

f(x)GP(m(x),k(x,x))f(x) \sim GP(m(x), k(x, x'))

where:

  • m(x): the prior mean function and

  • k(x, x′): the prior kernel (covariance) function.

Based on the GP’s assumptions on a multivariate Gaussian distribution, the joint prior distribution of the observed training outputs y and the true function value f(x∗) at the given prediction point x∗ is defined:

[yf(x)]N([m(X)m(x)]mean vector,[K(X,X)+σn2IK(X,x)K(x,X)K(x,x)]kernel (covariance) matrix) (1)\begin{bmatrix} y \\ f(\mathbf{x}) \end{bmatrix} \sim \mathcal{N} \left( \underbrace{\begin{bmatrix} m(\mathbf{X}) \\ m(\mathbf{x}) \end{bmatrix}}{\text{mean vector}} , \underbrace{\begin{bmatrix} \mathbf{K}(\mathbf{X},\mathbf{X})+\sigma_n^2\mathbf{I} & \mathbf{K}(\mathbf{X},\mathbf{x}) \\ \mathbf{K}(\mathbf{x},\mathbf{X}) & \mathbf{K}(\mathbf{x},\mathbf{x}) \end{bmatrix}}{\text{kernel (covariance) matrix}} \right) \cdots \text{ (1)}

where:

X=[x1,,xn]T  : input vectorm(X)=[m(x1),,m(xn)]T  : prior mean vector of the input vector Xm(x) : prior mean vector of the prediction point x K(X,X) : entry Kij=k(xi,xj)  : n-by-n kernel matrixK(X,x) : entry Ki1=k(xi,x)  : n-by-1 kernel matrixK(x,X)=K(X,x)T  : 1-by-n kernel matrixK(x,x)=k(x,x)  : scalar kernelσn2I : Gaussian noise*\begin{align} X &= [x_1, \cdots, x_n]^T \text{ } \text{ : input vector} \\ \\ m(X) &= [m(x_1), \cdots, m(x_n)]^T \text{ } \text{ : prior mean vector of the input vector X} \\ \\ m(x) &\text{ : prior mean vector of the prediction point x } \\ \\ K(X, X) &\text{ : entry }K_{ij} = k(x_i, x_j) \text{ } \text{ : n-by-n kernel matrix} \\ \\ K(X, x) &\text{ : entry }K_{i1} = k(x_i, x) \text{ } \text{ : n-by-1 kernel matrix} \\ \\ K(x, X) &= K(X, x)^T \text{ } \text{ : 1-by-n kernel matrix} \\ \\ K(x^, x^) &= k(x^, x^) \text{ } \text{ : scalar kernel} \\ \\ \sigma_n^2I &\text{ : Gaussian noise*} \end{align}

*For the Gaussian noise, GP assumes the output y is corrupted by independent Gaussian noise ε such that:

yi=f(xi)+ϵiy_i​=f(x_i​)+ \epsilon_i​

where the noise is draw from a Gaussian distribution with the noise variance σ_n:

ϵiN(0,σn2)\epsilon_i \sim N(0, \sigma_n^2)

Leveraging the properties of conditional distributions for multivariate Gaussians on the formula (1), GP can define the posterior probability at the prediction point x* with the mean μ(x∗) and the variance σ^2(x∗):

μ(x)=m(x)+K(x,X)(K(X,X)+σn2I)1influence of the observed data(ym(X))\mu(x^*) = m(x^∗)+K(x^∗,X) \underbrace{(K(X, X)+ \sigma_n^2 I​)^{−1}}_{\text{influence of the observed data}} (y−m(X))
σ2(x)=k(x,x)K(x,X)(K(X,X)+σn2I)1influence of the observed dataK(X,x)\sigma^2(x^) = k(x^, x^) - K(x^, X)\underbrace{(K(X, X)+ \sigma_n^2 I​)^{−1}}_{\text{influence of the observed data}} K(X, x^*)

These equations show how the posterior probability is computed by adjusting the prior mean and kernel based on the observed data. Particularly, the inverse of K(X, X) + σ_n² I represents the influence of the observed data, effectively weighting how much each observation contributes to the prediction and how much uncertainty is reduced from the variance.

Different kernel functions K encode different assumptions about the smoothness, periodicity, or other properties of the underlying function, which makes GP flexible to learn diverse functions.

Finding Next Evaluation Point in the Search Space

To find next best evaluation point, Bayesian Optimization leverages the acquisition function that computes a utility score for each potential evaluation point based on the mean and variance that the GP computed.

The acquisition function balances the trade-off between:

  • Exploitation (High Mean): Choosing a point where the GP predicts promising with high μ(x) and

  • Exploration (High Variance): Choosing a point where the GP predicts high uncertainty σ² (x).

One of the common acquisition functions in Bayesian Optimization is Expected Improvement (EI):

EI(x)={(μ(x)f(x+)ξ)Φ(Z)+σ(x)ϕ(Z)if σ(x)>00if σ(x)=0Z=μ(x)f(x+)ξσ(x)\begin{align} EI(x) &= \begin{cases} (\mu(x) - f(x^+) - \xi)\Phi(Z) + \sigma(x)\phi(Z) & \text{if } \sigma(x) > 0 \\ 0 & \text{if } \sigma(x) = 0 \end{cases} \\ \\ \because Z &= \frac{\mu(x) - f(x^{+}) - \xi}{\sigma(x)} \end{align}

where:

  • μ(x): The predictive mean at point x given by GP

  • f(x+): The current best.

  • ξ≥0: The exploration-exploitation trade-off parameter. A larger ξ encourages more exploration, while ξ=0 focuses purely on maximizing the expected gain.

  • μ(x) - f(x+): The predicted improvement over the current best.

  • μ(x) - f(x+) - ξ: The potential gain when taking a point x.

  • σ(x): The predictive standard deviation at point x given by GP.

  • Z: A standardized score (a z-score) that normalizes the potential gain by the uncertainty.

  • Φ(Z)/ϕ(Z) : The CDF/ PDF of the standard normal distribution evaluated at Z.

A higher EI(x) value means that x has more expected gain when the system explores it next.

So, when there's no uncertainty at point x (hence σ(x) = 0), the EI value is zero as there is no potential for further expected gain.

Instead, when uncertainty exists (σ(x) > 0), the EI computes the potential gain at point x based on the predicted improvement over the current best, while taking the balance between exploration and exploitation using ξ.

The Search Space

I used the same search space as Random Search with 108,000 combinations. (Note: Similar to Random Search, Bayesian Optimization can take continuous values as optimal hyperparameter values.)

Simulation

I used the BayesianOptimization class from the keras-tuner library and tested 50 subsets for apple-to-apple comparison with other advanced methods with five trials.

1import time
2import keras_tuner as kt
3
4# set up keras tuner
5max_trials = 50
6
7tuner = kt.BayesianOptimization(
8    hypermodel=build_cnn_model_keras,
9    objective=kt.Objective("val_mae", direction="min"), # ensures MAE is minimized
10    max_trials=max_trials,
11    executions_per_trial=1,
12    overwrite=True,
13    directory="my_tuning_dir",             # stores results
14    project_name="cnn_bayesian_tuning"
15)
16
17# start searching
18start_tuning_time = time.time()
19tuner.search(X_train, y_train, epochs=50, validation_data=(X_val, y_val))
20end_tuning_time = time.time()
21
22# results
23best_model = tuner.get_best_models(num_models=1)[0]
24
25best_trial = tuner.oracle.get_best_trials(num_trials=1)[0]
26best_hparams = best_trial.hyperparameters
27
28best_mae = best_trial.metrics.get_last_value('val_mae')
29total_time = end_tuning_time - start_tuning_time
30eval_count = max_trials
31
32return best_mae, eval_count, total_time, best_hparams, best_model
33

Results

Figure D. Comparison of five Bayesian Optimization trials (left: MAE, right: search time in seconds) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Comparison of five Bayesian Optimization trials (left: MAE, right: search time in seconds) (Created by Kuriko IWAI)

  • Best MAE: 0.12172 pixels (2.41% error range)

  • Average MAE: 0.2377 pixels (2.64% error range) / trial

  • Average Execution Time: 661.26 seconds / trial

  • Total Number of Evaluations: 50 times / trial

  • Optimal Hyperparameter Set:

    num_conv_layers: 1, dropout: 0, dense_units: 32, learning_rate: 0.0001, epochs: 1000, filters_0: 16, batch_norm_0: False

Bayesian Optimization slightly underperformed Random Search with the average error range of 2.64% and 661 seconds (about 11 minutes) per trial.

This approach offered a balance between manual and Grid Search methods, notably generating continuous values like the learning_rate.

To enhance accuracy further, increasing epochs and max_trials could be considered.

Metaheuristic Algorithms

Metaheuristic algorithms are optimization techniques used to find good solutions to complex problems that are difficult or impossible to solve with traditional methods, especially when dealing with extremely large search spaces.

They do not guarantee finding the global optimum solution, but aim to find a "good enough" (near-optimal) solution within a reasonable amount of time.

Their key characteristics include:

  • Inspired by Nature/Heuristics: Draw inspiration from natural phenomena, biological processes, or human behavior.

  • Stochasticity: Incorporate randomness in their search process, exploring different parts of the solution space and avoid getting stuck in local optima (Hence, good at handling non-conventional search space).

  • Balance of Exploration and Exploitation just like Bayesian Optimization

  • Problem Independence: Adaptable to a wide range of optimization problems with minimal modifications.

  • Iterative Process: They work by iteratively refining a set of candidate solutions over multiple generations or iterations.

  • Fitness Function: A fitness function (objective function) is used to evaluate the quality of each candidate solution (Maximizing or minimizing this fitness function is their optimization problem).

How they generally work (simplified steps):

  1. Initialization: A set of initial candidate solutions (called population) is randomly generated within the problem's search space.

  2. Evaluation: Evaluates the fitness of each candidate solution using the objective function.

  3. Update/Improve: Generates new, potentially better population with better fitness values, appliying specific rules inspired by its underlying metaphor. The process involves:

    • Mutation: Small, random changes to existing solutions to introduce diversity.

    • Crossover/Recombination: Combining parts of good solutions to create new ones.

    • Movement/Swarm Dynamics: Particles or agents moving through the search space based on their own best experience and the best experience of their neighbors/global best.

    • Acceptance Criteria: Deciding whether to accept a new solution (even if it's worse than the current one, especially in the early stages, to promote exploration).

  4. Selection: The best population is selected, while less fit solutions are discarded.

  5. Termination: Repeats the Step 2 to 4 for a fixed number of iterations until the convergence.

Types of Metaheuristic Algorithms:

Metaheuristic algorithms can be broadly categorized based on their inspiration:

  • Evolutionary Algorithms (EAs): Inspired by biological evolution and natural selection.

    • Genetic Algorithms: Mimic processes like mutation, crossover, and selection to evolve solutions.

    • Differential Evolution: Uses vector differences to create new candidate solutions.

    • Evolution Strategies: Focus on self-adaptation of mutation parameters.

  • Swarm Intelligence (SI) Algorithms: Inspired by the collective behavior of decentralized, self-organized systems.

    • Particle Swarm Optimization: Models the social behavior of bird flocking or fish schooling.

    • Ant Colony Optimization: Based on the foraging behavior of ants finding the shortest path to food.

    • Artificial Bee Colony: Simulates the intelligent foraging behavior of honeybee swarms.

  • Physics-Based Algorithms: Inspired by physical phenomena.

    • Simulated Annealing: Analogous to the annealing process in metallurgy, where a material is heated and then slowly cooled to reduce defects.

    • Gravitational Search Algorithm: Based on the law of gravity and mass interactions.

  • Human-Based Algorithms: Inspired by human behavior or social interactions.

    • Tabu Search: Uses a "tabu list" to avoid revisiting previously explored solutions.

    • Teaching-Learning-Based Optimization: Mimics the teaching and learning process in a classroom.

Best when:

  • Highly Complex, Non-Convex, and Discontinuous Search Spaces: When traditional gradient-based methods are not applicable or efficient to find the global optima.

  • Global Optimization is Crucial: Need to explore a diverse range of solutions and avoid getting stuck in local optima.

Limitations:

  • Can be computationally intensive.

  • Performance can depend on tuning their own model parameters.

The Search Space

I used the same search space as Random Search and Bayesian Optimization with 108,000 combinations.

Simulation

Here, I’ll take a common Metaheuristic algorithm, Genetic Algorithms (GA), for an example.

GAs mimic natural selection to evolve a species. In hyperparameter optimization, the "species" is a population of hyperparameter combinations, which evolve over generations by favoring those with “higher survival rates”, indicating better accuracy scores in our case.

Step 1. Initialization: Creating a population

I set up a initial population with ten individuals (hyperparameter sets):

1import random
2
3def create_individual():
4    hparam_set = dict()
5    hparam_set['num_conv_layers'] = random.choice(hparams_options['num_conv_layers'])
6
7    for i in range(hparam_set['num_conv_layers']):
8        hparam_set[f'filters_{i}'] = random.choice(hparams_options['filters_0'])
9        hparam_set[f'batch_norm_{i}'] = random.choice(hparams_options['batch_norm_0'])
10
11    hparam_set['dense_units'] = random.choice(hparams_options['dense_units'])
12    hparam_set['dropout'] = random.choice(hparams_options['dropout'])
13    hparam_set['learning_rate'] = random.uniform(hparams_options['learning_rate'][0], hparams_options['learning_rate'][-1])
14    hparam_set['epochs'] = random.choice(hparams_options['epochs'])
15    return hparam_set
16
17population_size = 10
18population = [create_individual() for _ in range(population_size)]
19

Step 2. Evaluation: Computing fitness scores

Next, computed fitness scores across all individuals in the population and choose the best hyperparameters:

1eval_count = 0
2total_time = 0.0
3
4best_mae = float('inf')
5best_hparams = None
6best_model = None
7
8n_generations = 10
9
10for _ in range(n_generations):
11    fitnesses = list()
12
13    for indiv in population:
14        eval_count_returned, total_time_returned, mae, model = evaluate_model(hparams=indiv)
15        fitnesses.append(accuracy)
16        eval_count += eval_count_returned
17        total_time += total_time_returned
18
19    current_best_idx = np.argmin(fitnesses)
20    current_best_mae = fitnesses[current_best_idx]
21    current_best_params = population[current_best_idx]
22
23    if current_best_mae < best_mae:
24        best_mae = current_best_mae
25        best_params = current_best_params
26        best_model = model
27

Step 3: Iteration

The first step of the iteration is to build a new population by selecting top 10% (or any percentile of your choice) of elites with best fitness scores (lower MAE scores):

1import numpy as np
2
3# initiate new_population
4new_population = []
5
6# finds and add elites to the new population
7num_elites = max(1, int(0.1 * population_size))
8elites = sorted(zip(population, fitnesses), key=lambda x: x[1], reverse=False)[:num_elites]
9new_population.extend([e[0] for e in elites])
10

Then, update the population by generating and mutating offspring and elites:

1import random
2
3# add parents to ensure the new_population keeps the population size of 10
4num_parentes_to_select = population_size - num_elites
5total_fitness = sum(fitnesses)
6probabilities = [f / total_fitness for f in fitnesses]
7indices = np.random.choice(
8    len(population), 
9    size=num_parents_to_select, 
10    p=probabilities, 
11    replace=True
12)
13parents = [population[i] for i in indices]
14
15
16# defines crossover function
17def crossover(parent1, parent2, crossover_rate):
18    offspring1 = parent1.copy()
19    offspring2 = parent2.copy()
20
21    if random.random() < crossover_rate:
22        offspring1['num_conv_layers'] = random.choice([parent1['num_conv_layers'], parent2['num_conv_layers']])
23        offspring2['num_conv_layers'] = random.choice([parent1['num_conv_layers'], parent2['num_conv_layers']])
24
25    for param in ['dense_units', 'dropout', 'learning_rate', 'epochs']:
26        if random.random() < crossover_rate:
27            offspring1[param], offspring2[param] = parent2[param], parent1[param]
28
29    for offspring in [offspring1, offspring2]:
30        for i in range(offspring['num_conv_layers']):
31            if f'filters_{i}' not in offspring:
32                offspring[f'filters_{i}'] = random.choice(hparams_options['filters_0'])
33            if f'batch_norm_{i}' not in offspring:
34                offspring[f'batch_norm_{i}'] = random.choice(hparams_options['batch_norm_0'])
35    return offspring1, offspring2
36
37# defines mutate function:
38def mutate(individual):
39    mutated_individual = individual.copy()
40
41    if random.random() < mutation_rate:
42        mutated_individual['num_conv_layers'] = random.choice(hparams_options['num_conv_layers'])
43        for i in range(max(hparams_options['num_conv_layers'])):
44            if i < mutated_individual['num_conv_layers']:
45                if f'filters_{i}' not in mutated_individual or random.random() < mutation_rate:
46                    mutated_individual[f'filters_{i}'] = random.choice(hparams_options['filters_0'])
47                if f'batch_norm_{i}' not in mutated_individual or random.random() < mutation_rate:
48                    mutated_individual[f'batch_norm_{i}'] = random.choice(hparams_options['batch_norm_0'])
49            else:
50                mutated_individual.pop(f'filters_{i}', None)
51                mutated_individual.pop(f'batch_norm_{i}', None)
52
53    if random.random() < mutation_rate:
54        mutated_individual['dense_units'] = random.choice(hparams_options['dense_units'])
55
56    if random.random() < mutation_rate:
57        mutated_individual['dropout'] = random.uniform(hparams_options['dropout'][0], hparams_options['dropout'][-1])
58
59    if random.random() < mutation_rate:
60        mutated_individual['learning_rate'] = random.uniform(hparams_options['learning_rate'][0], hparams_options['learning_rate'][-1])
61    if random.random() < mutation_rate:
62        mutated_individual['epochs'] = random.choice(hparams_options['epochs'])
63    return mutated_individual
64
65
66# iterative the process of updating the population
67crossover_rate = 0.8
68mutation_rate = 0.1
69
70for i in range(0, len(parents), 2):
71    if len(new_population) >= population_size: break
72
73    # crossover parent 1 and 2:
74    parent1 = parents[i]
75    parent2 = parents[i+1]
76    offspring1, offspring2 = crossover(parent1, parent2, crossover_rate)
77
78    # add mutated offspring to the new_population
79    new_population.append(mutate(offspring1))
80    if len(new_population) < population_size: new_population.append(mutate(offspring2))
81
82# ensure the elites are also mutated
83while len(new_population) < population_size:
84    new_population.append(mutate(random.choice(elites)[0]))
85
86# update the population
87population = new_population
88

Step 4 & 5: Selecting the best parameters

Iterate the process for 5 generations, creating 10 combinations (individuals) in each, to find the best one.

1n_generations = 5
2
3for _ in range(n_generations):
4    # step 2
5    fitnesses = list()
6    for indiv in population:
7        eval_count_returned, total_time_returned, mae, model = evaluate_model(hparams=indiv)
8        fitnesses.append(accuracy)
9        eval_count += eval_count_returned
10        total_time += total_time_returned
11
12    current_best_idx = np.argmin(fitnesses)
13    current_best_mae = fitnesses[current_best_idx]
14    current_best_params = population[current_best_idx]
15
16    if current_best_mae < best_mae:
17        best_mae = current_best_mae
18        best_params = current_best_params
19        best_model = model
20
21    # step 3
22    new_population = []
23    num_elites = max(1, int(0.1 * population_size))
24    elites = sorted(zip(population, fitnesses), key=lambda x: x[1], reverse=False)[:num_elites]
25    new_population.extend([e[0] for e in elites])
26    num_offspring = population_size - num_elites
27    parents = select_parents(population, fitnesses, num_offspring)
28    for i in range(0, len(parents), 2):
29        if len(new_population) >= population_size: break
30        parent1 = parents[i]
31        parent2 = parents[i+1]
32        offspring1, offspring2 = crossover(parent1, parent2, crossover_rate)
33        new_population.append(mutate(offspring1))
34        if len(new_population) < population_size: new_population.append(mutate(offspring2))
35
36    while len(new_population) < population_size:
37        new_population.append(mutate(random.choice(elites)[0]))
38
39    population = new_population
40
41return best_mae, eval_count, total_time, best_hparams, best_model
42

Results

Figure E. Comparison of five Genetic Algorithm trials (left: MAE, right: search time in seconds) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Comparison of five Genetic Algorithm trials (left: MAE, right: search time in seconds) (Created by Kuriko IWAI)

  • Best MAE: 0.2120 pixels (2.36% error range)

  • Average MAE: 0.2454 pixels (2.73% error range) / trial

  • Average Execution Time: 188.81 seconds / trial

  • Total Number of Evaluations: 50 times / trial

  • Optimal Hyperparameter Set:

    num_conv_layers: 1, dropout: 0, dense_units: 32, learning_rate: 0.0001, epochs: 1000, filters_0: 16, batch_norm_0: False

The Genetic Algorithm demonstrated a broad range of performance, generating a best error rate of 2.36% that surpassed Grid Search. But its worst error range was the highest among all five methods evaluated.

Its second trial took considerably more time and effort to converge on an optimal solution likely due to the inherent stochastic nature of genetic algorithms, which explore the search space more dynamically and less systematically than methods like Grid Search.

While this allows for escaping local optima and finding better global solutions in some cases, it can also lead to variability in performance and convergence speed across different runs or trials.

Consequently, the average MAE for the Genetic Algorithm is the highest among the four automation methods.

It demonstrated the quickest average time per trial, completing the evaluation of 50 CNN models in 188 seconds (about 3 minutes), making it the most efficient in terms of time spent per trial compared to other methods.

To further boost accuracy, we can run more trials and/or increase n_generations to explore more diverse population.

Final Evaluation on Generalization Capability and Cost Efficiency

In evaluating the five methods, the generalization performance, as measured by the MAE on the test dataset, is the most critical metric, indicating a model's ability to perform well on unseen data.

Then, time efficiency is also a significant consideration especially when computational resources are limited.

The below table illustrates the best results and search strategies I took for each method.

Figure F. Performance comparison of hyperparameter tuning methods over CNN (Created by Kuriko IWAI) (Generalization: MAE scores on test dataset, Total Time: Total search time in seconds, Ave. Time: Average time in second per evaluation (Total Time / Evaluations))

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Performance comparison of hyperparameter tuning methods over CNN (Created by Kuriko IWAI) (Generalization: MAE scores on test dataset, Total Time: Total search time in seconds, Ave. Time: Average time in second per evaluation (Total Time / Evaluations))

Among the automated methods, Random Search and Bayesian Optimization achieved he best generalization performance with an error rate of approximately 2.53%.

Their strong generalization was further supported by the minimal difference between their best MAEduring the search and their generalization MAE.

In contrast, the Genetic Algorithm showed a slight tendency to overfit, evidenced by a 0.45 percentage point gap between its search and generalization error rates.

Grid Search, while an improvement over Manual Search, ended up with the poorest generalization, suggesting that expanding its search space could enhance performance.

Regarding time efficiency, the Genetic Algorithm proved to be the most efficient automated method with an average time per evaluation of 3.65 seconds and a total time of 182.50 seconds for 50 evaluations.

Random Search was also highly efficient, with an average time of 4.10 seconds and a total time of 205.04 seconds.

In contrast, Bayesian Optimization, despite its competitive generalization, was the least time-efficient, with an average evaluation time of 14.19 seconds and a substantial total time of 709.70 seconds.

This indicates the nature of Bayesian Optimization. It is designed to be sample-efficient, requiring fewer actual evaluations of the expensive objective function, but introduces its own computational overhead related to building and optimizing the probabilistic surrogate model (Gaussian Process).

When the individual objective function evaluations are not so costly, this overhead can make the total time longer than simpler methods like Random Search.

In conclusion, Random Search achieved the best generalization and time efficiency for this experiment.

Case Study: Simple Machine Learning Models Trained on Tabular Data

Now, with understanding of each method on complex models, in this section, I’ll explore the same methods on a simpler model, Kernel SVM trained on a tabular dataset with 3,000 data points:

The Tabular Data

I created train, validation, test datasets from a 3,000 synthetic tabular dataset with 10 features:

1import numpy as np
2from sklearn.model_selection import train_test_split
3
4n_samples = 3000
5n_features = 10
6test_size = 300
7random_state = 42
8
9np.random.seed(random_state)
10X = np.zeros((n_samples, n_features))
11X[:, 0] = np.random.uniform(-5, 5, n_samples)
12X[:, 1] = np.random.normal(0, 3, n_samples)
13X[:, 2] = np.random.uniform(-10, 10, n_samples)
14if n_features > 3:
15    X[:, 3] = np.random.normal(5, 2, n_samples)
16
17y = np.zeros(n_samples)
18y += 2 * X[:, 0] + 5
19y += 0.5 * (X[:, 1] ** 2) - 3 * X[:, 1]
20y += 10 * np.sin(X[:, 2] / 2)
21y += 0.7 * X[:, 0] * X[:, 3]
22noise = np.random.normal(0, 2, n_samples) * (1 + 0.1 * np.abs(X[:, 0]))
23y += noise
24
25X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
26X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state)
27
28
29# scales numerical features
30from sklearn.preprocessing import StandardScaler
31
32scaler = StandardScaler()
33X_train = scaler.fit_transform(X_train)
34X_val = scaler.transform(X_val)
35X_test = scaler.transform(X_test)
36print(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape)
37

(2400, 10) (2400,) (300, 10) (300,) (300, 10) (300,)

The range of the target variable: -47.65321223180952, 101.87027635511603

The Search Space

Taking similar strategies to CNN, I defined a broader search space with 10,752 combinations for Random Search, Bayesian Optimization, and Genetic Algorithm:

1hparams_options = {
2    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
3    'gamma': ['auto', 'scale', 1, 0.1, 0.05, 0.01, 0.001],
4    'tol': [0.1, 0.05, 0.01, 0.001],
5    'C': [0.1, 0.5, 1, 10,],
6    'epsilon': [0.1, 0.01, 0.001, 0.0001],
7    'max_iter': [30, 50, 100, 200, 500, 1000]
8}
9

Note: Kernel SVM is much simpler than CNN with a limited number of hyperparameter combinations (10K) compared to CNNs (2+millions).

For Grid Search, narrow down the search space:

1hparams_options = {
2    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
3    'gamma': ['auto', 'scale', 1, 0.1, 0.05, 0.01, 0.001],
4    'tol': [0.01, 0.001],
5    'C': [0.1, 1, 10,],
6    'epsilon': [0.001, 0.0001],
7    'max_iter': [100, 200, 500]
8}
9

Evaluating Models

While optional, K-Fold cross-validation is recommended for simpler scikit-learn models to ensure stable performance.

I defined the build_evaluate_model function to handle the building, training, and evaluation of each model using K-Fold of five:

1import time
2from sklearn.svm import SVR
3from sklearn.model_selection import KFold
4from sklearn.metrics import mean_absolute_error
5
6
7def build_evaluate_model(hparams: dict, eval_count: int = 0, total_time: float = 0.0, n_splits: int = 5) -> tuple:
8    kernel = hparams.get('kernel', 'rbf')
9    degree = hparams.get('degree', 3)
10    gamma = hparams.get('gamma', 'scale')
11    coef0 = hparams.get('coef0', 0)
12    tol = hparams.get('tol', 0.001)
13    C = hparams.get('C', 1)
14    epsilon = hparams.get('epsilon', 0.1)
15    max_iter = hparams.get('max_iter', 100)
16
17    start_time = time.time() 
18
19    # run k-fold cross validation
20    best_mae, best_model = float('inf'), None
21    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
22
23    for train_index, val_index in kf.split(X_train):
24        X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
25        y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]
26
27        # build and train the model
28        model = SVR(
29            kernel=kernel,
30            degree=degree,
31            gamma=gamma,
32            coef0=coef0,
33            tol=tol,
34            C=C,
35            epsilon=epsilon,
36            max_iter=max_iter,
37            verbose=False, 
38        ).fit(X_train_fold, y_train_fold)
39
40        # validate the model
41        y_pred_val_kf = model.predict(X_val_fold)
42        mae_val_kf = mean_absolute_error(y_pred_val_kf, y_val_fold)
43        eval_count += 1
44
45        # update model and mae
46        if mae_val_kf < best_mae:
47            best_mae = mae_val_kf
48            best_model = model
49
50    end_time = time.time()
51    total_time += (end_time - start_time)
52
53    return eval_count, total_time, best_mae, best_model
54

Results

The results demonstrate an improvement in model performance as more sophisticated optimization strategies were employed; Bayesian Optimization performed the best generalization error (MAE: 2.5194, 1.68% error range), while manual tuning performed the worst.

Figure G. Performance comparison of hyperparameter tuning methods over Kernel SVM (Created by Kuriko IWAI) (Generalization: MAE scores on test dataset, (): error range, Total Time: Total search time in seconds, Ave. Time: Average time in second per evaluation (Total Time / Evaluations))

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G-1. Performance comparison of hyperparameter tuning methods over Kernel SVM (Created by Kuriko IWAI) (Generalization: MAE scores on test dataset, (): error range, Total Time: Total search time in seconds, Ave. Time: Average time in second per evaluation (Total Time / Evaluations))

Figure G-2. Visualization of generalization error (MAE) and total search time by tuning method on Kernel SVMs (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G-2. Visualization of generalization error (MAE) and total search time by tuning method on Kernel SVMs (Created by Kuriko IWAI)

Now, take a look at the performance of each method.

Manual Search

While the least computationally demanding, generated the poorest predictive accuracy (MAE of 8.1905 / 5.48% error range), highlighting its limitations for effective hyperparameter tuning.

Best Hyperparameter Set: Kernel: Linear, gamma: 0.1, tol: 0.01, C: 1.0, epsilon: 0.1, max_iter: 50

Grid Search

Grid Search marked a substantial improvement in MAE (2.8035/ 1.87% error range) over manual tuning. However, its exhaustive search led to a high number of evaluations (5,040 = 1,008 patterns x 5 folds) and the longest real-time execution (86.08 seconds), making it inefficient for larger search spaces.

Best Hyperparameter Set: Kernel: RBF, gamma: 1, tol: 0.01, C: 10.0, epsilon: 0.001, max_iter: 500

Random Search

Though cost efficient, Random Search performed the worst in the experiment, resulting in the error range of 3.96% after 250 evaluations (50 patterns x 5 folds). It might need more trials to find better options given the vast search space.

Best Hyperparameter Set: Kernel: RBF, gamma: 1, max_iter: 500, tol: 0.020284023357958703, C: 1.1065135587026282, epsilon: 0.02046763470290644

Bayesian Optimization (Scikit-Optimize)

Bayesian Optimization emerged as the top performer in terms of predictive accuracy, generating the lowest MAE (2.6782) and strong generalization (MAE: 2.5194 / 1.68% error range).

It also achieved this with the fewest evaluations (250) among the automated methods.

However, its probabilistic modeling approach made it most time consuming approach, resulting in the highest average time per evaluation (0.26 seconds).

Best Hyperparameter Set: Kernel: RBF, degree: 1, gamma: 0.5728719373151481, tol: 0.041556522143005495, coef0: 0.15500919954372303, C: 10.0, epsilon: 0.000510393784855045, max_iter: 3,000

Genetic Algorithm

Genetic Algorithm also demonstrated moderate performance (MAE of 4.2248 / 2.83% error range), comparable to Random Search with also the fewest number of evaluations (250) and reasonable real-time execution (11.63 seconds).

Best Hyperparameter Set: Kernel: RBF, gamma: scale, tol: 0.01, C: 10.0, epsilon: 0.001, max_iter: 2,000

In summary, when considering the trade-off between computational time and predictive results, Random Search stands out as providing the best balance, offering strong performance with reasonable computational cost.

Bayesian Optimization, while achieving the absolute lowest MAE, comes with a higher computational overhead per evaluation, making it a good option when we prioritize best peak performance over overhead.

The choice among these methods ultimately depends on the specific project's priorities, balancing the performance gains against the available computational resources and time constraints.

Wrapping Up

In our experiments, we observed varying effectiveness among hyperparameter tuning methods.

For complex models like CNNs, our findings suggest that advanced method like Genetic Algorithms yield the best performance. However, this often comes at a higher computational cost.

Alternatively, for scenarios where computational resources are limited, methods like Random Search or Bayesian Optimization offer a compelling alternative for CNNs, providing reasonably good results without the extensive time commitment.

Interestingly, for simpler models like Kernel SVM, a Random Search might be sufficient enough to quickly identify optimal hyperparameters.

Overall:

  • Start with a quick Manual Search.

  • To find the global optima especially in a small search space, try with Grid Search.

  • When a sub-optima is good enough, try with stochastic methods:

    • For costly objective functions, Bayesian Optimization is a highly effective option that balances computational cost and model’s generalization performance.

    • For moderate objective function, Random Search is more efficient than Bayesian Optimization due to its simplicity.

    • For non-convex, complex search spaces, Genetic Algorithms is effective to implement.

If computational resources allow, our experiments suggest running multiple trials with stochastic methods to normalize results. Subsequently, selecting the most common hyperparameter values for the final model can be a viable strategy.

The final choice depends on model complexity, desired predictive power, and practical compute constraints in each project.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "A Comparative Guide to Hyperparameter Optimization Strategies" in Kernel Labs

https://kuriko-iwai.com/mastering-hyperparameter-tuning

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.