A Deep Dive into KNN Optimization and Distance Metrics

Strategies for distance computation and optimal "k" selection

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is K Nearest Neighbor (KNN)How the KNN Algorithm Works
1. Classification Task
2. Regression Task
How to Compute Distance - Choosing Distance Metrics
Euclidean Distance (L2 Distance)
Manhattan Distance (L1 Distance)
Minkowski Distance (Generalized Distance)
Cosine Distance
Hamming Distance
Jaccard Distance
The Walkthrough Example - Manhattan Distance vs Euclidean Distance
Choosing the Optimal Value for “k“
Simulation
Preprocessing
Defining Pipeline
Finding Optimal k
Grid Search
Model Training and Inference
Results
Conclusion

Introduction

K-Nearest Neighbor (KNN) is a supervised machine learning algorithm widely used for classification and regression tasks.

However, selecting the optimal value for k is challenging because it highly depends on the dataset's characteristics, and the computational cost of finding the nearest neighbors can be high, especially with large datasets.

In this article, I’ll explore the distance computation and k-optimization tactics, while deep diving into the core of KNN algorithms.

What is K Nearest Neighbor (KNN)

K-nearest neighbor is a supervised learning algorithm that approximates functions by computing the distance between a query point and training samples in its ‘k’ nearest neighbors.

Key characteristics include:

  • Supervised Learning: It requires labeled training data to learn from.

  • Non-parametric: Makes no assumptions on the underlying distribution of the data.

  • Handling both regression (KNN regression, or NN smoothing) and classification (KNN classifier).

  • Lazy Learning: Stores the training data and only performs computations at the time of prediction. It does not have explicit training phase.

  • Simplicity: Easy to implement, making KNN a good starting point for many machine learning problems.

KNNs are used in various fields including image recognition, recommendation systems, financial forecasting, and fraud detection.

How the KNN Algorithm Works

KNN’s core concept lies in similarity and distance approximation; much like us assuming people in the same neighborhood likely share a similar socioeconomic status.

So, it is crucial to define the ‘neighbor’ — by deciding how to compute the distance and how many neighbors to consider.

In this section, I’ll cover two methods for computing distance — weighted and unweighted — as applied to both classification and regression tasks.

Then in the later sections, I’ll demonstrate the distance computation and k optimization.

1. Classification Task

The main goal of the KNN Classifier is to find a class (y^) for a given query point (x_q) that appears most frequently among the k neighbors.

Unweighted KNN

The unweighted KNN simply counts the occurrences of each class among its k nearest neighbors and selects a class with the highest occurrences as its final prediction.

The below figure shows that the query point (green) is classified to the red triangle class if k=3, and to the blue rectangle class if k=5, simply because more samples from the class exist in the selected neighbors.

Fig. An example of KNN classification. The test sample (x_q in green dot) is assigned to the red triangle if k=3 (solid line circle), and assigned to the blue squares If k = 5 (dashed line circle) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. An example of KNN classification. The test sample (x_q in green dot) is assigned to the red triangle if k=3 (solid line circle), and assigned to the blue squares If k = 5 (dashed line circle) (Created by Kuriko IWAI)

Mathematically, this process is defined using an indicator function:

y^q=argmaxc classesxiNk(xq)I(yi=c)\hat y_q =\arg \max_{c \in \text{ classes}} \sum_{x_i \in N_k(x_q)} I(y_i = c)

where

  • y^​q​: The predicted class label for the query point x_q​.

  • N_k​(x_q​): The set of the k-nearest neighbors to the query point (x_q)​.

  • c: A specific class label,

  • y_i: the actual label of the neighbor xi​, and

  • I(⋅): the indicator function (generates one if the condition is true else zero).

In unweighted KNNs, the calculated distance is only used to identify the nearest neighbors. Once those neighbors are identified, their specific distances are ignored for the prediction step. Hence, we don’t see any distance metric in the formula of y^. To simplify, the formula can be defined as the mode (most frequent value) of the labels of k nearest neighbors:

y^q=Mode {yixiNk(xq)}\hat y_q = \text{Mode }\{y_i \mid x_i \in N_k(x_q)\}

although essentially, these two formulas define the same result of y^.

The below coding block demonstrates how the model works in case of k = 5:

1from collections import Counter
2
3dataset = [
4    {'label': 0, 'dist': 10 },
5    {'label': 0, 'dist': 30 },
6    {'label': 1, 'dist': 5 },
7    {'label': 0, 'dist': 10 },
8    {'label': 0, 'dist': 22 },
9    {'label': 1, 'dist': 45 },
10    {'label': 1, 'dist': 18 },
11    {'label': 0, 'dist': 7 },
12    {'label': 0, 'dist': 10 },
13    {'label': 1, 'dist': 0 }, # distance = zero
14]
15
16n_neighbors = 5
17dataset.sort(key=lambda x: x['dist'])
18
19# takes 5 closest data points from the dataset.
20k_nearest_neighbors = dataset[:n_neighbors]
21k_neighbor_labels = [item['label'] for item in k_nearest_neighbors]
22
23# finds a label with highest frequency.
24most_common = Counter(k_neighbor_labels).most_common(1)
25print(f'Selected Label: {most_common[0][0]}, Count {most_common[0][1]}')
26
27# Selected Label: 0, Count 3
28

It classified the query point to the class zero because the dataset contains more class zeros than class ones in the neighborhood.

Weighted KNN

On the other hand, a weighted KNN leverages the weight scheme so that closer neighbors contribute more to the counts.

In the below figure, the same query point is classified to the red triangle class even when k=5 because the two red triangles are closest (hence, are assigned larger weights) to the overall counts:

Fig. An example of weighted KNN classification. The test sample (x_q in green dot) is assigned to the red triangle if k=3 (solid line circle) or k=5 (dashed line circle) as the the red triangles are closer to the test sample. (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. An example of weighted KNN classification. The test sample (x_q in green dot) is assigned to the red triangle if k=3 (solid line circle) or k=5 (dashed line circle) as the the red triangles are closer to the test sample. (Created by Kuriko IWAI)

Mathematically, this process is defined by adding the weight term W(D) to the unweighted KNN formula:

y^q=argmaxcClassesxiNk(xq)W(D(xq,xi))weightI(yi=c)\hat{y}q = \operatorname{argmax}{c \in \text{Classes}} \sum_{x_i \in N_k(x_q)} \underbrace{W(D(x_q, x_i))}_{\text{weight}} \cdot I(y_i = c)

A common weighting scheme uses the inverse of the distance or squared distance:

W(D)=1D(xq,xi)orW(D)=1D(xq,xi)2\begin{align*} W(D) &= \frac{1}{D(x_q, x_i)} \\ \text{or} \\ W(D) &= \frac{1}{D(x_q, x_i)^2} \end{align*}

where D(x_q, x_i) indicates the distance from the query point (x_q) to a random point in the sample dataset (x_i)).

The formula of W(D) showcases that closer neighbors (smaller D) have a larger weight, while farther neighbors (larger D) have a smaller weight.

This distance is crucial for the weighted classification scheme because it determines how much each neighbor contributes to the final vote.

***

Using the same dataset from the unweighted KNN case, I’ll demonstrate how the weight influence the final prediction.

In the below coding block, the weighted KNN classified a query point to the class one instead of the class zero as it assigned the largest weight to a data point whose distance to the query point is zero.

1dataset = [
2    {'label': 0, 'dist': 10},
3    {'label': 0, 'dist': 30},
4    {'label': 1, 'dist': 5},
5    {'label': 0, 'dist': 10},
6    {'label': 0, 'dist': 22},
7    {'label': 1, 'dist': 45},
8    {'label': 1, 'dist': 18},
9    {'label': 0, 'dist': 7},
10    {'label': 0, 'dist': 10},
11    {'label': 1, 'dist': 0}, # distance = 0
12]
13
14n_neighbors = 5
15dataset.sort(key=lambda x: x['dist'])
16k_nearest_neighbors = dataset[:n_neighbors]
17
18# adds weights
19weighted_votes = {}
20for i, item in enumerate(k_nearest_neighbors):
21    d, l = item['dist'], item['label']
22
23    # weight = inverse of distance. when the distance is zero, it assigns a large weight.
24    weight = 1 / d if d != 0 else 0.999
25    weighted_votes[l] = weighted_votes.get(l, 0) + weight
26
27predicted_label = max(weighted_votes, key=weighted_votes.get)
28print(f'Selected Label: {predicted_label}, Weighted Vote: {weighted_votes[predicted_label]:.4f}')
29
30# Selected Label: 1, Weighted Vote: 1.1990
31

2. Regression Task

In regression problems, the predicted value is typically the average of the values of the k nearest neighbors.

Unweighted KNN

The predicted value y^​​ is the mean of the values of the k nearest neighbors:

y^q=1kxiNk(xq)yi\hat y_q = \frac{1}{k} \sum_{x_i \in N_k(x_q)} y_i

where:

  • y^​q​: The predicted value for the new query point, x_q​,

  • x_q​: The new query point for which we want to predict a value,

  • N_k​(x_q​): A set of the k-nearest data points to the query point x_q​,

  • x_i​: A data point within the set of the k-nearest neighbors, and

  • y_i​: The known, true value (the label or target) corresponding to the data point x_i​.

Weighted KNN

Similar to the classification problem, closer neighbors contribute more to the average:

yˉq=xiNk(xq)W(D(xq,xi))yixiNk(xq)W(D(xq,xi))\bar{y}q = \frac{\sum{x_i \in N_k(x_q)} W(D(x_q,x_i)) \cdot y_i}{\sum{x_i \in N_k(x_q)} W(D(x_q,x_i))}

where:

  • y^​q​: The predicted value for the new query point, x_q​,

  • x_q​: The new query point for which we want to predict a value,

  • N_k​(x_q​): A set of the k-nearest data points to the query point x_q​,

  • x_i​: A data point within the set of the k-nearest neighbors,

  • y_i​: The known, true value (the label or target) corresponding to the data point x_i​,

  • D(x_q​, x_i​): The distance between the query point x_q​ and a neighboring data point x_i​, and

  • W(D(x_q​, x_i​)): The weight function that assigns a weight to each neighbor based on its distance from the query point.

The function W is designed to adjust weights by distance where data points closer to a query point x_q​ receive a higher weight, and those farther away receive a lower weight.

The below code demonstrates a KNN regressor predicts a continuous value of 101.02 on an unweighted distance, and 98.9 on a weighted distance where a data point with zero distance to the query point dictated the predicted value:

1import numpy as np
2
3dataset = [
4    {'val': 100.5, 'dist': 10},
5    {'val': 120.0, 'dist': 30},
6    {'val': 90.2, 'dist': 5},
7    {'val': 110.1, 'dist': 10},
8    {'val': 130.8, 'dist': 22},
9    {'val': 80.3, 'dist': 45},
10    {'val': 95.7, 'dist': 18},
11    {'val': 105.4, 'dist': 7},
12    {'val': 115.6, 'dist': 10},
13    {'val': 98.9, 'dist': 0}, # distance = 0
14]
15
16n_neighbors = 5
17dataset.sort(key=lambda x: x['dist'])
18k_nearest_neighbors = dataset[:n_neighbors]
19
20# unweighted knn
21neighbor_values = [item['val'] for item in k_nearest_neighbors]
22predicted_value_unweighted = np.mean(neighbor_values)
23
24# weighted knn
25total_weighted_sum = 0
26total_weight = 0
27for item in k_nearest_neighbors:
28    d, v = item['dist'], item['val']
29    # if the distance is zero, return the value of the sample
30    if d == 0:
31        predicted_value_weighted = v
32        break
33    # else computes the weighted distance
34    else:
35        weight = 1 / d
36        total_weighted_sum += v * weight
37        total_weight += weight
38        predicted_value_weighted = total_weighted_sum / total_weight
39
40    print(f"Predicted Value (Weighted): {predicted_value_weighted:.2f}")
41

— — KNN Unweighted Regression — -

Neighbor values: [98.9, 90.2, 105.4, 100.5, 110.1]

Predicted Value (Unweighted): 101.02

— — KNN Weighted Regression — -

Found a neighbor with distance 0.

Predicted Value (Weighted): 98.90

How to Compute Distance - Choosing Distance Metrics

In KNNs, the choice of distance metric depends on the nature of data. I’ll list up major methods and use cases.

In the notations, d(x_q, x_i) denotes the distance from a query point (x_q) and a training point (x_i), and f_qj, f_ij​ indicates a query point and a training point of the j-th feature in the D-dimensional space.

Euclidean Distance (L2 Distance)

d(xq,xi)=j=1D(fqjfij)2d(x_q, x_i) = \sqrt{\sum_{j=1}^{D} (f_{qj} - f_{ij})^2}
  • Measures the shortest straight-line path between two points in D-dimensional space

  • The most common and intuitive distance metric

  • Ideal when features represent dimensions in a continuous space and the actual physical distance between points is meaningful.

  • Requires feature scaling (e.g., standardization or normalization) to prevent features with larger ranges from dominating the distance calculation.

1euclidean_dist = np.sqrt(np.sum((current_val - target_val)**2))
2

Manhattan Distance (L1 Distance)

d(xq,xi)=j=1Dfqjfijd(x_q, x_i) = \sum_{j=1}^{D} |f_{qj} - f_{ij}|
  • Measures the sum of the absolute differences between the coordinates of two points, like navigating a city grid.

  • Preferable when the feature dimensions are not directly comparable or when movements are constrained to axis-parallel directions.

1manhattan_dist = np.sum(np.abs(current_val - target_val))
2

Minkowski Distance (Generalized Distance)

d(xq,xi)=(j=1DfqjfijP)1/Pd(x_q, x_i) = \left(\sum_{j=1}^{D} |f_{qj} - f_{ij}|^P\right)^{1/P}
  • Generalized metric that encompasses both Euclidean and Manhattan distances.

  • If P=1, becomes the Manhattan Distance.

  • If P=2, becomes the Euclidean Distance.

  • As P→∞, approaches the Chebyshev Distance.

  • Useful to experiment with different distance sensitivities.

  • Choosing an optimal P value can be a part of hyperparameter tuning for KNN.

1P_minkowski = 3
2minkowski_dist = np.power(
3    np.sum(np.power(np.abs(current_val - target_val), P_minkowski)), 
4    1/P_minkowski
5)
6

Cosine Distance

d(xq,xi)=1j=1Dfqjfijj=1Dfqj2j=1Dfij2Cosine Similarityd(x_q, x_i) = 1 - \underbrace{ \frac{\sum_{j=1}^{D} f_{qj} f_{ij}}{\sqrt{\sum_{j=1}^{D} f_{qj}^2} \sqrt{\sum_{j=1}^{D} f_{ij}^2}} }_{\text{Cosine Similarity}}
  • Quantifies the cosine of the angle between two non-zero vectors, focusing on their orientation or direction, instead of magnitude.

  • 1 means the exact same direction (perfectly similar).

  • 0 means orthogonal (no similarity).

  • -1 means pointing in the opposite directions (perfectly dissimilar).

  • Effective with high-dimensional sparse data, like text documents (where vectors represent word frequencies) or user ratings in recommendation systems (where vectors represent ratings), as it effectively handles varying document lengths or rating scales.

1import numpy as np
2
3# computes cosine similarity
4dot_product = np.dot(current_val, target_val)
5norm_current_val = np.linalg.norm(current_val)
6norm_target_val = np.linalg.norm(target_val)
7cosine_similarity = dot_product / (norm_current_val * norm_target_val) if norm_current_val * norm_target_val else 0  
8
9# computes cosine distance
10cosine_dist = 1 - cosine_similarity
11

Hamming Distance

d(xq,xi)=j=1DI(fqjfij)d(x_q, x_i) = \sum_{j=1}^{D} I(f_{qj} \neq f_{ij})
  • Counts the number of features with different values between the query point and a sample.

  • Primarily suited for comparing binary vectors (e.g., presence/absence of features) or categorical data that can be directly compared for equality (e.g., one-hot encoded features).

1import numpy as np
2
3# define two binary vectors of equal length
4vector1 = np.array([1, 0, 1, 1, 0, 1])
5vector2 = np.array([1, 1, 0, 1, 0, 0])
6
7# counts the number of positions where the elements are different
8hamming_distance = np.sum(vector1 != vector2)
9

Jaccard Distance

d(A,B)=1ABABsimilarity(A,B)=ABABABd(A,B) = 1 - \underbrace{|A \cup B| |A \cap B|}_{\text{similarity}(A,B)} = \frac{|A \cup B| - |A \cap B|}{|A \cup B|}
  • Measures the similarity between two sets as the size of their intersection divided by the size of their union.

  • Emphasizes the shared "presences" of features, largely ignoring shared "absences."

  • Useful for comparing binary attributes where a match on '0' (absence) is not considered a sign of similarity.

  • Common in applications like comparing shopping baskets, user interests, or document similarity based on unique word sets.

1import numpy as np
2
3# x_q: user A's preference, x_i: user B's preference.
4user_a_prefs = np.array([1, 0, 1, 1, 0])
5user_b_prefs = np.array([1, 1, 0, 1, 0])
6
7n11 = np.sum((user_a_prefs == 1) & (user_b_prefs == 1))
8n10 = np.sum((user_a_prefs == 1) & (user_b_prefs == 0))
9n01 = np.sum((user_a_prefs == 0) & (user_b_prefs == 1))
10
11jaccard_similarity = n11 / (n11 + n10 + n01) # only considers shared similality (n11)
12jaccard_distance = 1 - jaccard_similarity
13

The Walkthrough Example - Manhattan Distance vs Euclidean Distance

Manhattan distance and Euclidean distance are major distance metrics in KNNs.

Let us imagine a scenario where the inventory manager needs to prioritize stock supplement based on the current stock level of Product A and B.

The Target:

  • Factory 1: Current Stock Level p(A, B) = (100, 50)

  • Factory 2: Current Stock Level p(A, B) = (105, 61)

  • Target Stock Level of the Product A and B: q(A, B) = (90, 60)
    *Both factories have the same target.

The distance metrics give us different conclusions:

1. Manhattan Distance

  • Factory 1: D_A = |100−90|+ | 50−60 | = 20 units ← prioritize

  • Factory 2: D_A = |105−90|+ | 61−60 | = 16 units

Conclusion based on the Manhattan Distance

Factory 2 has a smaller Manhattan distance (16 units) compared to Factory 1 (20 units).

This means Factory 2 is closer to the target stock level in terms of total units requiring adjustment.

If the inventory manager prioritizes based on the total quantity of units that need adjustment, then Factory 2 would be a lower priority for replenishment.

2. Euclidean Distance

  • Factory 1: D_1 = \sqrt ((100– 90)² + (50–60)²)) ≈ 14.14

  • Factory 2: D_2 = \sqrt ((105– 90)² + (61–60)²)) ≈ 15.03 ← prioritize

Conclusion based on Euclidean Distance

Factory 1 has a smaller Euclidean distance (14.14) compared to Factory 2 (15.03).

This means that Factory 1 is closer to the target when considering the magnitude of individual deviations.

So, if the inventory manager wants to heavily penalize larger individual stock discrepancies (e.g., being significantly overstocked on one product), then Factory 1 would be a lower priority for replenishment.

The results demonstrate how Euclidean distance significantly penalizes larger deviations (a penalty of 225) compared to Manhattan distance (which only penalizes by 15).

***

In a real-world scenario, if the inventory manager needs to consider the unique importance or criticality of individual product deviations — a large overstock of Product A is more problematic than a small deficit of Product B- then Euclidean Distance might be a more suitable option.

Conversely, if the goal is simply to minimize the overall number of units that need to be moved regardless of which product they are, then Manhattan Distance would be more appropriate.

Choosing the Optimal Value for “k“

The choice of k is crucial and impacts the bias-variance trade-off such that:

  • Small k (e.g., k=1) would cause high variance, low bias. The model is highly sensitive to noise in the training data (overfitting).

  • Large k would cause low variance, high bias. The model smoothes out the predictions, but might miss local patterns and lead to underfitting.

The optimal k is usually found through cross-validation.

I’ll demonstrate how to find the optimal k in the next section using a real-world dataset.

Simulation

Here, I’ll leverage a churn prediction case.

The Dataset

I’ll load the dataset with 3,500 samples / 14 features from the UC Irvene Machine Learning Repository:

 (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Screenshot of the variable table (source)

Preprocessing

Since KNNs heavily rely on the distance computation, standardizing and normalizing the dataset in advance is crucial.

Fig. Image of standardization and normalization

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Image of standardization and normalization

I’ll sort the features by numerical and categorical values, then create a preprocessor to apply corresponding column transfers:

1from sklearn.preprocessing import OneHotEncoder, StandardScaler
2from sklearn.compose import ColumnTransformer
3from sklearn.pipeline import Pipeline
4from sklearn.impute import SimpleImputer
5
6num_features = [
7    'subscription_length',
8    'charge_amount',
9    'seconds_of_use',
10    'frequency_of_use',
11    'frequency_of_sms',
12    'distinct_called_numbers',
13    'age',
14    'customer_value'
15]
16num_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
17
18cat_features = [
19    'age_group',
20    'tariff_plan',
21    'status',
22    'call_failure',  # Binary, treated as categorical for consistent encoding
23    'complains',     # Binary, treated as categorical
24]
25cat_transformer = Pipeline(steps=[
26    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) # for unseen category, since histgram only address dense data
27])
28
29preprocessor = ColumnTransformer(
30    transformers=[
31        ('num', num_transformer, num_features),
32        ('cat', cat_transformer, cat_features),
33    ],
34    remainder='passthrough' # keep othe columns 
35)
36

Defining Pipeline

Next, I defined the pipeline where preprocessed data is applied to the KNeighborsClassifier class with a distance metric, Minkowski distance (p=3):

1from sklearn.neighbors import KNeighborsClassifier
2from sklearn.pipeline import Pipeline
3
4pipeline = Pipeline([
5      ('preprocessor', preprocessor),
6      ('knn', KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=3)) 
7])
8

Finding Optimal k

To find the optimal value for k, I applied K-fold cross validation to the pipeline. The search space for the k value is set from 1 to 30.

1import numpy as np
2from sklearn.model_selection import KFold, cross_val_score
3from sklearn.neighbors import KNeighborsClassifier
4from sklearn.pipeline import Pipeline
5
6# the options for k is 1 to 30.
7k_vals = list(range(1, 31))
8
9# defines k-fold cross validation
10n_splits = 5
11kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
12
13mean_accs_uw = []
14std_accs_uw = []
15
16for k in k_vals:    
17    # run k-fold cross validation
18    cv_score = cross_val_score(
19        pipeline, X_train, y_train, cv=kf, scoring='accuracy', n_jobs=-1
20    )
21
22    # store results
23    mean_accs_uw.append(np.mean(cv_score))
24    std_accs_uw.append(np.std(cv_score))
25
26# find the optimal k
27optimal_k_uw = k_vals[np.argmax(mean_accs_uw)]
28best_accuracy = mean_accs_uw[np.argmax(mean_accs_uw)]
29print(f"\nOptimal k: {optimal_k_uw} with best mean accuracy: {best_accuracy:.4f}")
30

The Results

The optimal k was 1 for an unweighted KNN (best mean accuracy: 0.9091) and 14 for a weighted KNN (best mean accuracy: 0.9222):

Fig. History of K-fold cross validation process over an unweighted KNN pipeline (left) and weighted KNN pipeline (right) using Minkowski distance with P=3(Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. History of K-fold cross validation process over an unweighted KNN pipeline (left) and weighted KNN pipeline (right) using Minkowski distance with P=3 (Created by Kuriko IWAI)

Though this is optional, to further optimize broad range of hyperparameters, I ran a grid search using the GridSearchCV library:

1from sklearn.model_selection import GridSearchCV
2from sklearn.pipeline import Pipeline
3from sklearn.neighbors import KNeighborsClassifier
4
5# define a base pipeline to optimize
6pipeline_for_grid = Pipeline([
7    ('preprocessor', preprocessor),
8    ('model', KNeighborsClassifier())
9])
10
11# define the search space
12param_grid = {
13    'model__n_neighbors': [i for i in range (1, 31)],
14    'model__metric': ['euclidean', 'manhattan', 'minkowski'],
15    'model__p': [i for i in range(1, 31)],
16    'model__weights': ['uniform', 'distance']
17}
18
19# run grid search
20grid_search = GridSearchCV(
21    pipeline_for_grid, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2
22)
23grid_search.fit(X_train, y_train)
24print(f"Best parameters: {grid_search.best_params_}")
25print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")
26

The Tuning Results

The grid search indicates that a weighted KNN with k=13 on manhattan distance is the optimal model based on the accuracy score.

  • Best parameters: {'model__metric': 'manhattan', 'model__n_neighbors': 13, 'model__p': 1, 'model__weights': 'distance'}

  • Best cross-validation accuracy: 0.9246

Model Training and Inference

Giving the tuning results, I built the three KNN classifiers:

1from sklearn.pipeline import Pipeline
2from sklearn.neighbors import KNeighborsClassifier
3
4# 1. unweighted knn (minkowski, k = 1)
5p_uw = Pipeline([
6    ('preprocessor', preprocessor),
7    ('knn', KNeighborsClassifier(n_neighbors=1, metric='minkowski', p=3))
8]).fit(X_train, y_train)
9
10
11# 2. weighted knn (minkowski, k = 14)
12p_w = Pipeline([
13    ('preprocessor', preprocessor),
14    ('knn', KNeighborsClassifier(
15        n_neighbors=14, weights='distance', metric='minkowski', p=3
16    ))
17]).fit(X_train, y_train) 
18
19
20# 3. grid search result (weighted knn (manhattan, k=13)
21grid_knn = grid_search.best_estimator_
22

Then, each model performed inference on train, validation, and test datasets. The results were assessed by accuracy score:

1from sklearn.metrics import accuracy_score
2
3y_pred_uw_train = p_uw.predict(X_train)
4y_pred_uw_val = p_uw.predict(X_val)
5y_pred_uw_test = p_uw.predict(X_test)
6
7acc_score_train_uw = accuracy_score(y_train, y_pred_uw_train)
8acc_score_val_uw = accuracy_score(y_val, y_pred_uw_val)
9acc_score_test_uw = accuracy_score(y_test, y_pred_uw_test)
10

Results

The results indicate that the Pipeline with Grid Search achieved the highest overall performance on unseen data, yielding a test accuracy of 0.9320.

Pipeline

The weighted pipeline showed a significant improvement (0.9280 test accuracy), nearly matching the custom unweighted version, indicating the importance of weighting.

  • Unweighted (optimal k = 1)- Train: 0.9890 → 0.9140, Generalization: 0.8980

  • Weighted (optimal k = 14) - Train: 0.9436→ 0.9200, Generalization: 0.9280

Pipeline with Grid Search (optimal: weight, manhattan distance, k=13)

  • Train: 0.9436 → 0.9260, Generalization: 0.9320

This outcome highlights the importance of hyperparameter tuning across multiple parameters not only k, but weight or unweighted and distance metrics.

Conclusion

KNN algorithms are straightforward methods based on the simple principle of proximity.

In our simulation, we observed the importance of choosing right k value and distance metric for a task.

The main drawbacks to consider are:

  • The high computational cost: Calculating distances to all training points can be computationally expensive especially for very large datasets, and

  • The curse of dimensionality: A high dimensional feature space becomes sparse with similar distances, which makes it difficult for KNNs to accurately compare the distances.

In the latter case, KNNs need more data points to identify the neighbors.

Looking ahead, exploring techniques like dimensionality reduction (e.g., PCA, t-SNE) or approximate nearest neighbor (ANN) search algorithms could enhance KNN’s scalability and efficiency without compromising its predictive power.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Share What You Learned

Kuriko IWAI, "A Deep Dive into KNN Optimization and Distance Metrics" in Kernel Labs

https://kuriko-iwai.com/k-nearest-neighbor

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.