Dimensionality Reduction Unveiled: LLM Fine-tuning and Mechanics of SVD and PCA

Explore foundational concepts and practical applications with a comparison of major PCA methods

Machine LearningDeep LearningData Science

By Kuriko IWAI

Introduction

Dimensionality reduction is a crucial technique for simplifying complex datasets by reducing the number of variables while retaining key information.

In this article, I’ll explore its foundational concept: Singular Value Decomposition (SVD), and its primary application to the Principal Component Analysis (PCA), comparing different PCA methodologies.

What is Dimensionality Reduction

Dimensionality reduction is a technique to reduce the number of variables (or features) in a dataset while preserving the most important information.

It simplifies complex data, making it easier to analyze and visualize, especially in machine learning.

◼ Why Reducing Dimensionality

Dimensionality reduction helps make complex data more manageable and insightful by transforming it into a lower-dimensional space while preserving its key characteristics.

Its major purposes include:

Simplifies data: High-dimensional data can be difficult to understand, visualize, and process.
Improves performance: Speeds up machine learning algorithms and improves their performance by reducing the number of features and dealing with the “curse of dimensionality”.
Removes redundancy: Eliminates redundant or irrelevant features, leading to a more focused analysis.
Reduces noise: Filters out noise and focuses on the most relevant data by transforming the data.

◼ Dimensionality Reduction Techniques

Dimensionality reduction techniques are broadly categorized into:

Feature Selection: Selects a subset of the original features without changing them.
Feature Extraction: Creates new features by combining or transforming the original features.

Major techniques of feature extraction include:

Principal Component Analysis (PCA): A common technique that transforms data into a new set of orthogonal, uncorrelated variables (principal components), capturing the most variance in the data.
Linear Discriminant Analysis (LDA): A supervised dimensionality reduction technique that aims to find a linear combination of features that best separates different classes.
Autoencoders: Neural networks with a bottleneck layer that learn to encode and decode data, effectively reducing dimensionality.
t-distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique particularly useful for visualizing high-dimensional data in lower dimensions (e.g., 2D or 3D).

The below figure illustrates a bird-eye view of the dimensionality reduction:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Dimensionality reduction techniques overview (Created by Kuriko IWAI)

In the next section, I'll deep dive into Singular Value Decomposition (SVD), a fundamental technique, and various Principal Component Analysis (PCA) methods.

What is Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD) is a fundamental matrix factorization technique that decomposes any matrix (A) into three matrices:

A = U\Sigma V^T

where:

U (Left Singular Vectors): The orthogonal basis vectors in the output space,
Σ (Singular Values): Diagonal matrix containing non-negative, real numbers (singular values) ordered in descending magnitude.
V^T (Right Singular Vectors (Transposed): Orthogonal matrix whose rows are the right singular vectors (columns of V are orthonormal basis vectors for the row space of A).

◼ How SVD Works

The figure illustrates how the SVD approaches the linear transformation step by step:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. The linear transformation process by SVD (Created by Kuriko IWAI)

The vector transformation runs sequentially, working from right to left in the formula. So, the SVD decomposes the input vector (x) by:

Step 1. Rotates in the row space using V^T

Rotates and aligns with the principal axes or directions of greatest variance in the original data space (v1 and v2 in the figure).

Step 2. Scales along the principal axes using Σ

The reoriented vector (V^T⋅x) is scaled by the diagonal matrix Σ that consists of singular values (σ) on the diagonal:

\Sigma = \begin{bmatrix} \sigma_1 & 0 & \cdots & \cdots & 0 \\ 0 & \sigma_2 & \cdots & \cdots & 0 \\ \cdots & \cdots &\ddots & \cdots & \cdots \\ 0 & \cdots & \cdots & \sigma_{n-1} & 0 \\ 0 & \cdots & \cdots & 0 & \sigma_n \end{bmatrix}

Acts as stretching factors along each of the principal axes.

Step 3. Rotates in the column space using U

The scaled vector (ΣV^T⋅x) is transformed by U.
Orients the scaled vector into its correct position within the output space
The final result (UΣV^T⋅x) is the same as A_x.

◼ Major Applications of SVD for Dimensionality Reduction

Since SVD can reveal inherent structure and compress information, it becomes particularly valuable for dimensionality reduction across various fields.

Here are some key methods where SVD is applied:

Principal Component Analysis (PCA):

SVD provides a robust and numerically stable method for performing PCA. I’ll demonstrate how it works in detail in the next section.

Low-Rank Approximation:

A fundamental mechanism underlying SVD’s use in data compression and noise reduction.
Involves approximating a complex matrix A with a simpler, lower-rank matrix Ak by retaining only the top k (most significant) singular values.
Best when: The original data has a significant amount of redundancy, noise, or an inherently lower-dimensional structure (e.g. majority of features are highly correlated).

Latent Semantic Indexing (LSI)

PCA for text data (NLP).
In NLP, uses SVD to remove noise and groups semantically similar terms and documents
Best when: Working with large text corpora to identify underlying themes or topics. Improve information retrieval (e.g., by matching documents to queries based on concepts rather than exact word matches), or to handle synonymy and polysemy.

Image Compression and Denoising

Uses SVD’s low-rank approximation to reduce the file size of images by representing them with fewer singular values
Remove noise by discarding less significant components that often capture random fluctuations.

Recommender Systems (Matrix Factorization)

Uses SVD to form the basis for decomposing large user-item interaction matrices.
Helps predict user preferences for unrated items, reducing the complexity of the interaction space.

What is Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical method for linear dimensionality reduction.

Its primary goal is to transform a dataset with a large number of potentially correlated variables into a smaller set of uncorrelated variables, known as principal components.

These components are ordered by the variance captured across the dataset.

This process effectively reduces the number of variables while preserving as much of the original data’s underlying variation as possible.

◼ How PCA Leverages Single Value Decomposition (SVD)

Leveraging SVD for PCA is the most common and numerically stable approach.

In PCA, each of the components given by SVD plays a key role such that:

V: Principal components that Indicates the directions of maximum variance (highlighted in orange in the Fig A)
Σ: Variance explained by each component because its squares are proportional to the eigenvalues of the covariance matrix.
UΣ: Principal Component Score (or latent variables) in the new coordinate system (blue in Fig A).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. PCA leveraging SVD over a 2×2 matrix transformation (Created by Kuriko IWAI)

I’ll demonstrate the computational process of these components step by step, taking a simple 3 × 2 matrix A for an example:

A = \begin{bmatrix} 1 & 5 \\ 2 & 1 \\ 3 & 6 \end{bmatrix}

◼ Prerequisite: Shifting Matrix A to Origin

PCA is about variance, not location.

When the input vector A is given, it is crucial for PCA to transform the vector to mean-centered, so that the centroid of A lies at the origin (0, 0):

Press enter or click to view image in full size

A_s = A - \mu^T = \begin{bmatrix} 1-2 & 5-4 \\ 2-2 & 1-4 \\ 3-2 & 6-4 \end{bmatrix} = \begin{bmatrix} -1 & -1 \\ 0 & -3 \\ -1 & 2 \end{bmatrix}

(μ: the column average, μ_1 = (1 + 2 + 3) / 3 = 2, μ_2 = (5 + 1 + 6) / 3 = 4)

This process is not a part of standard SVD.

But for PCA, it is important because the process ensures that the variance truly reflects the scatter of the data points around their central tendency, rather than their absolute position in space.

◼ Step 1. Find Σ

To find the singular values, first, compute the product of the transpose of A with A itself.

This results in a square, symmetric matrix whose eigenvectors correspond to the right singular vectors (V):

A_s^T A_s = \begin{bmatrix} -1 & -1 \\ 0 & -3 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} -1 &0 & -1 \\ -1&-3& 2 \end{bmatrix} = \begin{bmatrix} 2 & 3 & -1 \\ 3 & 9 & -6 \\ -1 & -6 & 5 \end{bmatrix}

▫ Find Eigenvalues (λ)

Solve the characteristic equation det(A^T A − λI) = 0 to find the eigenvalues λ_i.

For our example:

(2 - \lambda)(14 - \lambda) - 1^2 = 0 \implies λ_1 \approx 14.083 \text { , } \lambda_2 \approx 1.917

▫ Compute Singular Values (σ)

The singular values are the square roots of the eigenvalues:

\sigma_i = \sqrt \lambda_i

So, the singular values are computed:

\sigma_1 = \sqrt {14.083} \approx 3.753 \text { , } \sigma_2 \sqrt {1.917} \approx 1.385

▫ Construct Σ

Place these singular values on the diagonal of the Σ matrix, ordered from largest to smallest:

\Sigma = \begin {bmatrix} 3.753 & 0\\ 0 & 1.385 \end{bmatrix}

Here, σ1=3.753 is the largest singular value, and σ2=1.385 is the second largest singular value, indicating the importance of each corresponding principal component.

◼ Step 2. Find Principal Component V

The eigenvectors corresponding to the eigenvalues of A^TA (λ_1, λ_2) form the columns of the matrix V.

These eigenvectors can be found by solving the following equation for each eigenvalue λi:

(A^T A - \lambda_i \cdot I) x_i = 0

▫ Finding the Eigenvector for λ_1

Substitute λ_1:

A_s^TA_s - \lambda_1 I = \begin{bmatrix} 2 & 1 \\ 1 &14 \end{bmatrix} - 14.083 \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 2 - 14.083 & 1 \\ 1 & 14 - 14.083 \end{bmatrix} = \begin{bmatrix} -12.083 & 1 \\ 1 & -0.083 \end{bmatrix}

Solve the system by finding non-zero vector x:

\begin{bmatrix} -12.083 & 1 \\ 1 & -0.083 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \implies \begin{cases} -12.083x_1 + x_2 = 0 \\ x_1 - 0.083 x_2 = 0 \end{cases} \implies x_2 \approx 12.083 \cdot x_1

Construct the first principal component V_1

The x’s lead us to the first principal component (before normalization):

v_1 =\begin{bmatrix} 1 \\ 12.083 \end{bmatrix}

Applying normalization, we find the first principal component V_1:

V_1 = \frac{1} {\left| \left| v_1\right| \right| }v_1 =\frac{1}{\sqrt{ 1^{2} + (12.083)^{2} } }\cdot \begin{bmatrix} 1 \\ 12.083\end{bmatrix} = ​ \begin{bmatrix} 0.0825 \\ 0.9966 \end{bmatrix}

Taking the same step for λ_2, we find the second principal component V_2:

V_2 = \begin{bmatrix} 0.9966 \\ −0.0825 \end{bmatrix} ​ ​

▫ Construct the Matrix V

The columns of V’s are the principal components V1 and V2, ordered according to their corresponding singular values (largest singular value first):

V =\begin{bmatrix} | & | \\ V_1 & V_2 \\ | & | \end{bmatrix} ​ ​ =\begin{bmatrix}0.0825 & 0.9966 \\ 0.9966 & −0.0825\end{bmatrix}

◼ Step 3. Find Left Singular Vectors U

The left singular vectors can be computed using the relationship:

A v_i = \sigma_i u_i \implies u_i = \frac{1}{\sigma_i} Av_i

Substitutes the numbers computed, we find u_1 and u_2 such that:

u_1 = \frac{1}{\sigma_1} A_s v_1 = \frac{1}{3.753} \begin{bmatrix} -1 & 1 \\ 0 & -3 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 0.0825 \\ 0.9966 \end{bmatrix} \approx \begin{bmatrix} 0.2436 \\ -0.7966 \\ 0.5531 \end{bmatrix}

u_2 = \frac{1}{\sigma_2} A_s v_2 = \frac{1}{1.385} \begin{bmatrix} -1 & 1 \\ 0 & -3 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 0.9966 \\ -0.0825 \end{bmatrix} \approx \begin{bmatrix} -0.7791 \\ 0.1787 \\ 0.6004 \end{bmatrix}

Then, construct U:

U = \begin{bmatrix} 0.2436 & -0.7791 \\ -0.7966 & 0.1787 \\ 0.5531 & 0.6004 \end{bmatrix}

This process allows us to decompose the original matrix A into its fundamental rotational (U, V^T) and scaling (Σ) components, which are then used for PCA.

◼ Practical Approach

In practice, we can easily compute the PCA components using the NumPy library:

1import numpy as np
2
3A = np.array([
4    [1, 5],
5    [2, 1],
6    [3, 6],
7])
8
9A_mean_centered = A - np.mean(A, axis=0)
10U, s, Vt = np.linalg.svd(A_mean_centered)
11
12eigenvalues_of_covariance_np = (s**2) / A_mean_centered.shape[0]
13
14Sigma_diagonal_matrix = np.zeros(A_mean_centered.shape)
15Sigma_diagonal_matrix[:s.shape[0], :s.shape[0]] = np.diag(s)
16latent_variables_np = U @ Sigma_diagonal_matrix
17

sigma’s: [3.7527007 1.38464344]
V^T: [[ 0.08248053 0.99659268] [-0.99659268 0.08248053]]
U: [[ 0.24358781 0.77931486 0.57735027] [-0.79670037 -0.1787042 0.57735027] [ 0.55311256 -0.60061066 0.57735027]]

A side notes on U: For an m×n matrix A_s, the U matrix from SVD is always square with dimensions m×m (3×3 in this case). While we explicitly calculated the first two columns of U (u1 and u2) corresponding to our non-zero singular values, the remaining (m−n) columns (here, one column) complete the orthonormal basis for the m-dimensional space.

Applying PCA for Dimensionality Reduction

Now, the matrices are computed, I’ll examine two cases:

Full transformation and
PCA application

on a random matrix A.

◼ The Full Transformation

When a new, unseen data point: x is given, the full transformation of the original matrix A on the input vector x is defined:

y_{full} = Ax=(U \Sigma V^T)x

which can be viewed as a sequence of transformations:

Ax = U (\Sigma (V^T x))

Here, the term V^Tx represents the input vector x transformed into the full coordinate system defined by the principal components V.

Each element of V^Tx is x’s score along a specific principal component. And the multiplication by Σ and U completes the original transformation, resulting in a vector y_full in the original output space.

Taking the matrix A in the previous case for an example, the full transformation of a random x results in three-dimensional column space:

y_{full} = Ax= \begin{bmatrix} 1 & 5 \\ 2 & 1 \\ 3 & 6 \end{bmatrix} \begin {bmatrix} 1 &2 \\ 3 &4 \end{bmatrix} = \begin {bmatrix} 16& 22\\5& 8\\21 &30 \end{bmatrix} \in R^3

*column space

◼ The PCA Application

The goal of PCA is to create a lower-dimensional representation of the input vector x.

To achieve this, SVD projects x onto a subset of the principal components such that:

y_{pca} = V_k^T x

where k represents the number of dimensions after PCA.

So, for example, when reducing the dimensionality of x to two, SVD takes only the first and second principal components.

This operation essentially projects the input vector onto the subspace spanned by the chosen two (or k in general) principal components, resulting in a lower-dimensional score.

In our case, using the first principal component V_1, the same input x is transformed onto one dimensional column space:

y_{pca} = V_1^Tx = \begin{bmatrix} 0.0825 & 0.9966 \end{bmatrix} \begin {bmatrix} 1 &2 \\ 3 &4 \end{bmatrix} = \begin{bmatrix} 3.0723 & 4.1514 \end{bmatrix} \in R^1

◼ Practical Impact

As illustrated, the full transformation generates a three-dimensional output, while applying PCA reduces it to an one-dimensional score.

This data compression is incredibly powerful.

Imagine if an input vector represents a 300 × 300 pixel color thumbnail image.

This vector has 300×300×3 = 270,000 dimensions.

PCA can reduce this to a few thousand dimensions while retaining the crucial information. As you can imagine, this significantly reduces the computational cost of storing, processing, and analyzing the data.

In fact, when I compressed the cat image, the rank 50 (left bottom) appeared almost identical to the original:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Comparison of dimensionality reduction on image via SVD-based PCA by rank (Created by Kuriko IWAI)

PCA has downside on information loss, especially when crucial discriminative patterns are present in lower-variance components that PCA discarded.

Also, PCA prioritizes variance preservation over discriminative information, which can make PCA less effective for classification tasks or categorical feature handling.

Simulation

In this section, I’ll simulate five PCA methods with SVD as a benchmark and compress the 57-dimensional customer data to 2 dimensions.

Here are the PCA methods:

SVD-based PCA: Benchmark
Eigendecomposition PCA: Traditional and conceptually direct (can be numerically less stable for large/high-dimensional data)
Incremental PCA: Computes PCA in mini-batches, memory-efficient for very large datasets.
Randomized PCA: Uses randomized algorithms to approximate the leading principal components. Significantly faster for very large, sparse feature space.
Kernel PCA: Maps data using kernels and performs linear PCA. Good for handling non-linear relationship, but computationally intense.

◼ Preprocessing

I used a telecom churn data from the UC Irvine Machine Learning Repository , generating 2,826 datasets with 61 input features after applying column transformations:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Features in training dataset (source)

Shape: (3150, 57)

◼ SVD-Based (Benchmark)

Applying the PCA class from the Scikit-Learn library:

1from sklearn.decomposition import PCA
2
3pca_sklearn = PCA(n_components=2).fit(X_processed)
4pca_sklearn_transformed = pca_sklearn.transform(X_processed)
5

◼ Eigendecompsition

Manually computing eigenvectors and principal component scores using the NumPy library:

1import numpy as np
2
3Sigma = np.cov(X_processed, rowvar=False)
4
5eigenvalues, eigenvectors = np.linalg.eig(Sigma)
6eigenvectors_eig = eigenvectors[:, eigenvalues.argsort()[::-1]]
7pc_scores = X_processed @ eigenvectors_eig
8

◼ Incremental PCA

Creating a batch of 100 data point:

1from sklearn.decomposition import IncrementalPCA
2
3n_samples = X_processed.shape[0]
4n_components = 2
5batch_size = 100
6ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size)
7
8for i in range(0, n_components, batch_size):
9    batch = X_processed[i:i + batch_size]
10    ipca.partial_fit(batch)
11
12X_transformed_ipca = ipca.transform(X_processed)
13

◼ Randomized PCA

Added randomized to svd_solver:

1from sklearn.decomposition import PCA
2
3pca_randomized = PCA(
4    n_components=2,
5    svd_solver='randomized',  # added
6    random_state=42
7).fit(X_processed)
8X_transformed_randomized = pca_randomized.transform(X_processed)
9

◼ Kernel PCA

Used the KernelPCA class from the Scikit-Library:

1from sklearn.decomposition import KernelPCA
2
3kpca_rbf_plot = KernelPCA(
4    n_components=2, 
5    kernel='rbf', 
6    gamma=10, 
7    fit_inverse_transform=True, 
8    random_state=42
9).fit_transform(X_processed)
10

◼ Results

▫ Transformed Data (First 3 Rows)

SVD: [[-0.52226898 -0.31852918] [-2.33364661 0.30846163] [ 1.00560426 2.97975418]]
Eigendecomp: [[-0.49955958 0.2649316 ] [-2.31093721 -0.36205921] [ 1.02831366 -3.03335176]]
Incremental PCA: [[-0.70377262 -0.16708658] [-2.36044435 0.41672305] [ 0.90155373 3.06526561]]
Randomized PCA: [[-0.52226898 -0.31852918] [-2.33364661 0.30846163] [ 1.00560426 2.97975418]]
Kernel PCA: [[ 0.31379832 -0.36497502] [-0.00993375 -0.01026973] [ 0.37912822 -0.24748029]]

Scatter plot of the first two principal components:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Scatter plots of the first two principal components (SVD, Eigendecomp, Incremental, Randomized, Kernel) (Created by Kuriko IWAI)

◼ Key Findings

SVD-based PCA, Eigendecomposition-PCA: Both methods generate highly comparable principal components, as evidenced by their plots and the nearly identical transformed data for the first few rows and components.

Incremental PCA: The Incremental PCA plot generally mirrors the standard PCA plots. It effectively reduces dimensionality in a similar way, even when processing data in smaller batches. The transformed data also shows values very close to the standard PCA methods.

Randomized PCA: The Randomized PCA plot also shows a similar structure to standard PCA. The transformed data is almost identical to SVD-based PCA.

Kernel PCA: This plot stands out significantly from the others. Unlike the linear PCA methods, Kernel PCA with an RBF kernel transforms the non-linearly separable “circles” data into a linearly separable pattern.

Wrapping Up

PCA, particularly when applied with SVD, is a fundamental and practical approach for dimensionality reduction.

Our mathematical demonstration provided a clear insight into how SVD effectively factors complex input vectors.

In the practical experiments, we observed that while SVD forms a powerful core, various techniques built upon these principles can also be effectively tailored to address diverse use cases and challenges in dimensionality reduction.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds
Explore the Vapnik-Chervonenkis (VC) dimension and its impact on the bias-variance trade-off.
Achieving Accuracy
Repairing Audio Artifacts via Independent Component Analysis (ICA)
Learn how Independent Component Analysis (ICA) solves the cocktail party problem. A technical guide to blind source separation for audio repair using FastICA and Python.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Dimensionality Reduction Unveiled: LLM Fine-tuning and Mechanics of SVD and PCA" in Kernel Labs

https://kuriko-iwai.com/dimensionality-reduction-with-svd-and-pca

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Dimensionality Reduction Unveiled: LLM Fine-tuning and Mechanics of SVD and PCA

Explore foundational concepts and practical applications with a comparison of major PCA methods

Table of Contents

Introduction

What is Dimensionality Reduction

◼ Why Reducing Dimensionality

◼ Dimensionality Reduction Techniques

What is Singular Value Decomposition (SVD)

◼ How SVD Works

◼ Major Applications of SVD for Dimensionality Reduction

What is Principal Component Analysis (PCA)

◼ How PCA Leverages Single Value Decomposition (SVD)

◼ Prerequisite: Shifting Matrix A to Origin

◼ Step 1. Find Σ

▫ Find Eigenvalues (λ)

▫ Compute Singular Values (σ)

▫ Construct Σ

◼ Step 2. Find Principal Component V

▫ Finding the Eigenvector for λ_1

▫ Construct the Matrix V

◼ Step 3. Find Left Singular Vectors U

◼ Practical Approach

Applying PCA for Dimensionality Reduction

◼ The Full Transformation

◼ The PCA Application

◼ Practical Impact

Simulation

◼ Preprocessing

◼ SVD-Based (Benchmark)

◼ Eigendecompsition

◼ Incremental PCA

◼ Randomized PCA

◼ Kernel PCA

◼ Results

▫ Transformed Data (First 3 Rows)

◼ Key Findings

Wrapping Up

Continue Your Learning

Mastering the Bias-Variance Trade-Off: An Empirical Study of VC Dimension and Generalization Bounds

Achieving Accuracy

Repairing Audio Artifacts via Independent Component Analysis (ICA)

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?