Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation

Master the math of rank decomposition, hyperparameter tuning, and tips to avoid common pitfalls

Data ScienceLLMMLOps

By Kuriko IWAI

Introduction What Is LoRA How LoRA Works - The Core Mechanism The Math of <1%: Why LoRA is Computationally Superior Tuning LoRA Hyperparameters Security & Vulnerabilities: Is LoRA Truly Bulletproof?Wrapping Up - Various LoRA Configuration and Full Fine-tuning

Introduction

Architecting a language model from the ground up requires massive clusters because the model has enormous number of trainable parameters.

But in the real world where VRAM is a finite resource and not everyone has a cluster of A100s, we need a more surgical approach.

LoRA tackles this challenge by training only a tiny fraction of the model parameters, enabling us to achieve identical performance to the full-tuned model with less than 1% of computational cost.

In this article, I’ll deep dive into LoRA’s core mechanism, while addressing its common pitfalls in configuration and implementation.

What Is LoRA

Low-Rank Adaptation (LoRA) is a fine-tuning technique that optimizes low-rank decomposition matrices within the transformer's dense layers rather than updating all model parameters.

Originally introduced by Hu et al. [1], LoRA has become a cornerstone of Parameter-Efficient Fine-Tuning (PEFT) for adapting language models to downstream tasks.

The diagram below illustrates how LoRA is applied within a standard decoder-only transformer, using Qwen-3-1.7B model to demonstrate the parameter breakdown:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Decoder-only transformer architecture and parameter breakdown for Qwen-3-1.7B model and its LoRA adapter (Created by Kuriko IWAI)

◼ Deconstructing Parameter Allocation in Decoder-Only Transformers

Figure A shows that the model features 1.7 billion parameters distributed across its architecture.

Before detailing the mechanics of LoRA, I will examine its allocation strategy.

▫ 1. The Vocabulary and Input Embedding Layer

The model first converts raw text into an input embedding, referring to a massive lookup table with tokens (word or sub-word) (Green box, Figure A).

Each token is assigned token ID as an unique vector.

For example, the token "cat" is assigned Token ID 1, then the ID points to a vector with decimals ranged from -1 to 1:

[ -0.0124, 0.4591, -0.8233, 0.0042, \dots, 0.1192 ]

These numbers represent the token’s semantic meaning across the model's hidden dimensions.

And because Qwen-3-1.7B has 2,048 hidden dimensions, the model requires 2,048 columns to represent one token.

With a vocabulary size of 151,936, the resulting lookup table has 151,936 rows and 2,048 columns:

Token ID	Dim 1	Dim 2	Dim 3	...	Dim 2048
ID 0	0.0135	-0.022	0.0556	...	0.1167
ID 1 (cat)	-0.0124	0.4591	-0.8233	...	-0.1192
...	...	...	...	...	...
ID 1523 (apple)	-0.0445	0.4503	-0.8240	...	0.1103
...	...	...	...	...	...
ID 151,935	0.0904	0.0120	-0.0555	...	0.0224

Table 1. A mock vocabulary matrix of Qwen-3-1.7B

To store this matrix, the model requires to allocate 311 million parameters:

151,936 \times 2,048 = 311,164,928 \approx \mathbf{311 \text{ Million Parameters}}

This matrix is reused during the final decoding stage (Linear layer, Figure A) to translate the model's internal math back into human-readable text.

▫ 2. The Core Engine - Transformer Blocks

Next, the model processes the data through the core engine, transformer blocks (Pink box, Figure A).

The parameters within these blocks are distributed:

Attention layers (Orange box, Figure A): 12.5M parameters per block.
Feed-Forward (FF) layers (Blue box, Figure A): 33.8M parameters per block.

Because Qwen-3-1.7B has 28 layers stacked on top of each other:

Attention layers total:

12.5 \text{ Million} \times 28 \text{ Layers} \approx \mathbf{350 \text{ Million Parameters}}

FF layers total:

33.8 \text{ Million} \times 28 \text{ Layers} \approx \mathbf{946 \text{ Million Parameters}}

▫ 3. The Final Tally

To reach the final prediction, the model applies a linear transformation using the original 311M parameters from the vocabulary, and a Softmax function to normalize the results into a probability distribution (Grey boxes, Figure A).

Adding up small fractions from the normalization layers, the model requires 1.7 billion parameters in total:

331 \text{ Million}+ 350\text{ Million} + 946 \text{ Million}+ \Delta \approx \mathbf{1.7\text{ Billion}}

Full fine-tuning trains all 1.7 billion parameters, requiring massive VRAM and data samples to prevent overfitting.

Developer Note - Balancing Dimensions (storage) and Layers (thinking capacity)

The very first Transformer from "Attention is All You Need" paper defines the ideal hidden dimension size:

Small Models: 512 or 768.
Medium Models (e.g., Qwen): 1,024 or 2,048.
Large Models (e.g., Llama): 4,096.
Giant Models (e.g., GPT-3): 12,288.

For Qwen-3 to meet the threshold of total 1.7 billion parameters, it needs to balance the hidden dimension size and the number of layers.

Because if the model is:

Too wide with 8,192 dimensions, it only affords three layers. This means that the model has enough room to comprehend the input, but would fail to think deeper when serving a response.
Too narrow with 128 dimensions, it can afford 500+ layers, but wouldn't have enough capacity to process a single word.

Ultimately, the optimal combination of 28 layers x 2,048 dims provides enough capacity to store and process relatively complex concepts without making the model too slow to run on a consumer VRAM.

How LoRA Works - The Core Mechanism

LoRA optimizes the fine-tuning process by targeting the core transformer block (Pink box, Figure A) through specific mathematical shortcuts, low-rank decomposition.

◼ Low-Rank Decomposition

Low-rank decomposition is a mathematical theory in linear algebra where a large matrix can be approximated with a product of two smaller matrices.

The theory leverages rank factorization, a fundamental linear algebra theory where any non-null vector Z is represented by a product of two full-rank matrices, X and Y:

Z = XY = X\cdot Y \quad \cdots (1.1.1)

In the context of standard fine-tuning, Z represents a weight update matrix which has the same size as the original model weight (e.g., 2,048 × 2,048 in the case of Qwen-3-1.7 model).

Large models have enormous size of the weight, making the storage and computation of X and Y prohibitively expensive.

LoRA bypasses this bottleneck by leveraging low-rank decomposition:

Z \approx AB = A\cdot B \quad \cdots (1.1.2)

In this theory, the low-rank matrices A and B are considered representing the most latent structure of the original, large matrix Z based on the assumption that much of data in Z is redundant or correlated, hence can be ignored (Gray area, Figure B).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Visual representation of low-rank decomposition (Created by Kuriko IWAI)

So, instead of computing the full-rank update Z, LoRA tunes smaller AB with a reduced inner dimension, called rank (r).

This process is generalized:

\Delta W \approx \frac{\alpha}{r} AB \quad \cdots \text{(1.2)}

where:

ΔW: The full-rank update matrix (size d x d),
A: The first low-rank matrix of size d x r,
B: The second low-rank matrix of size r x d,
α/r: The scaling factor to control the magnitude of the new learning,
r: The rank (LoRA hyperparameter, e.g., r = 8), and
α: The scaling parameter (LoRA hyperparameter). Default is one (α = 1).

As Figure B shows, the rank r determines how much data LoRA will streamline from the original matrix.

For example, Qwen-3 (d = 2,048) has ΔW with over 4 million parameters (2,048 x 2,048 = 4,194,304).

Full fine-tuning the model indicates that all 4 million parameters must be updated.

But when applying a rank-eight decomposition r = 8, LoRA only needs to tune 32,768 parameters:

Matrix A becomes 2,048 x 8 = 16,384 parameters.
Matrix B becomes 8 x 2,048 = 16,384 parameters.
Total trainable parameters: 32,768 (16,384 + 16,384 = 32,768)

LoRA successfully reduces the number of trainable parameters, while reserving the essential data from the original weight matrix.

The calculation of total trainable parameters is generalized:

n_{lora} = (d_{in} \times r) + (r \times d_{out}) \quad \cdots (1.3)

where:

n_{lora}: Total number of trainable parameters that LoRA targets to tune,
d_{in}: The input dimension of the original layer (e.g., 2,048),
d_{out}: The output dimension of the original layer (e.g., 2,048), and
r: The rank (where r « d_{in} and r « d_{out}).

And because LoRA maintains the same output dimensions as the full fine-tuning update, it can simply add the tuning result to the pre-trained weight W_{pt}:

W_{ft} = \underbrace{W_{pt} + \Delta W}_{\text{Full Fine-tunig }}\approx \underbrace{W{pt} + \frac{\alpha}{r} AB}_{\text{LoRA}} \quad \cdots (1.4)

where:

W_{ft}: The weight matrix after the full-tuning,
W_{pt}: The pre-trained weight matrix,
ΔW: The full fine-tuning results, and
α/r AB : The LoRA’s tuning results.

Eq 1.4 mathematically proves the absense of non-linearity in the LoRA process.

Traditional adapter methods introduce activation functions, making the model more complex and parameter heavy.

Instead, LoRA can maintain the original architecture of the model, preventing the system from increasing latency or computational cost during inference.

The Math of <1%: Why LoRA is Computationally Superior

Now, Figure A shows that LoRA with rank eight r = 8 trains only 8.2 million parameters (0.7%) out of 1.7 billion.

Let us see how it happens, leveraging Eq 1’s.

◼ LoRA Targeting Attention Layers

Qwen-3-1.7B has 16 Query heads and 8 Key-Value (KV) heads in the attention layer.

In the common scenario where LoRA targets the four main projection matrices: W_q, W_k, W_v, and W_{out} in the attention layer, the total number of trainable parameters is 114,688:

Module	Original (d_in×d_out)	LoRA (r = 8)	LoRA Trainable Params
Query (W_q)	2,048 x 2,048	(2,048 x 8) + (8 x 2,048)	32,768
Key (W_k)	2,048 × 1,024	(2,048 x 8) + (8 x 1,024)	24,576
Value (W_v)	2,048 x 1,024	(2,048 x 8) + (8 x 1,024)	24,576
Output (W_o)	2,048 x 2,048	(2,048 x 8) + (8 x 2,048)	32,768
Total Per Layer	12,582,912		114,688

Note: Key and Value matrices are smaller (1,024) because Qwen uses Grouped-Query Attention (GQA) with 8 heads instead of 16.

Table 2-1. Comparison of total parameters and LoRA trainable parameters in a single attention layer of Qwen-3-1.7B

Because the model has 28 layers, the total trainable parameters in the attention layers is 3.2 million:

n_{\text{lora-att}} = 114,688 \times 28 = 3,211,264 \approx \mathbf{3.2 \text{ Million Parameters}}

which is just 0.9% of all parameters in the attention layers.

◼ LoRA Targeting FF Layers

Similar to the attention layer, I break down the computation in the FF layers.

Qwen-3-1.7B leverages SwiGLU architecture where the information is processed by three layers:

Gate layer.
Up layer.
Down layer.

When LoRA targets all the three layers, the total number of trainable parameters is 181,248:

Module	Original (d_in×d_out)	LoRA (r = 8)	LoRA Trainable Params
Gate (W_g)	2,048 x 5,504	(2,048 x 8) + (8 × 5,504)	60,416
Up (W_u)	2,048 x 5,504	(2,048 x 8) + (8 × 5,504)	60,416
Down (W_d)	5,504 x 2,048	(5,504 x 8) + (8 × 2,048)	60,416
Total Per Layer	33,816,576		181,248

*Note: The model expands its hidden dimensions from 2,048 to 5,504 in the FF layer to secure capacity,

Table 2-2. Comparison of total parameters and LoRA trainable parameters in a single FF layer of Qwen-3-1.7B

So, the total trainable parameters in all 28 FF layers is:

n_{\text{lora-FF}} = 181,248 \times 28 = 5,074,944 \approx \mathbf{5.1 \text{ Million Parameters}}

which is only 0.3% of all parameters in the FF layers.

And in total, LoRA tunes only 8.2 million parameters.

Tuning LoRA Hyperparameters

Although LoRA seems versatile, tuning its hyperparameters is critical to balance the integration of new knowledge to the existing learning from pre-training.

Three primary variables drive this adjustment:

Layer selection: Which dense layers to apply LoRA.
Rank r: The degree of LoRA’s approximation.
Scaling parameter α: Balance the pre-trained weights and LoRA adaptation.

◼ Layer Selection

In the initial paper [1], LoRA was originally designed to target the attention layers (Orange block in Figure A), specifically the Query and Value projection matrices (W_q, W_v).

The logic behind this is that these attention layers are the most critical component for capturing task-specific logic, while keeping trainable parameters minimal.

But the paper also mentioned that LoRA can be applied to any dense layer, so many modern implementations target both attention and FF layers for better performance.

In practice, the choice set is:

Attention layers only.
FF layers only.
Both attention and FF layers.

Developer Note:

Some implementations by default only target Query (q_proj) and Value (v_proj) vectors.

While this saves VRAM, it can limit capabilities to adapt complex reasoning.

When the model’s performance is underwhelming, try expanding target modules to include Key, Output vectors k_proj, o_proj, as well as FF layers gate_proj or up_proj.

◼ Rank

The rank r defines the model’s learning capacities on the new knowledge.

High rank improves knowledge storage and generalization capabilities, while low rank saves computational cost as per Eq (1.2).

Common choices are r = 8 and r = 16, which balances computational cost and the model’s learning capabilities.

A high rank like r = 64 and r = 128 is suitable when:

Handling extremely complex task adaptation like diagnosing a rare disease because it requires more learning capabilities to process, and
Requiring better generalization capacity due to limited training samples.

Developer Note:

The rank trap - “higher is better“ is not always the case.

A high-rank LoRA tends to overfit, hits the ceiling of performance improvement, and consumes more VRAM.

Practical strategy here is to start with a row rank r = 8 and gradually increase the rank only when the model is failing to learn the new knowledge.

◼ Scaling Parameter

The scaling parameter α defines the balance between the new learning and the pre-trained learning.

The default scaling parameter is one (α = 1), meaning that the pre-trained weights and the low-rank weights (new learning) are weighted equally when performing the forward pass.

Adjusting the scaling factor (α/r) depends on the task strategies:

Strategy	Scaring Factor	Use Case
Aggressive	4 x r	When the task is very different from the base model's training (e.g., teaching a completely new language).
Standard	2 x r	Common choice for general purpose fine-tuning; most stable.
Conservative	1 x r	To prevent catastrophic forgetting. To keep the model's original learning intact.

Table 3. Scaling factor examples by use case

Security & Vulnerabilities: Is LoRA Truly Bulletproof?

Aside from misconfiguration, LoRA has adversarial vulnerabilities in its:

Learning process.
Hyperparameters.
Architecture itself.

◼ Vulnerabilities in Learning Process - Data-Centric Attacks

LoRA’s row-rank matrices are prone to learn noise due to their simple structures.

Data-centric attack, one of the most frequent attacks, is used to inject noise to the training set to mislead the adapter.

Examples include:

Adversarial Fine-tuning (LoFT).
Backdoor Poisoning (LoRATK).
Untargeted data (noise) poisoning.

The Fix:

Use spectral signatures. Detect poisoned samples in the training set before they reach the LoRA adapter.

◼ Vulnerabilities in Hyperparameters - Parameter-Centric Attacks

Parameter-centric attacks are used to exploit LoRA’s hyperparameters and configuration.

Examples include:

Weight amplification (LoBAM) to update the scaling factor with an extreme value, and
Gradient Assembly Poisoning (GAP)[2] to modify matrices A and B with statistically identical matrices whose product AB creates a malicious shift in the model's weight.

The Fix:

Implement product-space validation. The server reconstructs AB and verifies its spectral properties before adding it to the global model.

◼ Architectural Vulnerabilities - Extraction Attacks

Because LoRA is simple and small, the LoRA adapter is used to leak training data.

Examples include:

Membership inference attacks (LoRA-Leak) to trace the training data via loss profile of the LoRA adapter.
Model extraction (StolenLoRA) to clone the functionality of the LoRA adapter to mimic the base model.

The Fix:

Apply differential privacy (DP) directly to the LoRA updates (ensure that individual data points cannot be reconstructed from the adapter weights).

Wrapping Up - Various LoRA Configuration and Full Fine-tuning

LoRA is an efficient tuning technique for transformer-based language models especially when carefully tuned and implemented.

I’ve included a comparison table below that maps VRAM requirements to specific LoRA ranks:

Feature	Ref. Full Fine-Tuning	LoRA (r=8, FF Only)	LoRA (r=64, FF Only)	LoRA (r=8, All Layers)	LoRA (r=64, All Layers)
Parameters Updated	~1.7B (100%)	~5M (0.07%)	~40M (0.5%)	~18M (0.25%)	~150M (2.1%)
VRAM Required	~112GB - 140GB	~6GB - 10GB	~10GB - 14GB	~12GB - 16GB	~20GB - 26GB
Training Samples	5,000 to 50,000+	100 to 5,000	2,000 to 15,000	500 to 10,000	5,000 to 20,000
Best Use Case	Deep domain change (e.g., new language).	Style mimicry & basic tasks.	Domain Adaptation (Legal/Med).	Balanced General Finetuning.	High-complexity reasoning/knowledge.
Learning Depth	Infinite (Total Rewiring)	Shallow (Pattern Matching)	Deep (Niche knowledge)	Nuanced (Logic + Knowledge)	Very Deep (High-fidelity)
Risk	High (Forgetting)	Very Low (Frozen Base)	Low (Safe)	Low	Moderate (Slightly slower)

Table 4. Comparison of LoRA configuration and full fine-tuning

Full fine-tuning (Leftmost column, Table 4) updates every parameter, so requires a significant number of samples to prevent overfitting, as well as substantial VRAM to process and store the updated weights.

When targeting more layers and increasing the rank, LoRA requires more samples and VRAM because the optimizer must track states for a larger number of parameters.

But even with a high-rank(r = 64) LoRA targeting all layers (rightmost column, Table 4), the incremental increase of the requirements remain significantly lower than those of full fine-tuning (e.g., 20k samples vs 50k samples).

—save this for your next architectural review!