Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation
Master the math of rank decomposition, hyperparameter tuning, and tips to avoid common pitfalls
By Kuriko IWAI

Table of Contents
IntroductionWhat Is LoRAHow LoRA Works - The Core MechanismThe Math of <1%: Why LoRA is Computationally SuperiorTuning LoRA HyperparametersSecurity & Vulnerabilities: Is LoRA Truly Bulletproof?Wrapping Up - Various LoRA Configuration and Full Fine-tuningIntroduction
Architecting a language model from the ground up requires massive clusters because the model has enormous number of trainable parameters.
But in the real world where VRAM is a finite resource and not everyone has a cluster of A100s, we need a more surgical approach.
LoRA tackles this challenge by training only a tiny fraction of the model parameters, enabling us to achieve identical performance to the full-tuned model with less than 1% of computational cost.
In this article, I’ll deep dive into LoRA’s core mechanism, while addressing its common pitfalls in configuration and implementation.
What Is LoRA
Low-Rank Adaptation (LoRA) is a fine-tuning technique that optimizes low-rank decomposition matrices within the transformer's dense layers rather than updating all model parameters.
Originally introduced by Hu et al. [1], LoRA has become a cornerstone of Parameter-Efficient Fine-Tuning (PEFT) for adapting language models to downstream tasks.
The diagram below illustrates how LoRA is applied within a standard decoder-only transformer, using Qwen-3-1.7B model to demonstrate the parameter breakdown:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Decoder-only transformer architecture and parameter breakdown for Qwen-3-1.7B model and its LoRA adapter (Created by Kuriko IWAI)
◼ Deconstructing Parameter Allocation in Decoder-Only Transformers
Figure A shows that the model features 1.7 billion parameters distributed across its architecture.
Before detailing the mechanics of LoRA, I will examine its allocation strategy.
▫ 1. The Vocabulary and Input Embedding Layer
The model first converts raw text into an input embedding, referring to a massive lookup table with tokens (word or sub-word) (Green box, Figure A).
Each token is assigned token ID as an unique vector.
For example, the token "cat" is assigned Token ID 1, then the ID points to a vector with decimals ranged from -1 to 1:
These numbers represent the token’s semantic meaning across the model's hidden dimensions.
And because Qwen-3-1.7B has 2,048 hidden dimensions, the model requires 2,048 columns to represent one token.
With a vocabulary size of 151,936, the resulting lookup table has 151,936 rows and 2,048 columns:
| Token ID | Dim 1 | Dim 2 | Dim 3 | ... | Dim 2048 |
| ID 0 | 0.0135 | -0.022 | 0.0556 | ... | 0.1167 |
| ID 1 (cat) | -0.0124 | 0.4591 | -0.8233 | ... | -0.1192 |
| ... | ... | ... | ... | ... | ... |
| ID 1523 (apple) | -0.0445 | 0.4503 | -0.8240 | ... | 0.1103 |
| ... | ... | ... | ... | ... | ... |
| ID 151,935 | 0.0904 | 0.0120 | -0.0555 | ... | 0.0224 |
Table 1. A mock vocabulary matrix of Qwen-3-1.7B
To store this matrix, the model requires to allocate 311 million parameters:
This matrix is reused during the final decoding stage (Linear layer, Figure A) to translate the model's internal math back into human-readable text.
▫ 2. The Core Engine - Transformer Blocks
Next, the model processes the data through the core engine, transformer blocks (Pink box, Figure A).
The parameters within these blocks are distributed:
Attention layers (Orange box, Figure A): 12.5M parameters per block.
Feed-Forward (FF) layers (Blue box, Figure A): 33.8M parameters per block.
Because Qwen-3-1.7B has 28 layers stacked on top of each other:
- Attention layers total:
- FF layers total:
▫ 3. The Final Tally
To reach the final prediction, the model applies a linear transformation using the original 311M parameters from the vocabulary, and a Softmax function to normalize the results into a probability distribution (Grey boxes, Figure A).
Adding up small fractions from the normalization layers, the model requires 1.7 billion parameters in total:
Full fine-tuning trains all 1.7 billion parameters, requiring massive VRAM and data samples to prevent overfitting.
Developer Note - Balancing Dimensions (storage) and Layers (thinking capacity)
The very first Transformer from "Attention is All You Need" paper defines the ideal hidden dimension size:
Small Models: 512 or 768.
Medium Models (e.g., Qwen): 1,024 or 2,048.
Large Models (e.g., Llama): 4,096.
Giant Models (e.g., GPT-3): 12,288.
For Qwen-3 to meet the threshold of total 1.7 billion parameters, it needs to balance the hidden dimension size and the number of layers.
Because if the model is:
Too wide with 8,192 dimensions, it only affords three layers. This means that the model has enough room to comprehend the input, but would fail to think deeper when serving a response.
Too narrow with 128 dimensions, it can afford 500+ layers, but wouldn't have enough capacity to process a single word.
Ultimately, the optimal combination of 28 layers x 2,048 dims provides enough capacity to store and process relatively complex concepts without making the model too slow to run on a consumer VRAM.
How LoRA Works - The Core Mechanism
LoRA optimizes the fine-tuning process by targeting the core transformer block (Pink box, Figure A) through specific mathematical shortcuts, low-rank decomposition.
◼ Low-Rank Decomposition
Low-rank decomposition is a mathematical theory in linear algebra where a large matrix can be approximated with a product of two smaller matrices.
The theory leverages rank factorization, a fundamental linear algebra theory where any non-null vector Z is represented by a product of two full-rank matrices, X and Y:
In the context of standard fine-tuning, Z represents a weight update matrix which has the same size as the original model weight (e.g., 2,048 × 2,048 in the case of Qwen-3-1.7 model).
Large models have enormous size of the weight, making the storage and computation of X and Y prohibitively expensive.
LoRA bypasses this bottleneck by leveraging low-rank decomposition:
In this theory, the low-rank matrices A and B are considered representing the most latent structure of the original, large matrix Z based on the assumption that much of data in Z is redundant or correlated, hence can be ignored (Gray area, Figure B).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. Visual representation of low-rank decomposition (Created by Kuriko IWAI)
So, instead of computing the full-rank update Z, LoRA tunes smaller AB with a reduced inner dimension, called rank (r).
This process is generalized:
where:
ΔW: The full-rank update matrix (size d x d),
A: The first low-rank matrix of size d x r,
B: The second low-rank matrix of size r x d,
α/r: The scaling factor to control the magnitude of the new learning,
r: The rank (LoRA hyperparameter, e.g., r = 8), and
α: The scaling parameter (LoRA hyperparameter). Default is one (α = 1).
As Figure B shows, the rank r determines how much data LoRA will streamline from the original matrix.
For example, Qwen-3 (d = 2,048) has ΔW with over 4 million parameters (2,048 x 2,048 = 4,194,304).
Full fine-tuning the model indicates that all 4 million parameters must be updated.
But when applying a rank-eight decomposition r = 8, LoRA only needs to tune 32,768 parameters:
Matrix A becomes 2,048 x 8 = 16,384 parameters.
Matrix B becomes 8 x 2,048 = 16,384 parameters.
Total trainable parameters: 32,768 (16,384 + 16,384 = 32,768)
LoRA successfully reduces the number of trainable parameters, while reserving the essential data from the original weight matrix.
The calculation of total trainable parameters is generalized:
where:
n_{lora}: Total number of trainable parameters that LoRA targets to tune,
d_{in}: The input dimension of the original layer (e.g., 2,048),
d_{out}: The output dimension of the original layer (e.g., 2,048), and
r: The rank (where r « d_{in} and r « d_{out}).
And because LoRA maintains the same output dimensions as the full fine-tuning update, it can simply add the tuning result to the pre-trained weight W_{pt}:
where:
W_{ft}: The weight matrix after the full-tuning,
W_{pt}: The pre-trained weight matrix,
ΔW: The full fine-tuning results, and
α/r AB : The LoRA’s tuning results.
Eq 1.4 mathematically proves the absense of non-linearity in the LoRA process.
Traditional adapter methods introduce activation functions, making the model more complex and parameter heavy.
Instead, LoRA can maintain the original architecture of the model, preventing the system from increasing latency or computational cost during inference.
The Math of <1%: Why LoRA is Computationally Superior
Now, Figure A shows that LoRA with rank eight r = 8 trains only 8.2 million parameters (0.7%) out of 1.7 billion.
Let us see how it happens, leveraging Eq 1’s.
◼ LoRA Targeting Attention Layers
Qwen-3-1.7B has 16 Query heads and 8 Key-Value (KV) heads in the attention layer.
In the common scenario where LoRA targets the four main projection matrices: W_q, W_k, W_v, and W_{out} in the attention layer, the total number of trainable parameters is 114,688:
| Module | Original (d_in×d_out) | LoRA (r = 8) | LoRA Trainable Params |
| Query (W_q) | 2,048 x 2,048 | (2,048 x 8) + (8 x 2,048) | 32,768 |
| Key (W_k) | 2,048 × 1,024 | (2,048 x 8) + (8 x 1,024) | 24,576 |
| Value (W_v) | 2,048 x 1,024 | (2,048 x 8) + (8 x 1,024) | 24,576 |
| Output (W_o) | 2,048 x 2,048 | (2,048 x 8) + (8 x 2,048) | 32,768 |
| Total Per Layer | 12,582,912 | 114,688 |
Note: Key and Value matrices are smaller (1,024) because Qwen uses Grouped-Query Attention (GQA) with 8 heads instead of 16.
Table 2-1. Comparison of total parameters and LoRA trainable parameters in a single attention layer of Qwen-3-1.7B
Because the model has 28 layers, the total trainable parameters in the attention layers is 3.2 million:
which is just 0.9% of all parameters in the attention layers.
◼ LoRA Targeting FF Layers
Similar to the attention layer, I break down the computation in the FF layers.
Qwen-3-1.7B leverages SwiGLU architecture where the information is processed by three layers:
Gate layer.
Up layer.
Down layer.
When LoRA targets all the three layers, the total number of trainable parameters is 181,248:
| Module | Original (d_in×d_out) | LoRA (r = 8) | LoRA Trainable Params |
| Gate (W_g) | 2,048 x 5,504 | (2,048 x 8) + (8 × 5,504) | 60,416 |
| Up (W_u) | 2,048 x 5,504 | (2,048 x 8) + (8 × 5,504) | 60,416 |
| Down (W_d) | 5,504 x 2,048 | (5,504 x 8) + (8 × 2,048) | 60,416 |
| Total Per Layer | 33,816,576 | 181,248 |
*Note: The model expands its hidden dimensions from 2,048 to 5,504 in the FF layer to secure capacity,
Table 2-2. Comparison of total parameters and LoRA trainable parameters in a single FF layer of Qwen-3-1.7B
So, the total trainable parameters in all 28 FF layers is:
which is only 0.3% of all parameters in the FF layers.
And in total, LoRA tunes only 8.2 million parameters.
Tuning LoRA Hyperparameters
Although LoRA seems versatile, tuning its hyperparameters is critical to balance the integration of new knowledge to the existing learning from pre-training.
Three primary variables drive this adjustment:
Layer selection: Which dense layers to apply LoRA.
Rank r: The degree of LoRA’s approximation.
Scaling parameter α: Balance the pre-trained weights and LoRA adaptation.
◼ Layer Selection
In the initial paper [1], LoRA was originally designed to target the attention layers (Orange block in Figure A), specifically the Query and Value projection matrices (W_q, W_v).
The logic behind this is that these attention layers are the most critical component for capturing task-specific logic, while keeping trainable parameters minimal.
But the paper also mentioned that LoRA can be applied to any dense layer, so many modern implementations target both attention and FF layers for better performance.
In practice, the choice set is:
Attention layers only.
FF layers only.
Both attention and FF layers.
Developer Note:
Some implementations by default only target Query (q_proj) and Value (v_proj) vectors.
While this saves VRAM, it can limit capabilities to adapt complex reasoning.
When the model’s performance is underwhelming, try expanding target modules to include Key, Output vectors k_proj, o_proj, as well as FF layers gate_proj or up_proj.
◼ Rank
The rank r defines the model’s learning capacities on the new knowledge.
High rank improves knowledge storage and generalization capabilities, while low rank saves computational cost as per Eq (1.2).
Common choices are r = 8 and r = 16, which balances computational cost and the model’s learning capabilities.
A high rank like r = 64 and r = 128 is suitable when:
Handling extremely complex task adaptation like diagnosing a rare disease because it requires more learning capabilities to process, and
Requiring better generalization capacity due to limited training samples.
Developer Note:
The rank trap - “higher is better“ is not always the case.
A high-rank LoRA tends to overfit, hits the ceiling of performance improvement, and consumes more VRAM.
Practical strategy here is to start with a row rank r = 8 and gradually increase the rank only when the model is failing to learn the new knowledge.
◼ Scaling Parameter
The scaling parameter α defines the balance between the new learning and the pre-trained learning.
The default scaling parameter is one (α = 1), meaning that the pre-trained weights and the low-rank weights (new learning) are weighted equally when performing the forward pass.
Adjusting the scaling factor (α/r) depends on the task strategies:
| Strategy | Scaring Factor | Use Case |
| Aggressive | 4 x r | When the task is very different from the base model's training (e.g., teaching a completely new language). |
| Standard | 2 x r | Common choice for general purpose fine-tuning; most stable. |
| Conservative | 1 x r | To prevent catastrophic forgetting. To keep the model's original learning intact. |
Table 3. Scaling factor examples by use case
Security & Vulnerabilities: Is LoRA Truly Bulletproof?
Aside from misconfiguration, LoRA has adversarial vulnerabilities in its:
Learning process.
Hyperparameters.
Architecture itself.
◼ Vulnerabilities in Learning Process - Data-Centric Attacks
LoRA’s row-rank matrices are prone to learn noise due to their simple structures.
Data-centric attack, one of the most frequent attacks, is used to inject noise to the training set to mislead the adapter.
Examples include:
Adversarial Fine-tuning (LoFT).
Backdoor Poisoning (LoRATK).
Untargeted data (noise) poisoning.
The Fix:
- Use spectral signatures. Detect poisoned samples in the training set before they reach the LoRA adapter.
◼ Vulnerabilities in Hyperparameters - Parameter-Centric Attacks
Parameter-centric attacks are used to exploit LoRA’s hyperparameters and configuration.
Examples include:
Weight amplification (LoBAM) to update the scaling factor with an extreme value, and
Gradient Assembly Poisoning (GAP)[2] to modify matrices A and B with statistically identical matrices whose product AB creates a malicious shift in the model's weight.
The Fix:
- Implement product-space validation. The server reconstructs AB and verifies its spectral properties before adding it to the global model.
◼ Architectural Vulnerabilities - Extraction Attacks
Because LoRA is simple and small, the LoRA adapter is used to leak training data.
Examples include:
Membership inference attacks (LoRA-Leak) to trace the training data via loss profile of the LoRA adapter.
Model extraction (StolenLoRA) to clone the functionality of the LoRA adapter to mimic the base model.
The Fix:
- Apply differential privacy (DP) directly to the LoRA updates (ensure that individual data points cannot be reconstructed from the adapter weights).
Wrapping Up - Various LoRA Configuration and Full Fine-tuning
LoRA is an efficient tuning technique for transformer-based language models especially when carefully tuned and implemented.
I’ve included a comparison table below that maps VRAM requirements to specific LoRA ranks:
| Feature | Ref. Full Fine-Tuning | LoRA (r=8, FF Only) | LoRA (r=64, FF Only) | LoRA (r=8, All Layers) | LoRA (r=64, All Layers) |
| Parameters Updated | ~1.7B (100%) | ~5M (0.07%) | ~40M (0.5%) | ~18M (0.25%) | ~150M (2.1%) |
| VRAM Required | ~112GB - 140GB | ~6GB - 10GB | ~10GB - 14GB | ~12GB - 16GB | ~20GB - 26GB |
| Training Samples | 5,000 to 50,000+ | 100 to 5,000 | 2,000 to 15,000 | 500 to 10,000 | 5,000 to 20,000 |
| Best Use Case | Deep domain change (e.g., new language). | Style mimicry & basic tasks. | Domain Adaptation (Legal/Med). | Balanced General Finetuning. | High-complexity reasoning/knowledge. |
| Learning Depth | Infinite (Total Rewiring) | Shallow (Pattern Matching) | Deep (Niche knowledge) | Nuanced (Logic + Knowledge) | Very Deep (High-fidelity) |
| Risk | High (Forgetting) | Very Low (Frozen Base) | Low (Safe) | Low | Moderate (Slightly slower) |
Table 4. Comparison of LoRA configuration and full fine-tuning
Full fine-tuning (Leftmost column, Table 4) updates every parameter, so requires a significant number of samples to prevent overfitting, as well as substantial VRAM to process and store the updated weights.
When targeting more layers and increasing the rank, LoRA requires more samples and VRAM because the optimizer must track states for a larger number of parameters.
But even with a high-rank(r = 64) LoRA targeting all layers (rightmost column, Table 4), the incremental increase of the requirements remain significantly lower than those of full fine-tuning (e.g., 20k samples vs 50k samples).
—save this for your next architectural review!
◼ References
[1]. LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
[2]. Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems (Dong et al., 2025)
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
The Definitive Guide to LLM Fine-Tuning: Objectivee, Mechanisms, and Hardware
Building LoRA Multi-Adapter Inference on AWS SageMaker
Transformer Architecture: Self-Attention & MLOps Guide
The Reasoning Wall: A Comparative Benchmark of Llama 3.2 vs. Qwen 3
Tokenization Strategies for LLM Applications
Optimizing LLM Performance: Context Window Impact on RAG Accuracy
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation
Share What You Learned
Kuriko IWAI, "Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation" in Kernel Labs
https://kuriko-iwai.com/lora-llm-fine-tuning-mechanics-guide
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.





