Is 4-Bit All You Need? The Math Behind Modern LLM Compression
The Engineer’s Guide to LLM Quantization. Learn How Quantization Makes 70B Models Run on Local GPU.
By Kuriko IWAI

Table of Contents
IntroductionWhat is QuantizationIntroduction
Fine-tuning massive Large Language Models (LLMs) was once a luxury reserved for those with high-performance server clusters.
Today, quantization has transformed this landscape by significantly reducing computational costs, evolving from an optional optimization into a fundamental necessity for modern AI development.
In this article, I’ll outline the standard FP32 formats first, and explore major scalar quantizations in LLM engineering, categorized by encoding schemes.
What is Quantization
Quantization is the process of mapping a large set of values to a smaller, discrete set.
In the context of LLM and deep learning, quantization reduces the numerical precision of a model’s weights and activations from the standard 32-bit floats (FP32) to lower-bit formats.
Quantization has significant memory saving benefits as the model size increases.
For example, quantizing a 70B parameter model like Llama 3–70B to 4-bit precision can reduce VRAM by over 200GB:

Figure A. Diagram comparing 32-bit vs 4-bit precision VRAM requirements for a 70B parameter model (Created by Kuriko IWAI)
At a standard 32-bit precision (Figure A, left), the model requires 280 GB of VRAM:
32 bits x 70B parameters / 8 bits/byte = 280B bytes = 280 GB
By utilizing 4-bit precision (Figure A, right), the VRAM requirement is reduced to 35 GB — just 1/8th of the original footprint:
4 bits x 70B parameters / 8 bits/byte = 35B bytes = 35 GB
This reduction makes it possible to run a massive 70B parameter model on consumer-grade GPU.
Quantization leverages this accessibility across the entire model lifecycle, including training, fine-tuning, and inference.
◼ Quantization in Model Training and Fine-Tuning
As Figure A shows, training an LLM from scratch in the standard FP32 is prohibitively expensive.
Quantization during training—specifically via Quantization-Aware Training (QAT) or QLoRA—offers several benefits:
Reduced VRAM overhead: Fine-tune massive models (like Llama-3-70B) on consumer-grade GPUs.
Faster throughput: Lower precision allows for more data to be processed in parallel, speeding up iterative training cycles.
Lower energy costs: Reduce bit-widths (e.g., from 32 to 16 or 8) lead to fewer bit-flips and less data movement, significantly lowering the electricity required to train the model.
Developer Note:
While lower precision can significantly reduce VRAM usage, it requires a careful balance to maintain model quality because it can cause the vanishing gradient problem during backpropagation.
◼ Quantization in Inference
Quantization enables the model to perform inference with greater speed and efficiency.
Memory footprint: As we saw in the example, a 70B parameter model quantized in 4-bit precision only requires 35GB VRAM. This allows the model to run on a single high-end workstation or two consumer GPUs.
Latency: Compressing the data moved from the GPU memory to the processors can reduce the memory bandwidth bottleneck and result in faster tokens-per-second.
Edge deployment: With data in memory, LLMs can run locally on laptops, smartphones, and embedded devices without constant internet connections.
Quantization levels can range from 16 bits down to 4 bits or even fewer.
After outlining the standard FP32 format, I’ll walk you through prominent quantization options.
Types of Quantization
Quantization can be categorized into four groups by its strategies:
Scalar Quantization (SQ): Maps each high-precision value to the nearest value on a discrete, lower-precision grid.
Binary Quantization (BQ): An extreme form of scalar quantization where every number is reduced to just a 1 or a 0 based on its sign (positive or negative).
Vector Quantization (VQ): Replaces a whole vector of numbers with a single index from a pre-trained codebook (dictionary) of representative patterns.
Product Quantization (PQ): A hybrid of SQ and VQ. Slices a high-dimensional vector into smaller sub-vectors and performs VQ on each piece independently.
The below table summarizes their characteristics and use cases:
| Type | Unit of Measure | Strategy | Accuracy | Examples | Use Cases |
|---|---|---|---|---|---|
| Scalar (SQ) | A single number | Rounding | HIGH | GGUF, EXL2, AWQ, NF4 | Personal AI & Local LLMs: Running SLMs or LLMs on laptops/phones. Best for text generation where peed is preferred over perfection. |
| Binary (BQ) | A single bit | Sign checking | LOW | BitNet 1.58b, HNSW-BQ | Edge AI: "Battery-sipping" AI on smartwatches or IoT sensors that only need to detect simple triggers (voice commands, gesture recognition). |
| Vector (VQ) | A single index. | Pattern matching | HIGHEST | AQLM, VQ-VAE, EnCodec | High-Fidelity Media Gen: Used in Sora-style video models and AI music to compress complex patterns without losing visual/audio textures. |
| Product (PQ) | A sub-vector. | Divide & Concur | MID | FAISS, Pinecone, Milvus | Long-Term Memory (RAG): Searching through billions of PDF pages or company documents in milliseconds to find relevant context for a chatbot. |
Table 1: A List of Quantization Types and Use Cases.
This article focuses on SQ.
The Standard FP32 Precision
FloatPoint 32 (FP32) is industry-standard precision for training deep learning models, offering the high dynamic range necessary to maintain numerical stability.
The format belongs to the Floating Point (FP) encoding scheme.
◼ How Floating Point (FP) Works
FP uses three key components to represent a decimal value:
Sign: Negative or positive.
Exponent: A scale of the number.
Mantissa: Actual digits of the number.
The diagram below illustrates how FP32 represents a decimal (weight) 0.15625:

Figure B. Breakdown of FP32 bit allocation among sign, exponent, and mantissa for decimal 0.15625 (Created by Kuriko IWAI)
FP32 allocates:
1 bit to the sign (Figure B, white)
8 bits to the exponent (orange), and
The rest 23 bits to the mantissa (pink).
◼ Sign
The sign acts as a toggle switch for positive or negative:
0: The number is positive.
1: The number is negative.
Because it only needs to represent two states, it only ever needs 1 bit.
For a decimal 0.15625 in Figure B, the sign is zero, indicating the number 0.15625 is positive.
◼ Exponent
The exponent determines the range of numbers that the quantization format can represents.
FP32 has the range of 1.2 × 10^{-38} to 3.4 × 10^{38}.
The exponent of the decimal 0.15625 is 01111100 (Figure B, orange), representing 124.
FP32 applies a bias of 127 to the exponent.
So, the actual exponent is 124 - 127 = -3, indicating that the number is multiplied by 2^{-3} = 1/8.
◼ Mantissa
The mantissa (also called significand) represents the actual digits of the number (e.g., the 1.25 in 1.25 x 10^5).
In FP32, the mantissa can represent 7 decimal digits of decimals, and the rest gets rounded off.
For example, the decimal 123.456789 is processed as 123.4567 in FP32.
For the decimal 0.15625, the mantissa is 0100000... (lasting 23 bits) (Figure B, pink).
The computer assumes a leading 1. before these bits, so the binary 1.0100000... represents 1.25 (1 + (0 x 0.5) + (1 x 0.25) + (0 × 0.125) + … = 1.25) in decimal.
◼ The Math of FP32
To combine the sign, exponent, and mantissa lead to the decimal 0.15625 such that:
The calculation is generalized:
where:
d: The decimal value to quantize (weight in decimal),
S: The sign,
E: The exponent, and
M: The mantissa.
Major Quantization by Encoding Scheme
In the context of deep learning, an encoding scheme refers to the method used to transform raw data into a structured numerical representation.
Quantization options are categorized in four major encoding schemes:
Floating Point (FP): FP16, BF16, FP8.
Integer (INT): INT8, INT4.
Normal Float (NF): NF4.
Block Floating Point (BFP): MX, MSFP
I’ll take the same decimal value (weight) 0.15625 to see how each precision works.
Floating Point (FP) Family
FP has three major quantization options:
FP16.
BF16 (BFloat16)
FP8.
Each option uniquely allocates its total bites to represent the decimal 0.15625:

Figure C. Visual comparison of FP16, BF16, and FP8 bit-width allocation and encoding schemes (Created by Kuriko IWAI)
◼ FP16
FP16 is a 16-bit floating-point format where the sign, exponent, and mantissa are allocated 1, 5, and 10 bits respectively (Figure C, second row).
Pros: Very fast on almost all NVIDIA GPUs; high precision for small numbers.
Cons: Can overflow and crash training.
Best when: Performing inference on older GPUs (T4, V100, RTX 20-series). Training with mixed precision and a loss scaler.
Hardware constraints: Universal (NVIDIA/AMD)
◼ BF16 (BFloat16)
Developed by Google, BF16 largely replaces FP16 for training large models.
Pros: Keeps the same range as FP32. Less crashes.
Cons: Less details than FP32.
Best when: Training or fine-tuning LLMs on modern hardware (A100, H100, RTX 30/40-series).
Hardware constraints: Modern GPUs (A100, H100)
◼ FP8
FP8 is supported natively by NVIDIA H100 and newer chips, offering a middle ground between 16-bit and 8-bit integers.
E4M3 (4-bit Exponent, 3-bit Mantissa) focuses on precision by securing 3 bits for the mantissa to represent subtle differences in numbers.
E5M2 (5-bit Exponent, 2-bit Mantissa) focuses on dynamic range by securing more bits to the exponent. The format can represent much larger and much smaller numbers, with less details in subtle differences.
Pros: Faster and more accurate than INT8.
Cons: Requires NVIDIA H100+. Limited software ecosystem compared to INT8.
Best when: E4M3 for fast inference. E5M2 for training and fine-tuning.
Hardware constraints: Modern GPUs (H100).
Developer Note:
Exponent (E) and Mantissa (M) are trade-offs, where securing E (the wider range of decimals quantized) compromises M (precision of the quantization).
FP8 is sometimes categorized in the Log Quant encoding scheme because its 3-bit precision makes it behave like a log-scale.
Integer (INT) Family
INT quantization takes a fixed-point approach by finding the nearest whole number to the original decimals.
It maps a range from -1.0 to 1.0 onto a fixed grid of whole numbers, ranging up to a maximum integer that the n-bit can represent.
Figure D illustrates how its primary options, NT8 and INT4, represent the decimal:

Figure D-1. Comparison of INT8 and INT4 quantization (Created by Kuriko IWAI)
◼ INT8
INT8 is the industry standard for efficient, production-ready inference.
With 8 bits, INT8 represents decimals in the range of -128 to 127 because the maximum positive value represented by 8-bit is 127:
Its quantization process flows:
Maps 1 to the maximum value 127.
Scales the original value 0.15625 to 127 → 19.8437 (127 x 0.15625).
Finds the nearest whole number to 19.8437 → 20.
Creates 8-bit representation of 20 → 00010100 (as Figure D-1 shows).
The dequantized final value becomes 0.15748 (20 / 127), resulting in an approximate 0.7% error rate from the original value.
Pros: Cuts memory in half compared to FP16. Supported natively by most modern CPUs and GPUs.
Cons: Not suitable for training (gradients are too small. Error rates can be large).
Best when: Deploying models on servers with 2x inference speed without losing much intelligence.
Hardware constraints: Modern CPUs / GPUs.
◼ INT4
INT4 is the ultra-compressed integer format used to run huge models on local hardware.
Applying the same logic as INT8, its quantization process flows:
Maps 1 to the maximum possible value represented by 4-bit, 7.
Scales the original value 0.15625 to 7 → 1.09375 (7 x 0.15625).
Finds the nearest whole number to 1.09375 → 1.
4-bit representation of 1: 0001 (as Figure D-1 shows).
The dequantized final value becomes 0.14285 (1 / 7).
In this case, the error rate is around 8.5% to the original value, much higher than INT8.
This is because INT8 (Figure D-2, left) has smaller notch size of 0.00787 (1 / 127) than INT4’s 0.143 (1 / 7, Figure D-2, right), which makes INT4 leap a huge jump between notches, resulting in the higher error rate.

Figure D-2. Comparison of INT8 and INT4 notches (Created by Kuriko IWAI)
Pros: Massive VRAM savings - allowing 13B model to run on a standard smartphone or 8GB GPU.
Cons: Significant quantization error. Many different decimals get rounded to the same number.
Best when: Edge devices. Local LLM setups.
Hardware constraints: Consumer GPU. MPS.
Developer Note:
Although technically doable, INT16 is not the choice because its FP alternatives, BF16 and FP16, can represent decimals more accurately, while consuming the same 16 bits.
NormalFloat (NF) - The QLoRA Engine
NormalFloat quantization leverages model weights’ normal distribution where billions of AI weights are mostly tiny numbers clustered around zero.
Its primary format, NormalFloat4 (NF4) places most of its 16 notches right in that crowded center around zero to represent the decimal more accurately than INT4:

Figure E. NormalFloat4 (NF4) distribution curve showing high-density quantization levels around zero (Created by Kuriko IWAI)
NF4 refers to a pre-calculated Look-Up Table (LUT) designed to match a Normal distribution.
The below table shows the LUT for NF4 in Figure E:
| Index (Binary) | Index (Decimal) | NF4 Assigned Value based on Normal Distribution | |
|---|---|---|---|
| 0000 | 0 | -1.0000 | The extreme negative edge. |
| ... | ... | ... | ... |
| 0110 | 6 | -0.0910 | High precision near zero. |
| 0111 | 7 | 0.0000 | The center of the bell curve. |
| 1000 | 8 | 0.0796 | High precision near zero. |
| 1001 | 9 | 0.1609 | Closest match for 0.15625! |
| 1010 | 10 | 0.2461 | Spacing starts to widen. |
| ... | ... | ... | ... |
| 1111 | 15 | 1.0000 | The extreme positive edge. |
Table 2. NF4 Look-Up Table (LUT) and Weight Mapping.
Table 2 shows how the numbers are very close together near zero, but get much further apart as they move toward one.
NF4 starts to look for the value closest to 0.15625 in the LUT, and takes 1001 located at Index 9 which represents 0.1609, the closest value to 0.15625.
As Figure E shows, NF4 achieves higher precision (around 0.2% error rate to the original value) by concentrating more quantization levels in the high-density region around zero, whereas its 4-bit counterpart INT4 utilizes fixed-width notches regardless of the data distribution.
Pros: Keeps near 16-bit performance levels while using 4-bit space.
Cons: Computation complexity. It requires dequantization back to BF16 for math computation during forward pass.
Best when: QLoRA (Fine-tuning massive LLMs on consumer-grade GPUs).
Hardware constraints: Modern GPUs.
Block Floating Point (BFP) Family
The Block Floating Point (BFP) family leverages a hybrid approach to numerical representation, sharing a single exponent across a block of precision.
It can achieve a much higher dynamic range than integer formats like INT8, while maintaining significantly lower hardware complexity than full floating-point like FP32.
Major formats include
Microsoft Shared Floating Point (MSFP) and
Microscaling (MX).
◼ Microsoft Shared Floating Point (MSFP)
Microsoft originally developed MSFP for its Project Brainwave (FPGA-based AI) to handle high-throughput, low-latency inference.
In MSFP, the 8 bits of the element are treated as a Signed Integer Mantissa, and they all share one single exponent for the entire block.
The formula looks like this:
where:
d: The decimal value to quantize,
M_{int}: The pure integer mantissa,
E_{shared}: The shared exponent among all 16 numbers in a block, and
b: The exponent bias used to shift the range of the shared exponent.
The decimal 0.15625 in MSFP is 01010000:
Determines the block’s shared exponent: Exp_{shared} = -9.
Computes the pure integer mantissa: M_{int} = 0.15625 / 2^{-9} = 80.
Encodes the pure integer mantissa 80 in 8-bit binary: 01010000
MSFP is simple and efficient enough to streamline individual computation of the exponent.
Pros: Hardware efficient. Saves significant memory bandwidth compared to standard FP8.
Cons: Outlier sensitivity due to the shared exponent. Clipping strategies needed to handle the shared exponent range.
Best when: Deploying on FPGA or custom ASIC architectures for inference where power consumption and silicon area are the primary bottlenecks.
Hardware constraints: Intel Stratix 10 (Project Brainwave), Microsoft custom silicon (Maia).
◼ Microscaling (MX) Formats
The MX standard focuses on ultra-fine-grained scaling used in high-performance AI training.
The format was evolved MSFP by the MX Alliance (which includes Microsoft, NVIDIA, AMD, Intel, and others) later to create something interoperable across all modern GPUs and AI accelerators.
Compared to MSFP, MX formats allow larger blocks with 32 individual elements to share the exponent, making it more hardware-efficient by reducing the bits spent on exponents.
The quantization is generalized:
where
d: The decimal value to quantize,
S: The sign of the individual element (number),
M: The mantissa of the individual element,
E_{shared}: The shared exponent among all 32 elements in a block, and
b: The exponent bias used to shift the range of the shared exponent.
In standard FP formats like FP16 or BF16, every single number carries its own exponent using up to 8 bits, as Figure C shows.
This consumes expensive VRAM memory.
Eq 2.2 shows that MX uses block-based scaling where a group of numbers (e.g., 32 numbers per block) shares one master exponent E_{shared}, while the sign and mantissa preserve individual elements.
For example, to apply an MX format (MXFP8) to the decimal 0.15625 quantized in E4M3**:**
Convert the decimal into binary scientific notation: 0.15625 = 1.25 x 2^{-3}
E4M3 (3 bits for mantissa) secures the precision of 2^{-3} for the fractional part.
Calculates E_{shared}, the maximum exponent of the group. We assume E_{shared} = 0.
Define the exponent (-3 + E4M3 bias 7 = 4) in 4 bits: 0100.
Defines the mantissa (1.25 - 1 = 0.25) in 3 bits: 010.
Defines the sign in 1 bit: 0
Combining the sign, exponent, and mantissa, 0.15625 in MXFP8 is quantized 0 0100 010.
Pros: Minimizes the precision loss. Broad support by NVIDIA, AMD, and Intel. Versatile enough to apply various bit-width.
Cons: Management overhead. Sub-optimal for sparse data (performance degrades if the data distribution is irregular across the small blocks).
Best when: Performing large-scale AI training or high-fidelity inference where model accuracy is critical.
Hardware constraints: Requires specialized hardware units capable of performing shifts at a granular level.
Wrapping Up
Among many quantization options, selecting the right one is critical for performance.
My advice is to always test your specific use case.
If you're building a creative writing assistant, 4-bit is plenty. If you're building a medical diagnosis tool, you might want to stick to the higher-fidelity BF16 or FP8.
◼ The Cheat Sheet
I summarize the options in the cheat sheet with hardware requirements:
| Format | Bits | Primary Use Case | Hardware Support |
|---|---|---|---|
| FP32 | 32 | Reference / Scientific Research | Universal |
| BF16 | 16 | Training (Industry Standard) | Modern GPUs (A100, H100) |
| FP16 | 16 | Inference / Older Training | Universal (NVIDIA/AMD) |
| FP8 | 8 | Accelerated Training & Inference | NVIDIA Hopper (H100), Blackwell |
| INT8 | 8 | Production Inference | Most Modern GPUs |
| INT4 | 4 | Local Inference (GGUF/GPTQ) | Consumer GPUs / Mac |
| NF4 | 4 | Fine-tuning (QLoRA) | Modern GPUs |
| MSFP | 12, 16 | High-precision scaling for LLMs | Specialized OCP-compliant accelerators |
| MX | 4, 6, 8 | Ultra-efficient Scale-out Inference | Next-gen Blackwell (NVIDIA), AMD, Intel |
Table 3. The LLM Quantization Cheat Sheet: Formats, Bits, and Hardware Support
Developer Note:
Vector Quantization (VQ) methods like GGUF and AQLM can push quantization limits down to 2-bit.
These techniques achieve higher fidelity at low bitrates by mapping weights to optimized codebooks rather than simply rounding values to the nearest grid point.
Since both are deep methodologies, I’ll cover these in a separate title.
◼ Smaller is Better?
Balancing E and M reveals a crucial truth in model training: smaller precision is not always superior.
For example:
At FP32:
- The most accurate model can distinguish tiny differences between 0.0000001 and 0.0000002.
At INT4:
- The model saves significant VRAM, but recognizes the tiny values 0.0000001 and 0.0000002 as the same 0.0000 due to the lack of bits.
This triggers the vanishing gradient problem, where an optimizer fails to converge and the model eventually collapses into NaN (Not a Number) errors.
QLoRA can solve this challenge without sacrificing the memory benefits of quantization.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation
The Definitive Guide to LLM Fine-Tuning: Objectivee, Mechanisms, and Hardware
Transformer Architecture: Self-Attention & MLOps Guide
Tokenization Strategies for LLM Applications
Optimizing LLM Performance: Context Window Impact on RAG Accuracy
Regularizing LLMs with Kullback-Leibler Divergence
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation
Share What You Learned
Kuriko IWAI, "Is 4-Bit All You Need? The Math Behind Modern LLM Compression" in Kernel Labs
https://kuriko-iwai.com/llm-quantization-guide-fp32-to-int4-nf4-mx
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.





