Is 4-Bit All You Need? The Math Behind Modern LLM Compression

The Engineer’s Guide to LLM Quantization. Learn How Quantization Makes 70B Models Run on Local GPU.

Machine LearningDeep LearningLLM

By Kuriko IWAI

Introduction

Fine-tuning massive Large Language Models (LLMs) was once a luxury reserved for those with high-performance server clusters.

Today, quantization has transformed this landscape by significantly reducing computational costs, evolving from an optional optimization into a fundamental necessity for modern AI development.

In this article, I’ll outline the standard FP32 formats first, and explore major scalar quantizations in LLM engineering, categorized by encoding schemes.

What is Quantization

Quantization is the process of mapping a large set of values to a smaller, discrete set.

In the context of LLM and deep learning, quantization reduces the numerical precision of a model’s weights and activations from the standard 32-bit floats (FP32) to lower-bit formats.

Quantization has significant memory saving benefits as the model size increases.

For example, quantizing a 70B parameter model like Llama 3–70B to 4-bit precision can reduce VRAM by over 200GB:

Figure A. Diagram comparing 32-bit vs 4-bit precision VRAM requirements for a 70B parameter model (Created by Kuriko IWAI)

At a standard 32-bit precision (Figure A, left), the model requires 280 GB of VRAM:
32 bits x 70B parameters / 8 bits/byte = 280B bytes = 280 GB

By utilizing 4-bit precision (Figure A, right), the VRAM requirement is reduced to 35 GB — just 1/8th of the original footprint:
4 bits x 70B parameters / 8 bits/byte = 35B bytes = 35 GB

This reduction makes it possible to run a massive 70B parameter model on consumer-grade GPU.

Quantization leverages this accessibility across the entire model lifecycle, including training, fine-tuning, and inference.

◼ Quantization in Model Training and Fine-Tuning

As Figure A shows, training an LLM from scratch in the standard FP32 is prohibitively expensive.

Quantization during training—specifically via Quantization-Aware Training (QAT) or QLoRA—offers several benefits:

Reduced VRAM overhead: Fine-tune massive models (like Llama-3-70B) on consumer-grade GPUs.
Faster throughput: Lower precision allows for more data to be processed in parallel, speeding up iterative training cycles.
Lower energy costs: Reduce bit-widths (e.g., from 32 to 16 or 8) lead to fewer bit-flips and less data movement, significantly lowering the electricity required to train the model.

Developer Note:

While lower precision can significantly reduce VRAM usage, it requires a careful balance to maintain model quality because it can cause the vanishing gradient problem during backpropagation.

◼ Quantization in Inference

Quantization enables the model to perform inference with greater speed and efficiency.

Memory footprint: As we saw in the example, a 70B parameter model quantized in 4-bit precision only requires 35GB VRAM. This allows the model to run on a single high-end workstation or two consumer GPUs.
Latency: Compressing the data moved from the GPU memory to the processors can reduce the memory bandwidth bottleneck and result in faster tokens-per-second.
Edge deployment: With data in memory, LLMs can run locally on laptops, smartphones, and embedded devices without constant internet connections.

Quantization levels can range from 16 bits down to 4 bits or even fewer.

After outlining the standard FP32 format, I’ll walk you through prominent quantization options.

Types of Quantization

Quantization can be categorized into four groups by its strategies:

Scalar Quantization (SQ): Maps each high-precision value to the nearest value on a discrete, lower-precision grid.
Binary Quantization (BQ): An extreme form of scalar quantization where every number is reduced to just a 1 or a 0 based on its sign (positive or negative).
Vector Quantization (VQ): Replaces a whole vector of numbers with a single index from a pre-trained codebook (dictionary) of representative patterns.
Product Quantization (PQ): A hybrid of SQ and VQ. Slices a high-dimensional vector into smaller sub-vectors and performs VQ on each piece independently.

The below table summarizes their characteristics and use cases:

Type	Unit of Measure	Strategy	Accuracy	Examples	Use Cases
Scalar (SQ)	A single number	Rounding	HIGH	GGUF, EXL2, AWQ, NF4	Personal AI & Local LLMs: Running SLMs or LLMs on laptops/phones. Best for text generation where peed is preferred over perfection.
Binary (BQ)	A single bit	Sign checking	LOW	BitNet 1.58b, HNSW-BQ	Edge AI: "Battery-sipping" AI on smartwatches or IoT sensors that only need to detect simple triggers (voice commands, gesture recognition).
Vector (VQ)	A single index.	Pattern matching	HIGHEST	AQLM, VQ-VAE, EnCodec	High-Fidelity Media Gen: Used in Sora-style video models and AI music to compress complex patterns without losing visual/audio textures.
Product (PQ)	A sub-vector.	Divide & Concur	MID	FAISS, Pinecone, Milvus	Long-Term Memory (RAG): Searching through billions of PDF pages or company documents in milliseconds to find relevant context for a chatbot.

Table 1: A List of Quantization Types and Use Cases.

This article focuses on SQ.

The Standard FP32 Precision

FloatPoint 32 (FP32) is industry-standard precision for training deep learning models, offering the high dynamic range necessary to maintain numerical stability.

The format belongs to the Floating Point (FP) encoding scheme.

◼ How Floating Point (FP) Works

FP uses three key components to represent a decimal value:

Sign: Negative or positive.
Exponent: A scale of the number.
Mantissa: Actual digits of the number.

The diagram below illustrates how FP32 represents a decimal (weight) 0.15625:

Figure B. Breakdown of FP32 bit allocation among sign, exponent, and mantissa for decimal 0.15625 (Created by Kuriko IWAI)

FP32 allocates:

1 bit to the sign (Figure B, white)
8 bits to the exponent (orange), and
The rest 23 bits to the mantissa (pink).

◼ Sign

The sign acts as a toggle switch for positive or negative:

0: The number is positive.
1: The number is negative.

Because it only needs to represent two states, it only ever needs 1 bit.

For a decimal 0.15625 in Figure B, the sign is zero, indicating the number 0.15625 is positive.

◼ Exponent

The exponent determines the range of numbers that the quantization format can represents.

FP32 has the range of 1.2 × 10^{-38} to 3.4 × 10^{38}.

The exponent of the decimal 0.15625 is 01111100 (Figure B, orange), representing 124.

FP32 applies a bias of 127 to the exponent.

So, the actual exponent is 124 - 127 = -3, indicating that the number is multiplied by 2^{-3} = 1/8.

◼ Mantissa

The mantissa (also called significand) represents the actual digits of the number (e.g., the 1.25 in 1.25 x 10^5).

In FP32, the mantissa can represent 7 decimal digits of decimals, and the rest gets rounded off.

For example, the decimal 123.456789 is processed as 123.4567 in FP32.

For the decimal 0.15625, the mantissa is 0100000... (lasting 23 bits) (Figure B, pink).

The computer assumes a leading 1. before these bits, so the binary 1.0100000... represents 1.25 (1 + (0 x 0.5) + (1 x 0.25) + (0 × 0.125) + … = 1.25) in decimal.

◼ The Math of FP32

To combine the sign, exponent, and mantissa lead to the decimal 0.15625 such that:

\text{Weight in decimal} = 1 \times 2^{-3} \times 1.25 = 1.25 \times 0.125 = \mathbf{0.156250000...} \quad \cdots \text{(1.1)}

The calculation is generalized:

d = (-1)^{S} \times 2^{(E - 127)} \times (1 + M) \quad \cdots \text{(1.2)}

where:

d: The decimal value to quantize (weight in decimal),
S: The sign,
E: The exponent, and
M: The mantissa.

Major Quantization by Encoding Scheme

In the context of deep learning, an encoding scheme refers to the method used to transform raw data into a structured numerical representation.

Quantization options are categorized in four major encoding schemes:

Floating Point (FP): FP16, BF16, FP8.
Integer (INT): INT8, INT4.
Normal Float (NF): NF4.
Block Floating Point (BFP): MX, MSFP

I’ll take the same decimal value (weight) 0.15625 to see how each precision works.

Floating Point (FP) Family

FP has three major quantization options:

FP16.
BF16 (BFloat16)
FP8.

Each option uniquely allocates its total bites to represent the decimal 0.15625:

Figure C. Visual comparison of FP16, BF16, and FP8 bit-width allocation and encoding schemes (Created by Kuriko IWAI)

◼ FP16

FP16 is a 16-bit floating-point format where the sign, exponent, and mantissa are allocated 1, 5, and 10 bits respectively (Figure C, second row).

Pros: Very fast on almost all NVIDIA GPUs; high precision for small numbers.
Cons: Can overflow and crash training.
Best when: Performing inference on older GPUs (T4, V100, RTX 20-series). Training with mixed precision and a loss scaler.
Hardware constraints: Universal (NVIDIA/AMD)

◼ BF16 (BFloat16)

Developed by Google, BF16 largely replaces FP16 for training large models.

Pros: Keeps the same range as FP32. Less crashes.
Cons: Less details than FP32.
Best when: Training or fine-tuning LLMs on modern hardware (A100, H100, RTX 30/40-series).
Hardware constraints: Modern GPUs (A100, H100)

◼ FP8

FP8 is supported natively by NVIDIA H100 and newer chips, offering a middle ground between 16-bit and 8-bit integers.

E4M3 (4-bit Exponent, 3-bit Mantissa) focuses on precision by securing 3 bits for the mantissa to represent subtle differences in numbers.

E5M2 (5-bit Exponent, 2-bit Mantissa) focuses on dynamic range by securing more bits to the exponent. The format can represent much larger and much smaller numbers, with less details in subtle differences.

Pros: Faster and more accurate than INT8.
Cons: Requires NVIDIA H100+. Limited software ecosystem compared to INT8.
Best when: E4M3 for fast inference. E5M2 for training and fine-tuning.
Hardware constraints: Modern GPUs (H100).

Developer Note:

Exponent (E) and Mantissa (M) are trade-offs, where securing E (the wider range of decimals quantized) compromises M (precision of the quantization).

FP8 is sometimes categorized in the Log Quant encoding scheme because its 3-bit precision makes it behave like a log-scale.

Integer (INT) Family

INT quantization takes a fixed-point approach by finding the nearest whole number to the original decimals.

It maps a range from -1.0 to 1.0 onto a fixed grid of whole numbers, ranging up to a maximum integer that the n-bit can represent.

Figure D illustrates how its primary options, NT8 and INT4, represent the decimal:

Figure D-1. Comparison of INT8 and INT4 quantization (Created by Kuriko IWAI)

◼ INT8

INT8 is the industry standard for efficient, production-ready inference.

With 8 bits, INT8 represents decimals in the range of -128 to 127 because the maximum positive value represented by 8-bit is 127:

01111111_2 = (1 \times 2^6) + (1 \times 2^5) + (1 \times 2^4) + (1 \times 2^3) + (1 \times 2^2) + (1 \times 2^1) + (1 \times 2^0) = 127

Its quantization process flows:

Maps 1 to the maximum value 127.
Scales the original value 0.15625 to 127 → 19.8437 (127 x 0.15625).
Finds the nearest whole number to 19.8437 → 20.
Creates 8-bit representation of 20 → 00010100 (as Figure D-1 shows).

The dequantized final value becomes 0.15748 (20 / 127), resulting in an approximate 0.7% error rate from the original value.

Pros: Cuts memory in half compared to FP16. Supported natively by most modern CPUs and GPUs.
Cons: Not suitable for training (gradients are too small. Error rates can be large).
Best when: Deploying models on servers with 2x inference speed without losing much intelligence.
Hardware constraints: Modern CPUs / GPUs.

◼ INT4

INT4 is the ultra-compressed integer format used to run huge models on local hardware.

Applying the same logic as INT8, its quantization process flows:

Maps 1 to the maximum possible value represented by 4-bit, 7.
Scales the original value 0.15625 to 7 → 1.09375 (7 x 0.15625).
Finds the nearest whole number to 1.09375 → 1.
4-bit representation of 1: 0001 (as Figure D-1 shows).

The dequantized final value becomes 0.14285 (1 / 7).

In this case, the error rate is around 8.5% to the original value, much higher than INT8.

This is because INT8 (Figure D-2, left) has smaller notch size of 0.00787 (1 / 127) than INT4’s 0.143 (1 / 7, Figure D-2, right), which makes INT4 leap a huge jump between notches, resulting in the higher error rate.

Figure D-2. Comparison of INT8 and INT4 notches (Created by Kuriko IWAI)

Pros: Massive VRAM savings - allowing 13B model to run on a standard smartphone or 8GB GPU.
Cons: Significant quantization error. Many different decimals get rounded to the same number.
Best when: Edge devices. Local LLM setups.
Hardware constraints: Consumer GPU. MPS.

Developer Note:

Although technically doable, INT16 is not the choice because its FP alternatives, BF16 and FP16, can represent decimals more accurately, while consuming the same 16 bits.

NormalFloat (NF) - The QLoRA Engine

NormalFloat quantization leverages model weights’ normal distribution where billions of AI weights are mostly tiny numbers clustered around zero.

Its primary format, NormalFloat4 (NF4) places most of its 16 notches right in that crowded center around zero to represent the decimal more accurately than INT4:

Figure E. NormalFloat4 (NF4) distribution curve showing high-density quantization levels around zero (Created by Kuriko IWAI)

NF4 refers to a pre-calculated Look-Up Table (LUT) designed to match a Normal distribution.

The below table shows the LUT for NF4 in Figure E:

Index (Binary)	Index (Decimal)	NF4 Assigned Value based on Normal Distribution
0000	0	-1.0000	The extreme negative edge.
...	...	...	...
0110	6	-0.0910	High precision near zero.
0111	7	0.0000	The center of the bell curve.
1000	8	0.0796	High precision near zero.
1001	9	0.1609	Closest match for 0.15625!
1010	10	0.2461	Spacing starts to widen.
...	...	...	...
1111	15	1.0000	The extreme positive edge.

Table 2. NF4 Look-Up Table (LUT) and Weight Mapping.

Table 2 shows how the numbers are very close together near zero, but get much further apart as they move toward one.

NF4 starts to look for the value closest to 0.15625 in the LUT, and takes 1001 located at Index 9 which represents 0.1609, the closest value to 0.15625.

As Figure E shows, NF4 achieves higher precision (around 0.2% error rate to the original value) by concentrating more quantization levels in the high-density region around zero, whereas its 4-bit counterpart INT4 utilizes fixed-width notches regardless of the data distribution.

Pros: Keeps near 16-bit performance levels while using 4-bit space.
Cons: Computation complexity. It requires dequantization back to BF16 for math computation during forward pass.
Best when: QLoRA (Fine-tuning massive LLMs on consumer-grade GPUs).
Hardware constraints: Modern GPUs.

Block Floating Point (BFP) Family

The Block Floating Point (BFP) family leverages a hybrid approach to numerical representation, sharing a single exponent across a block of precision.

It can achieve a much higher dynamic range than integer formats like INT8, while maintaining significantly lower hardware complexity than full floating-point like FP32.

Major formats include

Microsoft Shared Floating Point (MSFP) and
Microscaling (MX).

◼ Microsoft Shared Floating Point (MSFP)

Microsoft originally developed MSFP for its Project Brainwave (FPGA-based AI) to handle high-throughput, low-latency inference.

In MSFP, the 8 bits of the element are treated as a Signed Integer Mantissa, and they all share one single exponent for the entire block.

The formula looks like this:

\text{d} = M_{\text{int}} \times 2^{E_{\text{shared}} - b} \quad \cdots \text{(2.1)}

where:

d: The decimal value to quantize,
M_{int}: The pure integer mantissa,
E_{shared}: The shared exponent among all 16 numbers in a block, and
b: The exponent bias used to shift the range of the shared exponent.

The decimal 0.15625 in MSFP is 01010000:

Determines the block’s shared exponent: Exp_{shared} = -9.
Computes the pure integer mantissa: M_{int} = 0.15625 / 2^{-9} = 80.
Encodes the pure integer mantissa 80 in 8-bit binary: 01010000

MSFP is simple and efficient enough to streamline individual computation of the exponent.

Pros: Hardware efficient. Saves significant memory bandwidth compared to standard FP8.
Cons: Outlier sensitivity due to the shared exponent. Clipping strategies needed to handle the shared exponent range.
Best when: Deploying on FPGA or custom ASIC architectures for inference where power consumption and silicon area are the primary bottlenecks.
Hardware constraints: Intel Stratix 10 (Project Brainwave), Microsoft custom silicon (Maia).

◼ Microscaling (MX) Formats

The MX standard focuses on ultra-fine-grained scaling used in high-performance AI training.

The format was evolved MSFP by the MX Alliance (which includes Microsoft, NVIDIA, AMD, Intel, and others) later to create something interoperable across all modern GPUs and AI accelerators.

Compared to MSFP, MX formats allow larger blocks with 32 individual elements to share the exponent, making it more hardware-efficient by reducing the bits spent on exponents.

The quantization is generalized:

d = S \times (1 + M) \times 2^{E_{\text{shared}} - b} \quad \cdots \text{(2.2)}

where

d: The decimal value to quantize,
S: The sign of the individual element (number),
M: The mantissa of the individual element,
E_{shared}: The shared exponent among all 32 elements in a block, and
b: The exponent bias used to shift the range of the shared exponent.

In standard FP formats like FP16 or BF16, every single number carries its own exponent using up to 8 bits, as Figure C shows.

This consumes expensive VRAM memory.

Eq 2.2 shows that MX uses block-based scaling where a group of numbers (e.g., 32 numbers per block) shares one master exponent E_{shared}, while the sign and mantissa preserve individual elements.

For example, to apply an MX format (MXFP8) to the decimal 0.15625 quantized in E4M3**:**

Convert the decimal into binary scientific notation: 0.15625 = 1.25 x 2^{-3}
E4M3 (3 bits for mantissa) secures the precision of 2^{-3} for the fractional part.
Calculates E_{shared}, the maximum exponent of the group. We assume E_{shared} = 0.
Define the exponent (-3 + E4M3 bias 7 = 4) in 4 bits: 0100.
Defines the mantissa (1.25 - 1 = 0.25) in 3 bits: 010.
Defines the sign in 1 bit: 0

Combining the sign, exponent, and mantissa, 0.15625 in MXFP8 is quantized 0 0100 010.

Pros: Minimizes the precision loss. Broad support by NVIDIA, AMD, and Intel. Versatile enough to apply various bit-width.

Cons: Management overhead. Sub-optimal for sparse data (performance degrades if the data distribution is irregular across the small blocks).
Best when: Performing large-scale AI training or high-fidelity inference where model accuracy is critical.
Hardware constraints: Requires specialized hardware units capable of performing shifts at a granular level.

Wrapping Up

Among many quantization options, selecting the right one is critical for performance.

My advice is to always test your specific use case.

If you're building a creative writing assistant, 4-bit is plenty. If you're building a medical diagnosis tool, you might want to stick to the higher-fidelity BF16 or FP8.

◼ The Cheat Sheet

I summarize the options in the cheat sheet with hardware requirements:

Format	Bits	Primary Use Case	Hardware Support
FP32	32	Reference / Scientific Research	Universal
BF16	16	Training (Industry Standard)	Modern GPUs (A100, H100)
FP16	16	Inference / Older Training	Universal (NVIDIA/AMD)
FP8	8	Accelerated Training & Inference	NVIDIA Hopper (H100), Blackwell
INT8	8	Production Inference	Most Modern GPUs
INT4	4	Local Inference (GGUF/GPTQ)	Consumer GPUs / Mac
NF4	4	Fine-tuning (QLoRA)	Modern GPUs
MSFP	12, 16	High-precision scaling for LLMs	Specialized OCP-compliant accelerators
MX	4, 6, 8	Ultra-efficient Scale-out Inference	Next-gen Blackwell (NVIDIA), AMD, Intel

Table 3. The LLM Quantization Cheat Sheet: Formats, Bits, and Hardware Support

Developer Note:

Vector Quantization (VQ) methods like GGUF and AQLM can push quantization limits down to 2-bit.

These techniques achieve higher fidelity at low bitrates by mapping weights to optimized codebooks rather than simply rounding values to the nearest grid point.

Since both are deep methodologies, I’ll cover these in a separate title.

◼ Smaller is Better?

Balancing E and M reveals a crucial truth in model training: smaller precision is not always superior.

For example:

At FP32:

The most accurate model can distinguish tiny differences between 0.0000001 and 0.0000002.

At INT4:

The model saves significant VRAM, but recognizes the tiny values 0.0000001 and 0.0000002 as the same 0.0000 due to the lack of bits.

This triggers the vanishing gradient problem, where an optimizer fails to converge and the model eventually collapses into NaN (Not a Number) errors.

QLoRA can solve this challenge without sacrificing the memory benefits of quantization.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation
Master Low-Rank Adaptation (LoRA). Learn the math of rank decomposition, parameter allocation in Qwen-3, hyperparameter tuning (r, alpha), and security risks.
The Definitive Guide to LLM Fine-Tuning: Objectivee, Mechanisms, and Hardware
Master LLM fine-tuning. Compare SFT vs. CPFT, PEFT vs. Full Fine-Tuning, and hardware requirements (VRAM/GPU) for Llama 3, Mistral, and more.
Transformer Architecture: Self-Attention & MLOps Guide
Master the inner workings of Transformers. A technical walkthrough of self-attention, multi-head mechanisms, and positional encoding with vector math examples.
Tokenization Strategies for LLM Applications
Master the mechanics of LLM tokenization. Compare word-based, character-based, BPE, WordPiece, and Unigram architectures, and learn how to choose the right tokenizer for memory and learning efficiency.
Optimizing LLM Performance: Context Window Impact on RAG Accuracy
Does a larger context window always mean better results? Explore a technical deep dive into benchmarking Llama 3.1 8B, evaluating RAG performance across various token lengths using GPT-4o as a judge.
Regularizing LLMs with Kullback-Leibler Divergence
Explore how Forward and Reverse KL Divergence regularize LLMs. Learn to balance exploration and exploitation in SFT and RLHF to prevent policy collapse.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

Share What You Learned

Kuriko IWAI, "Is 4-Bit All You Need? The Math Behind Modern LLM Compression" in Kernel Labs

https://kuriko-iwai.com/llm-quantization-guide-fp32-to-int4-nf4-mx

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Is 4-Bit All You Need? The Math Behind Modern LLM Compression

The Engineer’s Guide to LLM Quantization. Learn How Quantization Makes 70B Models Run on Local GPU.

Table of Contents

Introduction

What is Quantization

◼ Quantization in Model Training and Fine-Tuning

◼ Quantization in Inference

Types of Quantization

The Standard FP32 Precision

◼ How Floating Point (FP) Works

◼ Sign

◼ Exponent

◼ Mantissa

◼ The Math of FP32

Major Quantization by Encoding Scheme

Floating Point (FP) Family

◼ FP16

◼ BF16 (BFloat16)

◼ FP8

Integer (INT) Family

◼ INT8

◼ INT4

NormalFloat (NF) - The QLoRA Engine

Block Floating Point (BFP) Family

◼ Microsoft Shared Floating Point (MSFP)

◼ Microscaling (MX) Formats

Wrapping Up

◼ The Cheat Sheet

◼ Smaller is Better?

Continue Your Learning

Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation

The Definitive Guide to LLM Fine-Tuning: Objectivee, Mechanisms, and Hardware

Transformer Architecture: Self-Attention & MLOps Guide

Tokenization Strategies for LLM Applications

Optimizing LLM Performance: Context Window Impact on RAG Accuracy

Regularizing LLMs with Kullback-Leibler Divergence

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?