Model distillation is the engineering solution to this problem by packing intelligence of a giant model into a smaller, faster, and more cost-effective model.

In this article, I'll explore how model distillation works with its common application and practical implementation tips.

What is Model Distillation - Understanding the Teacher-Student Framework

Model Distillation (or also called Knowledge Distillation) is a compression technique in deep learning model engineering where a small model (the student) is trained to reproduce the behavior and output of a large, pre-trained model (the teacher).

The below diagram illustrates its concept:

Figure A. Diagram of Knowledge Distillation architecture illustrating a high-parameter teacher model transferring soft targets to a lightweight student model. (Created by Kuriko IWAI)

The student is trained on the outputs of the teacher model (gray area, Figure A) and other internal factors depending on the distillation scheme, instead of raw training set from the ground up.

How It Works - 3 Core Distillation Schemes

The technique has evolved into three distinct approaches based on what part of the teacher's knowledge is transferred to the student:

Response-based: Mimics the final answer.
Feature-based: Mimics the internal logic.
Relation-based: Mimics the data structure.

◼ Response-Based Distillation

Response-based distillation is the most common form of distillation where the student learns from the probability distribution generated in the final Softmax layer of the teacher.

The below diagram illustrates how the algorithm assesses the student's prediction:

Figure B. Flowchart of Response-based Distillation showing logits passing through a Softmax layer with temperature T to compute Distillation Loss. (created by Kuriko IWAI)

▫ The Objective Function

As Figure B shows, the response-based approach attempts to minimize the total loss, a weight average of the distillation loss and student loss, during backward pass:

L_{total} = \alpha L_{distill} + (1 - \alpha) L_{student} \quad \cdots \text{(1.1)}

Where:

L_{total}: The total loss.
α: A hyperparameter that determines the relative importance of the teacher's guidance versus the ground-truth (labelled) data.
L_{distill}: The distillation loss (white box, Figure B). The difference between the teacher’s and student’s softened outputs.
L_{student}: The student loss (pink box, Figure B). The standard cross-entropy loss between the student’s predictions and the actual ground-truth labels (hard targets).

The process has five distinct steps:

▫ Step 1. Forward Pass

Both the teacher model and student model perform forward pass to get logits (raw outputs) z_T and z_S, respectively.

▫ Step 2. Soften the Logits

Apply a temperature parameter T to the both z_T and z_S to smooth the probability distribution:

P(i, T) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} \quad \cdots \text{(1.2)}

Where:

P(i, T): The softened probability for a random class i by a temperature T.
z_i: The logit for the class i.
T = 1: The standard Softmax.
T > 1: The softened Softmax used during distillation.

This process enables the student model to learn the relationship between incorrect classes because as T increases, the probability distribution becomes flatter, revealing which classes the teacher thinks are more similar to the correct one.

Step 3. Compute the Distillation Loss

Compare the soft distributions from Step 2 and compute the distillation loss.

A common way is to use the Kullback-Leibler (KL) divergence:

L_{distill} = T^2 \cdot \sum_{i} P_T(i, T) \log \left( \frac{P_T(i, T)}{P_S(i, T)} \right) \quad \cdots \text{(1.3)}

where:

T: Temperature parameter.
P_T(i, T): The Softmax probability for the i-th class produced by the teacher model, adjusted by a temperature T. The calculation is defined in Eq. 1.2.
P_S(i, T): The Softmax probability for the i-th class produced by the student model, adjusted by a temperature T. The calculation is defined in Eq. 1.2.

▫ Step 4. Compute the Student Loss

Compare the student's raw output (T=1) to the actual labels.

▫ Step 5. Backward Pass

Lastly, compute the total loss L_{total} as per Eq. 1.1 and update only the student’s weights.

▫ Common Applications

Response-based distillation works well with classification tasks.

Edge device: Compress image classifiers (e.g., from a massive Vision Transformer to a small MobileNet) so they can run locally on smartphones or IoT sensors without cloud latency.
Cross-architecture transfer: Transfers knowledge between entirely different architectures—for example, distilling a CNN (teacher) into a MLP-Mixer (student).
Ensemble compression: Takes the average output of an ensemble of 10 different models (the teachers) and distills it into a single, much faster student model.

◼ Feature-Based Distillation

Instead of just looking at the final answer, feature-based distillation enables the student model to mimic the internal representations of the teacher.

The below diagram illustrates how the algorithm assesses the student's prediction:

Figure C. Architecture diagram for Feature-based Distillation showing the alignment of intermediate hidden layers and activation maps between models. (created by Kuriko IWAI)

▫ The Objective Function

The core objective is to minimize the difference between the teacher’s intermediate feature maps and the student’s corresponding layers (Distillation loss in Figure C):

L_{feat} = \sum_{x \in \mathcal{X}} \mathcal{D}\left( \Phi_T(x), G(\Phi_S(x)) \right) \quad \cdots \text{(2.1)}

where:

Φ_T(x): The activation maps from the n-th layer of the teacher model.
Φ_S(x): The activation maps from the n-th layer of the student model.
G(...): An alignment function like a 1 x 1 convolution or a linear projection that reshapes the student's features to match the teacher's dimensionality.
D(...): A distance metric. Commonly Mean Squared Error (MSE) but can be L_1 norm or Cosine Similarity.

Similar to the response-based distillation, the technique attempts to minimize the loss in Eq. 2.1 during backward pass to find the optimal internal parameters of the student model.

▫ Key Variants of Feature Knowledge

There are three distinct variants in feature distillation based on what the student model attempts to mimic:

FitNets: Attempts to mimic the teacher's hidden layers, leveraging a regressor.
Attention Transfer (AT): Attempts to mimic the teacher's attention maps.
Factor Transfer: Attempts to mimic meaningful factors from the teacher's features, leveraging an encoder-decoder paraphraser.

▫ Primary Use Cases

Feature distillation is useful for multi-reasoning tasks as it allows the student to learn the internal logic of the teacher.

Its primary use cases involve:

Object detection task using computer vision: Distill the feature maps to preserve spatial information and object boundaries.
Transformer compression: Distill models like BERT into DistilBERT or TinyBERT by matching attention matrices and hidden states, ensuring that the student retains the linguistic nuances and contextual relationships.
Cross-modal learning: Distill features from a teacher trained on images into a student trained on depth maps or infrared data, helping the student learn robust features even with limited input types.
Transfer learning with small data: Allows the small student model to learn the rich feature hierarchy of the teacher (pre-trained on a giant dataset), while avoiding overfitting.

◼ Relation-Based Distillation

Relation-based distillation shifts the focus from what a model sees to how it perceives the structure of the data.

The below diagram illustrates how the algorithm assesses the student's prediction:

Figure D. Visual representation of Relation-based Distillation focusing on the structural similarity of the data manifold and distance matrices. (Created by Kuriko IWAI)

The approach focuses on the structure of the data manifold, instead of mimicking specific layers or outputs of the teacher.

For example, in an image classification task, essentially, the student isn't learning what a "dog" image looks like; it's learning that a "dog" is closer to a "cat" than it is to a "car."

▫ The Objective Function

The goal of relation distillation is to ensure that if the teacher thinks Image A and Image B are similar, the student should also map them close together in its own feature space.

The loss function compares similarity matrices (Gram matrices) or distance matrices from both models:

L_{rel} = \sum_{i,j} \ell \left( \psi(f_T^i, f_T^j), \psi(f_S^i, f_S^j) \right) \quad \cdots \text{(3.1)}

where:

f_T^i, f_S^i: The feature embeddings of i-th input from the teacher and student.
ℓ(...): A loss function to penalize the difference between the teacher's similarity score and the student's. e.g., mean squared error (MSE) or huber loss.
ψ(...): A similarity function. e.g., cosine similarity or euclidean distance.

Relation distillation tunes the student model to minimize the loss defined in Eq. 3.1 during backward pass.

▫ Primary Use Cases

Relation distillation allows the student to learn the underlying shape of the data.

And because it only computes a distance from the teacher's output, the approach is robust and model architecture agnostic.

Its primary use cases involve:

Image retrieval like image classification tasks or face verification. The student learns to cluster dog images together and keep them far from cat images, for instance.
Zero-shot / few-shot learning: The student can better guess where a new class should sit in the feature space by learning the relationships between known classes.
Knowledge graphs: Distilling complex relationships between entities into smaller, faster graph neural networks like GNNs.

Distillation Strategies

Aside from the distillation schemes, there are several factors that dictate distillation strategies:

Learning source: Learning source available from the teacher.
Structural relation: How structurally close the teacher and student are.
Training method: How the student /teacher models are trained.
Task-specific distillation.

◼ Learning Source

The learning source from the teacher dictates what the student can mimic.

There are two categories:

Black-box: The student learns only from the teacher's final text outputs.
White-box: The student has full access to the teacher's internal parameters and probabilities.

▫ Black-Box

When the teacher is a proprietary model like GPT or Gemini, the student can only access the final outputs via API.

The approach is straightforward and common in standard SFT, focusing on cloning general predictive performance, but the student might miss the teacher's reasoning depth.

Typical use case: Creating a small specialized model via API. Basic chatbot fine-tuning.

▫ White-Box

Although requiring hosting the teacher locally, the white-box approach allows the student to access the teacher's internal parameters to mimic its reasoning processes.

Typical use case: Distilling Llama-3 70B into a local 8B version.

◼ Structural Relation

Structural relation refers to a relation between the student and the teacher's model families; falling into the three groups:

Same family: Both the teacher and the student belong to the same model family.
Cross architecture: The teacher and the student belong to different model families.

▫ Same Family

When the teacher and the student belong to the same model family, they can achieve perfect layer alignment, directly mapping each layer of the teacher to the student.

The approach is straight-forward yet rigid; the application is limited to specific model lineages.

Typical use case: Distilling Qwen-32B into Qwen-7B.

▫ Cross Architecture

The teacher and the student have different architectures. It can be difficult to converge.

Typical use case: Converting a Transformer to a faster linear model.

◼ Training Methods

The nature of training methods dictates how the student learns from the teacher:

Offline
Online
Self-distillation

Figure E. Comparative diagram of Distillation strategies: Offline (static dataset), Online (joint training), and Self-distillation (internal layer refinement). (Created by Kuriko IWAI)

▫ Offline Distillation

Offline distillation is the standard approach where the teacher creates a static training set once; and then the student is trained on the dataset.

Its learning process is extremely stable, but sometimes, the student fails to learn complex patterns.

Typical use case: Standard model compression pipelines.

▫ Online Distillation

Online distillation updates both the teacher and student; allowing the teacher to adapt to the student's learning pace during training.

The approach is competitive when enough VRAM and computational resources are secured for training both the teacher and student.

Typical use case: Research-grade co-training and ensembles.

▫ Self-Distillation

The student refines itself by letting its deeper layers teach shallower ones.

Although the approach tends to reinforce errors from the deeper layer, it is handy as it does not require any teacher model.

Typical use case: DeepSeek-style internal layer optimization.

◼ Task-Specific Distillation

Specific architectures require specialized distillation logic:

Sequence Distillation: Used in NLP (e.g., DistilBERT) where the student learns to match the teacher's hidden states and attention heads.
Logic Distillation: Used in RL or reasoning tasks where the student mimics the teacher's policy or value functions.

Implementation Strategy: Which Path to Take?

In practice, these model distillation techniques are rarely used in isolation.

The most effective implementations combine different strategies to balance performance, cost, and target hardware constraints.

The below table introduces common combinations:

Strategy	Primary Goal	Distillation	Implementation	Example
Production shortcut	Speed & Cost	Black-box + Offline + Response Knowledge Distillation	Collect a static dataset of teacher responses via API. Use them as the ground truth labelled data for the student.	GPT-4 API → 7B Model.
Reasoning Preservation	High Fidelity	White-box + Logit + Feature Knowledge Distillation	The student matches both the final answers (Logits) and the internal logic (Intermediate features/attention maps) of the teacher.	Llama 3 70B → 8B
Edge Migration	On-device latency.	White-box + Architecture Mapping + Response Knowledge Distillation	Focuses on Response-based transfer.	BERT → MobileNet
Reasoning bridge	Overcome significant size gap.	White-box + Self-Distillation + Response Knowledge Distillation	Distill the 400B model into a 70B TA. Then, distill the 70B TA into the final 1B Student. Then runs self-distillation on the 1B student.	400B Model → 1B Model

Table 1. Comparison of Model Distillation Strategy Combinations.

Model Distillation in Action - Distilling GPT-4o into Llama 3-1B

In this section, I'll distill a massive GPT-4o model into a tiny student, Llama 3-1B model, for an edge device application.

The distillation follows the Offline, Response Knowledge Distillation pattern. Since we cannot access GPT-4o's internal weights, I'd distill its outputs to generate a high-quality instruction dataset to run SFT for the student model.

The process follows the four primary steps:

Step 1. Prompt GPT-4o to summarize 50,000 legal briefs with detailed explanations.
Step 2. Collect the teacher's outputs as the ground truth.
Step 3. Fine-tune the student (Llama 3-1B) on the ground truth data in Step 2.
Step 4. The student performs inference.

◼ Step 1. Prompt Gemini 3.1

The first step is to call Gemini 3.1 API to generate outputs:

1from openai import OpenAI
2
3client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
4queries = ["Legal text A...", "Legal text B..."]
5teacher_outputs = [summarize_with_gpt4o(q) for q in queries]
6

◼ Step 2. Collect the Teacher Outputs

The teacher's outputs are structured and saved in a JSON file:

1import json
2
3dataset = []
4for i, (original, summary) in enumerate(zip(queries, teacher_outputs)):
5    dataset.append({"id": i, "input": original, "teacher_summary": summary })
6
7with open("teacher_data.json", "w") as f:
8    json.dump(dataset, f)
9

◼ Step 3. Fine-Tune Llama 3-1B

Using the dataset from Step 2, fine-tune the student model, Llama 3-1B:

1from trl.trainer.sft_trainer import SFTTrainer
2from transformers import TrainingArguments, AutoModelForCausalLM, AutoTokenizer
3
4# load student model and its corresponding tokenizer
5model_id = "meta-llama/Llama-3.2-1B"
6tokenizer = AutoTokenizer.from_pretrained(model_id)
7model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
8
9# instantiate sft trainer
10trainer = SFTTrainer(
11    model=model,
12    train_dataset=teacher_outputs, # loaded from step 2
13    processing_class=tokenizer,
14    args=TrainingArguments(
15        output_dir="./llama-3-legal-distilled",
16        per_device_train_batch_size=4,
17        gradient_accumulation_steps=4,
18        learning_rate=2e-5,
19        num_train_epochs=3,
20        save_steps=100,
21        logging_steps=10,
22        bf16=True
23    ),
24)
25
26# train the student model on the ground truth teacher_output
27trainer.train()
28

◼ Step 4. Perform Inference

Lastly, the student performs inference to assess the results:

1import torch
2
3device = "cuda" if torch.cuda.is_available() else "cpu"
4
5# trained model
6model = trainer.model.to(device)
7
8query = "The petitioner claims a violation of the 4th Amendment..."
9inputs = tokenizer(f"Summarize: {query}", return_tensors="pt").to(device)
10
11outputs = model.generate(**inputs, max_new_tokens=200)
12output_clean = tokenizer.decode(outputs[0], skip_special_tokens=True)
13
14print(output_clean)
15

Now, the distilled 1B model can perform at 95% of the teacher's quality, but run 100x faster.

Wrapping Up

Model distillation has shifted the focus of LLM engineering from "how big can we go?" to "how small can we get?"

By effectively transferring knowledge from the teacher to student, AI applications can be not only smart but also sustainable and snappy.

◼ When to Pivot: Distillation vs. RAG vs. Fine-Tuning

While model distillation is a powerful way to shrink large models into smaller, faster versions, it is not always the optimal choice.

Here are five cases where other tuning methods (like Fine-Tuning or RAG) are preferred:

▫ 1. High-Stakes Domain Specialization -> Choose Fine-Tuning.

Distillation results in a loss of depth or nuanced reasoning.

While a distilled model mimics the teacher's style, it may lose the exact factual precision required for specialized fields.

Full or Parameter-Efficient Fine-Tuning (PEFT) is better for baking in specific domain knowledge.

Use Case: Medical diagnosis, legal contract analysis, or specialized engineering.

▫ 2. Frequent Data Updates -> Choose RAG or Context Engineering.

Distillation is a static process; if the information changes, Student must be re-distilled, which is computationally expensive.

Retrieval-Augmented Generation (RAG) is preferred here because it allows the model to access fresh data without any retraining.

Use Case: Real-time news bots, stock market analysis, or internal company wikis.

▫ 3. Safety-Critical Applications -> Choose Fine-Tuning or RLHF.

Research suggests that distillation (especially logit-based) can erode safety guards by up to 50% compared to Teacher.

Student prioritizes mimicking performance over following safety constraints.

Direct Fine-Tuning with safety-labeled data is more reliable for maintaining guardrails.

Use Case: Public-facing AI with strict compliance.

▫ 4) Limited Computational Access -> Choose PEFT (LoRA or QLoRA).

Distillation is a high-cost method because a massive Teacher model must generate millions of synthetic labels and then train Student from scratch.

LoRA or QLoRA is much cheaper and faster because it only tunes tiny fraction (<1%) of all parameters of the large model.

Use Case: Startups or researchers with limited GPU access.

▫ 5) Bridging the Huge Capacity Gap -> Choose Multi-Stage Tuning.

If the gap between Teacher and Student is too large, Student fails to learn effectively because it cannot understand Teacher's complexity.

In these cases, Supervised Fine-Tuning (SFT) on high-quality labelled data yields better results than trying to force a tiny model to mimic a giant