Engineer High-Fidelity SLM for Edge AI with Multi-stage Tuning Pipeline

Q: Why use GPT-5.4 for SLM Distillation?

GPT-5.4 achieves an 83.0% win rate on GDPval professional tasks. This provides the high-fidelity logical grounding necessary for a 3B model to mimic professional reasoning.

Q: How do 'Thinking' blocks improve RKD?

By capturing the explicit 'Thought' block from GPT-5.4, the student model learns the specific reasoning steps (CoT) rather than just pattern-matching the final response.

Q: What is the cost benefit of GPT-5.4 mini?

Using GPT-5.4 mini for Phase 2 allows for high-volume synthetic data generation at 1/10th the cost of the Pro version while maintaining frontier-level logic.

Learn how to engineer high-fidelity Small Language Model (Llama 3.2 3B) with SFT, RKD, and DPO for edge deployment.

Machine LearningDeep LearningData SciencePythonLLM

By Kuriko IWAI

Introduction What We'll Build

The Architecture: Hybrid Cloud-Edge Deployment

Comparison of Cloud and Edge Deployment

Infrastructure Strategy: SageMaker LMI vs. Local GGUF

Cloud Deployment Strategies - Choosing the Right Option

The Quantization

Edge Deployment Strategies

The Multi-Step Model Tuning Pipeline

Phase 1. Learn: SFT & QLoRA for Factual Grounding

Phase 2. Mimic: Response Knowledge Distillation (RKD) for Logical Nuance

Phase 3. Align: DPO for Behavioral and Tone Alignment

Saving the Tuned Model Artifacts

Deployment

Why vLLM? PagedAttention and Streaming for LLMs

Wrapping Up

Introduction

Large Language Models (LLMs) are accessible in any occasion, on any device.

But LLMs with hundreds of billions parameters are too expensive for low-latency applications, while vanilla small models (SLMs) tend to fail in fidelity and consistent responses.

To tackle this challenge, I’ll tune a compact Llama 3.2-3B model to mirror the conversational prowess of much larger LLMs.

This involves a three-phase pipeline—SFT (QLoRA), RKD, and DPO—to effectively embed a distinct personality into the model's smaller weights.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

What We'll Build

We'll build our digital clone, mimicking our personality when answering a question.

The below diagram illustrates the system architecture:

Figure A. The system architecture of the digital twin for cloud and edge deployments (Created by Kuriko IWAI)

The fine-tuned model is deployed via AWS LMI with a Lambda function acting as a security proxy for downstream services, or distributed as a GGUF file for edge devices like Smartphone.

◼ The Architecture: Hybrid Cloud-Edge Deployment

I'll adopt a hybrid deployment strategy across cloud and edge environments.

The approach follows two phases:

Cloud Deployment: Enables rapid testing of the persona across any device (Web, Mobile, Tablet) via a simple API.
Edge Deployment: Positioned as a premium tier, allowing users to download the model for offline interaction with 100% data privacy.

Utilizing the Llama 3.2-3B model achieves the optimal balance between computational power and a compact footprint.

◼ Comparison of Cloud and Edge Deployment

The below table compares the strategies of cloud and edge deployment:

Strategy	Cloud Deployment (SageMaker)	Edge Deployment (Local)
Where the Model Lives	In the Cloud on a Cloud provider's (AWS) GPU.	Inside the phone as a .gguf file.
Connectivity	Requires Internet connection.	Works offline; zero data cost.
Phone Battery Impact	Zero. The phone just displays text.	Drains the battery fast. The model uses 100% CPU.
Model Size	No limit. Can run massive 70B models.	Limited to small models.
Model Serving Overhead	Low. Requires only standard API calls.	High. Must code for iOS/Android NPU.

Table 1. Architectural Trade-offs: Cloud SageMaker vs. Local Edge Execution.

Cloud deployment can host massive models without hardware limitations but requires a stable internet connection for access.

Edge deployment eliminates the need for an internet connection; however, the model must be significantly compressed to fit within smartphone RAM constraints, and real-time inference risks rapid battery depletion.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Infrastructure Strategy: SageMaker LMI vs. Local GGUF

◼ Cloud Deployment Strategies - Choosing the Right Option

Cloud deployment tech stacks fall into four primary categories based on service model and infrastructure:

Tier	Format x Engine	Infrastructure	Why this works
The Enterprise	AWQ / FP8 × vLLM	SageMaker LMI (Real-time)	Max Throughput. Optimized for hundreds of simultaneous users with sub-second latency via Paged Attention.
The Cost Saver	AWQ / BF16 × vLLM	SageMaker Async Inference	Scale-to-Zero. Best for apps used 5–10 times an hour. Saves money by shutting down GPU during idle time. Cons: Decompression latency of zipped tarballs. Input/Output must be stored in S3.
The Serverless	GGUF × llama.cpp	SageMaker Serverless	Pure Pay-per-Use. Zero cost when idle. Ideal for 1B models because of the max 6GB RAM limit. Cons. CPU. Hard RAM limit. Cold start.
The Hobbyist	GGUF × llama.cpp	SageMaker BYOC (CPU)	Lowest Entry Barrier. Runs on cheap, always-on CPU instances ($0.05/hr) like ml.t3.medium. Cons. CPU.

Table 2.1. SageMaker Deployment Tiers: From Enterprise vLLM to Serverless GGUF

The Serverless tier is the zero-maintenance tier where SageMaker manages instances on behalf of the user, just like Lambda. But because it's CPU-only and has a strict 6GB RAM limit, the model must be small with < 3B parameters. It also has 10-20 seconds of cold start.

The Hobbyist tier is to avoid the cold start by running llama.cpp on a tiny ml.t3.medium or ml.m5.large instance. Inference is slower than a GPU, but for a 1B model, it still can hit 15+ tokens/sec.

This project applies the Enterprise tier to secure the model intelligence real-time.

Developer Note:

In Asynchronous Inference, when setting Min Inference zero, SageMaker shuts down the GPU instance when no one accesses the inference.

This will introduce 1+ minute cold start for the first message, but can save hundreds of dollars a month especially when traffic is sparse.

◼ The Quantization

On SageMaker GPUs like the T4 or A10G, a 3B model is so small that quantizing the model is actually counter-productive.

Keeping it in BF16 ensures the model keeps all the subtle nuances that the model learns during DPO without any compression.

If the model has more than 8B, vLLM complies well with formats like AWQ (4-bit) or FP8 (8-bit floating point), which are designed to let the GPU tensor cores do the math much faster than the GGUF format.

◼ Edge Deployment Strategies

For the edge deployment, model and quantization choices depend heavily on the edge device:

Device	Recommended Model Size	Quantization	Experience
Mobile (iOS/Android)	< 3B	Q4_K_S (GGUF)	Smooth, instant local chat. Blazing-fast responses (30–60 tokens/sec).
Laptop (8GB RAM)	3B	Q4_K_M (GGUF)	Very responsive local experience.
MacBook (16GB+ RAM)	8B	Q4_K_M (GGUF)	High intelligence, steady local speed.
Gaming PC (Local GPU)	8B	Q8_0 or FP16 (GGUF)	Best possible local quality.

Table 2.2. Hardware Recommendation Matrix for Edge Quantization (GGUF).

In the case of Llama 3.2 3B, it only requires ~ 3.0GB of RAM when quantized with Q4_K_S, leaving plenty of room for the mobile system to run other apps.

The Multi-Step Model Tuning Pipeline

The training sequence follows a "Learn, Mimic, Align" flow:

	Tuning Method	Data Source	What Model Learns	Objective
Phase 1. Learn	SFT (QLoRA)	Resume. References. Assessment results.	Factual Grounding: Equip the model with the knowledge base - "facts" on professional history.	Eliminate hallucination.
Phase 2. Mimic	RKD	Meeting transcript. A copy of negotiation emails.	Logical Nuance: Distill Chain-of-Thought (CoT) from the teacher (GPT 5.4-mini).	Mimic how the target person thinks.
Phase 3. Align	DPO	Social media comments. WhatsApp chat history.	Behavioral & Tone Alignment: Make the model prefer specific tones over generic AI responses.	Sounds like the target person.

Table 3. The Triple-Phase Model Tuning Sequence.

The data sources can be anything which can fulfill the model's learning objectives in each phase.

For example, in Phase 1, one can use their MBTI results as the ground truth of their professional personality.

In either case, make sure you'll mask sensitive information before tuning the model.

◼ Phase 1. Learn: SFT & QLoRA for Factual Grounding

The first phase is to convert my professional background into the model to enable it to master the factual truth.

After creating at least 100 Question-Answer pairs from the reference documents such as resume and LinkedIn posts, I'll format the train dataset to follow the chat template of the Llama model family:

1[
2    {
3        "instruction": "QUESTION_1", 
4        "context": "CONTEXT OF THE ANSWER_1. RELATED_PROJECT ETC",
5        "response": "ANSWER_1"
6    },
7...
8]
9

Then, I'll perform QLoRA (Quantized Low-Rank Adaptation) using the SFTTrainer instance from the trl library:

1from trl.trainer.sft_trainer import SFTTrainer
2from transformers import TrainingArguments
3
4# instantiate sft trainer
5trainer = SFTTrainer(
6    model=model,
7    train_dataset=train_dataset['train'], # HuggingFace's Dataset
8    processing_class=tokenizer,
9    peft_config=peft_config,
10    formatting_func=formatting_func,
11    args=TrainingArguments(
12        per_device_train_batch_size=2,
13        gradient_accumulation_steps=4,
14        ...
15    ),
16)
17
18# train
19trainer.train()
20

The trained model and tokenizer are accessible:

1trained_model = trainer.model
2tokenizer = trainer.processing_class
3

Developer Note: Synthetic SFT - Another Form of RKD

Instead of manually drafting question-answer pairs, you can prompt an LLM to generate the pairs from the reference document like resume. This is another form of Response Knowledge Distillation, using the LLM as a teacher.

◼ Phase 2. Mimic: Response Knowledge Distillation (RKD) for Logical Nuance

Next, I'll perform Response Knowledge Distillation (RKD) by using GPT-5.4-mini as a teacher model.

This process distills the logic behind my response into the student, 3B model, so that the student model can grasp reasoning and linguistic nuance behind the response.

I'll first prompt the teacher to generate the train dataset:

Read the reference, and generate 100 examples that follow a Chain-of-Thought (CoT) structure such that:
Instruction: A task or question related to the reference.
Thought: Based on the given reference, write out the internal logical steps I took (e.g., 'First, I'll check the API logs to see if the token expired...').
Response: Write the final answer I selected based on the given reference.

The train dataset will look like:

1[
2    {
3        "instruction": "GENERAL_QUESTION_1", 
4        "thought": "COT_1", 
5        "response": "RESPONSE_1"
6    },
7    ...
8]
9

Developer Note:

Keep all elements under 150 tokens because small models perform much better and hallucinate less when the training examples are punchy and concise rather than long-winded.

Then, similar to the previous phase, I'll train the 3B-model using the SFTTrainer instance:

1from trl.trainer.sft_trainer import SFTTrainer
2
3trainer = SFTTrainer(
4    model=model,
5    train_dataset=train_dataset['train'],
6    processing_class=tokenizer,
7    peft_config=peft_config,
8    formatting_func=_formatting_func,
9    args=training_args,
10)
11
12trainer.train()
13

◼ Phase 3. Align: DPO for Behavioral and Tone Alignment

The last phase is to perform DPO to align the model with the tone and styles of conversation.

DPO needs prompts with chosen answers and rejected answers:

1[
2    {
3        "prompt": "QUESTION_1", 
4        "chosen": "ANSWER_1 (aligns with tone and style)",
5        "rejected": "REJECTED_ANSWER_1 (sounds like AI)"
6    },
7...
8]
9

After generating at least 100 train dataset, I'll perform DPO using the DPOTrainer instance from the trl library:

1from unsloth import PatchDPOTrainer
2from trl.trainer.dpo_trainer import DPOTrainer
3
4# apply the patch
5PatchDPOTrainer()
6
7# initialize the trainer
8dpo_trainer = DPOTrainer(
9    model=model,
10    ref_model=None,
11    args=training_args,
12    train_dataset=train_dataset['train'],
13    processing_class=tokenizer
14)
15
16# train
17dpo_trainer.train()
18

◼ Saving the Tuned Model Artifacts

Lastly, I'll save the model artifacts:

1# for cloud deployment (bf16)
2model.save_pretrained_merged(
3    <OUTPUT_DIR>,
4    tokenizer,
5    save_method="merged_16bit", # keep the model in 16 bits
6)
7
8
9# for edge deployment (save the artifacts in a gguf file)
10model.save_pretrained_gguf(
11    <OUTPUT_DIR>,
12    tokenizer, 
13    quantization_method="q4_k_s", # llama.cpp quantization to compress the file
14)
15

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Deployment

Lastly, the model is deployed on cloud and edge devices.

Here, I'll demonstrate the cloud deployment using the Enterprise tier described in Table 2.1; deploying the tuned model on SageMaker LMI with its built-in vLLM engine.

After setting up a dedicated IAM role and trust policy, I'll create a model on SageMaker:

1aws sagemaker create-model \
2    --model-name "$MODEL_NAME" \
3    --execution-role-arn "$ROLE_ARN" \
4    --primary-container "{
5        "Image": "$IMAGE_URI", # sagemaker's built-in vllm container uri
6        "Environment": { 
7            "HF_MODEL_ID": "$S3_PATH", \
8            "OPTION_TRUST_REMOTE_CODE": "true", \ 
9            "OPTION_ROLLING_BATCH": "vllm", \
10            "OPTION_TASK": "text-generation", \
11            "OPTION_DTYPE": "fp16", \
12            "OPTION_MAX_MODEL_LEN": "512" \
13        }
14    }"
15

When the script calls aws sagemaker create-model, it passes the necessary configuration to the GPU instance. When the instance starts up, it:

Pulls the built-in container image from the IMAGE_URI.
Checks Environment variables and applies it as serving.properties parameters:
- option.rolling_batch=vllm indicates it uses the built-in vLLM engine.
- S3_PATH as HF_MODEL_ID indicates it uses the remote model (tuned) stored in the S3_PATH instead of the original HF model (e.g., unsloth/Llama-3.2-3B).
Initializes the built-in vLLM engine and deploy the tuned model as a new model on SageMaker.

Developer Note: SageMaker Standard Container vs LMI Container
Standard SageMaker containers (Hugging Face, PyTorch, etc) requires to zip the model artifacts into a tarball, model.tar.gz file, which triggers decompression latency:

Receive request.
→ Instance starts.
→ Download tar.
→ Unzip tar to the default dir: /opt/ml/model/.
→ Load weights.
→ Perform inference.

Conversely, SageMaker LMI containers do not need a tarball; it can pull raw weights directly from the S3 prefix in the HF_MODEL_ID variable, achieving better inference latency:

Receive request.

→ Instance starts.
→ Stream weights from S3.
→ Load weights.
→ Perform inference.

After creating a model, I'll configure a dedicated endpoint:

1# create an endpoint config
2aws sagemaker create-endpoint-config \
3    --endpoint-config-name "$CONFIG_NAME" \
4    --production-variants "[{
5        \"VariantName\": \"variant-1\",
6        \"ModelName\": \"$MODEL_NAME\",
7        \"InstanceType\": \"ml.g4dn.xlarge\",
8        \"InitialInstanceCount\": 1
9    }]"
10
11# create a dedicated endpoint and attach the config
12aws sagemaker create-endpoint \
13    --endpoint-name "$ENDPOINT_NAME" \
14    --endpoint-config-name "$CONFIG_NAME"
15

◼ Why vLLM? PagedAttention and Streaming for LLMs

vLLM is natively built in SageMaker Asynchronous Inference and LMI, so I didn't need to explicitly load the standalone vLLM package.

Using vLLM for even a tiny 3B model is strategic for three specific reasons:

Concurrency at scale: No lag to all users.
Cost: Squeeze more VRAM.
The human feel: Text appearing word-by-word.

▫ Concurrency at Scale

Latency on a standard inference engine (HuggingFace) grows sequentially as users increase. When User A asks a long question, User B has to wait for the model to finish User A's entire response before it even starts on theirs.

vLLM uses continuous batching where the system starts to generate a response to User B, while it’s still halfway through User A's paragraph. When the app has multiple people talking to the clone at once, vLLM ensures nobody feels a lag. Concurrency at scale nearly stays flat.

▫ Cost

vLLM can squeeze more from affordable hardware as its PagedAttention manages dynamically the KV cache with no OOM risks during long conversations.

▫ The Human Feel

vLLM is built to support high-speed streaming with an OpenAI-compatible API, key to the clone app as text appearing word-by-word is vital for the illusion of life.

Wrapping Up

Now, you can chat with the digital clone on playground.

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

FAQ

1) Why use GPT-5.4 for SLM Distillation?

👉 GPT-5.4 achieves an 83.0% win rate on GDPval professional tasks. This provides the high-fidelity logical grounding necessary for a 3B model to mimic professional reasoning.

2) How do 'Thinking' blocks improve RKD?

👉 By capturing the explicit 'Thought' block from GPT-5.4, the student model learns the specific reasoning steps (CoT) rather than just pattern-matching the final response.

3) What is the cost benefit of GPT-5.4 mini?

👉 Using GPT-5.4 mini for Phase 2 allows for high-volume synthetic data generation at 1/10th the cost of the Pro version while maintaining frontier-level logic.