Engineer High-Fidelity SLM for Edge AI with Multi-step Tuning Pipeline
Learn how to engineer high-fidelity Small Language Model (SLM) with SFT, RKD, and DPO for edge deployment.
By Kuriko IWAI

Table of Contents
IntroductionWhat We'll BuildIntroduction
Large Language Models (LLMs) are accessible to any occasion, on any device today.
However, large-scaled models with hundreds of billions parameters are too expensive for low-latency applications, while vanilla small models with a couple of billions parameters tend to fail in fidelity and consistent responses.
To tackle this challenge, I will fine-tune a compact Llama 3.2-3B model to mirror the conversational prowess of much larger LLMs.
This involves a three-phase pipeline—SFT, RKD, and DPO—to effectively embed a distinct personality into the model's smaller weights.
What We'll Build
We'll build our digital clone, mimicking our personality when answering a question.
The below diagram illustrates the system architecture:

Figure A. The system architecture of the digital twin for cloud and edge deployments (Created by Kuriko IWAI)
The fine-tuned model is deployed via AWS LMI with a Lambda function acting as a security proxy for downstream services, or distributed as a GGUF file for edge devices like Smartphone.
◼ The Architecture: Hybrid Cloud-Edge Deployment
I'll adopt a hybrid deployment strategy across cloud and edge environments.
The approach follows two phases:
Cloud Deployment: Enables rapid testing of the persona across any device (Web, Mobile, Tablet) via a simple API.
Edge Deployment: Positioned as a premium tier, allowing users to download the model for offline interaction with 100% data privacy.
Utilizing the Llama 3.2-3B model achieves the optimal balance between computational power and a compact footprint.
◼ Comparison of Cloud and Edge Deployment
The below table compares the strategies of cloud and edge deployment:
Table 1. Architectural Trade-offs: Cloud SageMaker vs. Local Edge Execution.
Cloud deployment can host massive models without hardware limitations but requires a stable internet connection for access.
Edge deployment eliminates the need for an internet connection; however, the model must be significantly compressed to fit within smartphone RAM constraints, and real-time inference risks rapid battery depletion.
Infrastructure Strategy: SageMaker LMI vs. Local GGUF
◼ Cloud Deployment Strategies - Choosing the Right Option
Cloud deployment tech stacks fall into four primary categories based on service model and infrastructure:
Table 2.1. SageMaker Deployment Tiers: From Enterprise vLLM to Serverless GGUF
The Serverless tier is the zero-maintenance tier where SageMaker manages instances on behalf of the user, just like Lambda. But because it's CPU-only and has a strict 6GB RAM limit, the model must be small with < 3B parameters. It also has 10-20 seconds of cold start.
The Hobbyist tier is to avoid the cold start by running llama.cpp on a tiny ml.t3.medium or ml.m5.large instance. Inference is slower than a GPU, but for a 1B model, it still can hit 15+ tokens/sec.
This project applies the Enterprise tier to secure the model intelligence real-time.
Developer Note:
In Asynchronous Inference, when setting Min Inference zero, SageMaker shuts down the GPU instance when no one accesses the inference.
This will introduce 1+ minute cold start for the first message, but can save hundreds of dollars a month especially when traffic is sparse.
◼ The Quantization
On SageMaker GPUs like the T4 or A10G, a 3B model is so small that quantizing the model is actually counter-productive.
Keeping it in BF16 ensures the model keeps all the subtle nuances that the model learns during DPO without any compression.
If the model has more than 8B, vLLM complies well with formats like AWQ (4-bit) or FP8 (8-bit floating point), which are designed to let the GPU tensor cores do the math much faster than the GGUF format.
◼ Edge Deployment Strategies
For the edge deployment, model and quantization choices depend heavily on the edge device:
Table 2.2. Hardware Recommendation Matrix for Edge Quantization (GGUF).
In the case of Llama 3.2 3B, it only requires ~ 3.0GB of RAM when quantized with Q4_K_S, leaving plenty of room for the mobile system to run other apps.
The Multi-Step Model Tuning Pipeline
The training sequence follows a "Learn, Mimic, Align" flow:
Table 3. The Triple-Phase Model Tuning Sequence.
The data sources can be anything which can fulfill the model's learning objectives in each phase.
For example, in Phase 1, one can use their MBTI results as the ground truth of their professional personality.
In either case, make sure you'll mask sensitive information before tuning the model.
◼ Phase 1. Learn: SFT & QLoRA for Factual Grounding
The first phase is to convert my professional background into the model to enable it to master the factual truth.
After creating at least 100 Question-Answer pairs from the reference documents such as resume and LinkedIn posts, I'll format the train dataset to follow the chat template of the Llama model family:
1[
2 {
3 "instruction": "QUESTION_1",
4 "context": "CONTEXT OF THE ANSWER_1. RELATED_PROJECT ETC",
5 "response": "ANSWER_1"
6 },
7...
8]
9
Then, I'll perform QLoRA (Quantized Low-Rank Adaptation) using the SFTTrainer instance from the trl library:
1from trl.trainer.sft_trainer import SFTTrainer
2from transformers import TrainingArguments
3
4# instantiate sft trainer
5trainer = SFTTrainer(
6 model=model,
7 train_dataset=train_dataset['train'], # HuggingFace's Dataset
8 processing_class=tokenizer,
9 peft_config=peft_config,
10 formatting_func=formatting_func,
11 args=TrainingArguments(
12 per_device_train_batch_size=2,
13 gradient_accumulation_steps=4,
14 ...
15 ),
16)
17
18# train
19trainer.train()
20
The trained model and tokenizer are accessible:
1trained_model = trainer.model
2tokenizer = trainer.processing_class
3
Developer Note: Synthetic SFT - Another Form of RKD
Instead of manually drafting question-answer pairs, you can prompt an LLM to generate the pairs from the reference document like resume. This is another form of Response Knowledge Distillation, using the LLM as a teacher.
◼ Phase 2. Mimic: Response Knowledge Distillation (RKD) for Logical Nuance
Next, I'll use the latest GPT model, GPT-5.4 as a teacher model and distill the logic behind my response into the student, 3B model, so that the student model can grasp reasoning and linguistic nuance behind the response.
I'll first prompt the teacher to generate the train dataset:
Read the reference, and generate 100 examples that follow a Chain-of-Thought (CoT) structure such that:
Instruction: A task or question related to the reference.
Thought: Based on the given reference, write out the internal logical steps I took (e.g., 'First, I'll check the API logs to see if the token expired...').
Response: Write the final answer I selected based on the given reference.
The train dataset will look like:
1[
2 {
3 "instruction": "GENERAL_QUESTION_1",
4 "thought": "COT_1",
5 "response": "RESPONSE_1"
6 },
7 ...
8]
9
Developer Note:
Keep all elements under 150 tokens because small models perform much better and hallucinate less when the training examples are punchy and concise rather than long-winded.
Then, similar to the previous phase, I'll train the 3B-model using the SFTTrainer instance:
1from trl.trainer.sft_trainer import SFTTrainer
2
3trainer = SFTTrainer(
4 model=model,
5 train_dataset=train_dataset['train'],
6 processing_class=tokenizer,
7 peft_config=peft_config,
8 formatting_func=_formatting_func,
9 args=training_args,
10)
11
12trainer.train()
13
◼ Phase 3. Align: DPO for Behavioral and Tone Alignment
The last phase is to perform DPO to align the model with the tone and styles of conversation.
DPO needs prompts with chosen answers and rejected answers:
1[
2 {
3 "prompt": "QUESTION_1",
4 "chosen": "ANSWER_1 (aligns with tone and style)",
5 "rejected": "REJECTED_ANSWER_1 (sounds like AI)"
6 },
7...
8]
9
After generating at least 100 train dataset, I'll perform DPO using the DPOTrainer instance from the trl library:
1from unsloth import PatchDPOTrainer
2from trl.trainer.dpo_trainer import DPOTrainer
3
4# apply the patch
5PatchDPOTrainer()
6
7# initialize the trainer
8dpo_trainer = DPOTrainer(
9 model=model,
10 ref_model=None,
11 args=training_args,
12 train_dataset=train_dataset['train'],
13 processing_class=tokenizer
14)
15
16# train
17dpo_trainer.train()
18
◼ Saving the Tuned Model Artifacts
Lastly, I'll save the model artifacts:
1# for cloud deployment (bf16)
2model.save_pretrained_merged(
3 <OUTPUT_DIR>,
4 tokenizer,
5 save_method="merged_16bit", # keep the model in 16 bits
6)
7
8
9# for edge deployment (save the artifacts in a gguf file)
10model.save_pretrained_gguf(
11 <OUTPUT_DIR>,
12 tokenizer,
13 quantization_method="q4_k_s", # llama.cpp quantization to compress the file
14)
15
Deployment
Lastly, the model is deployed on cloud and edge devices.
Here, I'll demonstrate the cloud deployment using the Enterprise tier described in Table 2.1; deploying the tuned model on SageMaker LMI with its built-in vLLM engine.
After setting up a dedicated IAM role and trust policy, I'll create a model on SageMaker:
1aws sagemaker create-model \
2 --model-name "$MODEL_NAME" \
3 --execution-role-arn "$ROLE_ARN" \
4 --primary-container "{
5 "Image": "$IMAGE_URI", # sagemaker's built-in vllm container uri
6 "Environment": {
7 "HF_MODEL_ID": "$S3_PATH", \
8 "OPTION_TRUST_REMOTE_CODE": "true", \
9 "OPTION_ROLLING_BATCH": "vllm", \
10 "OPTION_TASK": "text-generation", \
11 "OPTION_DTYPE": "fp16", \
12 "OPTION_MAX_MODEL_LEN": "512" \
13 }
14 }"
15
When the script calls aws sagemaker create-model, it passes the necessary configuration to the GPU instance. When the instance starts up, it:
Pulls the built-in container image from the IMAGE_URI.
Checks Environment variables and applies it as serving.properties parameters:
- option.rolling_batch=vllm indicates it uses the built-in vLLM engine.
- S3_PATH as HF_MODEL_ID indicates it uses the remote model (tuned) stored in the S3_PATH instead of the original HF model (e.g., unsloth/Llama-3.2-3B).Initializes the built-in vLLM engine and deploy the tuned model as a new model on SageMaker.
Developer Note: SageMaker Standard Container vs LMI Container
Standard SageMaker containers (Hugging Face, PyTorch, etc) requires to zip the model artifacts into a tarball, model.tar.gz file, which triggers decompression latency:
Receive request.
→ Instance starts.
→ Download tar.
→ Unzip tar to the default dir: /opt/ml/model/.
→ Load weights.
→ Perform inference.
Conversely, SageMaker LMI containers do not need a tarball; it can pull raw weights directly from the S3 prefix in the HF_MODEL_ID variable, achieving better inference latency:
Receive request.
→ Instance starts.
→ Stream weights from S3.
→ Load weights.
→ Perform inference.
After creating a model, I'll configure a dedicated endpoint:
1# create an endpoint config
2aws sagemaker create-endpoint-config \
3 --endpoint-config-name "$CONFIG_NAME" \
4 --production-variants "[{
5 \"VariantName\": \"variant-1\",
6 \"ModelName\": \"$MODEL_NAME\",
7 \"InstanceType\": \"ml.g4dn.xlarge\",
8 \"InitialInstanceCount\": 1
9 }]"
10
11# create a dedicated endpoint and attach the config
12aws sagemaker create-endpoint \
13 --endpoint-name "$ENDPOINT_NAME" \
14 --endpoint-config-name "$CONFIG_NAME"
15
◼ Why vLLM? PagedAttention and Streaming for LLMs
vLLM is natively built in SageMaker Asynchronous Inference and LMI, so I didn't need to explicitly load the standalone vLLM package.
Using vLLM for even a tiny 3B model is strategic for three specific reasons:
Concurrency at scale: No lag to all users.
Cost: Squeeze more VRAM.
The human feel: Text appearing word-by-word.
▫ Concurrency at Scale
Latency on a standard inference engine (HuggingFace) grows sequentially as users increase. When User A asks a long question, User B has to wait for the model to finish User A's entire response before it even starts on theirs.
vLLM uses continuous batching where the system starts to generate a response to User B, while it’s still halfway through User A's paragraph. When the app has multiple people talking to the clone at once, vLLM ensures nobody feels a lag. Concurrency at scale nearly stays flat.
▫ Cost
vLLM can squeeze more from affordable hardware as its PagedAttention manages dynamically the KV cache with no OOM risks during long conversations.
▫ The Human Feel
vLLM is built to support high-speed streaming with an OpenAI-compatible API, key to the clone app as text appearing word-by-word is vital for the illusion of life.
Wrapping Up
Now, you can access the playground to test the inference.
To mitigate cold start issues in Asynchronous Inference, implementing a frontend warmup function would be recommended.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
A Technical Guide to QLoRA and Memory-Efficient Fine-Tuning
Model Distillation Guide: Compressing LLMs for Edge Efficiency
Is 4-Bit All You Need? The Math Behind Modern LLM Compression
Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation
The Definitive Guide to LLM Fine-Tuning: Objectivee, Mechanisms, and Hardware
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation
Share What You Learned
Kuriko IWAI, "Engineer High-Fidelity SLM for Edge AI with Multi-step Tuning Pipeline" in Kernel Labs
https://kuriko-iwai.com/fine-tuning-slm-llama-persona-distillation-guide
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.




