Engineer High-Fidelity SLM for Edge AI with Multi-step Tuning Pipeline

Learn how to engineer high-fidelity Small Language Model (SLM) with SFT, RKD, and DPO for edge deployment.

Machine LearningDeep LearningData SciencePythonLLM

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat We'll Build
The Architecture: Hybrid Cloud-Edge Deployment
Comparison of Cloud and Edge Deployment
Infrastructure Strategy: SageMaker LMI vs. Local GGUF
Cloud Deployment Strategies - Choosing the Right Option
The Quantization
Edge Deployment Strategies
The Multi-Step Model Tuning Pipeline
Phase 1. Learn: SFT & QLoRA for Factual Grounding
Phase 2. Mimic: Response Knowledge Distillation (RKD) for Logical Nuance
Phase 3. Align: DPO for Behavioral and Tone Alignment
Saving the Tuned Model Artifacts
Deployment
Why vLLM? PagedAttention and Streaming for LLMs
Wrapping Up

Introduction

Large Language Models (LLMs) are accessible to any occasion, on any device today.

However, large-scaled models with hundreds of billions parameters are too expensive for low-latency applications, while vanilla small models with a couple of billions parameters tend to fail in fidelity and consistent responses.

To tackle this challenge, I will fine-tune a compact Llama 3.2-3B model to mirror the conversational prowess of much larger LLMs.

This involves a three-phase pipeline—SFT, RKD, and DPO—to effectively embed a distinct personality into the model's smaller weights.

What We'll Build

We'll build our digital clone, mimicking our personality when answering a question.

The below diagram illustrates the system architecture:

Figure A. The system architecture of the digital twin for cloud and edge deployments (Created by Kuriko IWAI)

The fine-tuned model is deployed via AWS LMI with a Lambda function acting as a security proxy for downstream services, or distributed as a GGUF file for edge devices like Smartphone.

The Architecture: Hybrid Cloud-Edge Deployment

I'll adopt a hybrid deployment strategy across cloud and edge environments.

The approach follows two phases:

  • Cloud Deployment: Enables rapid testing of the persona across any device (Web, Mobile, Tablet) via a simple API.

  • Edge Deployment: Positioned as a premium tier, allowing users to download the model for offline interaction with 100% data privacy.

Utilizing the Llama 3.2-3B model achieves the optimal balance between computational power and a compact footprint.

Comparison of Cloud and Edge Deployment

The below table compares the strategies of cloud and edge deployment:

Strategy

Cloud Deployment (SageMaker)

Edge Deployment (Local)

Where the Model Lives

In the Cloud on a Cloud provider's (AWS) GPU.

Inside the phone as a .gguf file.

Connectivity

Requires Internet connection.

Works offline; zero data cost.

Phone Battery Impact

Zero. The phone just displays text.

Drains the battery fast. The model uses 100% CPU.

Model Size

No limit. Can run massive 70B models.

Limited to small models.

Model Serving Overhead

Low. Requires only standard API calls.

High. Must code for iOS/Android NPU.

Table 1. Architectural Trade-offs: Cloud SageMaker vs. Local Edge Execution.

Cloud deployment can host massive models without hardware limitations but requires a stable internet connection for access.

Edge deployment eliminates the need for an internet connection; however, the model must be significantly compressed to fit within smartphone RAM constraints, and real-time inference risks rapid battery depletion.

Infrastructure Strategy: SageMaker LMI vs. Local GGUF

Cloud Deployment Strategies - Choosing the Right Option

Cloud deployment tech stacks fall into four primary categories based on service model and infrastructure:

Tier

Format x Engine

Infrastructure

Why this works

The Enterprise

AWQ / FP8 × vLLM

SageMaker LMI (Real-time)

Max Throughput. Optimized for hundreds of simultaneous users with sub-second latency via Paged Attention.

The Cost Saver

AWQ / BF16 × vLLM

SageMaker Async Inference

Scale-to-Zero. Best for apps used 5–10 times an hour. Saves money by shutting down GPU during idle time.
Cons: Decompression latency of zipped tarballs. Input/Output must be stored in S3.

The Serverless

GGUF × llama.cpp

SageMaker Serverless

Pure Pay-per-Use. \(0 cost when idle. Ideal for 1B models because of the max 6GB RAM limit.
Cons. CPU. Hard RAM limit. Cold start.

The Hobbyist

GGUF × llama.cpp

SageMaker BYOC (CPU)

Lowest Entry Barrier. Runs on cheap, always-on CPU instances (\)0.05/hr) like ml.t3.medium.
Cons. CPU.

Table 2.1. SageMaker Deployment Tiers: From Enterprise vLLM to Serverless GGUF

The Serverless tier is the zero-maintenance tier where SageMaker manages instances on behalf of the user, just like Lambda. But because it's CPU-only and has a strict 6GB RAM limit, the model must be small with < 3B parameters. It also has 10-20 seconds of cold start.

The Hobbyist tier is to avoid the cold start by running llama.cpp on a tiny ml.t3.medium or ml.m5.large instance. Inference is slower than a GPU, but for a 1B model, it still can hit 15+ tokens/sec.

This project applies the Enterprise tier to secure the model intelligence real-time.

Developer Note:

In Asynchronous Inference, when setting Min Inference zero, SageMaker shuts down the GPU instance when no one accesses the inference.

This will introduce 1+ minute cold start for the first message, but can save hundreds of dollars a month especially when traffic is sparse.

The Quantization

On SageMaker GPUs like the T4 or A10G, a 3B model is so small that quantizing the model is actually counter-productive.

Keeping it in BF16 ensures the model keeps all the subtle nuances that the model learns during DPO without any compression.

If the model has more than 8B, vLLM complies well with formats like AWQ (4-bit) or FP8 (8-bit floating point), which are designed to let the GPU tensor cores do the math much faster than the GGUF format.

Edge Deployment Strategies

For the edge deployment, model and quantization choices depend heavily on the edge device:

Device

Recommended Model Size

Quantization

Experience

Mobile (iOS/Android)

< 3B

Q4_K_S (GGUF)

Smooth, instant local chat. Blazing-fast responses (30–60 tokens/sec).

Laptop (8GB RAM)

3B

Q4_K_M (GGUF)

Very responsive local experience.

MacBook (16GB+ RAM)

8B

Q4_K_M (GGUF)

High intelligence, steady local speed.

Gaming PC (Local GPU)

8B

Q8_0 or FP16 (GGUF)

Best possible local quality.

Table 2.2. Hardware Recommendation Matrix for Edge Quantization (GGUF).

In the case of Llama 3.2 3B, it only requires ~ 3.0GB of RAM when quantized with Q4_K_S, leaving plenty of room for the mobile system to run other apps.

The Multi-Step Model Tuning Pipeline

The training sequence follows a "Learn, Mimic, Align" flow:

Tuning Method

Data Source

What Model Learns

Objective

Phase 1. Learn

SFT (QLoRA)

Resume. References. Assessment results.

Factual Grounding: Equip the model with the knowledge base - "facts" on professional history.

Eliminate hallucination.

Phase 2. Mimic

RKD

Meeting transcript. A copy of negotiation emails.

Logical Nuance: Distill Chain-of-Thought (CoT) from the teacher (GPT 5.4).

Mimic how the target person thinks.

Phase 3. Align

DPO

Social media comments. WhatsApp chat history.

Behavioral & Tone Alignment: Make the model prefer specific tones over generic AI responses.

Sounds like the target person.

Table 3. The Triple-Phase Model Tuning Sequence.

The data sources can be anything which can fulfill the model's learning objectives in each phase.

For example, in Phase 1, one can use their MBTI results as the ground truth of their professional personality.

In either case, make sure you'll mask sensitive information before tuning the model.

Phase 1. Learn: SFT & QLoRA for Factual Grounding

The first phase is to convert my professional background into the model to enable it to master the factual truth.

After creating at least 100 Question-Answer pairs from the reference documents such as resume and LinkedIn posts, I'll format the train dataset to follow the chat template of the Llama model family:

1[
2    {
3        "instruction": "QUESTION_1", 
4        "context": "CONTEXT OF THE ANSWER_1. RELATED_PROJECT ETC",
5        "response": "ANSWER_1"
6    },
7...
8]
9

Then, I'll perform QLoRA (Quantized Low-Rank Adaptation) using the SFTTrainer instance from the trl library:

1from trl.trainer.sft_trainer import SFTTrainer
2from transformers import TrainingArguments
3
4# instantiate sft trainer
5trainer = SFTTrainer(
6    model=model,
7    train_dataset=train_dataset['train'], # HuggingFace's Dataset
8    processing_class=tokenizer,
9    peft_config=peft_config,
10    formatting_func=formatting_func,
11    args=TrainingArguments(
12        per_device_train_batch_size=2,
13        gradient_accumulation_steps=4,
14        ...
15    ),
16)
17
18# train
19trainer.train()
20

The trained model and tokenizer are accessible:

1trained_model = trainer.model
2tokenizer = trainer.processing_class
3

Developer Note: Synthetic SFT - Another Form of RKD

Instead of manually drafting question-answer pairs, you can prompt an LLM to generate the pairs from the reference document like resume. This is another form of Response Knowledge Distillation, using the LLM as a teacher.

Phase 2. Mimic: Response Knowledge Distillation (RKD) for Logical Nuance

Next, I'll use the latest GPT model, GPT-5.4 as a teacher model and distill the logic behind my response into the student, 3B model, so that the student model can grasp reasoning and linguistic nuance behind the response.

I'll first prompt the teacher to generate the train dataset:

Read the reference, and generate 100 examples that follow a Chain-of-Thought (CoT) structure such that:
Instruction: A task or question related to the reference.
Thought: Based on the given reference, write out the internal logical steps I took (e.g., 'First, I'll check the API logs to see if the token expired...').
Response: Write the final answer I selected based on the given reference.

The train dataset will look like:

1[
2    {
3        "instruction": "GENERAL_QUESTION_1", 
4        "thought": "COT_1", 
5        "response": "RESPONSE_1"
6    },
7    ...
8]
9

Developer Note:

Keep all elements under 150 tokens because small models perform much better and hallucinate less when the training examples are punchy and concise rather than long-winded.

Then, similar to the previous phase, I'll train the 3B-model using the SFTTrainer instance:

1from trl.trainer.sft_trainer import SFTTrainer
2
3trainer = SFTTrainer(
4    model=model,
5    train_dataset=train_dataset['train'],
6    processing_class=tokenizer,
7    peft_config=peft_config,
8    formatting_func=_formatting_func,
9    args=training_args,
10)
11
12trainer.train()
13

Phase 3. Align: DPO for Behavioral and Tone Alignment

The last phase is to perform DPO to align the model with the tone and styles of conversation.

DPO needs prompts with chosen answers and rejected answers:

1[
2    {
3        "prompt": "QUESTION_1", 
4        "chosen": "ANSWER_1 (aligns with tone and style)",
5        "rejected": "REJECTED_ANSWER_1 (sounds like AI)"
6    },
7...
8]
9

After generating at least 100 train dataset, I'll perform DPO using the DPOTrainer instance from the trl library:

1from unsloth import PatchDPOTrainer
2from trl.trainer.dpo_trainer import DPOTrainer
3
4# apply the patch
5PatchDPOTrainer()
6
7# initialize the trainer
8dpo_trainer = DPOTrainer(
9    model=model,
10    ref_model=None,
11    args=training_args,
12    train_dataset=train_dataset['train'],
13    processing_class=tokenizer
14)
15
16# train
17dpo_trainer.train()
18

Saving the Tuned Model Artifacts

Lastly, I'll save the model artifacts:

1# for cloud deployment (bf16)
2model.save_pretrained_merged(
3    <OUTPUT_DIR>,
4    tokenizer,
5    save_method="merged_16bit", # keep the model in 16 bits
6)
7
8
9# for edge deployment (save the artifacts in a gguf file)
10model.save_pretrained_gguf(
11    <OUTPUT_DIR>,
12    tokenizer, 
13    quantization_method="q4_k_s", # llama.cpp quantization to compress the file
14)
15

Deployment

Lastly, the model is deployed on cloud and edge devices.

Here, I'll demonstrate the cloud deployment using the Enterprise tier described in Table 2.1; deploying the tuned model on SageMaker LMI with its built-in vLLM engine.

After setting up a dedicated IAM role and trust policy, I'll create a model on SageMaker:

1aws sagemaker create-model \
2    --model-name "$MODEL_NAME" \
3    --execution-role-arn "$ROLE_ARN" \
4    --primary-container "{
5        "Image": "$IMAGE_URI", # sagemaker's built-in vllm container uri
6        "Environment": { 
7            "HF_MODEL_ID": "$S3_PATH", \
8            "OPTION_TRUST_REMOTE_CODE": "true", \ 
9            "OPTION_ROLLING_BATCH": "vllm", \
10            "OPTION_TASK": "text-generation", \
11            "OPTION_DTYPE": "fp16", \
12            "OPTION_MAX_MODEL_LEN": "512" \
13        }
14    }"
15

When the script calls aws sagemaker create-model, it passes the necessary configuration to the GPU instance. When the instance starts up, it:

  1. Pulls the built-in container image from the IMAGE_URI.

  2. Checks Environment variables and applies it as serving.properties parameters:
    - option.rolling_batch=vllm indicates it uses the built-in vLLM engine.
    - S3_PATH as HF_MODEL_ID indicates it uses the remote model (tuned) stored in the S3_PATH instead of the original HF model (e.g., unsloth/Llama-3.2-3B).

  3. Initializes the built-in vLLM engine and deploy the tuned model as a new model on SageMaker.

Developer Note: SageMaker Standard Container vs LMI Container
Standard SageMaker containers (Hugging Face, PyTorch, etc) requires to zip the model artifacts into a tarball, model.tar.gz file, which triggers decompression latency:

Receive request.
→ Instance starts.
→ Download tar.
→ Unzip tar to the default dir: /opt/ml/model/.
→ Load weights.
→ Perform inference.

Conversely, SageMaker LMI containers do not need a tarball; it can pull raw weights directly from the S3 prefix in the HF_MODEL_ID variable, achieving better inference latency:

Receive request.

→ Instance starts.
→ Stream weights from S3.
→ Load weights.
→ Perform inference.

After creating a model, I'll configure a dedicated endpoint:

1# create an endpoint config
2aws sagemaker create-endpoint-config \
3    --endpoint-config-name "$CONFIG_NAME" \
4    --production-variants "[{
5        \"VariantName\": \"variant-1\",
6        \"ModelName\": \"$MODEL_NAME\",
7        \"InstanceType\": \"ml.g4dn.xlarge\",
8        \"InitialInstanceCount\": 1
9    }]"
10
11# create a dedicated endpoint and attach the config
12aws sagemaker create-endpoint \
13    --endpoint-name "$ENDPOINT_NAME" \
14    --endpoint-config-name "$CONFIG_NAME"
15

Why vLLM? PagedAttention and Streaming for LLMs

vLLM is natively built in SageMaker Asynchronous Inference and LMI, so I didn't need to explicitly load the standalone vLLM package.

Using vLLM for even a tiny 3B model is strategic for three specific reasons:

  • Concurrency at scale: No lag to all users.

  • Cost: Squeeze more VRAM.

  • The human feel: Text appearing word-by-word.

Concurrency at Scale

Latency on a standard inference engine (HuggingFace) grows sequentially as users increase. When User A asks a long question, User B has to wait for the model to finish User A's entire response before it even starts on theirs.

vLLM uses continuous batching where the system starts to generate a response to User B, while it’s still halfway through User A's paragraph. When the app has multiple people talking to the clone at once, vLLM ensures nobody feels a lag. Concurrency at scale nearly stays flat.

Cost

vLLM can squeeze more from affordable hardware as its PagedAttention manages dynamically the KV cache with no OOM risks during long conversations.

The Human Feel

vLLM is built to support high-speed streaming with an OpenAI-compatible API, key to the clone app as text appearing word-by-word is vital for the illusion of life.

Wrapping Up

Now, you can access the playground to test the inference.

To mitigate cold start issues in Asynchronous Inference, implementing a frontend warmup function would be recommended.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

Hands-On Large Language Models: Language Understanding and Generation

Share What You Learned

Kuriko IWAI, "Engineer High-Fidelity SLM for Edge AI with Multi-step Tuning Pipeline" in Kernel Labs

https://kuriko-iwai.com/fine-tuning-slm-llama-persona-distillation-guide

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.