Large Language Models (LLMs) are accessible in any occasion, on any device.
But LLMs with hundreds of billions parameters are too expensive for low-latency applications, while vanilla small models (SLMs) tend to fail in fidelity and consistent responses.
To tackle this challenge, I’ll tune a compact Llama 3.2-3B model to mirror the conversational prowess of much larger LLMs.
This involves a three-phase pipeline—SFT, RKD, and DPO—to effectively embed a distinct personality into the model's smaller weights.
What We'll Build
We'll build our digital clone, mimicking our personality when answering a question.
The below diagram illustrates the system architecture:
Figure A. The system architecture of the digital twin for cloud and edge deployments (Created by Kuriko IWAI)
The fine-tuned model is deployed via AWS LMI with a Lambda function acting as a security proxy for downstream services, or distributed as a GGUF file for edge devices like Smartphone.
◼ The Architecture: Hybrid Cloud-Edge Deployment
I'll adopt a hybrid deployment strategy across cloud and edge environments.
The approach follows two phases:
Cloud Deployment: Enables rapid testing of the persona across any device (Web, Mobile, Tablet) via a simple API.
Edge Deployment: Positioned as a premium tier, allowing users to download the model for offline interaction with 100% data privacy.
Utilizing the Llama 3.2-3B model achieves the optimal balance between computational power and a compact footprint.
◼ Comparison of Cloud and Edge Deployment
The below table compares the strategies of cloud and edge deployment:
Strategy
Cloud Deployment (SageMaker)
Edge Deployment (Local)
Where the Model Lives
In the Cloud on a Cloud provider's (AWS) GPU.
Inside the phone as a .gguf file.
Connectivity
Requires Internet connection.
Works offline; zero data cost.
Phone Battery Impact
Zero. The phone just displays text.
Drains the battery fast. The model uses 100% CPU.
Model Size
No limit. Can run massive 70B models.
Limited to small models.
Model Serving Overhead
Low. Requires only standard API calls.
High. Must code for iOS/Android NPU.
Table 1. Architectural Trade-offs: Cloud SageMaker vs. Local Edge Execution.
Cloud deployment can host massive models without hardware limitations but requires a stable internet connection for access.
Edge deployment eliminates the need for an internet connection; however, the model must be significantly compressed to fit within smartphone RAM constraints, and real-time inference risks rapid battery depletion.
Infrastructure Strategy: SageMaker LMI vs. Local GGUF
◼ Cloud Deployment Strategies - Choosing the Right Option
Cloud deployment tech stacks fall into four primary categories based on service model and infrastructure:
Tier
Format x Engine
Infrastructure
Why this works
The Enterprise
AWQ / FP8 × vLLM
SageMaker LMI (Real-time)
Max Throughput. Optimized for hundreds of simultaneous users with sub-second latency via Paged Attention.
The Cost Saver
AWQ / BF16 × vLLM
SageMaker Async Inference
Scale-to-Zero. Best for apps used 5–10 times an hour. Saves money by shutting down GPU during idle time. Cons: Decompression latency of zipped tarballs. Input/Output must be stored in S3.
The Serverless
GGUF × llama.cpp
SageMaker Serverless
Pure Pay-per-Use. Zero cost when idle. Ideal for 1B models because of the max 6GB RAM limit. Cons. CPU. Hard RAM limit. Cold start.
The Hobbyist
GGUF × llama.cpp
SageMaker BYOC (CPU)
Lowest Entry Barrier. Runs on cheap, always-on CPU instances ($0.05/hr) like ml.t3.medium. Cons. CPU.
Table 2.1. SageMaker Deployment Tiers: From Enterprise vLLM to Serverless GGUF
The Serverless tier is the zero-maintenance tier where SageMaker manages instances on behalf of the user, just like Lambda. But because it's CPU-only and has a strict 6GB RAM limit, the model must be small with < 3B parameters. It also has 10-20 seconds of cold start.
The Hobbyist tier is to avoid the cold start by running llama.cpp on a tiny ml.t3.medium or ml.m5.large instance. Inference is slower than a GPU, but for a 1B model, it still can hit 15+ tokens/sec.
This project applies the Enterprise tier to secure the model intelligence real-time.
Developer Note:
In Asynchronous Inference, when setting Min Inference zero, SageMaker shuts down the GPU instance when no one accesses the inference.
This will introduce 1+ minute cold start for the first message, but can save hundreds of dollars a month especially when traffic is sparse.
◼ The Quantization
On SageMaker GPUs like the T4 or A10G, a 3B model is so small that quantizing the model is actually counter-productive.
Keeping it in BF16 ensures the model keeps all the subtle nuances that the model learns during DPO without any compression.
If the model has more than 8B, vLLM complies well with formats like AWQ (4-bit) or FP8 (8-bit floating point), which are designed to let the GPU tensor cores do the math much faster than the GGUF format.
◼ Edge Deployment Strategies
For the edge deployment, model and quantization choices depend heavily on the edge device:
Device
Recommended Model Size
Quantization
Experience
Mobile (iOS/Android)
< 3B
Q4_K_S (GGUF)
Smooth, instant local chat. Blazing-fast responses (30–60 tokens/sec).
Laptop (8GB RAM)
3B
Q4_K_M (GGUF)
Very responsive local experience.
MacBook (16GB+ RAM)
8B
Q4_K_M (GGUF)
High intelligence, steady local speed.
Gaming PC (Local GPU)
8B
Q8_0 or FP16 (GGUF)
Best possible local quality.
Table 2.2. Hardware Recommendation Matrix for Edge Quantization (GGUF).
In the case of Llama 3.2 3B, it only requires ~ 3.0GB of RAM when quantized with Q4_K_S, leaving plenty of room for the mobile system to run other apps.
The Multi-Step Model Tuning Pipeline
The training sequence follows a "Learn, Mimic, Align" flow:
Behavioral & Tone Alignment: Make the model prefer specific tones over generic AI responses.
Sounds like the target person.
Table 3. The Triple-Phase Model Tuning Sequence.
The data sources can be anything which can fulfill the model's learning objectives in each phase.
For example, in Phase 1, one can use their MBTI results as the ground truth of their professional personality.
In either case, make sure you'll mask sensitive information before tuning the model.
◼ Phase 1. Learn: SFT & QLoRA for Factual Grounding
The first phase is to convert my professional background into the model to enable it to master the factual truth.
After creating at least 100 Question-Answer pairs from the reference documents such as resume and LinkedIn posts, I'll format the train dataset to follow the chat template of the Llama model family:
1[2{3"instruction":"QUESTION_1",4"context":"CONTEXT OF THE ANSWER_1. RELATED_PROJECT ETC",5"response":"ANSWER_1"6},7...8]9
Developer Note: Synthetic SFT - Another Form of RKD
Instead of manually drafting question-answer pairs, you can prompt an LLM to generate the pairs from the reference document like resume. This is another form of Response Knowledge Distillation, using the LLM as a teacher.
This process distills the logic behind my response into the student, 3B model, so that the student model can grasp reasoning and linguistic nuance behind the response.
I'll first prompt the teacher to generate the train dataset:
Read the reference, and generate 100 examples that follow a Chain-of-Thought (CoT) structure such that: Instruction: A task or question related to the reference. Thought: Based on the given reference, write out the internal logical steps I took (e.g., 'First, I'll check the API logs to see if the token expired...'). Response: Write the final answer I selected based on the given reference.
Keep all elements under 150 tokens because small models perform much better and hallucinate less when the training examples are punchy and concise rather than long-winded.
Then, similar to the previous phase, I'll train the 3B-model using the SFTTrainer instance:
1# for cloud deployment (bf16)2model.save_pretrained_merged(3<OUTPUT_DIR>,4 tokenizer,5 save_method="merged_16bit",# keep the model in 16 bits6)789# for edge deployment (save the artifacts in a gguf file)10model.save_pretrained_gguf(11<OUTPUT_DIR>,12 tokenizer,13 quantization_method="q4_k_s",# llama.cpp quantization to compress the file14)15
Deployment
Lastly, the model is deployed on cloud and edge devices.
Here, I'll demonstrate the cloud deployment using the Enterprise tier described in Table 2.1; deploying the tuned model on SageMaker LMI with its built-in vLLM engine.
After setting up a dedicated IAM role and trust policy, I'll create a model on SageMaker:
When the script calls aws sagemaker create-model, it passes the necessary configuration to the GPU instance. When the instance starts up, it:
Pulls the built-in container image from the IMAGE_URI.
Checks Environment variables and applies it as serving.properties parameters: - option.rolling_batch=vllm indicates it uses the built-in vLLM engine. - S3_PATH as HF_MODEL_ID indicates it uses the remote model (tuned) stored in the S3_PATH instead of the original HF model (e.g., unsloth/Llama-3.2-3B).
Initializes the built-in vLLM engine and deploy the tuned model as a new model on SageMaker.
Developer Note: SageMaker Standard Container vs LMI Container Standard SageMaker containers (Hugging Face, PyTorch, etc) requires to zip the model artifacts into a tarball, model.tar.gz file, which triggers decompression latency:
Receive request. → Instance starts. → Download tar. → Unzip tar to the default dir: /opt/ml/model/. → Load weights. → Perform inference.
Conversely, SageMaker LMI containers do not need a tarball; it can pull raw weights directly from the S3 prefix in the HF_MODEL_ID variable, achieving better inference latency:
After creating a model, I'll configure a dedicated endpoint:
1# create an endpoint config2aws sagemaker create-endpoint-config \
3--endpoint-config-name "$CONFIG_NAME" \
4--production-variants "[{5 \"VariantName\": \"variant-1\",6 \"ModelName\": \"$MODEL_NAME\",7 \"InstanceType\": \"ml.g4dn.xlarge\",8 \"InitialInstanceCount\":19}]"
1011# create a dedicated endpoint and attach the config12aws sagemaker create-endpoint \
13--endpoint-name "$ENDPOINT_NAME" \
14--endpoint-config-name "$CONFIG_NAME"15
◼ Why vLLM? PagedAttention and Streaming for LLMs
vLLM is natively built in SageMaker Asynchronous Inference and LMI, so I didn't need to explicitly load the standalone vLLM package.
Using vLLM for even a tiny 3B model is strategic for three specific reasons:
Concurrency at scale: No lag to all users.
Cost: Squeeze more VRAM.
The human feel: Text appearing word-by-word.
▫ Concurrency at Scale
Latency on a standard inference engine (HuggingFace) grows sequentially as users increase. When User A asks a long question, User B has to wait for the model to finish User A's entire response before it even starts on theirs.
vLLM uses continuous batching where the system starts to generate a response to User B, while it’s still halfway through User A's paragraph. When the app has multiple people talking to the clone at once, vLLM ensures nobody feels a lag. Concurrency at scale nearly stays flat.
▫ Cost
vLLM can squeeze more from affordable hardware as its PagedAttention manages dynamically the KV cache with no OOM risks during long conversations.
▫ The Human Feel
vLLM is built to support high-speed streaming with an OpenAI-compatible API, key to the clone app as text appearing word-by-word is vital for the illusion of life.