This system is crucial for ensuring models are reliable, scalable, and performant, preventing issues like model degradation and infrastructure bottlenecks.

In this article, I’ll walk through key deployment decisions, including the choice of inference types, serving platforms, and scalability strategies.

What is Machine Learning System

Machine Learning (ML) system refers to the comprehensive process of architecting and building the entire infrastructure to develop, deploy, monitor, and maintain machine learning models in a production environment.

The below diagram illustrates the scope of designing the ML system:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. ML system design flow (Created by Kuriko IWAI)

The process covers an end-to-end workflow of a model, ensuring that the entire system achieves the core principles:

Reliability: Functioning correctly and consistently even under some failures.
Scalability: Handling increasing amounts of data, users, and requests.
Performance: Meeting specific latency (speed) requirements and accuracy benchmark.
Security: Protecting sensitive data and models.
Maintainability: Easy to update, debug, and evolve over time.
Cost-Efficiency: Optimizing resource usage.

These principals are critical for the system to consistently deliver the optimal predictions, while preventing suboptimal performance due to:

Concept drift where the relationship between input features and the target variable starts to shift,
Data drift where the properties of input features used to train the model starts to change,
Infrastructure bottleneck where the system cannot handle the increasing requests, leading to its degradation and poor user experience, and
Lack of alert systems where problems go undetected until they cause significant damage in production.

In the next section, I’ll detail practical decision-making points in the Deployment phase.

Decision Making in Deployment

When deploying an ML model, three key decisions are crucial for meeting core system principles:

Inference Type: How will the model serve predictions?
Serving Platform: Is container orchestration or a serverless function the best fit?
Scalability & Reliability: Can the system handle expected load and potential failures?

◼ Inference Type

Inference refers to a process where the trained model serves predictions on new, unseen data.

Inference types are categorized into batch inference and real-time inference.

Choosing one depends on types of predictions that the model will serve.

▫ Batch Inference

Batch inference processes and stores predictions for a large volume of data at scheduled intervals like daily or weekly.

The client then queries this prediction store to retrieve the results.

The following diagram illustrates how the batch inference architecture works:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Batch inference architecture (Created by Kuriko IWAI)

Every batch, the system:

Pulls preprocessed features from the offline feature store (blue),
Stores the trained model in the model store (orange), and
Generates and stores predictions in the prediction store (pink).

Batch inference is best when:

No immediate prediction is required like predicting daily customer churn or computing weekly sales forecast for inventory management.
Pursuing cost-effectiveness by leveraging cheaper, less immediate compute resources.

Its disadvantages include:

Time gaps between data and predictions, waiting for the next batch to reflect data changes to the prediction.

Major service providers:

Cloud batch processing services (AWS Batch, Google Cloud Dataflow, Azure Batch),
Data warehousing solutions (Snowflake, Databricks).

▫ Real-time Inference

Real-time inference, on the other hand, serves the prediction within milliseconds by feeding inputs to the model real-time.

There are two major types:

Streaming real-time inference that continuously serves predictions without explicit client requests, and
Non-streaming (request-response) real-time inference that serves predictions based on discrete, individual requests from the client.

▫ Streaming Real-time Inference

Without explicit client requests, streaming real-time inference proactively serves predictions on new data that arrives as a continuous, unbounded stream.

The following diagram illustrates its architecture:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Streaming real-time inference architecture (Created by Kuriko IWAI)

In the architecture, data is flowing as a continuous, unbounded stream (a feature stream (blue).

Then, the stream processing engine (white box) creates a runtime environment for the model and continuously:

Receives data from the feature stream,
Retrieves the latest model from the model store, and
Streams predictions to downstream systems like a message queues, APIs, and other applications.

Streaming real-time inference is best when:

The application needs to react to live events, continuously evolving data and prediction (e.g., sensor data, clickstreams, financial transactions), and
Requiring low latency and hight throughput demands of incoming data.

Its disadvantages include:

Increasing end-to-end latency and debugging complexity when the stream involves multiple processing steps, and
Risk of high costs due to unexpected spikes in data volume.

Major service providers:

Stream processing frameworks: Apache Kafka, Apache Flink, Apache Spark Streaming, Amazon Kinesis, Google Cloud Pub/Sub, Azure Event Hubs.
Specialized serving frameworks: TensorFlow Serving, TorchServe, Triton Inference Server.
Cloud-based managed services: AWS SageMaker Endpoints, Google Cloud Vertex AI Endpoints, Azure Machine Learning Endpoints.

▫ Request-Response Real-time Inference

Request-response real-time inference makes predictions based on discrete requests from a client and returns immediate response.

The following diagram illustrates its architecture:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Non-streaming real-time inference architecture (Created by Kuriko IWAI)

Similar to the streaming real-time inference, input data continuously flows into the system where it:

Stores preprocessed features in the online feature store,
Synchronizes features with the offline features store,
Retrieves features from the offline features store, and
Trains and stores the model in the model store.

When a client sends a request via an API endpoint, the model delivers predictions within milliseconds.

This architecture is best for:

Interactive applications where clients directly receive predictions from the model when they need one.

Major service providers:

ML model serving APIs: TensorFlow Serving, TorchServe, BentoML
Cloud-based managed inference endpoints: AWS SageMaker Endpoints, Google Cloud Vertex AI Endpoints.

The Serving Platform

A serving platform in an ML system refers to an infrastructure and tools used to deploy, manage, and run a trained model in production.

The major architecture involves:

Container Orchestration,
Serverless Function (Function-as-a-Service, FaaS), and
Containerized Serverless.

The choice comes down to control, cost, and traffic patterns.

◼ Container Orchestration

Container orchestration is an architecture where multiple applications are containerized and orchestrated to manage, coordinate, and deploy as the entire system.

This diagram shows the orchestrator managing multiple application nodes that contain Docker containers:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Simplified container orchestration architecture (Created by Kuriko IWAI)

In this architecture, orchestrators handle tasks like:

Scheduling containers on hosts,
Managing resource allocation,
Facilitating inter-container communication,
Rolling out updates, and
Automatically recovering from failures.

This architecture is best when:

Handling complex microservice architectures where the system needs to scale many interconnected microservices while meeting consistent baseline capacities,
Running stateful applications like databases, message queues within the container, and
Maximizing portability across different cloud providers or between cloud and on-premise.

Its disadvantages involve:

Operational complexity in setting up and managing an orchestration platform, and
Cost and resource management among many nodes and containers including idling ones**.**

Major players:

Kubernetes (K8s): The de-facto standard for container orchestration, available as managed services from all major cloud providers.
Amazon ECS (Elastic Container Service): AWS’s proprietary container orchestration service.
Docker Swarm: Docker’s native orchestration solution, simpler to set up than Kubernetes but less feature-rich for large-scale deployments.

▫ FaaS — Serverless Functions

A serverless function (FaaS) is a set of code that runs in response to an event like an HTTP request, a file upload, or a message in a queue.

In its nature, FaaS is:

Event-driven: The function is triggered by events,
Stateless: Each execution of a function is independent from others. No data is stored among executions.
Ephemeral: The function has a short lifespan, existing only while they are executing.

In FaaS, the cloud provider dynamically manages the infrastructure and charges based on the actual execution time.

This architecture is best for:

Pursuing resource and cost-effectiveness for intermittent use, and
Unpredictable traffic where automatic scaling handled by the cloud provider ensures efficient resource utilization.

Its disadvantages involve:

Cold starts: Experience a delay as the runtime environment is initialized when a function hasn’t been invoked recently,
Package size limits: Functions have strict limits on memory or execution duration,
Statelessness: Maintaining state across the system requires external services (databases, caches), which adds complexity to the architecture.

◼ Containerized Serverless

Containerized serverless runs a container instead of code on a serverless platform, enjoying hybrid benefits of containers and FaaS.

This architecture is best when:

Applications can leverage the benefits of containerization like using a specific version of Python or having large dependencies, and
Shifting existing containerized applications to serverless without extensive refactoring.

Disadvantages are similar to FaaS, but it tends to make cold start longer due to a large container size or complex operations in the container.

Major FaaS/containerized serverless players include:

AWS Lambda: Amazon's pioneering Function as a Service (FaaS) offering.
Azure Functions: Microsoft Azure's serverless compute service.
Google Cloud Functions: Google Cloud's event-driven serverless platform.
Cloudflare Workers: A serverless platform that runs JavaScript, WASM, and other code on Cloudflare's global network edge.

Scalability & Reliability

Once we’ve chosen the deployment method, it’s crucial to design the system to handle varying workloads and remain operational even when the system is compromised.

In this section, I’ll cover the core concepts of horizontal scaling with load balancing and failover strategies in ML systems.

◼ Horizontal Scaling

Horizontal scaling indicates distributing the workload across multiple instances (machines or containers) rather than increasing the resources of a single instance (vertical scaling).

The figure below illustrates key differences between horizontal scaling and vertical scaling:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Comparison of horizontal scaling and vertical scaling (Created by Kuriko IWAI)

Horizontal scaling increases the system capacity by adding more instances.

So, when one instance fails, the remaining instances can absorb its workload, ensuring continued service.

On the other hand, vertical scaling, relying on a single, larger instance, risks complete system failure if that instance is compromised.

Although necessary for all inferences, high-throughput, low-latency machine learning inference especially prefers this approach because it offers elasticity (growing and shrinking with demand) and ensures high availability by taking over workloads from compromised instances.

◼ Load Balancer as an Enabler

When the horizontal scaling adds more instances to the system, it requires a load balancer to distribute incoming requests among the instances.

The below diagram shows the four key strategies of load balancing for the ML system:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Major strategies of load balancing in an ML system (Created by Kuriko IWAI)

In order of simplicity:

Hash-based routing (left top) uses model IDs or user session IDs to route all requests for a specific model to the instance where the model is hosted, reducing cold start time.
Resource-based (adaptive) load balancing (right top) uses real-time telemetry from each server to make intelligent routing decisions based on the current CPU/GPU utilization, memory usage, or queue length to send a new request to a server with enough free resources.
Predictive load balancing (bottom left) leverages an RNN-family model to predict future traffic patterns to proactively scale the server when traffic surge is expected.
Reinforcement learning (RL) based load balancing (bottom right) learns the optimal strategy in load balancing through trial and error using an RL algorithm where the agent attempts to maximize the rewards from the environment.

Major services:

Managed cloud service providers: AWS Elastic Load Balancing (ELB), Google Cloud Load Balancing, Azure Load Balancer.
Nginx: A popular open-source web server that can also function as a load balancer for HTTP, HTTPS, TCP, and UDP traffic.
HAProxy: A free, open-source solution specifically designed for reliable load balancing and proxying of TCP and HTTP-based applications.

◼ Failover Strategies

It is critical for the ML system to have failover strategies when scaling the system.

Redundancy refers to duplicating components in the system to let the healthy one take over failed one immediately.

Its examples include:

Building multiple instances that host the same model but in different availability zones to make sure there is a back up, or
Building a primary server and a cold standby server, where the standby server is only brought online manually after a failure of the primal one.

Graceful degradation refers to the ability of the system to maintaining functionality, instead of completely crashing, when a component fails.

For example, when a primary model crushed, the system falls back to the previous version of the model or less resource intensive model to still serve the prediction.

Key service providers include:

Managed ML serving platforms (like AWS, Google Cloud Vertex AI, Azure Machine Learning),
Circuit breakers (e.g., Hystrix, Resilience4j)
Fallback mechanisms in API design
Queueing systems (Kafka, RabbitMQ)

These processes are to prioritizing availability over performance, making sure that the application maintains high availability, even though its performance is compromised.

When dealing with a low-traffic or non-critical application, scaling/failover strategies are unnecessary.

The Walkthrough Example

In this section, I’ll take a personal recommendation engine for an example to demonstrate designing an ML system in a deployment phase.

The diagram illustarates the architecture of the recommendation engine:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. ML system architecture of a personal recommendation engine (Created by Kuriko IWAI)

The recommendation engine consists of the three main components:

The main recommendation system (center in the diagram) that serves personalized recommendation

The supporting systems:

The user profiling system (left in the diagram) retrieves the clickstream data and build a real-time user profile including browsing information.
The daily optimization system (grey box in the diagram) retrieves the aggregate profile data from the user profiling system and nightly retrains the model.

◼ The Inference Types

Since each component deals with a different type of data and tasks, the engine uses a hybrid inference architecture to create a robust solution:

▫ Recommendation System

Real-time request-response inference, delivering instant predictions whenever a user visits the site.

▫ User Profiling System

Streaming real-time inference approach, proactively updating profiles with a constant flow of clickstream data, even when a user isn’t making any request.

▫ Daily Opimization System

Batch inference, with a scheduled task dedicated to retraining the model.

◼ The Serving Platforms

Each component will deliver on the different serving platforms:

▫ Recommendation System— Containerized serverless

A request-response recommendation system needs to serve a prediction quickly every time a user accesses the site.

This workload is highly variable, with bursts of activity followed by periods of low traffic.

Containerized serverless is an excellent choice for the system as it offers more control over the runtime environment, crucial for the system that may require specific libraries or a larger memory footprint.

▫ User Profiling System— Containerized orchestration

The user profiling system continuously processes a stream of clickstream data.

This is a long-running, stateful workload that requires consistent, high-throughput processing.

Containerized orchestration (e.g., Kubernetes) is the ideal choice for the system because it can provide granular control over resource allocation, state management, and continuous operation for the system that needs to run continously.

▫ Daily Opimization System— Containerized orchestration

The system involves a scheduled task, like model retraining, which is a resource-intensive batch job.

Containerized orchestration is an excellent choice to define a batch job in Kubernetes that runs on a schedule.

The orchestrator will spin up the necessary containers, execute the training pipeline, and then tear down the resources upon completion.

◼ Scaling Strategies

Each component has different needs, so a single scaling strategy for all of them isn’t optimal.

▫ Recommendation System

The recommendation system needs low latency and high availability.

The best strategy here is a combination of Hash-Based Routing and Resource-Based Load Balancing.

Hash-Based Routing: Consistently routes a user’s requests to the same server by using a hash on the user ID, ensuring that the user’s latest profile information is readily available. This significantly reduces latency by avoiding a fresh fetch from the profiling system or a database for every request.
Resource-Based Load Balancing acts as a fallback. If the hashed-to server is overloaded, the system can dynamically route the request to another server with available resources, ensuring a fast response even under high load.

This hybrid approach provides both the benefit of caching (low latency) and the resilience of dynamic routing (high availability).

The system also falls back to showing generic popular contents instead of personalized recommendation when the entire system is down.

▫ User Profiling System

The best load balancing strategy here is Resource-Based (Adaptive) Load Balancing because the processing demands can fluctuate based on the volume of user activity.

An adaptive strategy can route new data streams to the least-busy server based on real-time metrics like CPU, memory, and network utilization.

This prevents any single server from becoming a bottleneck and ensures the continuous flow of data is processed efficiently.

▫ Daily Optimization System

The best strategy here is Resource-Based Load Balancing in combination with a batch scheduler.

Batch jobs are about throughput and efficient use of resources.

The system needs to process a large amount of data as fast as possible.

A resource-based load balancer can route the batch jobs to servers with the most available compute power (e.g., free CPU and GPU cycles), ensuring that the training process is completed in the shortest possible time without interfering with the real-time inference systems.

A scheduler, like Kubernetes CronJobs, can manage when these jobs run.

Wrapping Up

Deploying production-ready machine learning systems requires a multidisciplinary approach, blending expertise in machine learning, software engineering, and operations.

In the walkthrough example, we examined a hybrid architecture where multiple inferences are combined altogether to serve the prediction.

By carefully considering the aspects outlined above, we can deliver robust, scalable, and impactful solutions to our audience.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse
A technical deep dive into data pipeline architectures. Compare ETL/ELT, Lambda/Kappa patterns, and Medallion structures using a stock price prediction use case.
Engineering a Fully-Automated Lakehouse: From Raw Data to Gold Tables
Step-by-step guide to implementing a Medallion Lakehouse architecture using AWS S3, Delta Lake, Apache Spark, and Airflow for automated data pipelines.
Building a Production-Ready Data CI/CD Pipeline: Versioning, Drift Detection, and Orchestration
A step-by-step guide to building a robust Data CI/CD pipeline. Learn how to version data with DVC, detect drift with Evidently AI, and orchestrate workflows using Prefect and AWS S3.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Architecting Production ML: A Deep Dive into Deployment and Scalability" in Kernel Labs

https://kuriko-iwai.com/key-decisions-in-ml-system-deployment

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Architecting Production ML: A Deep Dive into Deployment and Scalability

Explore a practical walkthrough of deployment decisions from inference types to serving platforms

Table of Contents

Introduction

What is Machine Learning System

Decision Making in Deployment

◼ Inference Type

▫ Batch Inference

▫ Real-time Inference

▫ Streaming Real-time Inference

▫ Request-Response Real-time Inference

The Serving Platform

◼ Container Orchestration

▫ FaaS — Serverless Functions

◼ Containerized Serverless

Scalability & Reliability

◼ Horizontal Scaling

◼ Load Balancer as an Enabler

◼ Failover Strategies

The Walkthrough Example

◼ The Inference Types

▫ Recommendation System

▫ User Profiling System

▫ Daily Opimization System

◼ The Serving Platforms

▫ Recommendation System— Containerized serverless

▫ User Profiling System— Containerized orchestration

▫ Daily Opimization System— Containerized orchestration

◼ Scaling Strategies

▫ Recommendation System

▫ User Profiling System

▫ Daily Optimization System

Wrapping Up

Continue Your Learning

Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse

Engineering a Fully-Automated Lakehouse: From Raw Data to Gold Tables

Building a Production-Ready Data CI/CD Pipeline: Versioning, Drift Detection, and Orchestration

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?