Scaling Securely - A Technical Deep Dive into AWS VPC Architecture for MLOps
Master AWS VPC for Machine Learning and MLOps with Practical Use Cases.
By Kuriko IWAI

Table of Contents
IntroductionWhat is Virtual Private Cloud (VPC)How VPC Works - Decoding Core ComponentsAWS VPC In ActionWrapping UpIntroduction
Data security and infrastructure protection are critical for LLMs and broader Machine Learning (ML) systems.
Major cloud providers - AWS, Azure, and Google Cloud - provide Virtual Private Cloud (VPC) necessary to protect sensitive model weights and training data.
Yet, the complexity of networking requirements creates a bottleneck during the system deployment.
In this article, I’ll deep dive into the VPC architecture with four common ML use cases, taking Amazon Virtual Private Cloud (AWS VPC) for an example.
What is Virtual Private Cloud (VPC)
Virtual Private Cloud (VPC) is a logically isolated section of the Cloud where a client can launch a cloud provider’s resources in their own virtual network, while taking control over the IP address ranges and components in the cloud.
The below diagram illustrates its architecture, taking AWS VPC for an example:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Logical architecture diagram of AWS VPC showing Public and Private Subnets with Internet and NAT Gateways (Created by Kuriko IWAI)
The VPC functions like a traditional network that the client would operate in their own data center, but with the scalable benefits of using AWS's infrastructure.
Within an AWS VPC, subnets (Figure A, white boxes) are used to partition and control the network environment for Amazon Elastic Compute Cloud (EC2) instances, a virtual server in Amazon's data centers that the client rents to run the system.
Public EC2 instances can connect to the internet via the route table in the public subnet and the Internet Gateway (Figure A, pink box), while private EC2 instances remain isolated from direct inbound traffic.
Private EC2 instances can securely access:
The internet and other AWS resources via a NAT Gateway (Figure A, light orange box) in the public subnet,
Other AWS resources like S3 via their corresponding VPC endpoints, and
The client’s data centers and servers outside of the AWS VPC via the VPN gateway (Figure A, left, orange box).
◼ The Network Bottleneck: Why VPC Matters for Production ML
A VPC is critical for a secure ML lifecycle from data ingestion to model deployment, ensuring that the training data and model artifacts never touch the public internet.
▫ Private Subnets as Security Guardrails
Private subnets can secure training instances on SageMaker or EC2 GPU.
Dedicated security groups for the private subnet can not only block unauthorized traffic, but allow the system to meet HIPAA, PCI, or SOC2 requirements, providing that sensitive datasets are stored and processed within a logically isolated network.
▫ Hybrid Cloud Integration
The VPN gateway connected to the private EC2 instances can securely ingest raw data from on-premises databases into the cloud.
It can also enable models to call internal company APIs or authentication services.
▫ Cost & Speed Management with VPC Endpoints
The NAT gateway charges data processing fees ($0.045 per GB), surging model training expenses when the model is trained on massive datasets.
Using VPC endpoints to exchange data between S3 and private EC2 instances mitigate the cost as well as training latency by staying the traffic within the AWS network.
How VPC Works - Decoding Core Components
At its core, a VPC uses Software Defined Networking (SDN) to carve out a private network for the client within AWS's massive infrastructure.
The client defines a CIDR block (e.g., 10.0.0.0/16), and AWS handles the underlying packet routing between the client’s compute nodes on AWS resources and storage (EC2 instances).
This system comprises five key elements:
Subnets,
Gateways,
VPC Endpoints,
Security Layers, and
Routing Configuration.
◼ Subnets: The Logical Partition
Subnets are CIDR-assigned IP address ranges within the VPC.
In the ML context, subnets define the security zones:
Public subnets are the only component with a direct route to the Internet gateway. It contains the NAT gateway to connect private EC2 instances and load balancers.
Private subnets contain GPU clusters (e.g., P4d instances), RDS databases, and SageMaker Notebooks, completely isolated from direct ingress.
◼ Gateways: The Traffic Controllers
Gateways manage the entry and exit points of traffic to the VPC network.
There are three types of gateways:
Internet gateway (IGW) connects VPC and the internet and other AWS resources with public IPs via 1:1 static Network Address Translation (NAT).
NAT Gateway is a managed service that allows EC2 instances in a private subnet to connect to services outside the VPC (e.g., for pip install or OS patches), while preventing the internet from initiating a connection with those private instances.
Virtual Private Gateway (VGW) is the VPN concentrator that enables the client to extend an on-premises data center into the VPC over an encrypted tunnel.
◼ VPC Endpoints
VPC endpoints allow you to privately connect the VPC to supported AWS resources without requiring an IGW, NAT gateway, or VPN connection.
There are two types of VPC endpoints:
Gateway endpoints: Entry points for S3 and DynamoDB. Functions by adding a prefix list to the route table of the subnet.
Interface endpoints: Entry points for services like SageMaker API, EC2 API, or Kinesis. Functions by assigning a private IP address to the endpoint from the subnet's IP range.
To compare Gateway and Interface endpoints:
| Feature | Gateway Endpoint (S3 & DynamoDB) | Interface Endpoint (Most other services) |
| Mechanism | Add a prefix list entry to your Route Table. | Creates an Elastic Network Interface (ENI) with a private IP in the subnet. |
| Cost | Free. | Hourly charge + data processing fees. |
| Connectivity | Uses routing logic. No physical interface. | Uses DNS to point service URLs to private ENIs. |
| Requirement | Requires --route-table-ids. | Requires --subnet-ids and --security-group-ids. |
Table 1. Technical Comparison: Gateway Endpoints vs. Interface Endpoints for ML Data Ingestion.
Utilizing gateway endpoints is mandatory for ML to avoid massive NAT data charges when pulling training datasets.
◼ Security Layers
AWS employs a two-tier security system:
Network ACLs (NACLs): Stateless, subnet-level filters that block specific CIDR ranges.
Security Groups (SGs): Stateful, instance-level virtual firewalls.
The below diagram illustrates how NACLs and SGs work for distributed training with head node and four worker nodes:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. Technical diagram of Security Group self-reference rules for distributed ML training nodes (Created by Kuriko IWAI)
In Figure B, the head node (pink box) and worker nodes (yellow boxes) are on the private subnet protected by the network ACLs.
When the head node submits an SSH request to a worker node, the worker node refers to an inbound rule defined in its security group to check if the incoming request is acceptable (Inbound check).
If acceptable, the worker node can return a response to the head node without any outbound rules.
This is because the security group is stateful, which indicates that it can remember the state of the connection between the head and worker nodes, and automatically bypass the outbound traffic from the worker node.
On the other hand, SSH requests are blocked when they didn’t pass the inbound check (Figure B, grey boxes).
This architecture allows the ML system to:
Reduce overhead of managing complex outbound rules for high-speed data syncing (like NCCL or MPI) between GPUs.
Tighten security by blocking outbound connections to the internet, while returning responses to legitimate internal management commands.
◼ Routing Configuration
Routing configuration requires route tables, a logical container for a set of rules (routes) that dictate where traffic is directed.
On top of the main route table created by default, the client can add custom route tables for each subnet, containing a rule like 0.0.0.0/0 -> nat-gateway-id.
AWS VPC In Action
In this section, I cover four typical VPC architectures for ML systems:
MVP and rapid experimentation,
Tabular data pipeline,
Multi-modal training, and
Distributed LLM training.
◼ Use Case 1. MVP & Rapid Experimentation
In the MVP phase, the system runs on a single instance (e.g., g5.xlarge) and expects to pull various open-source libraries via pip or conda.
▫ The Goal
- Speed to market. Minimal running cost.
▫ VPC Configuration
- Workstation on public subnet + model/feature stores on S3
The system utilizes SageMaker Notebook as a workstation on EC2 in the public subnet and downloads libraries via the Internet Gateway, while securely accessing model and feature stores on S3 through a VPC endpoint:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C-1. MVP ML architecture using SageMaker Notebooks in a Public Subnet with S3 VPC Endpoints. (Created by Kuriko IWAI)
▫ Key Points
Simple configuration by integrating the Internet gateway instead of NAT gateways.
Cost saving. The system extracts data and artifacts on S3 via the VPC endpoint, avoiding NAT gateway’s data processing fees.
▫ Implementation
Using aws CLI, the VPC is deployed:
1# create vpc
2VPC_ID=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 --query 'Vpc.VpcId' --output text)
3
4# create public subnet
5SUBNET_ID=$(aws ec2 create-subnet
6 --vpc-id $VPC_ID
7 --cidr-block 10.0.1.0/24
8 --query 'Subnet.SubnetId'
9 --output text)
10
11# enable auto-assign public ip
12aws ec2 modify-subnet-attribute --subnet-id $SUBNET_ID --map-public-ip-on-launch
13
14# create and attach igw to the created vpc
15IGW_ID=$(aws ec2 create-internet-gateway --query 'InternetGateway.InternetGatewayId' --output text)
16aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID
17
Developer Note:
Complex VPC configurations are rare for an MVP where speed and flexibility matter.
But VPC can be necessary when handling sensitive data or when required by architectural constraints, such as site-to-site VPN connections to on-premises databases.
◼ Use Case 2. Tabular Data Pipelines (PII Protection & Batch ROI)
The tabular batch pipeline requires optimized cost management for structured data processing.
Additionally, to protect sensitive PII, the computing environment must be isolated within a private network.
▫ The Goal
Processing structured data (CSV, Parquet).
Maintain data security.
▫ VPC Configuration
- Private Subnet + VPC Endpoints + VPN Gateway
The system leverages a private subnet to secure data and artifacts, using VPC endpoints to access AWS resources like S3 and DynamoDB, and a VPN gateway for encrypted connectivity to off-site corporate data centers:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C-2. Secure tabular data pipeline architecture featuring VPN Gateway and Private Subnets. (Created by Kuriko IWAI)
▫ Key Points
Security. The traffic stays within the private subnet, connecting S3 and DynamoDB with the private route table. Encrypted connection to the off-site data center.
Cost effectiveness. VPC gateway does not charge any hourly fees. It is a cost-effective option for intermittent batch processing.
▫ Implementation
Similar to the previous use case, using aws CLI to deploy the VPC:
1# create a private subnet (no public ip mapping)
2PRIVATE_SUBNET_ID=$(aws ec2 create-subnet \
3 --vpc-id $VPC_ID \
4 --cidr-block 10.0.2.0/24 \
5 --query 'Subnet.SubnetId' --output text)
6
7# create a dedicated route table
8PRIVATE_ROUTE_TABLE_ID=$(aws ec2 create-route-table \
9 --vpc-id $VPC_ID \
10 --query 'RouteTable.RouteTableId' --output text)
11
12# explicitly associate the private subnet with the route table
13aws ec2 associate-route-table \
14 --subnet-id $PRIVATE_SUBNET_ID \
15 --route-table-id $PRIVATE_ROUTE_TABLE_ID
16
17# create s3 gateway endpoint
18aws ec2 create-vpc-endpoint \
19 --vpc-id $VPC_ID \
20 --service-name $S3_SERVICE \
21 --route-table-ids $PRIVATE_ROUTE_TABLE_ID
22
23# create dynamodb gateway endpoint
24aws ec2 create-vpc-endpoint \
25 --vpc-id $VPC_ID \
26 --service-name $DDB_SERVICE \
27 --route-table-ids $PRIVATE_ROUTE_TABLE_ID
28
◼ Use Case 3. Multi-Modal Training (High-Frequency API Orchestration)
For a Multi-modal ML architecture, the VPC configuration requires to balance massive raw data ingestion with high-speed API orchestration.
▫ The Goal
- Integrates various data streams (Video, Audio, LiDAR, Text) to model training pipeline via high-frequency API calls.
▫ VPC Configuration
- Private Subnet + NAT Gateway + Interface Endpoints.
In this system, a dedicated private subnet hosts the SageMaker EC2 instances used for model hosting.
These instances securely access model artifacts and feature stores on S3, as well as other AWS services like Amazon Rekognition for multi-modal data processing, using the private subnet’s route table and dedicated VPC endpoints.
Although optional, these private instances can securely ingest data from an off-site corporate data center via a VPN Gateway for continuous model refinement through additional training with on-premises data.
Lastly, inference results are routed through a NAT gateway in the public subnet and the Internet gateway.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C-3. Multi-modal ML workflow using Interface Endpoints for Amazon Rekognition and SageMaker API (Created by Kuriko IWAI)
▫ Key Points
Prevent NAT congestion by using the NAT gateway only for small external pulls, while using VPC endpoints for high-frequency requests to other AWS resources.
Security. The private subnet ensures that the training cluster has no direct exposure to the public internet.
▫ Implementation
Using aws CLI, first, configures a NAT gateway with a route to the route table in the private subnet.
1# allocate an elastic ip to the vpc
2ALLOCATION_ID=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
3
4# create a nat gateway in the public subnet
5NAT_GW_ID=$(aws ec2 create-nat-gateway
6 --subnet-id $PUBLIC_SUBNET_ID \
7 --allocation-id $ALLOCATION_ID \
8 --query 'NatGateway.NatGatewayId' \
9 --output text)
10
11# route all 0.0.0.0/0 traffic from private subnet to the nat gateway
12aws ec2 create-route
13 --route-table-id $PRIVATE_RTB_ID \
14 --destination-cidr-block 0.0.0.0/0 \
15 --nat-gateway-id $NAT_GW_ID
16
Then, create VPC endpoints corresponding to each of the necessary AWS resources, sagemaker.runtime, sagemaker.api, and rekognition:
1
2# create interface endpoints for related services to the private subnet
3SERVICES=("sagemaker.runtime" "sagemaker.api" "rekognition")
4for SERVICE in "${SERVICES[@]}"; do
5 aws ec2 create-vpc-endpoint \
6 --vpc-id $VPC_ID \
7 --vpc-endpoint-type Interface \
8 --service-name com.amazonaws.$REGION.$SERVICE \
9 --subnet-ids $PRIVATE_SUBNET_ID \
10 --security-group-ids $SECURITY_GROUP_ID \
11 --private-dns-enabled # code integration
12
◼ Use Case 4. Distributed LLM Training (EFA & Latency)
Tuning LLMs requires distributed training across multiple GPU instances.
▫ The Goal
- High-performance distributed training across multiple GPU instances (p4d or p5 nodes).
▫ VPC Configuration
- Private Subnet + EFA + Security Group with self-reference rules.
The system utilizes the Elastic Fabric Adapter (EFA) rather than standard TCP/IP to achieve microsecond-level latency.
To support this, the Security Group is configured to allow stateful ingress traffic from all associated GPU instances (Figure B).
▫ Implementation
After creating VPC, first, create a security group and authorize all ingress / egress traffic from the same security group:
1# create the security group
2SG_ID=$(aws ec2 create-security-group \
3 --group-name "llm-training-sg" \
4 --description "Security group for EFA and NCCL communication" \
5 --vpc-id $VPC_ID \
6 --query 'GroupId' --output text)
7
8# authorize all ingress traffic originating from the security group
9aws ec2 authorize-security-group-ingress \
10 --group-id $SECURITY_GROUP_ID \
11 --protocol all \
12 --port -1 \
13 --source-group $SECURITY_GROUP_ID
14
15# authorize all egress traffic to the security group
16aws ec2 authorize-security-group-egress \
17 --group-id $SECURITY_GROUP_ID \
18 --protocol all \
19 --port -1 \
20 --source-group $SECURITY_GROUP_ID
21
Then, create and launch a dedicate private subnet via EFA:
1# create a dedicated private subnet for training
2TRAIN_SUBNET_ID=$(aws ec2 create-subnet \
3 --vpc-id $VPC_ID \
4 --cidr-block 10.0.2.0/24 \
5 --availability-zone $REGION \
6 --query 'Subnet.SubnetId' \
7 --output text)
8
9# create a placement group (pg) for maximum performance of efa
10PG_NAME=$(aws ec2 create-placement-group \
11 --group-name "llm-training-pg" \
12 --strategy cluster \
13 --query 'PlacementGroup.GroupName' \
14 --output text)
15
16# launch a training instance with efa enabled
17aws ec2 run-instances \
18 --image-id ami-xxxxxx \
19 --instance-type p4d.24xlarge \
20 --key-name $KEY_NAME \
21 --placement "GroupName=$PG_NAME" \ # apply the placement group
22 --network-interfaces "DeviceIndex=0,InterfaceType=efa,Groups=$SG_ID,SubnetId=$TRAIN_SUBNET_ID"
23
Wrapping Up
A well-architected VPC is critical for a secure, production-ready AI platform.
Mastering subnets and choosing the right endpoints ensure that data remains private and egress costs remain low.
◼ Trade-offs - When to Skip the VPC
VPC is not always a hard requirement for ML systems.
VPC increases network complexity, latency, and overhead.
When these increase outweighs its security benefits, the system is more efficient using IAM roles and encrypted public endpoints rather than VPC.
For example:
Public dataset research, low-sensitivity projects: When working with open-source datasets or non-proprietary data.
Prototyping: When prioritizing iteration speed and easy access to external libraries over strict network isolation.
Serverless and fully-managed services with the built-in encryption systems.
◼ Cloud Comparison: AWS VPC vs. GCP VPC vs. Azure VNet
Lastly, Google Cloud Platform (GCP) VPC and Azure VNet are alternative options for AWS VPC.
They offer similar services with different architectures:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. Comparison chart of global networking architectures: AWS VPC vs. GCP VPC vs. Azure VNet (Created by Kuriko IWAI)
GCP VPC serves a single VPC which spans multiple regions worldwide, simplifying global load balancing and cross-region communication.
This makes GCP VPC suitable for global reach and multi-region architectures.
Azure VNets, on the other hand, are scoped regional, and offer seamless integration with the Microsoft ecosystem.
AWS VPCs are designed to span multiple Availability Zones within a region, providing robust regional control and fault isolation.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Building LoRA Multi-Adapter Inference on AWS SageMaker
Deconstructing LoRA: The Math and Mechanics of Low-Rank Adaptation
The Definitive Guide to LLM Fine-Tuning: Objectivee, Mechanisms, and Hardware
A Complete Guide to Resilient Quant ML Engines on AWS SageMaker
Regularizing LLMs with Kullback-Leibler Divergence
Architecting Production ML: A Deep Dive into Deployment and Scalability
Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse
Engineering a Fully-Automated Lakehouse: From Raw Data to Gold Tables
Building an Automated CI/CD Pipeline for Serverless Machine Learning on AWS
Building a Production-Ready Data CI/CD Pipeline: Versioning, Drift Detection, and Orchestration
From Notebook to Production: Building a Resilient ML Pipeline on AWS Lambda
Building a Serverless ML Lineage: AWS Lambda, DVC, and Prefect
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation
Share What You Learned
Kuriko IWAI, "Scaling Securely - A Technical Deep Dive into AWS VPC Architecture for MLOps" in Kernel Labs
https://kuriko-iwai.com/aws-vpc-architecture-for-machine-learning-mlops
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.











