Scaling Securely - A Technical Deep Dive into AWS VPC Architecture for MLOps

Master AWS VPC for Machine Learning and MLOps with Practical Use Cases.

Machine LearningDeep LearningLLMMLOps

By Kuriko IWAI

Introduction What is Virtual Private Cloud (VPC)How VPC Works - Decoding Core Components AWS VPC In Action Wrapping Up

Introduction

Data security and infrastructure protection are critical for LLMs and broader Machine Learning (ML) systems.

Major cloud providers - AWS, Azure, and Google Cloud - provide Virtual Private Cloud (VPC) necessary to protect sensitive model weights and training data.

Yet, the complexity of networking requirements creates a bottleneck during the system deployment.

In this article, I’ll deep dive into the VPC architecture with four common ML use cases, taking Amazon Virtual Private Cloud (AWS VPC) for an example.

What is Virtual Private Cloud (VPC)

Virtual Private Cloud (VPC) is a logically isolated section of the Cloud where a client can launch a cloud provider’s resources in their own virtual network, while taking control over the IP address ranges and components in the cloud.

The below diagram illustrates its architecture, taking AWS VPC for an example:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Logical architecture diagram of AWS VPC showing Public and Private Subnets with Internet and NAT Gateways (Created by Kuriko IWAI)

The VPC functions like a traditional network that the client would operate in their own data center, but with the scalable benefits of using AWS's infrastructure.

Within an AWS VPC, subnets (Figure A, white boxes) are used to partition and control the network environment for Amazon Elastic Compute Cloud (EC2) instances, a virtual server in Amazon's data centers that the client rents to run the system.

Public EC2 instances can connect to the internet via the route table in the public subnet and the Internet Gateway (Figure A, pink box), while private EC2 instances remain isolated from direct inbound traffic.

Private EC2 instances can securely access:

The internet and other AWS resources via a NAT Gateway (Figure A, light orange box) in the public subnet,
Other AWS resources like S3 via their corresponding VPC endpoints, and
The client’s data centers and servers outside of the AWS VPC via the VPN gateway (Figure A, left, orange box).

◼ The Network Bottleneck: Why VPC Matters for Production ML

A VPC is critical for a secure ML lifecycle from data ingestion to model deployment, ensuring that the training data and model artifacts never touch the public internet.

▫ Private Subnets as Security Guardrails

Private subnets can secure training instances on SageMaker or EC2 GPU.

Dedicated security groups for the private subnet can not only block unauthorized traffic, but allow the system to meet HIPAA, PCI, or SOC2 requirements, providing that sensitive datasets are stored and processed within a logically isolated network.

▫ Hybrid Cloud Integration

The VPN gateway connected to the private EC2 instances can securely ingest raw data from on-premises databases into the cloud.

It can also enable models to call internal company APIs or authentication services.

▫ Cost & Speed Management with VPC Endpoints

The NAT gateway charges data processing fees ($0.045 per GB), surging model training expenses when the model is trained on massive datasets.

Using VPC endpoints to exchange data between S3 and private EC2 instances mitigate the cost as well as training latency by staying the traffic within the AWS network.

How VPC Works - Decoding Core Components

At its core, a VPC uses Software Defined Networking (SDN) to carve out a private network for the client within AWS's massive infrastructure.

The client defines a CIDR block (e.g., 10.0.0.0/16), and AWS handles the underlying packet routing between the client’s compute nodes on AWS resources and storage (EC2 instances).

This system comprises five key elements:

Subnets,
Gateways,
VPC Endpoints,
Security Layers, and
Routing Configuration.

◼ Subnets: The Logical Partition

Subnets are CIDR-assigned IP address ranges within the VPC.

In the ML context, subnets define the security zones:

Public subnets are the only component with a direct route to the Internet gateway. It contains the NAT gateway to connect private EC2 instances and load balancers.
Private subnets contain GPU clusters (e.g., P4d instances), RDS databases, and SageMaker Notebooks, completely isolated from direct ingress.

◼ Gateways: The Traffic Controllers

Gateways manage the entry and exit points of traffic to the VPC network.

There are three types of gateways:

Internet gateway (IGW) connects VPC and the internet and other AWS resources with public IPs via 1:1 static Network Address Translation (NAT).
NAT Gateway is a managed service that allows EC2 instances in a private subnet to connect to services outside the VPC (e.g., for pip install or OS patches), while preventing the internet from initiating a connection with those private instances.
Virtual Private Gateway (VGW) is the VPN concentrator that enables the client to extend an on-premises data center into the VPC over an encrypted tunnel.

◼ VPC Endpoints

VPC endpoints allow you to privately connect the VPC to supported AWS resources without requiring an IGW, NAT gateway, or VPN connection.

There are two types of VPC endpoints:

Gateway endpoints: Entry points for S3 and DynamoDB. Functions by adding a prefix list to the route table of the subnet.
Interface endpoints: Entry points for services like SageMaker API, EC2 API, or Kinesis. Functions by assigning a private IP address to the endpoint from the subnet's IP range.

To compare Gateway and Interface endpoints:

Feature	Gateway Endpoint (S3 & DynamoDB)	Interface Endpoint (Most other services)
Mechanism	Add a prefix list entry to your Route Table.	Creates an Elastic Network Interface (ENI) with a private IP in the subnet.
Cost	Free.	Hourly charge + data processing fees.
Connectivity	Uses routing logic. No physical interface.	Uses DNS to point service URLs to private ENIs.
Requirement	Requires --route-table-ids.	Requires --subnet-ids and --security-group-ids.

Table 1. Technical Comparison: Gateway Endpoints vs. Interface Endpoints for ML Data Ingestion.

Utilizing gateway endpoints is mandatory for ML to avoid massive NAT data charges when pulling training datasets.

◼ Security Layers

AWS employs a two-tier security system:

Network ACLs (NACLs): Stateless, subnet-level filters that block specific CIDR ranges.
Security Groups (SGs): Stateful, instance-level virtual firewalls.

The below diagram illustrates how NACLs and SGs work for distributed training with head node and four worker nodes:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Technical diagram of Security Group self-reference rules for distributed ML training nodes (Created by Kuriko IWAI)

In Figure B, the head node (pink box) and worker nodes (yellow boxes) are on the private subnet protected by the network ACLs.

When the head node submits an SSH request to a worker node, the worker node refers to an inbound rule defined in its security group to check if the incoming request is acceptable (Inbound check).

If acceptable, the worker node can return a response to the head node without any outbound rules.

This is because the security group is stateful, which indicates that it can remember the state of the connection between the head and worker nodes, and automatically bypass the outbound traffic from the worker node.

On the other hand, SSH requests are blocked when they didn’t pass the inbound check (Figure B, grey boxes).

This architecture allows the ML system to:

Reduce overhead of managing complex outbound rules for high-speed data syncing (like NCCL or MPI) between GPUs.
Tighten security by blocking outbound connections to the internet, while returning responses to legitimate internal management commands.

◼ Routing Configuration

Routing configuration requires route tables, a logical container for a set of rules (routes) that dictate where traffic is directed.

On top of the main route table created by default, the client can add custom route tables for each subnet, containing a rule like 0.0.0.0/0 -> nat-gateway-id.

AWS VPC In Action

In this section, I cover four typical VPC architectures for ML systems:

MVP and rapid experimentation,
Tabular data pipeline,
Multi-modal training, and
Distributed LLM training.

◼ Use Case 1. MVP & Rapid Experimentation

In the MVP phase, the system runs on a single instance (e.g., g5.xlarge) and expects to pull various open-source libraries via pip or conda.

▫ The Goal

Speed to market. Minimal running cost.

▫ VPC Configuration

Workstation on public subnet + model/feature stores on S3

The system utilizes SageMaker Notebook as a workstation on EC2 in the public subnet and downloads libraries via the Internet Gateway, while securely accessing model and feature stores on S3 through a VPC endpoint:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C-1. MVP ML architecture using SageMaker Notebooks in a Public Subnet with S3 VPC Endpoints. (Created by Kuriko IWAI)

▫ Key Points

Simple configuration by integrating the Internet gateway instead of NAT gateways.
Cost saving. The system extracts data and artifacts on S3 via the VPC endpoint, avoiding NAT gateway’s data processing fees.

▫ Implementation

Using aws CLI, the VPC is deployed:

1# create vpc
2VPC_ID=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 --query 'Vpc.VpcId' --output text)
3
4# create public subnet
5SUBNET_ID=$(aws ec2 create-subnet 
6    --vpc-id $VPC_ID 
7    --cidr-block 10.0.1.0/24 
8    --query 'Subnet.SubnetId' 
9    --output text)
10
11# enable auto-assign public ip
12aws ec2 modify-subnet-attribute --subnet-id $SUBNET_ID --map-public-ip-on-launch
13
14# create and attach igw to the created vpc
15IGW_ID=$(aws ec2 create-internet-gateway --query 'InternetGateway.InternetGatewayId' --output text)
16aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID
17

Developer Note:
Complex VPC configurations are rare for an MVP where speed and flexibility matter.
But VPC can be necessary when handling sensitive data or when required by architectural constraints, such as site-to-site VPN connections to on-premises databases.

◼ Use Case 2. Tabular Data Pipelines (PII Protection & Batch ROI)

The tabular batch pipeline requires optimized cost management for structured data processing.

Additionally, to protect sensitive PII, the computing environment must be isolated within a private network.

▫ The Goal

Processing structured data (CSV, Parquet).
Maintain data security.

▫ VPC Configuration

Private Subnet + VPC Endpoints + VPN Gateway

The system leverages a private subnet to secure data and artifacts, using VPC endpoints to access AWS resources like S3 and DynamoDB, and a VPN gateway for encrypted connectivity to off-site corporate data centers:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C-2. Secure tabular data pipeline architecture featuring VPN Gateway and Private Subnets. (Created by Kuriko IWAI)

▫ Key Points

Security. The traffic stays within the private subnet, connecting S3 and DynamoDB with the private route table. Encrypted connection to the off-site data center.
Cost effectiveness. VPC gateway does not charge any hourly fees. It is a cost-effective option for intermittent batch processing.

▫ Implementation

Similar to the previous use case, using aws CLI to deploy the VPC:

1# create a private subnet (no public ip mapping)
2PRIVATE_SUBNET_ID=$(aws ec2 create-subnet \
3    --vpc-id $VPC_ID \
4    --cidr-block 10.0.2.0/24 \
5    --query 'Subnet.SubnetId' --output text)
6
7# create a dedicated route table
8PRIVATE_ROUTE_TABLE_ID=$(aws ec2 create-route-table \
9    --vpc-id $VPC_ID \
10    --query 'RouteTable.RouteTableId' --output text)
11
12# explicitly associate the private subnet with the route table
13aws ec2 associate-route-table \
14    --subnet-id $PRIVATE_SUBNET_ID \
15    --route-table-id $PRIVATE_ROUTE_TABLE_ID
16
17# create s3 gateway endpoint
18aws ec2 create-vpc-endpoint \
19    --vpc-id $VPC_ID \
20    --service-name $S3_SERVICE \
21    --route-table-ids $PRIVATE_ROUTE_TABLE_ID
22
23# create dynamodb gateway endpoint
24aws ec2 create-vpc-endpoint \
25    --vpc-id $VPC_ID \
26    --service-name $DDB_SERVICE \
27    --route-table-ids $PRIVATE_ROUTE_TABLE_ID
28

◼ Use Case 3. Multi-Modal Training (High-Frequency API Orchestration)

For a Multi-modal ML architecture, the VPC configuration requires to balance massive raw data ingestion with high-speed API orchestration.

▫ The Goal

Integrates various data streams (Video, Audio, LiDAR, Text) to model training pipeline via high-frequency API calls.

▫ VPC Configuration

Private Subnet + NAT Gateway + Interface Endpoints.

In this system, a dedicated private subnet hosts the SageMaker EC2 instances used for model hosting.

These instances securely access model artifacts and feature stores on S3, as well as other AWS services like Amazon Rekognition for multi-modal data processing, using the private subnet’s route table and dedicated VPC endpoints.

Although optional, these private instances can securely ingest data from an off-site corporate data center via a VPN Gateway for continuous model refinement through additional training with on-premises data.

Lastly, inference results are routed through a NAT gateway in the public subnet and the Internet gateway.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C-3. Multi-modal ML workflow using Interface Endpoints for Amazon Rekognition and SageMaker API (Created by Kuriko IWAI)

▫ Key Points

Prevent NAT congestion by using the NAT gateway only for small external pulls, while using VPC endpoints for high-frequency requests to other AWS resources.
Security. The private subnet ensures that the training cluster has no direct exposure to the public internet.

▫ Implementation

Using aws CLI, first, configures a NAT gateway with a route to the route table in the private subnet.

1# allocate an elastic ip to the vpc
2ALLOCATION_ID=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
3
4# create a nat gateway in the public subnet
5NAT_GW_ID=$(aws ec2 create-nat-gateway 
6    --subnet-id $PUBLIC_SUBNET_ID \
7    --allocation-id $ALLOCATION_ID \
8    --query 'NatGateway.NatGatewayId' \
9    --output text)
10
11# route all 0.0.0.0/0 traffic from private subnet to the nat gateway
12aws ec2 create-route 
13    --route-table-id $PRIVATE_RTB_ID \
14    --destination-cidr-block 0.0.0.0/0 \
15    --nat-gateway-id $NAT_GW_ID
16

Then, create VPC endpoints corresponding to each of the necessary AWS resources, sagemaker.runtime, sagemaker.api, and rekognition:

1
2# create interface endpoints for related services to the private subnet
3SERVICES=("sagemaker.runtime" "sagemaker.api" "rekognition")
4for SERVICE in "${SERVICES[@]}"; do
5    aws ec2 create-vpc-endpoint \
6        --vpc-id $VPC_ID \
7        --vpc-endpoint-type Interface \
8        --service-name com.amazonaws.$REGION.$SERVICE \
9        --subnet-ids $PRIVATE_SUBNET_ID \
10        --security-group-ids $SECURITY_GROUP_ID \
11        --private-dns-enabled # code integration
12

◼ Use Case 4. Distributed LLM Training (EFA & Latency)

Tuning LLMs requires distributed training across multiple GPU instances.

▫ The Goal

High-performance distributed training across multiple GPU instances (p4d or p5 nodes).

▫ VPC Configuration

Private Subnet + EFA + Security Group with self-reference rules.

The system utilizes the Elastic Fabric Adapter (EFA) rather than standard TCP/IP to achieve microsecond-level latency.

To support this, the Security Group is configured to allow stateful ingress traffic from all associated GPU instances (Figure B).

▫ Implementation

After creating VPC, first, create a security group and authorize all ingress / egress traffic from the same security group:

1# create the security group
2SG_ID=$(aws ec2 create-security-group \
3    --group-name "llm-training-sg" \
4    --description "Security group for EFA and NCCL communication" \
5    --vpc-id $VPC_ID \
6    --query 'GroupId' --output text)
7
8# authorize all ingress traffic originating from the security group
9aws ec2 authorize-security-group-ingress \
10    --group-id $SECURITY_GROUP_ID \
11    --protocol all \
12    --port -1 \
13    --source-group $SECURITY_GROUP_ID
14
15# authorize all egress traffic to the security group
16aws ec2 authorize-security-group-egress \
17    --group-id $SECURITY_GROUP_ID \
18    --protocol all \
19    --port -1 \
20    --source-group $SECURITY_GROUP_ID
21

Then, create and launch a dedicate private subnet via EFA:

1# create a dedicated private subnet for training
2TRAIN_SUBNET_ID=$(aws ec2 create-subnet \
3    --vpc-id $VPC_ID \
4    --cidr-block 10.0.2.0/24 \
5    --availability-zone $REGION \
6    --query 'Subnet.SubnetId' \
7    --output text)
8
9# create a placement group (pg) for maximum performance of efa
10PG_NAME=$(aws ec2 create-placement-group \
11    --group-name "llm-training-pg" \
12    --strategy cluster \
13    --query 'PlacementGroup.GroupName' \
14    --output text)
15
16# launch a training instance with efa enabled
17aws ec2 run-instances \
18    --image-id ami-xxxxxx \
19    --instance-type p4d.24xlarge \
20    --key-name $KEY_NAME \
21    --placement "GroupName=$PG_NAME" \ # apply the placement group
22    --network-interfaces "DeviceIndex=0,InterfaceType=efa,Groups=$SG_ID,SubnetId=$TRAIN_SUBNET_ID"
23

Wrapping Up

A well-architected VPC is critical for a secure, production-ready AI platform.

Mastering subnets and choosing the right endpoints ensure that data remains private and egress costs remain low.

◼ Trade-offs - When to Skip the VPC

VPC is not always a hard requirement for ML systems.

VPC increases network complexity, latency, and overhead.

When these increase outweighs its security benefits, the system is more efficient using IAM roles and encrypted public endpoints rather than VPC.

For example:

Public dataset research, low-sensitivity projects: When working with open-source datasets or non-proprietary data.
Prototyping: When prioritizing iteration speed and easy access to external libraries over strict network isolation.
Serverless and fully-managed services with the built-in encryption systems.