Building an Automated CI/CD Pipeline for Serverless Machine Learning on AWS

A step-by-step guide on automating the infrastructure pipeline on AWS Lambda architecture

Machine LearningDeep LearningPython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is CI/CD PipelineThe Workflow in ActionTesting and Building Workflow
Adding PyTest Scripts
Configuring the Synk Credential for SAST and SCA Tests
Setting Up OIDC for AWS Credentials
Configuring an IAM Role for Github Actions
Configuring the AWS CodeBuild
Deployment WorkflowMonitoring with Grafana
Creating AWS IAM User for Grafana
Attaching Roles and Policies
Connecting Data Source to Grafana
Wrapping Up

Introduction

A CI/CD pipeline is a set of automated processes that helps machine learning teams deliver models more reliably and efficiently.

This automation is crucial for ensuring that new model versions are continuously integrated, tested, and deployed to production without manual intervention.

In this article, I'll explore a step-by-step guide on integrating an infrastructure CI/CD pipeline for a machine learning application deployed on a serverless Lambda architecture.

What is CI/CD Pipeline

A CI/CD (Continuous Integration / Continuous Delivery) pipeline is an automated process that helps deliver code changes more reliably and efficiently by automating the steps of building, testing, and deploying software.

Continuous Integration (CI) focuses on the practice of developers regularly merging code changes into a central repository.

After each merge, an automated build and a series of tests like unit tests are run to ensure the new code doesn't break the existing application.

Continuous Delivery (CD) automates the process of taking the code that passed CI and getting it ready for release.

In the process, the software is built, tested, and packaged into a release-ready state.

Then, Continuous Deployment (CD) automatically deployed the code passed all automated tests to production without human intervention.

Using a CI/CD pipeline is critical in DevOp practices, providing benefits like:

  • Faster releases as automation eliminates manual, time-consuming operational tasks,

  • Reducing risk by running automated tests on every code change,

  • Improving collaboration through a shared, automated pipeline that provides a consistent process, and

  • Improving code quality that reflects the immediate feedback from the automated tests.

The Workflow in Action

To establish a robust CI/CD pipeline for an ML application, it is critical to automate the entire lifecycle of the infrastructure, models, and data.

This process is referred as MLOps, extending traditional DevOps practices to include the unique challenges of machine learning like data and model versioning.

In this article, I’ll focus on building an infrastructure CI/CD pipeline for the dynamic pricing system built on AWS Lambda:

Figure A. Infrastructure CI/CD pipelines (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Infrastructure CI/CD pipelines (Created by Kuriko IWAI)

The pipeline covers the four stages:

  • Source: Commits a code change to GitHub to trigger the pipeline,

  • Test: Runs automated tests and security scans over the artifact,

  • Build: Compiles the committed code into an artifact, and

  • Deploy: Deploys passed code to a staging or production.

All code is hosted on GitHub, where it's protected by branch protection rules and enforced pull request reviews.

Once a change is ready, a GitHub Actions workflow (green box in the diagram) is triggered to run the testing and building processes.

To prevent errors from reaching production, I added a human review phase (pink box in the diagram) between the build and deployment workflows, ensuring any issues to be addressed before the final deployment.

If the code passes the human review, another GitHub Actions workflow is manually triggered to deploy the code as a Lambda function in staging or production.

This entire process is enhanced with comprehensive monitoring and security checkups (orange boxes).

Testing and Building Workflow

I’ll first configure the Github Actions Workflow to trigger testing and building every push and pull request.

This automation process involves the three phases:

Environment Setup:

  • Setting up Python,

  • Installing dependencies,

  • Configuring AWS credentials using OIDC,

Test Phase:

  • Running PyTest,

  • Running the Static Application Security Testing (SAST),

  • Scanning dependencies with Software Composition Analysis (SCA), and

Build Phase:

  • Once the code passes all tests, triggering AWS CodeBuild to start the project where the container image is built and pushed to the ECR.

These phases are configured in the build_test.yml script stored in the .github folder located at the root of the project directory:

.github/workflows/build_test.yml

1name: Build and Test
2
3on:
4  push:
5    branches: [ main ]
6  pull_request:
7    branches: [ main ]
8
9env:
10  API_ENDPOINT: ${{ secrets.API_ENDPOINT }}
11  CLIENT_A: ${{ secrets.CLIENT_A }}
12
13
14# set permissions for oicd (open id connect) authentication with aws
15permissions:
16  id-token: write  # for requesting the jwt token from GitHub's OIDC provider.
17  contents: read   # for checking out the code in the repo
18  security-events: write
19
20jobs:
21  build_and_test:
22    runs-on: ubuntu-latest
23    timeout-minutes: 60
24
25    steps:
26      # environment setup
27      - name: checkout repository code
28        uses: actions/checkout@v4
29
30      - name: set up python
31        uses: actions/setup-python@v5
32        with:
33          python-version: '3.12'
34          cache: 'pip'
35
36      - name: install dependencies
37        run: |
38          python -m pip install --upgrade pip
39          pip install -r requirements.txt
40          pip install -r requirements_dev.txt
41
42      # config aws credentials using oidc
43      - name: configure aws credentials
44        uses: aws-actions/configure-aws-credentials@v4
45        with:
46          aws-region: ${{ secrets.AWS_REGION_NAME }}
47          role-to-assume: ${{ secrets.AWS_IAM_ROLE_ARN }} # iam role for github actions
48          role-session-name: GitHubActions-Build-Test-${{ github.run_id }}
49
50      - name: test aws access
51        run: |
52          aws sts get-caller-identity
53          echo "✅ oidc authentication successful"
54
55      # testing
56      - name: run pytest
57        run: pytest
58        env:
59          CORS_ORIGINS: 'http://localhost:3000,http://127.0.0.1:3000'
60          PYTEST_RUN: true
61
62      # static analysis and dependency scanning with snyk
63      - name: run snyk sast
64        uses: snyk/actions/python@master
65        with:
66          command: test
67          args: --severity-threshold=high --policy-path=.synk --python-version=3.12 --skip-unresolved --file=requirements.txt
68        env:
69          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
70
71      - name: run snyk sca
72        uses: snyk/actions/python@master
73        with:
74          command: code test
75          args: --severity-threshold=high --policy-path=.synk --python-version=3.12 --skip-unresolved --file=requirements.txt
76        env:
77          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
78
79      # building - trigger aws codebuild to start the project named  ${{ secrets.CODEBUILD_PROJECT }}
80      - name: trigger aws codebuild
81        uses: aws-actions/aws-codebuild-run-build@v1
82        id: codebuild
83        with:
84          project-name: ${{ secrets.CODEBUILD_PROJECT }}
85          source-version-override: ${{ github.sha }}
86          env-vars-for-codebuild: | # pass the env vars to buildspec.yml. set BUILD_TYPE as test not to trigger the deployment
87            GITHUB_SHA=${{ github.sha }},
88            BUILD_TYPE=test
89
90      - name: check codebuild status
91        if: always()
92        run: |
93          BUILD_ID="${{ steps.codebuild.outputs.aws-build-id }}"
94          echo "codebuild id: $BUILD_ID"
95
96          BUILD_STATUS=$(aws codebuild batch-get-builds --ids "$BUILD_ID" \
97            --query 'builds[0].buildStatus' --output text)
98          echo "build status: $BUILD_STATUS"
99
100          aws codebuild batch-get-builds --ids "$BUILD_ID" \
101            --query 'builds[0].phases[].{Phase:phaseType,Status:phaseStatus,Duration:durationInSeconds}' \
102            --output table
103
104          if [ "$BUILD_STATUS" != "SUCCEEDED" ]; then
105            echo "❌ codebuild failed with status: $BUILD_STATUS"
106            exit 1
107          else
108            echo "✅ codebuild completed successfully"
109          fi
110
111      - name: upload build artifacts
112        if: always()
113        run: |
114          echo "build completed for commit: ${{ github.sha }}"
115          echo "branch: ${{ github.ref_name }}"
116          echo "build ID: ${{ steps.codebuild.outputs.aws-build-id }}"
117

Next, I’ll add support components to make the workflow run successfully.

This process involves:

  • Adding PyTest scripts,

  • Configuring the Synk credential for SAST and SCA tests,

  • AWS related configuration:

    1. Setting up OIDC for AWS credentials,

    2. Defining an IAM role for GitHub Actions, and

    3. Configuring the AWS CodeBuild

Adding PyTest Scripts

I’ll start the process by adding PyTest scripts to the tests folder located at the root of the project repository.

For demonstration, I’ll add two test files to evaluate the main script and the Flask's app scripts:

tests/main_test.py (Testing the main script)

1import os
2import shutil
3import numpy as np
4import pytest
5from unittest.mock import patch, MagicMock
6
7import src.main as main_script
8
9
10def test_data_loading_and_preprocessor_saving(mock_data_handling, mock_s3_upload, mock_joblib_dump):
11    """tests that data loading is called and the preprocessor is saved and uploaded."""
12
13    main_script.run_main()
14
15    # verify that data_handling.main_script was called
16    mock_data_handling.assert_called_once()
17
18    # verify preprocessor is dumped in mock file
19    mock_joblib_dump.assert_called_once_with(mock_data_handling.return_value[-1], PREPROCESSOR_PATH)
20
21    # verify preprocessor is uploaded to mock s3
22    mock_s3_upload.assert_any_call(file_path=PREPROCESSOR_PATH)
23
24
25
26def test_model_optimization_and_saving(mock_data_handling, mock_model_scripts, mock_s3_upload):
27    """tests that each model's optimization script is called and the results are saved and uploaded."""
28
29    mock_torch_script, mock_sklearn_script = mock_model_scripts
30    main_script.run_main()
31
32    # verify each model's main_script was called
33    assert mock_torch_script.called
34    assert mock_sklearn_script.call_count == len(main_script.sklearn_models)
35
36    # verify that each model file exists and s3_upload was called for it
37    ## dfn
38    assert os.path.exists(DFN_FILE_PATH)
39    mock_s3_upload.assert_any_call(file_path=DFN_FILE_PATH)
40
41    ## svr model
42    assert os.path.exists(SVR_FILE_PATH)
43    mock_s3_upload.assert_any_call(file_path=SVR_FILE_PATH)
44
45    ## elastic net
46    assert os.path.exists(EN_FILE_PATH)
47    mock_s3_upload.assert_any_call(file_path=EN_FILE_PATH)
48
49    ## light gbm
50    assert os.path.exists(GBM_FILE_PATH)
51    mock_s3_upload.assert_any_call(file_path=GBM_FILE_PATH)
52

tests/app_test.py (Testing the Flask app scripts)

1import os
2import json
3import io
4import pandas as pd
5import numpy as np
6from unittest.mock import patch, MagicMock
7
8# import scripts to test
9import app
10
11# add cors origin
12os.environ['CORS_ORIGINS'] = 'http://localhost:3000, http://127.0.0.1:3000'
13
14
15@patch('app.t.scripts.load_model')
16@patch('torch.load')
17@patch('app._redis_client', new_callable=MagicMock)
18@patch('app.joblib.load')
19@patch('app.s3_load_to_temp_file')
20@patch('app.s3_load')
21def test_predict_endpoint_primary_model(
22    mock_s3_load,
23    mock_s3_load_to_temp_file,
24    mock_joblib_load,
25    mock_redis_client,
26    mock_torch_load,
27    mock_load_model,
28    flask_client,
29):
30    """test a prediction from the primary model without cache hit."""
31
32    # mock return values for file loading
33    mock_preprocessor = MagicMock()
34    mock_joblib_load.return_value = mock_preprocessor
35    mock_s3_load.return_value = io.BytesIO(b'dummy_data')
36    mock_s3_load_to_temp_file.return_value = 'dummy_path'
37
38    # config redis cache for cache miss
39    mock_redis_client.get.return_value = None
40
41    # config the model and torch mock
42    mock_torch_model = MagicMock()
43    mock_load_model.return_value = mock_torch_model
44    mock_torch_load.return_value = {'state_dict': 'dummy'}
45
46    # mock model's prediction array
47    num_rows = 1200
48    num_bins = 100
49    expected_length = num_rows * num_bins
50    mock_prediction_array = np.random.uniform(1.0, 10.0, size=expected_length)
51
52    # mock the return chain for the model's forward pass
53    mock_torch_model.return_value.cpu.return_value.numpy.return_value.flatten.return_value = mock_prediction_array
54
55    # create a mock dataframe
56    mock_df_expanded = pd.DataFrame({
57        'stockcode': ['85123A'] * num_rows,
58        'quantity': np.random.randint(50, 200, size=num_rows),
59        'unitprice': np.random.uniform(1.0, 10.0, size=num_rows),
60        'unitprice_min': np.random.uniform(1.0, 3.0, size=num_rows),
61        'unitprice_median': np.random.uniform(4.0, 6.0, size=num_rows),
62        'unitprice_max': np.random.uniform(8.0, 12.0, size=num_rows),
63    })
64
65    # set global variables used by the app endpoint
66    app.X_test = mock_df_expanded.drop(columns='quantity')
67    app.preprocessor = mock_preprocessor
68
69    with patch.object(pd, 'read_parquet', return_value=mock_df_expanded):
70        response = flask_client.get('/v1/predict-price/85123A')
71
72    # assertion
73    assert response.status_code == 200
74
75    data = json.loads(response.data)
76    assert isinstance(data, list)
77    assert len(data) == num_bins
78    assert data[0]['stockcode'] == '85123A'
79    assert 'predicted_sales' in data[0]
80

In the app_test.py script, I used the @patch decorators from the Python’s unittest.mock library to temporarily replace the functions and objects with mock objects.

This allows the tests to run without depending on external sources like files or storages.

In practice, these tests need to be added every change made in the code to make sure the code changes will not cause errors.

Configuring the Synk Credential for SAST and SCA Tests

For SAST and SCA, I’ll use a security platform, Synk to find and fix vulnerabilities in the code and dependencies.

Synk’s primary goal is to shift left by integrating security into the development workflow as early as possible.

So, the Github Actions Workflow must run Synk SAST and SCA processes before triggering the build process.

To configure the Synk credential, visit the synk account page, copy the Auth Token, and store it in the Github repository secrets.

Figure B. Screenshot of the Synk account page

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Screenshot of the Synk account page

Setting Up OIDC for AWS Credentials

Next, I’ll configure handling AWS credentials with OIDC (OpenID Connect).

OIDC is a security practice that avoids storing long-lived AWS credentials in the environment by leveraging a federated identity approach with an external identity provider (IdP).

The IdP generates a temporary, short-lived token, exchanged with AWS for temporary security credentials to grant access to specific resources for a limited time.

To make the process work, I’ll first add the identity provider to the AWS account.

Visit AWS's IAM console > Identity Providers:

  • Provider type: Select OpenID Connect

  • Provider URL: https://token.actions.githubusercontent.com

  • Audience: sts.amazonaws.com

  • Click Add provider.

Configuring an IAM Role for Github Actions

Next, I’ll add an IAM role for the Github Actions.

An IAM role is a security entity in AWS that defines a set of permissions for making service requests.

To make the Github Actions access necessary AWS resources for the project, the IAM role must have permissions on:

  • Retrieving the identity from the AWS Security Token Service (STS): GetCallerIdentity

  • Running AWS CodeBuild commands: BatchGetBuilds, BatchGetProjects, StartBuild

  • Logging the CodeBuild project: GetLogEvents, FilterLogEvents, DescribeLogStreams

  • Storing parameters in the System Manager (SSM) Parameter Store: GetParameter, GetParameters, GetParametersByPath

  • Retrieving and modifying project related resources:

    • Lambda function: GetFunction, UpdateFunctionCode, UpdateFunctionConfiguration, InvokeFunction

    • ECR: ListImage, DescribeImages, DescribeRepositories

I’ll configure these permissions as an inline policy github_actions_permissions in a JSON format.

The inline policy refines the security scope to the absolute minimum, even though some permissions are already covered by broader AWS managed policies like AWSCodeBuildDeveloperAccess and CloudWatchLogsReadOnlyAccess.

IAM console > Roles > Create role > Add permissions > Create inline policy > JSON > github_actions_permissions:

1{
2    "Version": "2012-10-17",
3    "Statement": [
4        {
5            "Effect": "Allow",
6            "Action": [
7                "sts:GetCallerIdentity"
8            ],
9            "Resource": "*"
10        },
11        {
12            "Effect": "Allow",
13            "Action": [
14                "codebuild:BatchGetBuilds",
15                "codebuild:StartBuild",
16                "codebuild:BatchGetProjects"
17            ],
18            "Resource": [
19                "ADD_CODEBUILD_PROJECT_ARN"
20            ]
21        },
22        {
23            "Effect": "Allow",
24            "Action": [
25                "logs:GetLogEvents",
26                "logs:DescribeLogStreams",
27                "logs:DescribeLogGroups",
28                "logs:FilterLogEvents"
29            ],
30            "Resource": [
31                "arn:aws:logs:*:AWS_ACCOUNT_ID:log-group:/aws/codebuild/*:*"
32            ]
33        },
34        {
35            "Effect": "Allow",
36            "Action": [
37                "ssm:GetParameter",
38                "ssm:GetParameters",
39                "ssm:GetParametersByPath"
40            ],
41            "Resource": [
42                "ADD_SSM_PARAMETER_ARN",
43            ]
44        },
45        {
46            "Effect": "Allow",
47            "Action": [
48                "lambda:GetFunction",
49                "lambda:UpdateFunctionCode",
50                "lambda:UpdateFunctionConfiguration",
51                "lambda:InvokeFunction"
52            ],
53            "Resource": "ADD_LAMBDA_FUNCTION_ARN"
54        },
55        {
56            "Effect": "Allow",
57            "Action": [
58                "ecr:ListImages",
59                "ecr:DescribeImages",
60                "ecr:DescribeRepositories"
61            ],
62            "Resource": "ADD_ECR_ARN"
63        }
64    ]
65}
66

Configuring the AWS CodeBuild

Lastly, I’ll configure the AWS CodeBuild project from the AWS console.

The process involves:

  • Step 1. Add an IAM role for the CodeBuild,

  • Step 2. Create a CodeBuild project, and

  • Step 3. Configure a buildspec.yml file to customize the build process.

Step 1. Adding an IAM Role for CodeBuild

The IAM role for CodeBuild needs to have permissions on:

  • Connecting CodePipeline to GitHub Actions: UseConnection

  • Creating logs on the CodeBuild project: CreateLogGroup, CreateLogStream, PutLogEvents.

  • Storing parameters in the SSM Parameter Store: GetParameter, GetParameters, GetParametersByPath

  • Putting objects in S3 buckets for the CodePipeline: GetObject, GetObjectVersion, PutObject

  • Retrieving and modifying project related resources:

    • Lambda function: GetFunction, UpdateFunctionCode, UpdateFunctionConfiguration, InvokeFunction

    • ECR: ListImage, DescribeImages, DescribeRepositories

Similar to the GitHub Actions Role, these permissions are defined as an inline policy:

IAM console > Roles > Create role > Add permissions > Create inline policy > JSON > codebuild_permissions:

1{
2    "Version": "2012-10-17",
3    "Statement": [
4        {
5            "Effect": "Allow",
6            "Action": [
7                "codeconnections:UseConnection"
8            ],
9            "Resource": "ADD_CONNCETION_ARN"
10        },
11        {
12            "Effect": "Allow",
13            "Action": [
14                "logs:CreateLogGroup",
15                "logs:CreateLogStream",
16                "logs:PutLogEvents"
17            ],
18            "Resource": [
19                "arn:aws:logs:<CODEBUILD_PROJECT_ARN>",
20            ]
21        },
22        {
23            "Effect": "Allow",
24            "Action": [
25                "ssm:PutParameter",
26                "ssm:GetParameter",
27                "ssm:GetParameters",
28                "ssm:DeleteParameter",
29                "ssm:DescribeParameters"
30            ],
31            "Resource": [
32                "ADD_SSM_ARN"
33            ]
34        },
35        {
36            "Effect": "Allow",
37            "Action": [
38                "s3:GetObject",
39                "s3:GetObjectVersion",
40                "s3:PutObject"
41            ],
42            "Resource": [
43                "arn:aws:s3:::codepipeline-us-east-1-*/*"
44            ]
45        },
46        {
47            "Effect": "Allow",
48            "Action": [
49                "lambda:UpdateFunctionCode",
50                "lambda:GetFunction",
51                "lambda:UpdateFunctionConfiguration"
52            ],
53            "Resource": "ADD_LAMBDA_FUNCTION_ARN"
54        },
55        {
56            "Effect": "Allow",
57            "Action": [
58                "ecr:BatchCheckLayerAvailability",
59                "ecr:GetDownloadUrlForLayer",
60                "ecr:BatchGetImage",
61                "ecr:GetAuthorizationToken",
62                "ecr:InitiateLayerUpload",
63                "ecr:UploadLayerPart",
64                "ecr:CompleteLayerUpload",
65                "ecr:PutImage"
66            ],
67            "Resource": "ADD_ECR_ARN"
68        }
69    ]
70}
71

Step 2. Creating a CodeBuild Project

Visit Developer Tools > CodeBuild > Build project > Create build project, create a new CodeBuild project:

  • Project name: pj-sales-pred (or any name that specifies in the .yml file),

  • Project type: Default project,

  • Source 1: Github (follow the instructions to connect the Github account to the CodeBuild)

  • Repository: https://github.com/<YOUR GITHUB ACCOUNT>/<REPOSITORY NAME> (The CodeBuild project needs to understand where to get the source code)

  • Service Role: Choose the IAM role created in Step 1

  • Build Spec: Choose Use a buildspec file option/ Add buildspec.yml to the Build Command tag

  • Configure environment:

    • Environment image: Managed image

    • Operating system: Amazon Linux 2

    • Runtime(s): Standard

    • Image: aws/codebuild/amazonlinux2-x86_64-standard:5.0

    • Image version: Always use the latest image for this runtime version

    • Environment type: Linux

    • Compute: 3 GB memory, 2 vCPUs (BUILD_GENERAL1_SMALL)

    • Check "Privileged"

Figure C. Screenshot of the AWS CodeBuild console

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Screenshot of the AWS CodeBuild console

Step 3. Add buildspec File

Lastly, add the buildspec.yml file at the root of the project repository.

The buildspec.yml file configures the CodeBuild process by defining key components:

  • version: Specifies the buildspec version.

  • env: Defines environment variables.

  • phases: Defines the commands to run:

    • pre_build: Commands to run before the main build. Login to ECR and create a repository if it does not exist.

    • build: The main part of the build where the Docker image is built and tagged.

    • post_build: Commands to run after the main build is complete. The Docker image is pushed to the ECR.

  • artifacts: Specifies the files or directories that stores the build output. The artifacts will be passed to the next stage of the CI/CD pipeline - deployment stage.

  • cache: Defines files or directories to cache between builds to speed up the process.

AWS CodeBuild automatically looks into the file and executes the commands accordingly.

buildspec.yml

1version: 0.2
2
3phases:
4  pre_build:
5    commands:
6      # login to ecr
7      - echo "=== Pre-build Phase Started ==="
8      - AWS_ACCOUNT_ID=$(echo $CODEBUILD_BUILD_ARN | cut -d':' -f5)
9      - ECR_REGISTRY="$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
10      - aws ecr get-login-password --region $AWS_DEFAULT_REGION > /tmp/ecr_password
11      - cat /tmp/ecr_password | docker login --username AWS --password-stdin $ECR_REGISTRY
12      - rm /tmp/ecr_password
13      - REPOSITORY_URI="$ECR_REGISTRY/$ECR_REPOSITORY_NAME"
14
15      # use github sha or codebuild commit hash as an image tag
16      - |
17        if [ -n "$GITHUB_SHA" ]; then
18          COMMIT_HASH=$(echo $GITHUB_SHA | cut -c 1-7)
19          echo "Using GitHub SHA: $GITHUB_SHA"
20        else
21          COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
22          echo "Using CodeBuild SHA: $CODEBUILD_RESOLVED_SOURCE_VERSION"
23        fi
24      - IMAGE_TAG="${COMMIT_HASH:-latest}"
25
26      # store image tag in aws ssm parameter store
27      - |
28        aws ssm put-parameter --name "/my-app/image-tag" --value "$IMAGE_TAG" --type "String" --overwrite
29
30      # create an ecr registory if not exist
31      - |
32        aws ecr describe-repositories --repository-names $ECR_REPOSITORY_NAME --region $AWS_DEFAULT_REGION || \
33        aws ecr create-repository --repository-name $ECR_REPOSITORY_NAME --region $AWS_DEFAULT_REGION
34
35  build:
36    commands:
37      - echo "=== Build Phase Started ==="
38      # build docker image
39      - docker build -t my-app -f Dockerfile.lambda .
40      - docker tag $ECR_REPOSITORY_NAME:latest $REPOSITORY_URI:$IMAGE_TAG
41      - docker images | grep $ECR_REPOSITORY_NAME
42
43  post_build:
44    commands:
45      - echo "=== Post-build Phase Started ==="
46
47      # push the docker image to ecr
48      - docker push ${REPOSITORY_URI}:${IMAGE_TAG}
49
50artifacts:
51  files:
52    - '**/*'
53  name: ml-sales-prediction-$(date +%Y-%m-%d)
54
55cache:
56  paths:
57    - '/root/.cache/pip/**/*'
58

After a successful test and build from the GitHub Actions workflow triggered by a push to the GitHub repository, the CodeBuild project has now build history:

Figure D. Screenshot of the CodeBuild console

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Screenshot of the CodeBuild console

This concludes the build_test.yml workflow.

Deployment Workflow

After human review on the build results, the container image is finally deployed as a Lambda function using GitHub Actions Workflow.

The process involves:

  • Environment Setup

    • Setting up Python,

    • Installing dependencies,

    • Configuring AWS credentials using OIDC, and

    • Extracting a shortened version of the Git commit SHA into the SHORT_SHA

  • Deployment

    • Checking if the Lambda function exists,

    • Retrieving the latest image tag from the SSM parameter store,

    • If the image tag is found, update the lambda function with the image retrieved,

    • If not, start a new CodeBuild project to rebuild a container image, and

    • Update the lambda function with the container image.

  • Verification and Testing:

    • Check if the Lambda function is updated, and

    • Test the updated Lambda function.

  • Configuration Update:

    • After successful test run, update the environment variable for the Lambda function, and

    • Clean up the temporary files, ensuring a clean state for the next run.

.github/workflows/deploy.yml

1name: Deploy Containerized Lambda
2
3on:
4  workflow_dispatch: # manual run
5    inputs:
6      branch:
7        description: 'The branch to deploy from'
8        required: true
9        default: 'develop'
10        type: choice
11        options:
12          - main
13          - develop
14
15env:
16  GITHUB_SHA: ${{ github.sha }}
17
18permissions:
19  id-token: write
20  contents: read
21
22jobs:
23  deploy:
24    runs-on: ubuntu-latest
25
26    steps:
27    ### environment setup ###
28    - name: checkout code
29      uses: actions/checkout@v4
30      with:
31        ref: ${{ github.event.inputs.branch }}
32
33    - name: set up python
34      uses: actions/setup-python@v5
35      with:
36        python-version: '3.12'
37        cache: 'pip'
38
39    # configure aws credentials using oicd
40    - name: configure aws credentials
41      uses: aws-actions/configure-aws-credentials@v4
42      with:
43        aws-region: ${{ secrets.AWS_REGION_NAME }}
44        role-to-assume: ${{ secrets.AWS_IAM_ROLE_ARN }}
45
46    - name: set environment variables
47      run: |
48        echo "SHORT_SHA=${GITHUB_SHA::8}" >> $GITHUB_ENV
49
50    ### deployment ###
51    - name: check lambda function exists
52      run: |
53        aws lambda get-function --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} --region ${{ secrets.AWS_REGION_NAME }}
54
55    - name: retrieve image tag and validate image
56      id: validate_image
57      run: |
58        IMAGE_TAG=$(aws ssm get-parameter --name "/my-app/image-tag" --query "Parameter.Value" --output text || echo "")
59        echo "IMAGE_TAG=$IMAGE_TAG" >> $GITHUB_ENV
60
61        if [[ -z "$IMAGE_TAG" ]]; then
62          echo "has_image=false" >> $GITHUB_OUTPUT
63        else
64          echo "... checking for image with tag: $IMAGE_TAG"
65          IMAGE_URI=${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION_NAME }}.amazonaws.com/${{ secrets.ECR_REPOSITORY }}:${IMAGE_TAG}
66
67          if aws ecr describe-images --repository-name ${{ secrets.ECR_REPOSITORY }} --image-ids imageTag=$IMAGE_TAG --region ${{ secrets.AWS_REGION_NAME }} > /dev/null 2>&1; then
68            echo "has_image=true" >> $GITHUB_OUTPUT
69            echo "IMAGE_URI=$IMAGE_URI" >> $GITHUB_OUTPUT
70          else
71            echo "has_image=false" >> $GITHUB_OUTPUT
72          fi
73        fi
74
75    - name: update lambda function with existing image
76      if: ${{ steps.validate_image.outputs.has_image == 'true' }}
77      run: |
78        aws lambda update-function-code \
79          --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} \
80          --region ${{ secrets.AWS_REGION_NAME }} \
81          --image-uri ${{ steps.validate_image.outputs.IMAGE_URI }}
82        echo "...lambda function updated with existing image ..."
83
84    - name: start codebuild for container build
85      if: ${{ steps.validate_image.outputs.has_image == 'false' }} # run only when the image is not found
86      uses: aws-actions/aws-codebuild-run-build@v1
87      id: codebuild
88      with:
89        project-name: ${{ secrets.CODEBUILD_PROJECT }}
90        source-version-override: ${{ github.event.inputs.branch }}
91        env-vars-for-codebuild: |
92          [
93            {
94              "name": "GITHUB_REF",
95              "value": "refs/heads/${{ github.event.inputs.branch }}"
96            },
97            {
98              "name": "BRANCH_NAME",
99              "value": "${{ github.event.inputs.branch }}"
100            },
101            {
102              "name": "ECR_REPOSITORY_NAME",
103              "value": "${{ secrets.ECR_REPOSITORY }}"
104            },
105            {
106              "name": "LAMBDA_FUNCTION_NAME",
107              "value": "${{ secrets.LAMBDA_FUNCTION_NAME }}"
108            }
109          ]
110
111    - name: update lambda function with a new image (after build)
112      if: ${{ steps.validate_image.outputs.has_image == 'false' }} # run only when the image is not found
113      run: |
114        LATEST_IMAGE_URI=$(aws ecr describe-images --repository-name ${{ secrets.ECR_REPOSITORY }} --query 'sort_by(imageDetails,&imagePushedAt)[-1].imagePushedAt' | xargs -I {} aws ecr describe-images --repository-name ${{ secrets.ECR_REPOSITORY }} --query 'imageDetails[?imagePushedAt==`{}`].imageUri' --output text)
115
116        if [[ -z "$LATEST_IMAGE_URI" ]]; then
117          echo "... failed to retrieve the new image uri ..."
118          exit 1
119        fi
120
121        aws lambda update-function-code \
122          --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} \
123          --region ${{ secrets.AWS_REGION_NAME }} \
124          --image-uri "$LATEST_IMAGE_URI"
125        echo "... lambda function updated with newly built image ..."
126
127    ### verification and testing ###
128    - name: verify lambda updates
129      run: |
130        CURRENT_IMAGE=$(aws lambda get-function \
131          --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} \
132          --region ${{ secrets.AWS_REGION_NAME }} \
133          --query 'Code.ImageUri' \
134          --output text)
135
136        if [[ $CURRENT_IMAGE == *"dkr.ecr"* ]]; then
137          echo "✅ lambda function successfully updated with new image"
138        else
139          echo "❌ lambda function update may have failed"
140          exit 1
141        fi
142
143    - name: test lambda function
144      if: github.event.inputs.branch == 'main'
145      run: |
146        aws lambda invoke \
147          --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} \
148          --region ${{ secrets.AWS_REGION_NAME }} \
149          --payload '{"test": true}' \
150          --cli-binary-format raw-in-base64-out \
151          response.json
152
153        cat response.json
154
155    ### update the lambda func env vars ###
156    - name: update lambda environment variables
157      if: github.event.inputs.branch == 'main'
158      run: |
159        DEPLOY_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
160        IMAGE_TAG="${{ env.IMAGE_TAG }}"
161
162        echo "ENVIRONMENT: production"
163        echo "VERSION: $IMAGE_TAG"
164        echo "DEPLOY_TIME: $DEPLOY_TIME"
165
166        # wait for a while to reattempt updating the env var updates
167        MAX_ATTEMPTS=30
168        ATTEMPT=1
169
170        while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do
171          echo "... attempt $ATTEMPT/$MAX_ATTEMPTS: checking function state ..."
172
173          FUNCTION_STATE=$(aws lambda get-function \
174            --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} \
175            --region ${{ secrets.AWS_REGION_NAME }} \
176            --query 'Configuration.State' \
177            --output text)
178
179          LAST_UPDATE_STATUS=$(aws lambda get-function \
180            --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} \
181            --region ${{ secrets.AWS_REGION_NAME }} \
182            --query 'Configuration.LastUpdateStatus' \
183            --output text)
184
185          if [ "$FUNCTION_STATE" = "Active" ] && [ "$LAST_UPDATE_STATUS" = "Successful" ]; then
186            echo "✅ function is ready for configuration update"
187            break
188          elif [ "$LAST_UPDATE_STATUS" = "Failed" ]; then
189            echo "❌ function update failed"
190            exit 1
191          else
192            echo "function not ready yet, waiting 30 seconds..."
193            sleep 30
194            ATTEMPT=$((ATTEMPT + 1))
195          fi
196        done
197
198        if [ $ATTEMPT -gt $MAX_ATTEMPTS ]; then
199          echo "❌ Timeout waiting for function to be ready"
200          exit 1
201        fi
202
203        aws lambda update-function-configuration \
204          --function-name ${{ secrets.LAMBDA_FUNCTION_NAME }} \
205          --region ${{ secrets.AWS_REGION_NAME }} \
206          --environment "Variables={ENVIRONMENT=production,VERSION=$IMAGE_TAG,DEPLOY_TIME=$DEPLOY_TIME}"
207
208    # clean up temp files for the clean state for the next run
209    - name: cleanup
210      if: always()
211      run: |
212        echo "=== Cleanup ==="
213        rm -f response.json
214        echo "✅ cleanup completed"
215

That’s all for the Infrastructure CI/CD pipeline integration.

Next, I'll configure Grafana for more advanced monitoring. This step is optional, as AWS CloudWatch can also cover your monitoring needs.

Monitoring with Grafana

Lastly, I’ll configure Grafana for advanced logging and monitoring on top of AWS CloudWatch.

Grafana is an open-source data visualization and analytics tool.

It allows to query, visualize, alert on, and understand the metrics no matter where they are stored.

The configuration process involves:

  • Create an IAM User,

  • Attaching roles and policies to the IAM User, and

  • Connecting the data source to Grafana.

Creating AWS IAM User for Grafana

First, I’ll add a new IAM User dedicated to the Grafana integration and grant it read-only access to various AWS services.

This ensures segregating the permissions to specific resources it needs to access, following the principle of least privilege.

Visit IAM console > User > Create user > Add user name “grafana” > Attach policies directly > Create policy > JSON:

1{
2    "Version": "2012-10-17",
3    "Statement": [
4        {
5            "Sid": "ListAllLogGroups",
6            "Effect": "Allow",
7            "Action": [
8                "logs:DescribeLogGroups"
9            ],
10            "Resource": "*"
11        },
12        {
13            "Sid": "AccessSpecificLogGroups",
14            "Effect": "Allow",
15            "Action": [
16                "logs:DescribeLogStreams",
17                "logs:GetLogEvents",
18                "logs:FilterLogEvents",
19                "logs:StartQuery",
20                "logs:StopQuery",
21                "logs:GetQueryResults",
22                "logs:DescribeMetricFilters",
23                "logs:GetLogGroupFields",
24                "logs:DescribeExportTasks",
25                "logs:DescribeDestinations"
26            ],
27            "Resource": [
28                "arn:aws:logs:*:<AWS ACCOUNT ID>:log-group:/aws/lambda/*",
29                "arn:aws:logs:*:<AWS ACCOUNT ID>:log-group:/aws/codebuild/*",
30                "arn:aws:logs:*:<AWS ACCOUNT ID>:log-group:/aws/apigateway/*",
31                "arn:aws:logs:*:<AWS ACCOUNT ID>:log-group:<RDS NAME>*",
32                "arn:aws:logs:*:<AWS ACCOUNT ID>:log-group:<PROJECT NAME>*",
33                "arn:aws:logs:*:<AWS ACCOUNT ID>:log-group:application/*",
34                "arn:aws:logs:*:<AWS ACCOUNT ID>:log-group:custom/*"
35            ]
36        },
37        {
38            "Sid": "CloudWatchLogsQueryOperations",
39            "Effect": "Allow",
40            "Action": [
41                "logs:DescribeQueries",
42                "logs:DescribeResourcePolicies",
43                "logs:DescribeSubscriptionFilters"
44            ],
45            "Resource": "*"
46        },
47        {
48            "Sid": "CloudWatchMetricsAccess",
49            "Effect": "Allow",
50            "Action": [
51                "cloudwatch:GetMetricStatistics",
52                "cloudwatch:GetMetricData",
53                "cloudwatch:ListMetrics",
54                "cloudwatch:DescribeAlarms",
55                "cloudwatch:DescribeAlarmsForMetric",
56                "cloudwatch:GetDashboard",
57                "cloudwatch:ListDashboards",
58                "cloudwatch:DescribeAlarmHistory",
59                "cloudwatch:GetMetricWidgetImage",
60                "cloudwatch:ListTagsForResource"
61            ],
62            "Resource": "*"
63        },
64        {
65            "Sid": "EC2DescribeAccess",
66            "Effect": "Allow",
67            "Action": [
68                "ec2:DescribeInstances",
69                "ec2:DescribeRegions",
70                "ec2:DescribeTags",
71                "ec2:DescribeAvailabilityZones",
72                "ec2:DescribeSecurityGroups",
73                "ec2:DescribeSubnets",
74                "ec2:DescribeVpcs",
75                "ec2:DescribeVolumes",
76                "ec2:DescribeNetworkInterfaces"
77            ],
78            "Resource": "*"
79        },
80        {
81            "Sid": "ResourceGroupsAccess",
82            "Effect": "Allow",
83            "Action": [
84                "resource-groups:ListGroups",
85                "resource-groups:GetGroup",
86                "resource-groups:ListGroupResources",
87                "resource-groups:SearchResources"
88            ],
89            "Resource": "*"
90        },
91        {
92            "Sid": "LambdaDescribeAccess",
93            "Effect": "Allow",
94            "Action": [
95                "lambda:ListFunctions",
96                "lambda:GetFunction",
97                "lambda:ListTags",
98                "lambda:GetAccountSettings",
99                "lambda:ListEventSourceMappings"
100            ],
101            "Resource": "*"
102        },
103        {
104            "Sid": "APIGatewayDescribeAccess",
105            "Effect": "Allow",
106            "Action": [
107                "apigateway:GET"
108            ],
109            "Resource": [
110                "arn:aws:apigateway:*::/restapis",
111                "arn:aws:apigateway:*::/restapis/*/stages",
112                "arn:aws:apigateway:*::/restapis/*/resources",
113                "arn:aws:apigateway:*::/domainnames",
114                "arn:aws:apigateway:*::/usageplans"
115            ]
116        },
117        {
118            "Sid": "ECSDescribeAccess",
119            "Effect": "Allow",
120            "Action": [
121                "ecs:ListClusters",
122                "ecs:DescribeClusters",
123                "ecs:ListServices",
124                "ecs:DescribeServices",
125                "ecs:ListTasks",
126                "ecs:DescribeTasks"
127            ],
128            "Resource": "*"
129        },
130        {
131            "Sid": "RDSDescribeAccess",
132            "Effect": "Allow",
133            "Action": [
134                "rds:DescribeDBInstances",
135                "rds:DescribeDBClusters",
136                "rds:ListTagsForResource"
137            ],
138            "Resource": "*"
139        },
140        {
141            "Sid": "TaggingAccess",
142            "Effect": "Allow",
143            "Action": [
144                "tag:GetResources",
145                "tag:GetTagKeys",
146                "tag:GetTagValues"
147            ],
148            "Resource": "*"
149        },
150        {
151            "Sid": "XRayAccess",
152            "Effect": "Allow",
153            "Action": [
154                "xray:BatchGetTraces",
155                "xray:GetServiceGraph",
156                "xray:GetTimeSeriesServiceStatistics",
157                "xray:GetTraceSummaries"
158            ],
159            "Resource": "*"
160        },
161        {
162            "Sid": "SNSAccess",
163            "Effect": "Allow",
164            "Action": [
165                "sns:ListTopics",
166                "sns:GetTopicAttributes"
167            ],
168            "Resource": "*"
169        },
170        {
171            "Sid": "SQSAccess",
172            "Effect": "Allow",
173            "Action": [
174                "sqs:ListQueues",
175                "sqs:GetQueueAttributes"
176            ],
177            "Resource": "*"
178        }
179    ]
180}
181

Attaching Roles and Policies

After creating the IAM User, I’ll create a policy by visiting IAM console > Policies > Create policy > JSON and adding the same policy attached to the IAM User created in the previous step.

Then, configure a new role by visiting IAM console > Roles > Create role > Custom trust policy:

1{
2    "Version": "2012-10-17",
3    "Statement": [
4        {
5            "Effect": "Allow",
6            "Principal": {
7                "AWS": "arn:aws:iam::<AWS ACCOUNT ID>:user/grafana"
8            },
9            "Action": "sts:AssumeRole"
10        }
11    ]
12}
13

Then, attach the policy created in the previous step.

The IAM Role Trust Policy defines who can assume a specific IAM role like “Who is trusted to use this role’s permissions?”

I configured the trust policy to allow the IAM User for Grafana grafana created in the first step to assume the role.

Connecting Data Source to Grafana

Lastly, I’ll configure the data source for Grafana.

Visit Grafana console > Data sources > cloudwatch > add:

  • Access Key ID: Access key ID of the IAM User grafana

  • Secret Access Key: Secret access key of the IAM User grafana

  • Assume Role ARN: ARN of the IAM Role created in the previous step.

  • Click save & test.

This will allow for importing data from the relative AWS resources:

Figure E. Screenshot of the Grafana console

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Screenshot of the Grafana dashboard

Wrapping Up

In this article, we demonstrated how to integrate a robust CI/CD pipeline into a machine learning application.

While the specific services used may vary depending on the project's needs, the principles remain the same: automating the process and detecting errors as early as possible before the actual deployment.

The next step would be to extend this pipeline to include the crucial aspects of model and data CI/CD.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Building an Automated CI/CD Pipeline for Serverless Machine Learning on AWS" in Kernel Labs

https://kuriko-iwai.com/integrating-cicd-pipelines

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.