Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse

A practical guide to transforming raw data into actionable predictions

Machine LearningDeep LearningPython

By Kuriko IWAI

Introduction What is Data Pipeline Architecture The Traditional Data Warehouse

Adding a Staging Area

The Cloud-Native Data Lake

Standard ELT Approach

Push ELT Approach

EtLT (Extract, light transform, Load, and Transform) Approach

Data Ingestion Patterns

Lambda Architecture

Kappa Architecture

The Modern Lakehouse Wrapping Up

Introduction

A data pipeline architecture serves as the strategic blueprint for transforming raw data into actionable predictions.

But designing the architecture seems complex because it involves numerous components, and the specific choices for each are driven by the data's characteristics and business needs.

In this article, I’ll structure these components and explore three common patterns:

The Traditional Data Warehouse,
The Cloud-Native Data Lake, and
The Modern Lakehouse,

taking stock price prediction for a practical use case.

What is Data Pipeline Architecture

Data pipeline architectures define the structure, components, and flow of data from ingestion to a usable state for analytics and machine learning.

The below diagram shows key components and major options for each component in the data pipeline architecture:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Key components and major options in data pipeline architecture (Created by Kuriko IWAI)

Key components involves:

Data Source: The origin of the data,
Ingestion: The method of collecting and bringing data into the system,
Storage: Where the data is housed,
Processing: The transformation and cleaning of data,
Serving: Making the processed data accessible to end-users and applications, and
Governance: Ensures data quality, security, privacy, and compliance.

And the processing part involves two sub-components:

Loading strategies like full load, incremental load, and delta load, and
Data transformation like cleaning, imputation, and preprocessing.

Choices are driven by the data characteristics, including its variety, volume, and velocity, as well as by specific business requirements.

Variety: The diversity of data (structured, semi-structured, unstructured) influences a storage choice.
Volume: The sheer amount of data dictates the need for scalable, distributed technologies (e.g., Spark, Hadoop) and cost-effective storage solutions (like cloud object storage).
Velocity: The speed at which data is generated and processed determines whether it should be a real-time streaming architecture for high velocity or a batch processing one for low velocity.
Business requirements are the ultimate guiding force.

Because some options in different components are strongly related, I’ll present three common combinations in the next section, taking a stock price prediction for an example.

The Traditional Data Warehouse

The first combination is a traditional, data warehouse architecture using an ETL (Extract, Transform, Load) approach.

The diagram below shows its standard architecture where the original data is extracted and transformed into structured one, and then loaded to fit a predefined schema in the data warehouse:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Standard ETL / DWH architecture (Created by Kuriko IWAI)

Typical options selected in each component are**:**

Source: Structured, batch
Ingestion: Batch
Storage: Data warehouse
Processing: ETL (Extract, Transform, Load)
Serving: BI, Low frequency reports

The ETL process rigorously cleans and transforms data before loading, which ensures:

An access to stable, well-defined data sources,
High level of accuracy and consistency, and
Very fast query performance.

Drawbacks include:

Data types: Not suitable for unstructured or semi-structured data like image, text.
Cost: Maintaining a data warehouse can be expensive.
Latency: The batch process indicates data is updated on a scheduled basis like daily or weekly. Not suitable for real-time inference.

▫ Use Case - Stock Price Prediction

The architecture is best suited for a long-term forecasting based on quarterly or annual financial reports alongside historical end-of-day stock prices.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Long-term forecasting on a traditional data warehouse architecture (Created by Kuriko IWAI)

In the diagram, the architecture first populates the data warehouse with historical data over decades.

Then, daily stock trading volumes are incrementally loaded via scheduled batch processes (e.g., daily or weekly).

When financial records are adjusted, the batch process also performs performs a delta load to update the data warehouse.

Then, the structured data is used to train a model that serves predictions for low-frequency decision-making.

This structure is not suitable for real-time stock prediction as the scheduled batch process generates latency between data and prediction.

◼ Adding a Staging Area

An advance approach utilizes a staging area to store extracted data before transformation by SQL queries:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. ETL / DWH architecture with a staging area (Created by Kuriko IWAI)

The main difference lies in the isolation and efficiency of the transformation process.

Without a staging area, transformations are done directly on the source system or within the target data warehouse. This can be inefficient and risky:

Source System Overload: Slow down the systems over complex transformations, impacting the core business operation.
Data Warehouse Bottlenecks: Slow down queries and reports, consuming computing resource of the data warehouse.

A staging area can offload the heavy transformation process from the data warehouse or the source system by loading raw data into a temporary storage space like a dedicated S3 bucket.

Then, a separate processing engine like Apache Spark runs transformations without affecting the source systems or the data warehouse.

Although it increases operational complexity, other perks include:

Error handling: The original raw data in the data warehouse is not impacted even when the transformation fails. Simply rerun the transformation in the staging area.
Data quality control: Ensure that only high-quality data is loaded to the data warehouse by adding multi-steps to the transformation like cleaning, feature engineering, and preprocessing in the staging area.

The Cloud-Native Data Lake

The second combination is a cloud-native data lake architecture.

This architecture is flexible and cost-effective, ideal for handling massive volumes of diverse data, including unstructured data.

There are three major approaches:

Standard ELT (Extract, Load, Transform),
Push ELT, and
EtLT (Extract, light transform, Load, Transform).

◼ Standard ELT Approach

The standard approach leverage an ELT processing:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-1. Standard ELT / data lake architecture(Created by Kuriko IWAI)

First, original data is extracted into data lake, and then loaded and transformed to store in the data warehouse.

◼ Push ELT Approach

An alternative approach of data ingestion to a data lake is the Push method, where external sources extract data directly to the data lake:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-2. Push ELT / data lake architecture (Created by Kuriko IWAI)

This approach may result in limited control over data extraction, requiring coordination with teams responsible for data sources in cases of missing or corrupted data.

◼ EtLT (Extract, light transform, Load, and Transform) Approach

Data extracted from sources may contain confidential data that should not be accessible to unauthorized individuals.

An EtLT approach includes an additional ‘light‘ transformation step where sensitive information is masked or encrypted before loading data into the data lake:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-3. EtLT architecture (Created by Kuriko IWAI)

In each approach, the combination of the data lake and the data warehouse is suitable for applying different analysis techniques on original data stored in the data lake.

Typical options selected in each component are:

Source: Versatile for data structure, streaming.
Ingestion: Suitable for streaming, but batch can be an option.
Storage: Data Lake +Data warehouse
Processing: ELT or EtLT

Although each approach introduces complexity in managing multiple tools, other perks include:

Scalability: Separation of ingestion and transformation processes enhances scalability and flexibility.
Manageability: Easy to store, track, and review SQL queries (transformation).

Data Ingestion Patterns

All approaches can be used for both batch and streaming pipelines.

However, due to ELT and EtLT’s nature of loading first and transforming later, they are well-suited for the real-time needs of streaming data.

But hybrid architectures like Lambda and Kappa are designed to seamlessly combine both batch and streaming ingestion to provide comprehensive data processing.

Let us take a look.

◼ Lambda Architecture

Lambda architecture uses a dual-path approach where the batch layer follows an ETL process to transform large historical datasets, and the speed layer handles real-time data suited for an ELT or EtLT approach.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Lambda architecture (Created by Kuriko IWAI)

The architecture can serve both real-time and batch access to the predictions depending on the business needs.

▫ Use Case - Stock Price Prediction

Using the same stock price prediction case in the previous architecture, Lambda architecture can be extended to serve real-time predictions via speed layer, while batch layer serves long-term forecasting.

The predictions from both layers are combined and served to the user, providing both a stable, long-term outlook and a volatile, real-time forecast.

◼ Kappa Architecture

Kappa architecture simplifies Lambda by using a single, unified stream processing pipeline for both real-time and historical data.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Kappa architecture (Created by Kuriko IWAI)

Kappa utilizes an ELT model, where all data from different sources is loaded as a stream and then transformed by the single processing engine.

▫ Use Case - Stock Price Prediction

In the same use case of the stock price prediction, when the company prioritizes reducing development complexity and operational overhead while providing both real-time predictions and long-term forecasting, Kappa is the best choice.

The Modern Lakehouse

The last combination is the lakehouse architecture.

The lakehouse architecture aims to combine the best features of both data warehouse and data lake, creating a unified platform that handles both structured and unstructured data:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Source: Versatile for both structure and data flow.
Ingestion: Batch, Stream
Storage: Lakehouse
Processing: ELT (Extract, Load, Transform)

Figure I. Lakehouse architecture with a medallion structure (Created by Kuriko IWAI)

In the lakehouse, Bronze Layer works as a data lake where raw data from external sources are ingested.

Then the Silver Layer handles transformation of data by cleaning and structuring the raw data.

Gold Layer builds a curated table for a specific project, performing feature engineering to engineer necessary features to future predictions.

This combination provides both the flexibility of a data lake and the reliability and performance of a data warehouse within a single, simplified system.

Major advantages include:

Unified storage: The Lakehouse can store all types of data—structured, unstructured, and semi-structured—in a single platform.
Cost-effective: Leverages low-cost, cloud-based object storage like S3.
Openness: Avoid vendor lock-in by using open-source technologies like Apache Spark, Delta Lake and open file formats like Parquet.

This architecture involves complexity in implementation and data governance.

▫ Use Case - Stock Price Prediction

The architecture handles both historical data and real-time streams in a single platform.

The medallion structure with Bronze, Silver, Gold layers progressively refine raw data into high-quality features.

Let’s say we have raw data from three data sources:

APIs - Unstructured stock prices
RSS feeds - Unstructured news
Internal database - Structured financial records

Bronze Layer:

The bronze layer works as a data lake to store the raw data from multiple data sources.

Silver Layer:

Then, the silver layer cleans and structures the raw data:

Running a query to join all stock prices with the corresponding financial records,
Cleanse messy text data from news feeds, and
Extract news articles for a given data range.

Gold Layer:

The gold layer runs feature engineering like calculating 30-day moving averages, volatility measures, and a market sentiment score from the cleaned news data.

This final, highly-refined dataset is used to train a model.

Wrapping Up

Data pipeline architectures play key roles in transforming raw data into meaningful predictions.

In this article, we learned that each of three common architectures - the traditional data warehouse, the cloud-native data lake, and the modern lakehouse - has its own advantages and disadvantages.

The optimal architecture is not a one-size-fits-all solution but a strategic choice guided by a careful assessment of data characteristics and business objectives.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Architecting Production ML: A Deep Dive into Deployment and Scalability
Master machine learning system design. Compare batch vs. real-time inference, evaluate K8s vs. Serverless, and implement robust scaling and failover strategies.
Engineering a Fully-Automated Lakehouse: From Raw Data to Gold Tables
Step-by-step guide to implementing a Medallion Lakehouse architecture using AWS S3, Delta Lake, Apache Spark, and Airflow for automated data pipelines.
Building a Production-Ready Data CI/CD Pipeline: Versioning, Drift Detection, and Orchestration
A step-by-step guide to building a robust Data CI/CD pipeline. Learn how to version data with DVC, detect drift with Evidently AI, and orchestrate workflows using Prefect and AWS S3.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse" in Kernel Labs

https://kuriko-iwai.com/data-pipeline-architectures-for-ml-models

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse

A practical guide to transforming raw data into actionable predictions

Table of Contents

Introduction

What is Data Pipeline Architecture

The Traditional Data Warehouse

▫ Use Case - Stock Price Prediction

◼ Adding a Staging Area

The Cloud-Native Data Lake

◼ Standard ELT Approach

◼ Push ELT Approach

◼ EtLT (Extract, light transform, Load, and Transform) Approach

Data Ingestion Patterns

◼ Lambda Architecture

▫ Use Case - Stock Price Prediction

◼ Kappa Architecture

▫ Use Case - Stock Price Prediction

The Modern Lakehouse

▫ Use Case - Stock Price Prediction

Wrapping Up

Continue Your Learning

Architecting Production ML: A Deep Dive into Deployment and Scalability

Engineering a Fully-Automated Lakehouse: From Raw Data to Gold Tables

Building a Production-Ready Data CI/CD Pipeline: Versioning, Drift Detection, and Orchestration

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?