Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse

A practical guide to transforming raw data into actionable predictions

Machine LearningDeep LearningPython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Data Pipeline ArchitectureThe Traditional Data Warehouse
Adding a Staging Area
The Cloud-Native Data Lake
Standard ELT Approach
Push ELT Approach
EtLT (Extract, light transform, Load, and Transform) Approach
Data Ingestion Patterns
Lambda Architecture
Kappa Architecture
The Modern LakehouseWrapping Up

Introduction

A data pipeline architecture serves as the strategic blueprint for transforming raw data into actionable predictions.

But designing the architecture seems complex because it involves numerous components, and the specific choices for each are driven by the data's characteristics and business needs.

In this article, I’ll structure these components and explore three common patterns:

  • The Traditional Data Warehouse,

  • The Cloud-Native Data Lake, and

  • The Modern Lakehouse,

taking stock price prediction for a practical use case.

What is Data Pipeline Architecture

Data pipeline architectures define the structure, components, and flow of data from ingestion to a usable state for analytics and machine learning.

The below diagram shows key components and major options for each component in the data pipeline architecture:

Figure A. Key components and major options in data pipeline architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Key components and major options in data pipeline architecture (Created by Kuriko IWAI)

Key components involves:

  • Data Source: The origin of the data,

  • Ingestion: The method of collecting and bringing data into the system,

  • Storage: Where the data is housed,

  • Processing: The transformation and cleaning of data,

  • Serving: Making the processed data accessible to end-users and applications, and

  • Governance: Ensures data quality, security, privacy, and compliance.

And the processing part involves two sub-components:

  • Loading strategies like full load, incremental load, and delta load, and

  • Data transformation like cleaning, imputation, and preprocessing.

Choices are driven by the data characteristics, including its variety, volume, and velocity, as well as by specific business requirements.

  • Variety: The diversity of data (structured, semi-structured, unstructured) influences a storage choice.

  • Volume: The sheer amount of data dictates the need for scalable, distributed technologies (e.g., Spark, Hadoop) and cost-effective storage solutions (like cloud object storage).

  • Velocity: The speed at which data is generated and processed determines whether it should be a real-time streaming architecture for high velocity or a batch processing one for low velocity.

  • Business requirements are the ultimate guiding force.

Because some options in different components are strongly related, I’ll present three common combinations in the next section, taking a stock price prediction for an example.

The Traditional Data Warehouse

The first combination is a traditional, data warehouse architecture using an ETL (Extract, Transform, Load) approach.

The diagram below shows its standard architecture where the original data is extracted and transformed into structured one, and then loaded to fit a predefined schema in the data warehouse:

Figure B. Standard ETL / DWH architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Standard ETL / DWH architecture (Created by Kuriko IWAI)

Typical options selected in each component are**:**

  • Source: Structured, batch

  • Ingestion: Batch

  • Storage: Data warehouse

  • Processing: ETL (Extract, Transform, Load)

  • Serving: BI, Low frequency reports

The ETL process rigorously cleans and transforms data before loading, which ensures:

  • An access to stable, well-defined data sources,

  • High level of accuracy and consistency, and

  • Very fast query performance.

Drawbacks include:

  • Data types: Not suitable for unstructured or semi-structured data like image, text.

  • Cost: Maintaining a data warehouse can be expensive.

  • Latency: The batch process indicates data is updated on a scheduled basis like daily or weekly. Not suitable for real-time inference.

Use Case - Stock Price Prediction

The architecture is best suited for a long-term forecasting based on quarterly or annual financial reports alongside historical end-of-day stock prices.

Figure C. Long-term forecasting on a traditional data warehouse architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Long-term forecasting on a traditional data warehouse architecture (Created by Kuriko IWAI)

In the diagram, the architecture first populates the data warehouse with historical data over decades.

Then, daily stock trading volumes are incrementally loaded via scheduled batch processes (e.g., daily or weekly).

When financial records are adjusted, the batch process also performs performs a delta load to update the data warehouse.

Then, the structured data is used to train a model that serves predictions for low-frequency decision-making.

This structure is not suitable for real-time stock prediction as the scheduled batch process generates latency between data and prediction.

Adding a Staging Area

An advance approach utilizes a staging area to store extracted data before transformation by SQL queries:

Figure D. ETL / DWH architecture with a staging area (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. ETL / DWH architecture with a staging area (Created by Kuriko IWAI)

The main difference lies in the isolation and efficiency of the transformation process.

Without a staging area, transformations are done directly on the source system or within the target data warehouse. This can be inefficient and risky:

  • Source System Overload: Slow down the systems over complex transformations, impacting the core business operation.

  • Data Warehouse Bottlenecks: Slow down queries and reports, consuming computing resource of the data warehouse.

A staging area can offload the heavy transformation process from the data warehouse or the source system by loading raw data into a temporary storage space like a dedicated S3 bucket.

Then, a separate processing engine like Apache Spark runs transformations without affecting the source systems or the data warehouse.

Although it increases operational complexity, other perks include:

  • Error handling: The original raw data in the data warehouse is not impacted even when the transformation fails. Simply rerun the transformation in the staging area.

  • Data quality control: Ensure that only high-quality data is loaded to the data warehouse by adding multi-steps to the transformation like cleaning, feature engineering, and preprocessing in the staging area.

The Cloud-Native Data Lake

The second combination is a cloud-native data lake architecture.

This architecture is flexible and cost-effective, ideal for handling massive volumes of diverse data, including unstructured data.

There are three major approaches:

  • Standard ELT (Extract, Load, Transform),

  • Push ELT, and

  • EtLT (Extract, light transform, Load, Transform).

Standard ELT Approach

The standard approach leverage an ELT processing:

Figure E. Standard ELT / data lake architecture(Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-1. Standard ELT / data lake architecture(Created by Kuriko IWAI)

First, original data is extracted into data lake, and then loaded and transformed to store in the data warehouse.

Push ELT Approach

An alternative approach of data ingestion to a data lake is the Push method, where external sources extract data directly to the data lake:

Figure E-2. Push ELT / data lake architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-2. Push ELT / data lake architecture (Created by Kuriko IWAI)

This approach may result in limited control over data extraction, requiring coordination with teams responsible for data sources in cases of missing or corrupted data.

EtLT (Extract, light transform, Load, and Transform) Approach

Data extracted from sources may contain confidential data that should not be accessible to unauthorized individuals.

An EtLT approach includes an additional ‘light‘ transformation step where sensitive information is masked or encrypted before loading data into the data lake:

Figure E-3. EtLT architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E-3. EtLT architecture (Created by Kuriko IWAI)

In each approach, the combination of the data lake and the data warehouse is suitable for applying different analysis techniques on original data stored in the data lake.

Typical options selected in each component are:

  • Source: Versatile for data structure, streaming.

  • Ingestion: Suitable for streaming, but batch can be an option.

  • Storage: Data Lake +Data warehouse

  • Processing: ELT or EtLT

Although each approach introduces complexity in managing multiple tools, other perks include:

  • Scalability: Separation of ingestion and transformation processes enhances scalability and flexibility.

  • Manageability: Easy to store, track, and review SQL queries (transformation).

Data Ingestion Patterns

All approaches can be used for both batch and streaming pipelines.

However, due to ELT and EtLT’s nature of loading first and transforming later, they are well-suited for the real-time needs of streaming data.

But hybrid architectures like Lambda and Kappa are designed to seamlessly combine both batch and streaming ingestion to provide comprehensive data processing.

Let us take a look.

Lambda Architecture

Lambda architecture uses a dual-path approach where the batch layer follows an ETL process to transform large historical datasets, and the speed layer handles real-time data suited for an ELT or EtLT approach.

Figure F. Lambda architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Lambda architecture (Created by Kuriko IWAI)

The architecture can serve both real-time and batch access to the predictions depending on the business needs.

Use Case - Stock Price Prediction

Using the same stock price prediction case in the previous architecture, Lambda architecture can be extended to serve real-time predictions via speed layer, while batch layer serves long-term forecasting.

The predictions from both layers are combined and served to the user, providing both a stable, long-term outlook and a volatile, real-time forecast.

Kappa Architecture

Kappa architecture simplifies Lambda by using a single, unified stream processing pipeline for both real-time and historical data.

Figure H. Kappa architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Kappa architecture (Created by Kuriko IWAI)

Kappa utilizes an ELT model, where all data from different sources is loaded as a stream and then transformed by the single processing engine.

Use Case - Stock Price Prediction

In the same use case of the stock price prediction, when the company prioritizes reducing development complexity and operational overhead while providing both real-time predictions and long-term forecasting, Kappa is the best choice.

The Modern Lakehouse

The last combination is the lakehouse architecture.

The lakehouse architecture aims to combine the best features of both data warehouse and data lake, creating a unified platform that handles both structured and unstructured data:

Figure I. Lakehouse architecture with a medallion structure (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

  • Source: Versatile for both structure and data flow.

  • Ingestion: Batch, Stream

  • Storage: Lakehouse

  • Processing: ELT (Extract, Load, Transform)

Figure I. Lakehouse architecture with a medallion structure (Created by Kuriko IWAI)

In the lakehouse, Bronze Layer works as a data lake where raw data from external sources are ingested.

Then the Silver Layer handles transformation of data by cleaning and structuring the raw data.

Gold Layer builds a curated table for a specific project, performing feature engineering to engineer necessary features to future predictions.

This combination provides both the flexibility of a data lake and the reliability and performance of a data warehouse within a single, simplified system.

Major advantages include:

  • Unified storage: The Lakehouse can store all types of data—structured, unstructured, and semi-structured—in a single platform.

  • Cost-effective: Leverages low-cost, cloud-based object storage like S3.

  • Openness: Avoid vendor lock-in by using open-source technologies like Apache Spark, Delta Lake and open file formats like Parquet.

This architecture involves complexity in implementation and data governance.

Use Case - Stock Price Prediction

The architecture handles both historical data and real-time streams in a single platform.

The medallion structure with Bronze, Silver, Gold layers progressively refine raw data into high-quality features.

Let’s say we have raw data from three data sources:

  • APIs - Unstructured stock prices

  • RSS feeds - Unstructured news

  • Internal database - Structured financial records

Bronze Layer:

The bronze layer works as a data lake to store the raw data from multiple data sources.

Silver Layer:

Then, the silver layer cleans and structures the raw data:

  • Running a query to join all stock prices with the corresponding financial records,

  • Cleanse messy text data from news feeds, and

  • Extract news articles for a given data range.

Gold Layer:

The gold layer runs feature engineering like calculating 30-day moving averages, volatility measures, and a market sentiment score from the cleaned news data.

This final, highly-refined dataset is used to train a model.

Wrapping Up

Data pipeline architectures play key roles in transforming raw data into meaningful predictions.

In this article, we learned that each of three common architectures - the traditional data warehouse, the cloud-native data lake, and the modern lakehouse - has its own advantages and disadvantages.

The optimal architecture is not a one-size-fits-all solution but a strategic choice guided by a careful assessment of data characteristics and business objectives.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse" in Kernel Labs

https://kuriko-iwai.com/data-pipeline-architectures-for-ml-models

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.