Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse
A practical guide to transforming raw data into actionable predictions
By Kuriko IWAI

Table of Contents
IntroductionWhat is Data Pipeline ArchitectureThe Traditional Data WarehouseIntroduction
A data pipeline architecture serves as the strategic blueprint for transforming raw data into actionable predictions.
But designing the architecture seems complex because it involves numerous components, and the specific choices for each are driven by the data's characteristics and business needs.
In this article, I’ll structure these components and explore three common patterns:
The Traditional Data Warehouse,
The Cloud-Native Data Lake, and
The Modern Lakehouse,
taking stock price prediction for a practical use case.
What is Data Pipeline Architecture
Data pipeline architectures define the structure, components, and flow of data from ingestion to a usable state for analytics and machine learning.
The below diagram shows key components and major options for each component in the data pipeline architecture:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Key components and major options in data pipeline architecture (Created by Kuriko IWAI)
Key components involves:
Data Source: The origin of the data,
Ingestion: The method of collecting and bringing data into the system,
Storage: Where the data is housed,
Processing: The transformation and cleaning of data,
Serving: Making the processed data accessible to end-users and applications, and
Governance: Ensures data quality, security, privacy, and compliance.
And the processing part involves two sub-components:
Loading strategies like full load, incremental load, and delta load, and
Data transformation like cleaning, imputation, and preprocessing.
Choices are driven by the data characteristics, including its variety, volume, and velocity, as well as by specific business requirements.
Variety: The diversity of data (structured, semi-structured, unstructured) influences a storage choice.
Volume: The sheer amount of data dictates the need for scalable, distributed technologies (e.g., Spark, Hadoop) and cost-effective storage solutions (like cloud object storage).
Velocity: The speed at which data is generated and processed determines whether it should be a real-time streaming architecture for high velocity or a batch processing one for low velocity.
Business requirements are the ultimate guiding force.
Because some options in different components are strongly related, I’ll present three common combinations in the next section, taking a stock price prediction for an example.
The Traditional Data Warehouse
The first combination is a traditional, data warehouse architecture using an ETL (Extract, Transform, Load) approach.
The diagram below shows its standard architecture where the original data is extracted and transformed into structured one, and then loaded to fit a predefined schema in the data warehouse:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. Standard ETL / DWH architecture (Created by Kuriko IWAI)
Typical options selected in each component are**:**
Source: Structured, batch
Ingestion: Batch
Storage: Data warehouse
Processing: ETL (Extract, Transform, Load)
Serving: BI, Low frequency reports
The ETL process rigorously cleans and transforms data before loading, which ensures:
An access to stable, well-defined data sources,
High level of accuracy and consistency, and
Very fast query performance.
Drawbacks include:
Data types: Not suitable for unstructured or semi-structured data like image, text.
Cost: Maintaining a data warehouse can be expensive.
Latency: The batch process indicates data is updated on a scheduled basis like daily or weekly. Not suitable for real-time inference.
▫ Use Case - Stock Price Prediction
The architecture is best suited for a long-term forecasting based on quarterly or annual financial reports alongside historical end-of-day stock prices.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C. Long-term forecasting on a traditional data warehouse architecture (Created by Kuriko IWAI)
In the diagram, the architecture first populates the data warehouse with historical data over decades.
Then, daily stock trading volumes are incrementally loaded via scheduled batch processes (e.g., daily or weekly).
When financial records are adjusted, the batch process also performs performs a delta load to update the data warehouse.
Then, the structured data is used to train a model that serves predictions for low-frequency decision-making.
This structure is not suitable for real-time stock prediction as the scheduled batch process generates latency between data and prediction.
◼ Adding a Staging Area
An advance approach utilizes a staging area to store extracted data before transformation by SQL queries:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. ETL / DWH architecture with a staging area (Created by Kuriko IWAI)
The main difference lies in the isolation and efficiency of the transformation process.
Without a staging area, transformations are done directly on the source system or within the target data warehouse. This can be inefficient and risky:
Source System Overload: Slow down the systems over complex transformations, impacting the core business operation.
Data Warehouse Bottlenecks: Slow down queries and reports, consuming computing resource of the data warehouse.
A staging area can offload the heavy transformation process from the data warehouse or the source system by loading raw data into a temporary storage space like a dedicated S3 bucket.
Then, a separate processing engine like Apache Spark runs transformations without affecting the source systems or the data warehouse.
Although it increases operational complexity, other perks include:
Error handling: The original raw data in the data warehouse is not impacted even when the transformation fails. Simply rerun the transformation in the staging area.
Data quality control: Ensure that only high-quality data is loaded to the data warehouse by adding multi-steps to the transformation like cleaning, feature engineering, and preprocessing in the staging area.
The Cloud-Native Data Lake
The second combination is a cloud-native data lake architecture.
This architecture is flexible and cost-effective, ideal for handling massive volumes of diverse data, including unstructured data.
There are three major approaches:
Standard ELT (Extract, Load, Transform),
Push ELT, and
EtLT (Extract, light transform, Load, Transform).
◼ Standard ELT Approach
The standard approach leverage an ELT processing:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure E-1. Standard ELT / data lake architecture(Created by Kuriko IWAI)
First, original data is extracted into data lake, and then loaded and transformed to store in the data warehouse.
◼ Push ELT Approach
An alternative approach of data ingestion to a data lake is the Push method, where external sources extract data directly to the data lake:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure E-2. Push ELT / data lake architecture (Created by Kuriko IWAI)
This approach may result in limited control over data extraction, requiring coordination with teams responsible for data sources in cases of missing or corrupted data.
◼ EtLT (Extract, light transform, Load, and Transform) Approach
Data extracted from sources may contain confidential data that should not be accessible to unauthorized individuals.
An EtLT approach includes an additional ‘light‘ transformation step where sensitive information is masked or encrypted before loading data into the data lake:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure E-3. EtLT architecture (Created by Kuriko IWAI)
In each approach, the combination of the data lake and the data warehouse is suitable for applying different analysis techniques on original data stored in the data lake.
Typical options selected in each component are:
Source: Versatile for data structure, streaming.
Ingestion: Suitable for streaming, but batch can be an option.
Storage: Data Lake +Data warehouse
Processing: ELT or EtLT
Although each approach introduces complexity in managing multiple tools, other perks include:
Scalability: Separation of ingestion and transformation processes enhances scalability and flexibility.
Manageability: Easy to store, track, and review SQL queries (transformation).
Data Ingestion Patterns
All approaches can be used for both batch and streaming pipelines.
However, due to ELT and EtLT’s nature of loading first and transforming later, they are well-suited for the real-time needs of streaming data.
But hybrid architectures like Lambda and Kappa are designed to seamlessly combine both batch and streaming ingestion to provide comprehensive data processing.
Let us take a look.
◼ Lambda Architecture
Lambda architecture uses a dual-path approach where the batch layer follows an ETL process to transform large historical datasets, and the speed layer handles real-time data suited for an ELT or EtLT approach.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure F. Lambda architecture (Created by Kuriko IWAI)
The architecture can serve both real-time and batch access to the predictions depending on the business needs.
▫ Use Case - Stock Price Prediction
Using the same stock price prediction case in the previous architecture, Lambda architecture can be extended to serve real-time predictions via speed layer, while batch layer serves long-term forecasting.
The predictions from both layers are combined and served to the user, providing both a stable, long-term outlook and a volatile, real-time forecast.
◼ Kappa Architecture
Kappa architecture simplifies Lambda by using a single, unified stream processing pipeline for both real-time and historical data.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure H. Kappa architecture (Created by Kuriko IWAI)
Kappa utilizes an ELT model, where all data from different sources is loaded as a stream and then transformed by the single processing engine.
▫ Use Case - Stock Price Prediction
In the same use case of the stock price prediction, when the company prioritizes reducing development complexity and operational overhead while providing both real-time predictions and long-term forecasting, Kappa is the best choice.
The Modern Lakehouse
The last combination is the lakehouse architecture.
The lakehouse architecture aims to combine the best features of both data warehouse and data lake, creating a unified platform that handles both structured and unstructured data:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Source: Versatile for both structure and data flow.
Ingestion: Batch, Stream
Storage: Lakehouse
Processing: ELT (Extract, Load, Transform)
Figure I. Lakehouse architecture with a medallion structure (Created by Kuriko IWAI)
In the lakehouse, Bronze Layer works as a data lake where raw data from external sources are ingested.
Then the Silver Layer handles transformation of data by cleaning and structuring the raw data.
Gold Layer builds a curated table for a specific project, performing feature engineering to engineer necessary features to future predictions.
This combination provides both the flexibility of a data lake and the reliability and performance of a data warehouse within a single, simplified system.
Major advantages include:
Unified storage: The Lakehouse can store all types of data—structured, unstructured, and semi-structured—in a single platform.
Cost-effective: Leverages low-cost, cloud-based object storage like S3.
Openness: Avoid vendor lock-in by using open-source technologies like Apache Spark, Delta Lake and open file formats like Parquet.
This architecture involves complexity in implementation and data governance.
▫ Use Case - Stock Price Prediction
The architecture handles both historical data and real-time streams in a single platform.
The medallion structure with Bronze, Silver, Gold layers progressively refine raw data into high-quality features.
Let’s say we have raw data from three data sources:
APIs - Unstructured stock prices
RSS feeds - Unstructured news
Internal database - Structured financial records
Bronze Layer:
The bronze layer works as a data lake to store the raw data from multiple data sources.
Silver Layer:
Then, the silver layer cleans and structures the raw data:
Running a query to join all stock prices with the corresponding financial records,
Cleanse messy text data from news feeds, and
Extract news articles for a given data range.
Gold Layer:
The gold layer runs feature engineering like calculating 30-day moving averages, volatility measures, and a market sentiment score from the cleaned news data.
This final, highly-refined dataset is used to train a model.
Wrapping Up
Data pipeline architectures play key roles in transforming raw data into meaningful predictions.
In this article, we learned that each of three common architectures - the traditional data warehouse, the cloud-native data lake, and the modern lakehouse - has its own advantages and disadvantages.
The optimal architecture is not a one-size-fits-all solution but a strategic choice guided by a careful assessment of data characteristics and business objectives.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Architecting Production ML: A Deep Dive into Deployment and Scalability
Engineering a Fully-Automated Lakehouse: From Raw Data to Gold Tables
Building a Production-Ready Data CI/CD Pipeline: Versioning, Drift Detection, and Orchestration
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps
Share What You Learned
Kuriko IWAI, "Data Pipeline Architecture: From Traditional DWH to Modern Lakehouse" in Kernel Labs
https://kuriko-iwai.com/data-pipeline-architectures-for-ml-models
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.


