Causal Inference in ML Pipelines: Beyond Feature Flattening
[Series] Why Machine Learning Fails in Production and How to Fix It with Causal Inference
By Kuriko IWAI

Table of Contents
IntroductionWhat is Causal Inference in Machine LearningWhy Not Standard MLIntroduction
In enterprise architectures, we optimize highly complex predictive models like deep neural networks to minimize empirical risk.
These models operate on observational data, making them suitable to answer what will happen under the current data-generating distribution, but failing to predict what would happen if an operational policy actively alters that distribution.
In other words, predictive models can be operational liabilities in between prediction and decision-making, treating conditional correlation in observational data as invariant policy layers.
Causal inference tackles this challenge by shifting the objective from passive observation to structural intervention.
This article dives deep into the mathematical foundations, hidden structural traps, and practical Python implementations of integrating causal inference into production machine learning pipelines.
What is Causal Inference in Machine Learning
Causal inference is the formal statistical and mathematical framework dedicated to uncovering, quantifying, and predicting the independent, directional effects of specific variables (treatments) on targeted outcomes.
The below diagram illustrates how causal inference works, compared to standard, predictive machine learning:

Figure 1. Symmetrical Statistical Associations in Predictive ML vs. Directional Causal Inference Mechanics (Created by Kuriko IWAI)
Rather than treating data as a collection of symmetrical statistical associations (Left, Figure 1), causal inference models the underlying physical mechanisms and data-generating processes of the system (Right, Figure 1).
Why Not Standard ML
To illustrate the divergence between statistical pattern matching and causal identification, let us consider a high-stakes enterprise staffing pipeline tracking three chronologically ordered variables:
The treatment / intervention:
A human manager assigns a candidate to a poorly matched corporate project due to unobserved cognitive bias or historical proxy variables (e.g., prestige resume tags).
2. A time-varying mediator:
Months later, the candidate experiences low role alignment, causing a sharp drop in quantifiable communication metrics (e.g., Slack/email response latency dropped significantly).
3. The target outcome:
The candidate prematurely terminates their contract with the company.
◼ The Statistical View (Standard ML)
Traditional supervised learning models operate entirely using conditional probabilities such that:
Where:
Y is the random variable representing the terminal target outcome (e.g., contract termination):
X is the vector of observed baseline features or covariates available to the model during passive observation:
P(Y|X) denotes the conditional probability distribution of Y given the observed context X.
P(Y|X) purely captures statistical dynamics of X and Y.
Because the intermediate symptom (the response rate M dropped) occurs in close temporal proximity to the terminal event (Y = 1, indicating the staff exits), the observational data exhibits a strong conditional dependency:
Where:
Y = 1 is the specific realization of the outcome variable, indicating the contract termination, in this case.
M = Drop is the specific realization of the post-treatment mediator indicating a severe decline in communication response rate M.
\approx 1.0 indicates an empirical conditional probability approaching unity due to tight chronological and statistical correlation.
A standard ML model optimizes for the objective function Eq. 1.1 by assigning maximum predictive weight to the response rate, M.
And the model concludes that communication degradation is the primary driver of turnover.
Consequently, an operational policy derived from the model would misallocate capital toward symptoms—such as automated HR alerts to improve response rates—while leaving the true generative mechanism (the structural mismatch A_0) entirely unaddressed.
◼ The Feature Flattening Trap
This fundamental failure of supervised ML in decision-making contexts stems from the feature flattening trap.
When constructing input matrices for models like XGBoost or Deep Neural Networks, practitioners compress highly complex, chronologically ordered, and causally dependent structural processes into a single, static feature vector:
Where:
x is the flattened input tensor passed to the model inference engine:
x_1, x_2, ..., x_n are the individual scalar features containing baseline covariates (X), treatment actions (A_0), and post-treatment mediators (M) mixed without topological ordering.
By mapping all features to a flat, identical plane, the empirical risk minimizer strips away all structural topology, temporal ordering, and directional dependency.
The model possesses no internal representation of the Directed Acyclic Graph (DAG) that generated the data.
This induces two primary structural failures in production:
Mediator Conditioning and
Weight Stealing.
▫ Challenge 1. Mediator Conditioning - Confusing Symptoms with Causes
By flattening the feature space, the model optimizes solely for predictive accuracy on the observational manifold. It cannot distinguish between an upstream cause (A_0) and a downstream mediator (M).
Because M shares a high mutual information metric with Y due to its temporal proximity, the model treats the symptom as the lever.
▫ Challenge 2. Weight Stealing - Statistical Masking
During optimization, the empirical risk minimizer identifies the path of least resistance.
Because the downstream mediator M sits closer to Y on the causal path, it absorbs the bulk of the gradient updates during empirical training.
If we analyze the resulting model parameters or regression coefficients β, the mediator completely dominates the true intervention lever:
Where:
β_{M=1} is the estimated parametric coefficient or feature contribution score assigned by the model to the mediator M, specifically, the response rate dropped.
β_{A_0} is the estimated parametric coefficient or feature contribution score assigned by the model to the primary action lever A_0, specifically, the initial assignment to the project is not ideal.
This triggers statistical masking where the model with access to post-treatment mediators on a flattened plane allows the symptom (M=1)to steal the predictive credit that structurally belongs to the actionable decision lever (A_0).
Developer Notes - The Fallacy of Feature Importance (SHAP / Gain)
Post-hoc explainability methods like SHAP (SHapley Additive exPlanations) or tree-based feature importance do not reflect causal mechanisms.
SHAP measures the marginal contribution of a feature to the model's mathematical output relative to the expectation over the current training distribution. It does not map the physical topology of the real world. A high SHAP value guarantees statistical predictive utility under the status quo policy, but provides zero guarantee of structural invariance under policy intervention.
The Causal View: Graph Surgery and Judea Pearl’s do-Operator
Causal inference transitions the paradigm from passive observational conditioning to active, counterfactual intervention.
Instead of evaluating the conditional probability distribution on the flat plane, causal inference isolates the interventional distribution:
Where:
Y is the random variable representing the terminal target outcome.
A_0 is the actionable treatment variable (e.g., project assignment policy).
a is the specific counterfactual assignment value enforced by the intervention:
do(A=a) is Judea Pearl’s interventional operator, indicating an exogenous manipulation that alters the data-generating mechanism.
◼ Judea Pearl’s Do-operator
The introduction of Judea Pearl’s do-operator (Judea Pearl, et. al, 2009) provides the mathematical syntax required to formalize this distinction by separating "seeing" (passive observation) from "doing" (active intervention).
Standard machine learning operates on the observational distribution denoted in Eq. 1.1, which conditions on the subpopulation that happened to receive treatment A = a under the historical, non-randomized policy.
The do-operator, denoted as do(A = a), represents a hypothetical, physical intervention that overrides the natural data-generating mechanism.
It forces the variable A to take the exact value a for the entire population, irrespective of historical tendencies.
For example, suppose we decide to programmatically overwrite the human managers.
We instantiate a strict routing policy that forces every single candidate in the pipeline into a high-pressure project, completely ignoring their resume prestige, their historical background, or the managers' personal preferences.
Mathematically, we evaluate the interventional distribution:
Where:
do(A_0 = 1) represents the programmatic mutation of the system where we exogenously force the assignment variable to 1 for the entire incoming population (candidates).
This physical intervention performs graph surgery on the operational pipeline.
Consider a Directed Acyclic Graph (DAG) where a set of baseline confounders X determines the treatment assignment (X → A).
▫ The Observational World (Before Graph Surgery)
In historical logs, the data is heavily confounded.
Baseline factors like a candidate's prestigious resume tag like harvard_tag or human manager biases X directly dictate the assignment action (A), which downstream affects the project outcome (Y).
1 [ Baseline Confounders (X) ]
2 / \
3 / \
4 v v
5[ Action (A) ] ───> [ Final Outcome (Y) ]
6Figure 2(a). The Confounded Observational Network.
In Figure 2(a), an open backdoor path exists between the action and the outcome:
Standard ML architectures fail here because they operate on pure statistical association.
Lacking directional awareness, the empirical risk minimizer cannot determine whether a favorable outcome Y was caused by the strategic assignment action A or merely inherited from the privileged background context X.
▫ The Interventional World (After Graph Surgery)
When the do-operator is invoked, the action is completely overriding human habit, performing graph surgery by severing all incoming arrows to A.
1 [ Baseline Confounders (X) ]
2 \
3 \
4 v
5[ FORCED Action (A = a) ] ───> [ Final Outcome (Y) ]
6Figure 2(b). Topological Graph Surgery - Structural Backdoor Path Elimination via Exogenous do-Operator Interventions
Now, X → A is completely erased and the backdoor path X → A → Y is destroyed because X no longer causes A.
Any remaining correlation between the forced action (A = a) and the outcome Y is guaranteed to be a pure, unconfounded causal effect.
This allows us to simulate a pure, unbiased A/B experiment using static, biased historical logs.
◼ Standard ML vs Causal Inference
This fundamentally shifts the mathematical objective of the engineering pipeline:
Standard ML evaluates P(Y | A_0 = a), answering: "What is the probability of an exit among the sub-population of candidates who happened to be assigned to this project style under the current biased policy?"
Causal inference evaluates P(Y | do(A=a)), answering: "What would the exit rate look like if we actively forced the assignment of this candidate to this project style, overriding human habit across the entire population?"
Mathematical Foundations of Structural Causal Models (SCMs)
To deeply analyze how standard ML fails, we formalize our domain knowledge using Directed Acyclic Graphs (DAGs) and measure-theoretic causal structural models to audit our feature space.
Let
be a DAG representing our enterprise staffing system, where the vertex set is defined as
and ε is the set of directed edges.
The structural topology defines the true data-generating process via the following non-parametric structural equations:
Where:
X, A_0, M, Y are the endogenous random variables defining the structural system.
f_X, f_A, f_M, f_Y are deterministic, non-parametric structural mapping functions that generate the value of each node based on its parents and exogenous noise.
ε_X, ε_A, ε_M, ε_Y are mutually independent, unobserved background noise terms (exogenous variables) drawn from an arbitrary joint probability space.
This induces the factorization of the observational joint distribution according to the Markov property compatibility:
Where:
P(X, A_0, M, Y) is the joint observational density function over all nodes in the graph.
P(V_i | PA_i) represents the conditional probability density of node V_i given its immediate graph-theoretic parents PA_i in G.
When we execute an intervention do(A_0 = a), the truncated factorization theorem dictates that the interventional distribution factors as:
Where:
P(X, M, Y | do(A_0 = a)) is the post-interventional joint probability distribution over the remaining active variables.
a is the constant value to which A_0 is exogenously forced, replacing the conditional density function P(A_0 | X) with an atomic indicator mass.
In Eq. 3.5., notice that the conditional factor P(A_0 | X) has been eliminated, reflecting the deletion of incoming edges pointing into A_0.
◼ The Hazard of Over-Adjustment Bias
If a machine learning engineer naively runs a regression or optimizes an empirical network conditioning on the mediator M alongside A_0 and X, they are attempting to isolate the Direct Effect of A_0 while holding M constant.
Mathematically, the Controlled Direct Effect (CDE) is defined as:
Where:
CDE(m) is the Controlled Direct Effect evaluated at a fixed mediator state m.
E[Y | ...] is the mathematical expectation operator under the designated interventional distributions.
do(A_0 = 1) and do(A_0 = 0) represent setting the treatment to active and baseline control states respectively.
do(M = m) denotes forcing the mediator to a static value m via a joint, secondary intervention.
However, if the strategic operational objective is to optimize the policy governing A_0, we must identify the Total Causal Effect (TCE), which accounts for the causal current flowing naturally through the mediator M:
Where:
TCE is the Total Causal Effect representing the net change in outcome expectation generated across all downstream pathways radiating from the intervention at A_0.
In Eq. 3.7., Conditioning on M blocks the directed path A_0 → M → Y.
By including M in the conditioning set, we integrate out the variation of M that is structurally induced by the action A_0, committing over-adjustment bias in terms of probability density.
◼ Identification via Backdoor Criterion
To recover the true interventional distribution from purely observational logs, we must locate an adjustment set:
that satisfies Pearl's Backdoor Criterion relative to (A_0, Y):
No vertex in Z is a descendant of A_0 because Z blocks every backdoor path between A_0 and Y.
In our topology, $\mathcal{Z} = {X}$ satisfies the criterion. The path $A_0 \leftarrow X \rightarrow Y$ is a confounding backdoor path. Conversely, $M$ is a descendant of $A_0$ and must be excluded from $\mathcal{Z}$.
Applying the backdoor adjustment formula yields:
Where:
X is the support space of the baseline confounder variable X.
P(Y | A_0=a, X=x) is the standard conditional observational probability of the outcome given specific instances of treatment and confounder.
P(X = x) or dP(X=x) is the marginal observational probability weight of the confounder stratum, used to re-weight the conditional estimates uniformly across the population.
◼ The Enterprise Unlock: Off-Policy Evaluation (OPE)
Shifting from P(Y | A_0) to P(Y | do(A_0)) enables counterfactual Off-Policy Evaluation (OPE).
Instead of deploying an unverified matching algorithm to production and risking live revenue or retention, a causally identified model allows engineers to leverage biased historical logs to answer the counterfactual optimization problem:
What would our retention curve look like if we had executed deployment policy π_{new} over the past 24 months instead of the legacy human-driven policy π_{old}?
Production Implementation - Synthetic A/B Testing in Python
Let us simulate an enterprise dataset characterized by our structural causal topology.
After defining a confounded dataset, the script runs both standard ML and causal inference, and visualizes the profound estimation bias that occurs in production frameworks:
1import numpy as np
2import pandas as pd
3import statsmodels.api as sm
4import statsmodels.formula.api as smf
5
6
7## 1. define a confounded dataset
8np.random.seed(42)
9N = 10000
10
11# X: baseline context confounder (e.g., talent score)
12X = np.random.normal(0, 1, N)
13
14# A0: match action (historical human managers over-index on X when making matches)
15prob_A0 = 1 / (1 + np.exp(- (1.5 * X)))
16A0 = np.random.binomial(1, prob_A0)
17
18# M: downstream mediator (communication dropout - a bad match A0=1 causes communication dropouts)
19prob_M = 1 / (1 + np.exp(- (-2.0 + 2.5 * A0)))
20M = np.random.binomial(1, prob_M)
21
22# Y: true project exit - driven directly by the bad match AND the communication dropout. True total effect on Y is a combination of direct and mediated pathways
23prob_Y = 1 / (1 + np.exp(- (-3.0 + 1.2 * A0 + 1.8 * M + 0.5 * X)))
24Y = np.random.binomial(1, prob_Y)
25
26# create a dataframe
27df = pd.DataFrame({'X': X, 'A0': A0, 'M': M, 'Y': Y})
28
29
30# 2. standard ML approach (feature flattening)
31## throw the mediator M into the feature matrix.
32standard_model = smf.logit("Y ~ A0 + M + X", data=df).fit(disp=0)
33
34
35# 3. causal inference
36## structurally leave mediator M out of the regression
37causal_model = smf.logit("Y ~ A0 + X", data=df).fit(disp=0)
38
39
40# 4. visualize estimation bias
41models_data = {
42 'Model Type': [
43 'Standard ML (Flattened)',
44 'Causal Backdoor Adjustment'
45 ],
46 'Estimated Beta (A0)': [
47 standard_model.params['A0'],
48 causal_model.params['A0']
49 ],
50 'CI_lower': [
51 standard_model.conf_int().loc['A0', 0],
52 causal_model.conf_int().loc['A0', 0]
53 ],
54 'CI_upper': [
55 standard_model.conf_int().loc['A0', 1],
56 causal_model.conf_int().loc['A0', 1]
57 ]
58}
59
60results_df = pd.DataFrame(models_data)
61◼ The Simulation Results: Proof of the Trap
By running the Python script, we generate a dataset where the initial match decision (A_0) has both a direct impact on project success and an indirect impact by causing a downstream communication dropout (M).
When we run both a standard machine learning and causal inference, the resulting coefficients expose a massive blind spot.
▫ The Results - Empirical Breakdown of the Coefficients
Table 1. Empirical Parameter Comparison: Structural Ground Truth vs. Observational Feature-Flattened Frameworks
The Model Illusion
The Standard ML model will report a highly deceptive feature importance profile.
It will convince humans that a drop in communication (M = 1.81) is nearly twice as critical to fix than the actual matching process (A_0 = 1.12).
The Downstream Value Drain
If leadership uses the standard model's outputs to plan investments, the company will pour capital into downstream band-aids such as automated Slack nudges or communication alerts.
These interventions will ultimately fail because they leave the true, high-leverage driver—the broken matching habit—completely unaddressed.
The Causal Unlock
By utilizing Pearl's Backdoor Criterion and blocking only the baseline confounder (X), the Causal model uncovers the true structural coefficient of 2.00.
This gives leadership an accurate, non-confounded ROI metric to justify rewriting the entire onboarding and assignment policy.
Wrapping Up
Standard machine learning architectures fail in production because they operate under the assumption of a static, passive environment.
They flatten time, collapse topological structure, and incentivize systems to optimize downstream symptoms rather than upstream levers.
By enforcing structural causal identification constraints, we can neutralize selection biases and prevent our models from falling into the feature flattening trap.
However, this static backdoor adjustment framework assumes a massive simplification: it treats the post-treatment environment as a passive channel. In complex operational pipelines, the system is highly dynamic. Human agents continuously monitor trajectories mid-flight, observe intermediate performance drops, and apply spontaneous, ad-hoc corrections.
In our next deep-dive, we will explore how these dynamic, mid-trajectory human adjustments introduce a severe statistical paradox known as Time-Varying Confounding, and analyze why standard static adjustment frameworks completely collapse under feedback loops.
◼ References
Judea Pearl, Causal inference in statistics: An overview
Pearl’s Backdoor Criterion is the mathematical theory to decide exactly which variables one must control for to isolate a true causal effect from observational data.
It essentially tells how to select a conditioning set of variables (Z) that blocks all spurious correlations without accidentally destroying the true causal signal.
▫ The Two Rules of the Backdoor Criterion
A set of variables Z satisfies the backdoor criterion relative to an action A and an outcome Y if it passes two strict conditions:
Rule 1: No Downstream Mediators (No Descendants)
No variable in Z can be a descendant of A.
If a variable M is caused by A, meaning it sits downstream on the causal path like A → M → Y, M is a mediator or a symptom. If you control for it, you block the very effect you are trying to measure, which is what triggers the feature flattening trap.
Rule 2: Block All Upstream Leaks (The Backdoor Paths)
Z must block every path between A and Y that contains an arrow pointing into A.
An arrow pointing into A such as N → A → Y represents an upstream cause or a confounder (like human bias or background environment) impacting A. These paths allow statistical information to flow backwards from A, through the confounder, and into Y, creating an illusion of causality.
▫ Backdoor Criterion in Action
Imagine a system with the following topology:
1 [ Conditioning Set (Experience level, Z) ]
2 / \
3 / \
4 v v
5 [ A ] ───────> [ M ] ────────> [ Y ]
6Figure 3. Graphical d-Separation Audit - Validating Covariate Conditioning Sets (Z) via Pearl’s Backdoor Criterion.
In the system, the conditioning set Z (Experience Level) perfectly satisfies the Backdoor Criterion.
Here are why:
Identify the Backdoor Path.
Look for paths from A to Y that start with an arrow pointing into A. Here, it is A → Z → Y. This path is currently open, meaning historical data is confounded by the candidate's experience level Z.
Test Z
If we put Z into our conditioning set, does it block the path? Yes, it intercepts the flow between A and Y.
Is Z a descendant of A? No, it happens before A.
Test M
Can we put M (communication drop) into our conditioning set? No. Because M is a descendant of A. Conditioning on M violates Rule 1 and breaks the causal chain.
▫ Backdoor Adjustment Formula
Once we identify a set Z that satisfies Pearl's Backdoor Criterion, we can rewrite Judea Pearl's do-operator using standard, observable conditional probabilities:
The backdoor adjustment formula denoted in Eq. 4.1. allows us to compute the conditional probability P(Y|do(A=a)).
Mathematically, Eq. 4.1. forces the data to simulate a world where the action A was assigned completely at random across the cause Z, enabling a clean, synthetic A/B test directly out of old data logs.
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.
FAQ
1) What is the 'feature flattening trap' in machine learning architectures?
👉 The feature flattening trap occurs when time-varying covariates, baseline confounders, upstream treatment vectors, and downstream post-treatment mediators are compressed into a single, unordered input tensor. This strips away chronological topology and directional dependencies, causing empirical risk minimizers to prioritize highly correlated downstream symptoms over true actionable decision levers.
2) Why fail-safes like SHAP or tree-based Gini gain values cannot detect causal validity?
👉 SHAP and traditional feature importance calculate the statistical marginal utility of an feature with respect to the existing conditional probability distribution under a static policy. They measure informational pattern tracking rather than physical invariants. Consequently, a variable can exhibit a high SHAP score due to proximity to the target variable while possessing zero structural capacity to alter outcomes under active intervention.
3) How does Judea Pearl's do-operator alter a system topology mathematically?
👉 The do-operator simulates a counterfactual physical intervention by performing graph surgery. It deletes all incoming directed edges pointing into the targeted treatment node. This effectively forces the conditional factor mapping the historical policy to an indicator mass, completely neutralizing selection bias and breaking open upstream backdoor confounding loops without requiring actual live system disruption.
4) What is over-adjustment bias in causal modeling pipelines?
👉 Over-adjustment bias occurs when a practitioner explicitly conditions on a downstream mediator (a variable along the casual pathway between treatment and outcome) or a collider. Conditioning on a mediator traps the natural causal flow, filtering out the variations induced by the policy lever. This structurally blinds the model to the Total Causal Effect (TCE), leading to severe parameter bias.
5) What core requirements must a covariate set fulfill to satisfy the Backdoor Criterion?
👉 The covariate set must satisfy two distinct conditions relative to the treatment-outcome pair: first, no variable within the set can be a graph-theoretic descendant of the treatment variable (preventing mediator conditioning); second, the set must completely block every backdoor pathway containing an arrow pointing into the treatment node, thereby screening out spurious correlations born from background confounders.
Shipping AI Systems?
I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.
Or explore:
- Dive deeper 👉 Research Archive
- Learn by building 👉 AI Engineering Masterclass
- Try it live 👉 Playground
Share What You Learned
Kuriko IWAI, "Causal Inference in ML Pipelines: Beyond Feature Flattening" in Kernel Labs
https://kuriko-iwai.com/causal-inference-machine-learning-backdoor-criterion
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation