Beyond Simple Imputation: Understanding MICE for Robust Data Science
A comprehensive guide to MICE framework for imputation and uncertainty pooling with practical examples.
By Kuriko IWAI

Table of Contents
IntroductionWhat is Multivariate Imputation by Chained-Equations (MICE)Introduction
Multivariate Imputation by Chained-Equations (MICE) is a powerful framework for imputing missing values, while minimizing bias and uncertainty arose from the imputation process.
But understanding and leveraging MICE is challenging because its iterative, model-dependent nature makes its process complex.
In this article, I'll explore its core mechanics with practical examples of MICE with predictive models like PMM and LR, comparing a normal model-based imputation method as a baseline.
What is Multivariate Imputation by Chained-Equations (MICE)
Multiple Imputations by Chained Equations (MICE) is an imputation method, a technique that fills in missing data points with estimated values.
Major imputation methods are categorized into the three groups:
Statistical methods: Mean, Median, Mode Imputation, etc.
Model-based methods: KNN Imputation, Regression Imputation, GAIN (Generative Adversarial Imputation Networks), etc.
Time series specific methods: Forward / Backward Fill
MICE is the model-based method, although it takes more sophisticated approach by iteratively predicting missing values using predictive models.
The below diagram illustrates the entire process:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. MICE imputation process (Created by Kuriko IWAI)
MICE runs imputation cycles using the predictive model once to generate the posterior predictive distribution of the target variable with missing values.
It then samples imputation values from the generated distribution many times through iteration (M times in the Figure A).
The process generates multiple imputed datasets with unique imputation values (Imputed Dataset 1, 2, …, M, colored yellow in Figure A).
Uncertainty analysis and pooling are conducted to generate a final set of statistical inference metrics, including the pooled parameter estimate, the total standard error, the t-statistic, and the p-value.
These metrics are good indicators to measure the uncertainty introduced by the missing values and see if the imputed datasets are statistically sufficient enough to produce reliable inference later.
◼ Key Concepts
During imputation cycles, MICE involves a chained prediction structure where the predictive model imputes missing values in sequence, leveraging the previous imputation results.
For example, let us imagine a simple dataset with two features of Income (X1) and Age (X2) with missing values:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. A sample original data with missing values (Created by Kuriko IWAI)
A standard regression imputation, for instance, imputes the missing values by simply applying the learned approximation function:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C. Standard regression imputation (Created by Kuriko IWAI)
This approach is not a mistake.
But because the regression model is trained on a small, incomplete part of the data, the imputation values (pink cells) might be biased.
The MICE algorithm, on the other hand, leverages chained prediction:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. The first imputation cycle by MICE (Created by Kuriko IWAI)
First, MICE replaces all missing values with initial guess like the mean values (light pink cells).
Then, it uses all data including the newly imputed values to keep refining the imputation values (pink cells).
MICE repeats the process several times until the imputed values stabilize, indicating the convergence.
◼ The Missing At Random (MAR) Assumption
Due to its chain prediction nature, MICE functions under the strong Missing At Random (MAR) assumptions where the likelihood of a value to be missed depends only on existing values in the dataset.
When the missingness (missing values) is categorized:
MCAR - Missing Completely At Random: The missing values are purely random, or
MNAR - Missing Not At Random: The missing values depends on a specific range of the values in the missing column (e.g., Survey respondents with high household income skip answering the household income question in the survey. Now, the missingness of the household income depends on the actual household income values).
it is more promising to use other imputation methods like:
For MCAR: Statistical methods, model-based methods except MICE.
For MNAR: Autoencoder especially for complex datasets, domain-specific models.
◼ When to Use MICE
MICE works the best on MARs as it can:
Achieve high precision: The iterative process provides valid uncertainty estimates compared to other one-shot imputation methods,
Handle multivariate dependencies: Better models multivariate dependencies than simple, non-iterative imputation methods, and
Avoid bias in imputed values by leveraging other variables to create realistic imputations.
However, it also has limitations like:
Slow computation process compared to other imputation methods,
Introducing complexity in the iterative process, and
Dependance on underlying model assumption: The underlying assumptions like linearity of the model in the MICE framework would impact the accuracy of the imputation values.
So, choosing cases that maximize its advantages is key.
These cases involve:
Multivariate missing data: As mentioned in its name, datasets with related missing values across the multiple variables can leverage MICE’s chain prediction nature.
Mixed variable types: Datasets containing continuous and categorical variables with missing data.
High missingness: The proportion of missing values to the entire samples is high enough, reaching around 10% or more where other imputation methods would introduce severe bias.
In reality, studies like clinical trials and public health surveys where unbiased parameter estimates are critical are great examples of real-world use cases of MICE.
In the next section, I’ll explore how the MICE algorithm works in detail.
How the MICE Algorithm Works
As showed in Figure A, the imputation process involves the following steps:
Step 1. Initial imputation with placeholder values.
Step 2. Model the posterior distribution and draw a sample.
Step 3. Repeat Step 2 until convergence.
Step 4. Repeat Step 3 to generate multiple datasets with unique imputation values.
Step 5. Conduct uncertainty analysis and pooling.
Let us take a look.
◼ Step 1. Initial Imputation
The first step is to perform a simple imputation like mean or median imputation for all missing values.
To generalize the process:
where Y is a dataset with P variables, Y = [Y_1, Y_2, … , Y_P] and Y(t) indicates the dataset Y at the imputation cycle t.
Before the imputation, Y contains both observed values (labelled values without missing values) and missing values.
◼ Step 2. The Posterior Distribution and Sampling
The next step is to model the posterior distribution and draw a sample from there.
The process starts by:
Training the predictive model,
Modeling the (conditional) posterior distribution of the target variable, and
Drawing a sample from the distribution.
The target variable (like X1 in the previous example) contains missing values.
This process is generalized with the column index of the target variable j and an imputation cycle t:
where:
Y_j: The observed values of the target variable,
Y_j^{(t−1)}: The imputed dataset from the previous imputation cycle t-1,
g_j: The posterior distribution generated by the predictive model, and
θ_j(t): The learnable parameters of the predictive model at the current imputation cycle t.
The posterior distribution g_j varies with the choice of the predictive model.
For regression tasks, major options are:
Normal (Bayesian) linear regression:
Foundational approach for regression task.
Draws a missing value from a Normal distribution:
where the mean μ_i^(t) is the predicted value and the variance σ_j^2(t) is the residual variance derived from a Bayesian framework.
Normal linear regression with bootstrap:
g_j is the same as the normal linear regression.
The model parameters are estimated from a bootstrap sample of the observed data of Y.
Predictive Mean Matching (PMM):
Uses a standard parametric model like linear regression to generate the predictive means.
Then, randomly draw a sample from the donor pool Di, a set of observed data with minimal distance to the predicted means:
where k is the donor pool size. The concept is similar to K-nearest neighbor.
For classification tasks, major options are:
Normal (Bayesian) logistic regression:
Foundational approach for binary classification tasks.
Draws a sample from the Bernoulli distribution:
where p_i is the conditional probability of the ith missing value belonging to class 1, at an imputation cycle of t.
Polytomous logistic regression:
Foundational approach for multinomial classification tasks.
Draws a sample from the Multinomial distribution:
where p_i is the conditional probability of the ith missing value belonging to category k out of total C categories.
Proportional odds logistic regression:
Same as other methods, draws a sample from Bernoulli or Multinomial distribution.
Leverages the proportional odds assumption.
The model parameters are significantly fewer than other methods as a single set of slope coefficients is estimated across all cumulative cut-points.
◼ Step 3. Iterative Convergence
Now, the MICE algorithm repeats Step 2 for all missing variables in sequence until all imputed values converge.
For instance, if the dataset Y has four target variables with missing values (e.g., j = 0, 1, 2, 3), MICE sequentially repeats the process:
Generating the posterior distribution for the first variable j = 0,
Drawing imputed samples,
Moving to the next variable j = 1, and so on.
This sequential imputation ensures a chain prediction where the imputed values from currently completed variables (j = 0, for instance) are incorporated when imputing the current variable (j = 1).
◼ Step 4. Iteration - Generating Multiple Datasets
Then, MICE repeats the entire cycle of Step 3 multiple times (e.g., 20 times) to create multiple imputed datasets, each with different imputation values.
In each cycle, MICE uses the random draw from the posterior distribution g_j , ensuring the variation and proper uncertainty estimation across the datasets.
◼ Step 5. Uncertainty Analysis and Pooling
The final step is to perform uncertainty analysis on all imputed datasets and conclude their statistic inference metrics with pooling.
▫ Uncertainty Analysis
The specific analysis depends entirely on the task and data types.
But the general principle here is to run the exact same statistical procedure on all imputed datasets.
For example:
Task: Determine the effect of Age and BMI on Cholesterol level.
Analysis model MICE used: Standard Ordinary Least Squares (OLS) Linear Regression.
Conceptual model formulation: Cholesterol ∼ β_0 + β_1⋅Age + β_2⋅BMI + ϵ
Analysis outcome for the m-th imputed dataset D(m):
A set of coefficient estimates: θ^_m = (β^_0, β^_1, β^_2)
The variance-covariance matrix of the coefficients: U_m.
Another example:
Task: Predict the likelihood of a chronic disease based on patient characteristics.
Analysis model MICE used: Binary Logistic Regression.
Conceptual model formulation: log(P(DM=1) /(1−P(DM=1)) ∼ α + γ_1⋅Age + γ_2⋅Gender + γ_3⋅Smoking
Analysis outcome for the m-th imputed dataset D(m):
A set of log-odds ratio estimates: θ^_m = (α^, γ^1, γ^2, γ^3)
The variance-covariance matrix: U_m
▫ Pooling with Rubin’s Rules
After analyzing all M datasets, there are M estimates for the model parameters (θ^1, θ^2, …, θ^M) and variances (U_1, U_2,… ,U_M).
In the pooling process, Rubin’s Rules combine these multiple results into a single, valid estimate that reflects the overall findings from the imputation process.
For example, the final model parameter θ is provided as the expected value of the averaged M estimates:
where θˉ is the average of M estimates (pooled point estimate).
Rubin's Rules quantifies the estimation error (difference) between the final result θ and pooled point estimate θˉ:
SE_pool is the estimation error called the pooled standard error (SE) and
T is the total variance summing up the within- and between-imputation variance:
where:
Uˉ: Within-imputation variance, the average of the M squared standard errors, quantifying the uncertainty from sampling variability,
B: Between-imputation variance, the sample variance of the M estimates, quantifying the extra uncertainty introduced by imputing the missing values, and
M: The total number of imputed datasets.
By using the error and appropriate degrees of freedom, the process constructs valid confidence intervals and p-values for the final model.
Practical Implementation
In this section, I’ll demonstrate its application by training a model for a regression task using a synthetic dataset.
The steps involve:
Structuring missing values,
Analyzing the missing values (statistical types),
Defining the imputer,
Imputation,
Uncertainty analysis and pooling,
Model training and evaluation.
For experiment, I’ll compare the following imputation methods:
MICE with PMM
MICE with Normal Linear Regression
Standard Mean Imputation (as the baseline)
The synthetic dataset contains 10,000 samples (including missing values) with three features: X1, X2, and Y.
◼ Step 1. Structure Missing Values
I’ll first replace all missing values with NumPy’s np.nan.
By default, Pandas DataFrames don't recognize unstructured missing values like a simple space or text inputs like “NA“ or “?“.
If not structured, these values are treated as valid entries by Pandas, which would skew the analysis.
1
2import pandas as pd
3import numpy as np
4
5
6def structure_missing_values(df: pd.DataFrame, target_cols: list = []) -> pd.DataFrame:
7 target_cols = target_cols if target_cols else df.columns.tolist()
8
9 # list up unstructured missing val options
10 unstructured_missing_vals = ['', '?', ' ', 'nan', 'N/A', None, 'na', 'None', 'none']
11
12 # structure
13 structured_nan = { item: np.nan for item in unstructured_missing_vals }
14 structured_df = df.copy()
15 for col in target_cols: structured_df[col].replace(structured_nan)
16 return structured_df
17
18df_structured = structure_missing_values(df=original_df)
19
◼ Step 2. Analyzing Missing Values
Next, I’ll analyze the missing values to assess if MICE is suitable.
We’ve learned there are multiple conditions under which MICE would work the best.
For demonstration, I’ll focus on categorizing statistic types of the missing values:
1
2from scipy import stats
3
4def assess_missingness_diagnostics(df, target_variable='X1', missingness_type='mcar'):
5 # create a missingness indicator
6 df_temp = df.copy()
7 df_temp['is_missing'] = df_temp[target_variable].isna().astype(int)
8 observed_vars = [col for col in df_temp.columns if col not in [target_variable, 'is_missing']]
9
10 # compare means of observed variables (x2, y) between 1) group_observed (x1 observed) and 2) group_missing (x1 missing)
11 for var in observed_vars:
12 # create the groups
13 group_observed = df_temp[df_temp['is_missing'] == 0][var]
14 group_missing = df_temp[df_temp['is_missing'] == 1][var]
15
16 # check if enough samples exist in both groups
17 if len(group_missing) < 2 or len(group_observed) < 2: continue
18
19 # perform two-sample t-test to compute the mean difference
20 _, p_value = stats.ttest_ind(group_observed.dropna(), group_missing.dropna())
21 mean_obs = group_observed.mean()
22 mean_miss = group_missing.mean()
23
24 # rejection of h0 (equal means) suggests missingness depends on the observed variable
25 if p_value < 0.05:
26 print(f" -> conclusion: mar or mnar because means are statistically different.")
27 else:
28 # failure to reject h0 suggests independence
29 print(f" -> conclusion: mcar because means are not statistically different.")
30
31# for mcar, the mean of y and x2 should be similar whether x1 is observed or missing.
32assess_missingness_diagnostics(
33 df=df_structured,
34 target_variable='X1',
35 missingness_type='mar'
36)
37
The outputs:
Means of both X2 and Y with observed X1 and missing X1 variables are both different, indicating the statistical nature of either MAR or MNAR.
variable: X2
mean (X1 observed): 50.8368932981
mean (X1 missing): 38.2861376084
p-value: 0.0000000000
variable: Y
mean (X1 observed): -2.3461196877
mean (X1 missing): 8.5763595796
p-value: 0.0000000000
In reality, these statistical types can be ambiguous or involves mixed types in a single dataset.
A good strategy is to train the model using several different imputation methods and then compare their impact on the model's generalization capabilities.
◼ Step 3. Defining the Imputer
After ensuring that MICE is suitable, I’ll define the imputer using the IterativeImputer class from the Scikit-learn library:
▫ MICE with PMM
1from sklearn.impute import IterativeImputer
2
3imputer_mice_pmm = IterativeImputer(
4 estimator=PMM(), # pmm
5 max_iter=10, # num of cycles
6 initial_strategy='mean', # initial imputation val
7 random_state=42, # for reproducibility
8)
9
For an estimator, I’ll customize the PMM class inheriting the BaseEstimator class:
1
2import numpy as np
3from sklearn.linear_model import LinearRegression
4from sklearn.base import BaseEstimator, RegressorMixin
5
6# implement a custom pmm estimator
7class PMM(BaseEstimator, RegressorMixin):
8 def __init__(self, k: int = 5, base_estimator=LinearRegression()):
9 self.k = k # k-neigbor
10 self.base_estimator = base_estimator
11
12 def fit(self, observed_X, observed_y):
13 self.base_estimator.fit(observed_X, observed_y)
14
15 # store the observed data
16 self.X_donors = observed_X.copy()
17 self.y_donors = observed_y.copy()
18
19 # predict the means for all observed donors
20 self.y_pred_donors = self.base_estimator.predict(self.X_donors)
21 return self
22
23 # core pmm logic of sampling val from the k nearest neighbors of observed data. x has missing vals
24 def predict(self, X):
25 # predict the mean for the missing data (recipients)
26 y_pred_recipients = self.base_estimator.predict(X)
27
28 imputed_values = np.zeros(X.shape[0])
29
30 # perform pmm for each recipient (row in x)
31 for i, pred_recipient in enumerate(y_pred_recipients):
32 # compute the absolute difference between the recipient's predicted mean and all the donor's predicted means.
33 diffs = np.abs(self.y_pred_donors - pred_recipient)
34
35 # get the indices that correspond to the k smallest differences (k nearest matches)
36 nearest_indices = np.argsort(diffs)[:self.k] # taking k indices plus 1 to avoid an exact match of the imputed val from the prev round
37
38 # randomly sample an observed value from the k nearest neighbors (donor_pool)
39 donor_pool = self.y_donors[nearest_indices]
40 imputed_value = np.random.choice(donor_pool, size=1)[0]
41 imputed_values[i] = imputed_value
42
43 return imputed_values
44
45 ## func for IterativeImputer compatibility
46 def _predict_with_uncertainty(self, X, return_std=False):
47 if return_std:
48 return self.predict(X), np.zeros(X.shape[0]) # pmm is semi-parametric. set standard deviation = 0
49 return self.predict(X)
50
The KNeighborsRegressor class from the library would perform similar.
▫ MICE with Normal Linear Regression
For normal linear regression, I’ll use the BayesianRidge model from the Scikit-learn library to ensure unique imputation values:
1from sklearn.impute import IterativeImputer
2from sklearn.linear_model import BayesianRidge
3
4imputer_mice_lr = IterativeImputer(
5 estimator=BayesianRidge(max_iter=500, tol=1e-3, alpha_1=1e-10, alpha_2=1e-10, lambda_1=1e-10, lambda_2=1e-10),
6 max_iter=10,
7 initial_strategy='mean',
8 random_state=42,
9 sample_posterior=True # ensure imputation values are unique by adding random noise drawn from the posterior distribution to the predictions
10)
11
◼ Step 4. Imputation
The imputation process involves iterating the chain prediction for M = 5 times.
I’ll define the run_mice_imputation function:
1
2import pandas as pd
3from sklearn.impute import IterativeImputer
4
5
6def run_mice_imputation(df: pd.DataFrame, imputer: IterativeImputer, M: int = 5) -> list:
7 # iteration
8 imputed_datasets= [] # subject to analysis and pooling later
9 imputed_values_x1 = {}
10 missing_indices_x1 = df[df['X1'].isna()].head(3).index.tolist()
11
12 for m in range(M):
13 # setup imputer for each interation (unique random state controls the numpy seed for pmm's sampling)
14 setattr(imputer, 'random_state', m)
15
16 # impute df and convert generated array to pandas df
17 df_m = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
18
19 # recording
20 imputed_datasets.append(df_m)
21 imputed_values_x1[m] = df_m['X1'].loc[missing_indices_x1].tolist()
22
23 # outputs - unique imputation values
24 print("Imputed values across M datasets (True PMM - expecting variability):")
25 print(f"imputed dataset 1 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[0]]}")
26 print(f"imputed dataset 2 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[1]]}")
27 print(f"imputed dataset 3 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[2]]}")
28 print(f"imputed dataset 4 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[3]]}")
29 print(f"imputed dataset 5 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[4]]}")
30
31 return imputed_datasets
32
▫ MICE with PMM
1imputed_datasets_mice_pmm = run_mice_imputation(df=df, imputer=imputer_mice_pmm, M=5)
2
The five completed datasets (D’s) generate unique imputation values for the target variable X1:
Imputed values across M datasets:
imputed dataset 1: ['9.24212539359209', '8.86784044761337', '16.27959379662983']
imputed dataset 2: ['10.67728456933526', '6.72710070935441', '8.84037515230672']
imputed dataset 3: ['10.15359370564326', '6.72710070935441', '18.59563606925789']
imputed dataset 4: ['10.78202129747172', '6.72710070935441', '8.84037515230672']
imputed dataset 5: ['9.35638700644611', '5.15233963776858', '13.88818918534419']
▫ MICE with Normal Linear Regression
1imputed_datasets_mice_pmm = run_mice_imputation(df=df, imputer=imputer_mice_lr, M=5)
2
The five completed datasets (D’s) generate unique imputation values for the target variable X1:
Imputed values across M datasets:
imputed dataset 1: ['12.21062281693754', '9.96558458658052', '12.49501000922409']
imputed dataset 2: ['8.71573023588752', '10.85416071007341', '12.35300502646557']
imputed dataset 3: ['12.72459653377737', '8.71106811173562', '11.99063594937416']
imputed dataset 4: ['9.73654929041454', '3.72657966114651', '12.08960122126962']
imputed dataset 5: ['8.59526280022813', '12.78831127091796', '12.39797242039489']
◼ Step 5. Uncertainty Analysis and Pooling
After creating multiple datasets, I’ll run uncertainty analysis and pooling:
1
2import numpy as np
3import statsmodels.formula.api as smf
4from scipy.stats import t
5
6
7def perform_analysis_and_pooling(imputed_datasets, target_param='X1'):
8 m = len(imputed_datasets)
9 estimates = [] # theta_m
10 variances = [] # within-imputation variance u_m
11
12 # 1. analysis - run the model on each imputed dataset
13 for i, df_m in enumerate(imputed_datasets):
14 # ols model
15 model = smf.ols(formula='Y ~ X1 + X2', data=df_m).fit()
16
17 # extract the estimate (theta_m) and its variance (u_m) for the target parameter
18 estimate = model.params[target_param]
19 variance = model.bse[target_param]**2
20
21 estimates.append(estimate)
22 variances.append(variance)
23
24 # 2. pooling w rubin's rules
25 # pooled point estimate (theta_bar)
26 theta_bar = np.mean(estimates)
27
28 # within-imputation variance (u_bar)
29 u_bar = np.mean(variances)
30
31 # between-imputation variance (b)
32 b = (1 / (m - 1)) * np.sum([(est - theta_bar)**2 for est in estimates])
33
34 # total variance (t)
35 t_total = u_bar + (1 + (1 / m)) * b
36
37 # total standard error (se)
38 se_pooled = np.sqrt(t_total)
39
40 # relative increase in variance (riv) and degrees of freedom (v)
41 riv = ((1 + (1 / m)) * b) / u_bar
42 v_approx = (m - 1) * (1 + (1 / riv))**2
43
44 # confidence interval
45 t_critical = t.ppf(0.975, df=v_approx)
46 ci_lower = theta_bar - t_critical * se_pooled
47 ci_upper = theta_bar + t_critical * se_pooled
48
49 print("\n--- pooled results using rubin's rules ---")
50 print(f"pooled estimate ({target_param}): {theta_bar:.10}")
51 print(f"within variance (u_bar): {u_bar:.10}")
52 print(f"between variance (b): {b:.10f}")
53 print(f"total variance (t): {t_total:.10}")
54 print(f"pooled standard error: {se_pooled:.10}")
55 print(f"relative increase in variance (riv): {riv:.10}")
56 print(f"degrees of freedom (approx): {v_approx:.2f}")
57 print(f"95% ci: [{ci_lower:.10}, {ci_upper:.10}]")
58
59 return theta_bar, se_pooled, t_total, v_approx
60
▫ MICE with PMM
The final coefficient estimate for X1 is 2.021, with a 95% confidence interval from 1.954 to 2.087.
The uncertainty added by the imputation process (RIV) was low (RIV ≈ 5.7%).
Pooled results:
pooled estimate (X1): 2.020731625
within variance (u_bar): 0.001079913886
between variance (b): 0.0000517265
total variance (t): 0.001141985746
pooled standard error: 0.0337932796
elative increase in variance (riv): 0.05747852707
degrees of freedom (approx): 1353.92
95% ci: [1.95443875, 2.087024499]
▫ MICE with Normal Linear Regression
The final coefficient estimate for X1 is 2.029, with a 95% confidence interval spanning from 1.963 to 2.096.
The uncertainty due to the missing data was relatively small (RIV ≈ 6.8%).
Pooled results:
pooled estimate (X1): 2.029474175
within variance (u_bar): 0.001082280232
between variance (b): 0.0000609312
total variance (t): 0.001155397613
pooled standard error: 0.03399114021
relative increase in variance (riv): 0.06755864026
degrees of freedom (approx): 998.81
95% ci: [1.962771936, 2.096176414]
It is safe to say that the uncertainty introduced by the imputation process is acceptable in both methods.
◼ Step 6. Model Training and Evaluation
Lastly, I’ll train the Support Vector Regressor (SVR) on each imputed dataset and average the results to evaluate its generalization capabilities:
1from sklearn.model_selection import train_test_split
2from sklearn.preprocessing import StandardScaler
3from sklearn.compose import ColumnTransformer
4from sklearn.pipeline import Pipeline
5from sklearn.svm import SVR
6from sklearn.metrics import mean_squared_error
7
8mse_train_mice_pmm_list = []
9mse_val_mice_pmm_list = []
10
11for i, df in enumerate(created_datasets):
12 # create X, y
13 y = df['Y']
14 X = df[['X1', 'X2']]
15
16 # create train, val, test datasets
17 test_size, random_state = 1000, 42
18 X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, random_state=random_state)
19 X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, shuffle=True, random_state=random_state)
20
21 # preprocess
22 num_cols = X.columns
23 num_transformer = Pipeline(steps=[('scaler', StandardScaler())])
24 preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_cols),], remainder='passthrough')
25
26 X_train = preprocessor.fit_transform(X_train)
27 X_val = preprocessor.transform(X_val)
28 X_test = preprocessor.transform(X_test)
29
30 # train the model
31 model = SVR(kernel="rbf", degree=3, gamma='scale', coef0=0, tol=1e-5, C=1, epsilon=1e-5, max_iter=1000)
32 model.fit(X_train, y_train)
33
34 # inference
35 y_pred_train = model.predict(X_train)
36 mse_train = mean_squared_error(y_pred_train, y_train)
37
38 y_pred_val = model.predict(X_val)
39 mse_val = mean_squared_error(y_pred_val, y_val)
40
41 # recording
42 mse_train_mice_pmm_list.append(mse_train)
43 mse_val_mice_pmm_list.append(mse_val)
44
45
46# final performance metric that accounts for imputation uncertainty.
47pooled_mse_train_mice_pmm = np.mean(mse_train_mice_pmm_list)
48pooled_mse_val_mice_pmm = np.mean(mse_val_mice_pmm_list)
49
50print(f'\nfinal performance - train mse: {pooled_mse_train_mice_pmm:.6f}, val mse: {pooled_mse_val_mice_pmm:.6f}')
51
◼ Results
The training and validation results of each method are evaluated with the Mean Squared Error (MSE):
MICE with PMM*: Training: 394.135692, Validation: 368.203358*
MICE with Normal Linear Regression*: Training: 331.430123, Validation: 308.422768*
Regression Imputation (baseline): Training: 410.631099, Validation: 381.695746
Both MICE methods achieved better performance (lower MSE) than the single Regression Imputation baseline.
The MICE with Normal Linear Regression method performed the best, delivering the lowest Validation MSE (308.42).
This superior performance is likely due to:
Imputation model fit: The normal linear regression model used by MICE accurately captures the underlying relationships in the data.
Prediction consistency: The PMM method uses stochastic sampling (randomness) to draw values, which inherent sampling noise and lead to higher MSEs.
In summary, MICE successfully improved model performance by correctly handling missing data uncertainty, with the Linear Regression imputation technique proving the most efficient for the task.
Wrapping Up
MICE is a competitive imputation method for creating statistically valid inferences and robust predictive models from incomplete data.
In the experiments, we observed that MICE successfully improved model performance, outperforming the baseline Linear Regression imputation.
Moving forward, optimizing an estimator, iterations, or cycles would further improve the model’s capabilities.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage
Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE
A Guide to Synthetic Data Generation: Statistical and Probabilistic Approaches
Maximum A Posteriori (MAP) Estimation: Balancing Data and Expert Knowledge
Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps
Share What You Learned
Kuriko IWAI, "Beyond Simple Imputation: Understanding MICE for Robust Data Science" in Kernel Labs
https://kuriko-iwai.com/multivariate-imputation-by-chained-equations
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.




