Beyond Simple Imputation: Understanding MICE for Robust Data Science

A comprehensive guide to MICE framework for imputation and uncertainty pooling with practical examples.

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Multivariate Imputation by Chained-Equations (MICE)
Key Concepts
The Missing At Random (MAR) Assumption
When to Use MICE
How the MICE Algorithm Works
Step 1. Initial Imputation
Step 2. The Posterior Distribution and Sampling
Step 3. Iterative Convergence
Step 4. Iteration - Generating Multiple Datasets
Step 5. Uncertainty Analysis and Pooling
Practical Implementation
Step 1. Structure Missing Values
Step 2. Analyzing Missing Values
Step 3. Defining the Imputer
Step 4. Imputation
Step 5. Uncertainty Analysis and Pooling
Step 6. Model Training and Evaluation
Results
Wrapping Up

Introduction

Multivariate Imputation by Chained-Equations (MICE) is a powerful framework for imputing missing values, while minimizing bias and uncertainty arose from the imputation process.

But understanding and leveraging MICE is challenging because its iterative, model-dependent nature makes its process complex.

In this article, I'll explore its core mechanics with practical examples of MICE with predictive models like PMM and LR, comparing a normal model-based imputation method as a baseline.

What is Multivariate Imputation by Chained-Equations (MICE)

Multiple Imputations by Chained Equations (MICE) is an imputation method, a technique that fills in missing data points with estimated values.

Major imputation methods are categorized into the three groups:

  • Statistical methods: Mean, Median, Mode Imputation, etc.

  • Model-based methods: KNN Imputation, Regression Imputation, GAIN (Generative Adversarial Imputation Networks), etc.

  • Time series specific methods: Forward / Backward Fill

MICE is the model-based method, although it takes more sophisticated approach by iteratively predicting missing values using predictive models.

The below diagram illustrates the entire process:

Figure A. MICE imputation process (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. MICE imputation process (Created by Kuriko IWAI)

MICE runs imputation cycles using the predictive model once to generate the posterior predictive distribution of the target variable with missing values.

It then samples imputation values from the generated distribution many times through iteration (M times in the Figure A).

The process generates multiple imputed datasets with unique imputation values (Imputed Dataset 1, 2, …, M, colored yellow in Figure A).

Uncertainty analysis and pooling are conducted to generate a final set of statistical inference metrics, including the pooled parameter estimate, the total standard error, the t-statistic, and the p-value.

These metrics are good indicators to measure the uncertainty introduced by the missing values and see if the imputed datasets are statistically sufficient enough to produce reliable inference later.

Key Concepts

During imputation cycles, MICE involves a chained prediction structure where the predictive model imputes missing values in sequence, leveraging the previous imputation results.

For example, let us imagine a simple dataset with two features of Income (X1) and Age (X2) with missing values:

Figure B. A sample original data with missing values (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. A sample original data with missing values (Created by Kuriko IWAI)

A standard regression imputation, for instance, imputes the missing values by simply applying the learned approximation function:

Figure C. Standard regression imputation (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Standard regression imputation (Created by Kuriko IWAI)

This approach is not a mistake.

But because the regression model is trained on a small, incomplete part of the data, the imputation values (pink cells) might be biased.

The MICE algorithm, on the other hand, leverages chained prediction:

Figure D. The first imputation cycle by MICE (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. The first imputation cycle by MICE (Created by Kuriko IWAI)

First, MICE replaces all missing values with initial guess like the mean values (light pink cells).

Then, it uses all data including the newly imputed values to keep refining the imputation values (pink cells).

MICE repeats the process several times until the imputed values stabilize, indicating the convergence.

The Missing At Random (MAR) Assumption

Due to its chain prediction nature, MICE functions under the strong Missing At Random (MAR) assumptions where the likelihood of a value to be missed depends only on existing values in the dataset.

When the missingness (missing values) is categorized:

  • MCAR - Missing Completely At Random: The missing values are purely random, or

  • MNAR - Missing Not At Random: The missing values depends on a specific range of the values in the missing column (e.g., Survey respondents with high household income skip answering the household income question in the survey. Now, the missingness of the household income depends on the actual household income values).

it is more promising to use other imputation methods like:

  • For MCAR: Statistical methods, model-based methods except MICE.

  • For MNAR: Autoencoder especially for complex datasets, domain-specific models.

When to Use MICE

MICE works the best on MARs as it can:

  • Achieve high precision: The iterative process provides valid uncertainty estimates compared to other one-shot imputation methods,

  • Handle multivariate dependencies: Better models multivariate dependencies than simple, non-iterative imputation methods, and

  • Avoid bias in imputed values by leveraging other variables to create realistic imputations.

However, it also has limitations like:

  • Slow computation process compared to other imputation methods,

  • Introducing complexity in the iterative process, and

  • Dependance on underlying model assumption: The underlying assumptions like linearity of the model in the MICE framework would impact the accuracy of the imputation values.

So, choosing cases that maximize its advantages is key.

These cases involve:

  • Multivariate missing data: As mentioned in its name, datasets with related missing values across the multiple variables can leverage MICE’s chain prediction nature.

  • Mixed variable types: Datasets containing continuous and categorical variables with missing data.

  • High missingness: The proportion of missing values to the entire samples is high enough, reaching around 10% or more where other imputation methods would introduce severe bias.

In reality, studies like clinical trials and public health surveys where unbiased parameter estimates are critical are great examples of real-world use cases of MICE.

In the next section, I’ll explore how the MICE algorithm works in detail.

How the MICE Algorithm Works

As showed in Figure A, the imputation process involves the following steps:

  • Step 1. Initial imputation with placeholder values.

  • Step 2. Model the posterior distribution and draw a sample.

  • Step 3. Repeat Step 2 until convergence.

  • Step 4. Repeat Step 3 to generate multiple datasets with unique imputation values.

  • Step 5. Conduct uncertainty analysis and pooling.

Let us take a look.

Step 1. Initial Imputation

The first step is to perform a simple imputation like mean or median imputation for all missing values.

To generalize the process:

Y(0)=SimpleImpute(Y)Y(0) = SimpleImpute(Y)

where Y is a dataset with P variables, Y = [Y_1​, Y_2​, … , Y_P​] and Y(t) indicates the dataset Y at the imputation cycle t.

Before the imputation, Y contains both observed values (labelled values without missing values) and missing values.

Step 2. The Posterior Distribution and Sampling

The next step is to model the posterior distribution and draw a sample from there.

The process starts by:

  • Training the predictive model,

  • Modeling the (conditional) posterior distribution of the target variable, and

  • Drawing a sample from the distribution.

The target variable (like X1 in the previous example) contains missing values.

This process is generalized with the column index of the target variable j and an imputation cycle t:

Yjgj(Yj(t1);θj(t))Y_j \sim g_j​(Y_{−j}^{(t-1)}​ ; \theta_j​^{(t)})

where:

  • Y_j​: The observed values of the target variable,

  • Y_j^{(t−1)}​: The imputed dataset from the previous imputation cycle t-1,

  • g_j​: The posterior distribution generated by the predictive model, and

  • θ_j(t)​: The learnable parameters of the predictive model at the current imputation cycle t.

The posterior distribution g_j varies with the choice of the predictive model.

For regression tasks, major options are:

Normal (Bayesian) linear regression:

  • Foundational approach for regression task.

  • Draws a missing value from a Normal distribution:

gj=N(μi(t),σj2(t))g_j​ = N(\mu_i^{(t)}, {\sigma_j^2}^{(t)})

where the mean μ_i^(t)​ is the predicted value and the variance σ_j^2(t) is the residual variance derived from a Bayesian framework.

Normal linear regression with bootstrap:

  • g_j is the same as the normal linear regression.

  • The model parameters are estimated from a bootstrap sample of the observed data of Y.

Predictive Mean Matching (PMM):

  • Uses a standard parametric model like linear regression to generate the predictive means.

  • Then, randomly draw a sample from the donor pool Di, a set of observed data with minimal distance to the predicted means:

Yjgj    P(Yiimp=Yj)=1k for jDiY_j​∼g_j​\implies P(Y_i^{imp} = Y_j) = \frac {1} {k} \text{ for } j \in D_i

where k is the donor pool size. The concept is similar to K-nearest neighbor.

For classification tasks, major options are:

Normal (Bayesian) logistic regression:

  • Foundational approach for binary classification tasks.

  • Draws a sample from the Bernoulli distribution:

gj=Bernoulli(pi(t))g_j​ = Bernoulli(p_i^{(t)}​)

where p_i​ is the conditional probability of the ith missing value belonging to class 1, at an imputation cycle of t.

Polytomous logistic regression:

  • Foundational approach for multinomial classification tasks.

  • Draws a sample from the Multinomial distribution:

gj=Categorical(pi1(t),pi2(t),,piC(t))g_j = Categorical(p_i1^{(t)}, p_i2^{(t)}, \cdots, p_iC^{(t)})

where p_i is the conditional probability of the ith missing value belonging to category k out of total C categories.

Proportional odds logistic regression:

  • Same as other methods, draws a sample from Bernoulli or Multinomial distribution.

  • Leverages the proportional odds assumption.

  • The model parameters are significantly fewer than other methods as a single set of slope coefficients is estimated across all cumulative cut-points.

Step 3. Iterative Convergence

Now, the MICE algorithm repeats Step 2 for all missing variables in sequence until all imputed values converge.

For instance, if the dataset Y has four target variables with missing values (e.g., j = 0, 1, 2, 3), MICE sequentially repeats the process:

  • Generating the posterior distribution for the first variable j = 0,

  • Drawing imputed samples,

  • Moving to the next variable j = 1, and so on.

This sequential imputation ensures a chain prediction where the imputed values from currently completed variables (j = 0, for instance) are incorporated when imputing the current variable (j = 1).

Step 4. Iteration - Generating Multiple Datasets

Then, MICE repeats the entire cycle of Step 3 multiple times (e.g., 20 times) to create multiple imputed datasets, each with different imputation values.

In each cycle, MICE uses the random draw from the posterior distribution g_j , ensuring the variation and proper uncertainty estimation across the datasets.

Step 5. Uncertainty Analysis and Pooling

The final step is to perform uncertainty analysis on all imputed datasets and conclude their statistic inference metrics with pooling.

Uncertainty Analysis

The specific analysis depends entirely on the task and data types.

But the general principle here is to run the exact same statistical procedure on all imputed datasets.

For example:

  • Task: Determine the effect of Age and BMI on Cholesterol level.

  • Analysis model MICE used: Standard Ordinary Least Squares (OLS) Linear Regression.

  • Conceptual model formulation: Cholesterol ∼ β_0​ + β_1​⋅Age + β_2​⋅BMI + ϵ

  • Analysis outcome for the m-th imputed dataset D(m):

    • A set of coefficient estimates: θ^_m = (β^​_0​, β^​_1, β^_​2​)

    • The variance-covariance matrix of the coefficients: U_m​.

Another example:

  • Task: Predict the likelihood of a chronic disease based on patient characteristics.

  • Analysis model MICE used: Binary Logistic Regression.

  • Conceptual model formulation: log(P(DM=1) /(1−P(DM=1)​) ∼ α + γ_1​⋅Age + γ_2​⋅Gender + γ_3​⋅Smoking

  • Analysis outcome for the m-th imputed dataset D(m):

    • A set of log-odds ratio estimates: θ^​_m = (α^​, γ^​1, γ^​2, γ^​3​)

    • The variance-covariance matrix: U_m​

Pooling with Rubin’s Rules

After analyzing all M datasets, there are M estimates for the model parameters (θ^1​, θ^2​, …, θ^M​) and variances (U_1​, U_2​,… ,U_M​).

In the pooling process, Rubin’s Rules combine these multiple results into a single, valid estimate that reflects the overall findings from the imputation process.

For example, the final model parameter θ is provided as the expected value of the averaged M estimates:

θ=E[θ]\theta = E[\overline \theta]

where θˉ is the average of M estimates (pooled point estimate).

Rubin's Rules quantifies the estimation error (difference) between the final result θ and pooled point estimate θˉ:

SEpool=TSE_{pool} = \sqrt{T}
  • SE_pool is the estimation error called the pooled standard error (SE) and

  • T is the total variance summing up the within- and between-imputation variance:

T=U+(1+1M)BT=\overline U +(1+ \frac{1} {M}​)B

where:

  • : Within-imputation variance, the average of the M squared standard errors, quantifying the uncertainty from sampling variability,

  • B: Between-imputation variance, the sample variance of the M estimates, quantifying the extra uncertainty introduced by imputing the missing values, and

  • M: The total number of imputed datasets.

By using the error and appropriate degrees of freedom, the process constructs valid confidence intervals and p-values for the final model.

Practical Implementation

In this section, I’ll demonstrate its application by training a model for a regression task using a synthetic dataset.

The steps involve:

  1. Structuring missing values,

  2. Analyzing the missing values (statistical types),

  3. Defining the imputer,

  4. Imputation,

  5. Uncertainty analysis and pooling,

  6. Model training and evaluation.

For experiment, I’ll compare the following imputation methods:

  • MICE with PMM

  • MICE with Normal Linear Regression

  • Standard Mean Imputation (as the baseline)

The synthetic dataset contains 10,000 samples (including missing values) with three features: X1, X2, and Y.

Step 1. Structure Missing Values

I’ll first replace all missing values with NumPy’s np.nan.

By default, Pandas DataFrames don't recognize unstructured missing values like a simple space or text inputs like “NA“ or “?“.

If not structured, these values are treated as valid entries by Pandas, which would skew the analysis.

1
2import pandas as pd
3import numpy as np
4
5
6def structure_missing_values(df: pd.DataFrame, target_cols: list = []) -> pd.DataFrame:
7    target_cols = target_cols if target_cols else df.columns.tolist()
8
9    # list up unstructured missing val options
10    unstructured_missing_vals = ['', '?', ' ', 'nan', 'N/A', None, 'na', 'None', 'none']
11
12    # structure
13    structured_nan = { item: np.nan for item in unstructured_missing_vals }
14    structured_df = df.copy()
15    for col in target_cols: structured_df[col].replace(structured_nan)
16    return structured_df
17
18df_structured = structure_missing_values(df=original_df)
19

Step 2. Analyzing Missing Values

Next, I’ll analyze the missing values to assess if MICE is suitable.

We’ve learned there are multiple conditions under which MICE would work the best.

For demonstration, I’ll focus on categorizing statistic types of the missing values:

1
2from scipy import stats
3
4def assess_missingness_diagnostics(df, target_variable='X1', missingness_type='mcar'):
5    # create a missingness indicator
6    df_temp = df.copy()
7    df_temp['is_missing'] = df_temp[target_variable].isna().astype(int)
8    observed_vars = [col for col in df_temp.columns if col not in [target_variable, 'is_missing']]
9
10    # compare means of observed variables (x2, y) between 1) group_observed (x1 observed) and 2) group_missing (x1 missing)
11    for var in observed_vars:
12        # create the groups
13        group_observed = df_temp[df_temp['is_missing'] == 0][var]
14        group_missing = df_temp[df_temp['is_missing'] == 1][var]
15
16        # check if enough samples exist in both groups
17        if len(group_missing) < 2 or len(group_observed) < 2: continue 
18
19        # perform two-sample t-test to compute the mean difference
20        _, p_value = stats.ttest_ind(group_observed.dropna(), group_missing.dropna())
21        mean_obs = group_observed.mean()
22        mean_miss = group_missing.mean()
23
24         # rejection of h0 (equal means) suggests missingness depends on the observed variable
25        if p_value < 0.05:
26            print(f"  -> conclusion: mar or mnar because means are statistically different.")
27        else:
28            # failure to reject h0 suggests independence
29            print(f"  -> conclusion: mcar because means are not statistically different.")
30
31# for mcar, the mean of y and x2 should be similar whether x1 is observed or missing.
32assess_missingness_diagnostics(
33    df=df_structured,
34    target_variable='X1', 
35    missingness_type='mar'
36)
37

The outputs:

Means of both X2 and Y with observed X1 and missing X1 variables are both different, indicating the statistical nature of either MAR or MNAR.

variable: X2

  • mean (X1 observed): 50.8368932981

  • mean (X1 missing): 38.2861376084

  • p-value: 0.0000000000

variable: Y

  • mean (X1 observed): -2.3461196877

  • mean (X1 missing): 8.5763595796

  • p-value: 0.0000000000

In reality, these statistical types can be ambiguous or involves mixed types in a single dataset.

A good strategy is to train the model using several different imputation methods and then compare their impact on the model's generalization capabilities.

Step 3. Defining the Imputer

After ensuring that MICE is suitable, I’ll define the imputer using the IterativeImputer class from the Scikit-learn library:

MICE with PMM

1from sklearn.impute import IterativeImputer
2
3imputer_mice_pmm = IterativeImputer(
4    estimator=PMM(),            # pmm
5    max_iter=10,                # num of cycles     
6    initial_strategy='mean',    # initial imputation val
7    random_state=42,            # for reproducibility
8)
9

For an estimator, I’ll customize the PMM class inheriting the BaseEstimator class:

1
2import numpy as np
3from sklearn.linear_model import LinearRegression 
4from sklearn.base import BaseEstimator, RegressorMixin
5
6# implement a custom pmm estimator
7class PMM(BaseEstimator, RegressorMixin):
8    def __init__(self, k: int = 5, base_estimator=LinearRegression()):
9        self.k = k # k-neigbor
10        self.base_estimator = base_estimator
11
12    def fit(self, observed_X, observed_y):
13        self.base_estimator.fit(observed_X, observed_y) 
14
15        # store the observed data
16        self.X_donors = observed_X.copy()
17        self.y_donors = observed_y.copy()
18
19        # predict the means for all observed donors
20        self.y_pred_donors = self.base_estimator.predict(self.X_donors) 
21        return self
22
23    # core pmm logic of sampling val from the k nearest neighbors of observed data. x has missing vals
24    def predict(self, X):
25        # predict the mean for the missing data (recipients)
26        y_pred_recipients = self.base_estimator.predict(X)
27
28        imputed_values = np.zeros(X.shape[0])
29
30        # perform pmm for each recipient (row in x)
31        for i, pred_recipient in enumerate(y_pred_recipients):
32            # compute the absolute difference between the recipient's predicted mean and all the donor's predicted means.
33            diffs = np.abs(self.y_pred_donors - pred_recipient)
34
35            # get the indices that correspond to the k smallest differences (k nearest matches)
36            nearest_indices = np.argsort(diffs)[:self.k] # taking k indices plus 1 to avoid an exact match of the imputed val from the prev round
37
38            # randomly sample an observed value from the k nearest neighbors (donor_pool)
39            donor_pool = self.y_donors[nearest_indices]
40            imputed_value = np.random.choice(donor_pool, size=1)[0]
41            imputed_values[i] = imputed_value
42
43        return imputed_values
44
45    ## func for IterativeImputer compatibility
46    def _predict_with_uncertainty(self, X, return_std=False):
47        if return_std:
48            return self.predict(X), np.zeros(X.shape[0])  # pmm is semi-parametric. set standard deviation = 0
49        return self.predict(X)
50

The KNeighborsRegressor class from the library would perform similar.

MICE with Normal Linear Regression

For normal linear regression, I’ll use the BayesianRidge model from the Scikit-learn library to ensure unique imputation values:

1from sklearn.impute import IterativeImputer
2from sklearn.linear_model import BayesianRidge
3
4imputer_mice_lr = IterativeImputer(
5    estimator=BayesianRidge(max_iter=500, tol=1e-3, alpha_1=1e-10, alpha_2=1e-10, lambda_1=1e-10, lambda_2=1e-10),
6    max_iter=10,
7    initial_strategy='mean',
8    random_state=42,
9    sample_posterior=True # ensure imputation values are unique by adding random noise drawn from the posterior distribution to the predictions
10)
11

Step 4. Imputation

The imputation process involves iterating the chain prediction for M = 5 times.

I’ll define the run_mice_imputation function:

1
2import pandas as pd
3from sklearn.impute import IterativeImputer
4
5
6def run_mice_imputation(df: pd.DataFrame, imputer: IterativeImputer, M: int = 5) -> list:
7    # iteration
8    imputed_datasets= [] # subject to analysis and pooling later
9    imputed_values_x1 = {}
10    missing_indices_x1 = df[df['X1'].isna()].head(3).index.tolist()
11
12    for m in range(M): 
13        # setup imputer for each interation (unique random state controls the numpy seed for pmm's sampling)
14        setattr(imputer, 'random_state', m)
15
16        # impute df and convert generated array to pandas df
17        df_m = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
18
19        # recording
20        imputed_datasets.append(df_m)
21        imputed_values_x1[m] = df_m['X1'].loc[missing_indices_x1].tolist()
22
23    # outputs - unique imputation values
24    print("Imputed values across M datasets (True PMM - expecting variability):")
25    print(f"imputed dataset 1 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[0]]}")
26    print(f"imputed dataset 2 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[1]]}")
27    print(f"imputed dataset 3 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[2]]}")
28    print(f"imputed dataset 4 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[3]]}")
29    print(f"imputed dataset 5 - the first three imputed values for X1: {[f'{x:.14f}' for x in imputed_values_x1[4]]}")
30
31    return imputed_datasets
32

MICE with PMM

1imputed_datasets_mice_pmm = run_mice_imputation(df=df, imputer=imputer_mice_pmm, M=5)
2

The five completed datasets (D’s) generate unique imputation values for the target variable X1:

Imputed values across M datasets:

  • imputed dataset 1: ['9.24212539359209', '8.86784044761337', '16.27959379662983']

  • imputed dataset 2: ['10.67728456933526', '6.72710070935441', '8.84037515230672']

  • imputed dataset 3: ['10.15359370564326', '6.72710070935441', '18.59563606925789']

  • imputed dataset 4: ['10.78202129747172', '6.72710070935441', '8.84037515230672']

  • imputed dataset 5: ['9.35638700644611', '5.15233963776858', '13.88818918534419']

MICE with Normal Linear Regression

1imputed_datasets_mice_pmm = run_mice_imputation(df=df, imputer=imputer_mice_lr, M=5)
2

The five completed datasets (D’s) generate unique imputation values for the target variable X1:

Imputed values across M datasets:

  • imputed dataset 1: ['12.21062281693754', '9.96558458658052', '12.49501000922409']

  • imputed dataset 2: ['8.71573023588752', '10.85416071007341', '12.35300502646557']

  • imputed dataset 3: ['12.72459653377737', '8.71106811173562', '11.99063594937416']

  • imputed dataset 4: ['9.73654929041454', '3.72657966114651', '12.08960122126962']

  • imputed dataset 5: ['8.59526280022813', '12.78831127091796', '12.39797242039489']

Step 5. Uncertainty Analysis and Pooling

After creating multiple datasets, I’ll run uncertainty analysis and pooling:

1
2import numpy as np
3import statsmodels.formula.api as smf
4from scipy.stats import t 
5
6
7def perform_analysis_and_pooling(imputed_datasets, target_param='X1'):
8    m = len(imputed_datasets)
9    estimates = []  # theta_m
10    variances = []  # within-imputation variance u_m
11
12    # 1. analysis - run the model on each imputed dataset
13    for i, df_m in enumerate(imputed_datasets):
14        # ols model
15        model = smf.ols(formula='Y ~ X1 + X2', data=df_m).fit()
16
17        # extract the estimate (theta_m) and its variance (u_m) for the target parameter
18        estimate = model.params[target_param]
19        variance = model.bse[target_param]**2
20
21        estimates.append(estimate)
22        variances.append(variance)
23
24    # 2. pooling w rubin's rules
25    # pooled point estimate (theta_bar)
26    theta_bar = np.mean(estimates)
27
28    # within-imputation variance (u_bar)
29    u_bar = np.mean(variances)
30
31    # between-imputation variance (b)
32    b = (1 / (m - 1)) * np.sum([(est - theta_bar)**2 for est in estimates])
33
34    # total variance (t)
35    t_total = u_bar + (1 + (1 / m)) * b
36
37    # total standard error (se)
38    se_pooled = np.sqrt(t_total)
39
40    # relative increase in variance (riv) and degrees of freedom (v)
41    riv = ((1 + (1 / m)) * b) / u_bar
42    v_approx = (m - 1) * (1 + (1 / riv))**2 
43
44    # confidence interval
45    t_critical = t.ppf(0.975, df=v_approx)
46    ci_lower = theta_bar - t_critical * se_pooled
47    ci_upper = theta_bar + t_critical * se_pooled
48
49    print("\n--- pooled results using rubin's rules ---")
50    print(f"pooled estimate ({target_param}): {theta_bar:.10}")
51    print(f"within variance (u_bar): {u_bar:.10}")
52    print(f"between variance (b): {b:.10f}")
53    print(f"total variance (t): {t_total:.10}")
54    print(f"pooled standard error: {se_pooled:.10}")
55    print(f"relative increase in variance (riv): {riv:.10}")
56    print(f"degrees of freedom (approx): {v_approx:.2f}")
57    print(f"95% ci: [{ci_lower:.10}, {ci_upper:.10}]")
58
59    return theta_bar, se_pooled, t_total, v_approx
60

MICE with PMM

The final coefficient estimate for X1 is 2.021, with a 95% confidence interval from 1.954 to 2.087.

The uncertainty added by the imputation process (RIV) was low (RIV ≈ 5.7%).

Pooled results:

  • pooled estimate (X1): 2.020731625

  • within variance (u_bar): 0.001079913886

  • between variance (b): 0.0000517265

  • total variance (t): 0.001141985746

  • pooled standard error: 0.0337932796

  • elative increase in variance (riv): 0.05747852707

  • degrees of freedom (approx): 1353.92

  • 95% ci: [1.95443875, 2.087024499]

MICE with Normal Linear Regression

The final coefficient estimate for X1 is 2.029, with a 95% confidence interval spanning from 1.963 to 2.096.

The uncertainty due to the missing data was relatively small (RIV ≈ 6.8%).

Pooled results:

  • pooled estimate (X1): 2.029474175

  • within variance (u_bar): 0.001082280232

  • between variance (b): 0.0000609312

  • total variance (t): 0.001155397613

  • pooled standard error: 0.03399114021

  • relative increase in variance (riv): 0.06755864026

  • degrees of freedom (approx): 998.81

  • 95% ci: [1.962771936, 2.096176414]

It is safe to say that the uncertainty introduced by the imputation process is acceptable in both methods.

Step 6. Model Training and Evaluation

Lastly, I’ll train the Support Vector Regressor (SVR) on each imputed dataset and average the results to evaluate its generalization capabilities:

1from sklearn.model_selection import train_test_split
2from sklearn.preprocessing import StandardScaler
3from sklearn.compose import ColumnTransformer
4from sklearn.pipeline import Pipeline
5from sklearn.svm import SVR
6from sklearn.metrics import mean_squared_error
7
8mse_train_mice_pmm_list = []
9mse_val_mice_pmm_list = []
10
11for i, df in enumerate(created_datasets):
12    # create X, y
13    y = df['Y']
14    X = df[['X1', 'X2']]
15
16    # create train, val, test datasets
17    test_size, random_state = 1000, 42
18    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, random_state=random_state)
19    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, shuffle=True, random_state=random_state)
20
21    # preprocess
22    num_cols = X.columns
23    num_transformer = Pipeline(steps=[('scaler', StandardScaler())])
24    preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_cols),], remainder='passthrough')
25
26    X_train = preprocessor.fit_transform(X_train)
27    X_val = preprocessor.transform(X_val)
28    X_test = preprocessor.transform(X_test)
29
30    # train the model
31    model = SVR(kernel="rbf", degree=3, gamma='scale', coef0=0, tol=1e-5, C=1, epsilon=1e-5, max_iter=1000)
32    model.fit(X_train, y_train)
33
34    # inference
35    y_pred_train = model.predict(X_train)
36    mse_train = mean_squared_error(y_pred_train, y_train)
37
38    y_pred_val = model.predict(X_val)
39    mse_val = mean_squared_error(y_pred_val, y_val)
40
41    # recording
42    mse_train_mice_pmm_list.append(mse_train)
43    mse_val_mice_pmm_list.append(mse_val)
44
45
46# final performance metric that accounts for imputation uncertainty.
47pooled_mse_train_mice_pmm = np.mean(mse_train_mice_pmm_list)
48pooled_mse_val_mice_pmm = np.mean(mse_val_mice_pmm_list)
49
50print(f'\nfinal performance - train mse: {pooled_mse_train_mice_pmm:.6f}, val mse: {pooled_mse_val_mice_pmm:.6f}')
51

Results

The training and validation results of each method are evaluated with the Mean Squared Error (MSE):

  • MICE with PMM*: Training: 394.135692, Validation: 368.203358*

  • MICE with Normal Linear Regression*: Training: 331.430123, Validation: 308.422768*

  • Regression Imputation (baseline): Training: 410.631099, Validation: 381.695746

Both MICE methods achieved better performance (lower MSE) than the single Regression Imputation baseline.

The MICE with Normal Linear Regression method performed the best, delivering the lowest Validation MSE (308.42).

This superior performance is likely due to:

  1. Imputation model fit: The normal linear regression model used by MICE accurately captures the underlying relationships in the data.

  2. Prediction consistency: The PMM method uses stochastic sampling (randomness) to draw values, which inherent sampling noise and lead to higher MSEs.

In summary, MICE successfully improved model performance by correctly handling missing data uncertainty, with the Linear Regression imputation technique proving the most efficient for the task.

Wrapping Up

MICE is a competitive imputation method for creating statistically valid inferences and robust predictive models from incomplete data.

In the experiments, we observed that MICE successfully improved model performance, outperforming the baseline Linear Regression imputation.

Moving forward, optimizing an estimator, iterations, or cycles would further improve the model’s capabilities.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Beyond Simple Imputation: Understanding MICE for Robust Data Science" in Kernel Labs

https://kuriko-iwai.com/multivariate-imputation-by-chained-equations

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.