A Guide to Synthetic Data Generation: Statistical and Probabilistic Approaches

Explore statistical approaches to transform experts knowledge into data with practical examples

Machine LearningDeep LearningData SciencePython

By Kuriko IWAI

Introduction

Generating synthetic data is a powerful technique to address various challenges in data-driven applications, especially when real-world data is scarce, sensitive, or insufficient for robust model training.

In this article, I'll explore major methods of synthetic data generation leveraging statistical / probabilistic models.

I'll examine both univariate approaches driven by PDF estimations and multivariate approaches like Kernel Density Estimation and Bayesian Networks, taking a real-world use case for an example.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

What is Synthetic Data Generation

Synthetic data generation is data enhancement technique in machine learning to generate completely new data from scratch.

Its fundamental approaches involve using statistical models or deep generative models to analyze the patterns and relationships within existing data to produce new data:

Statistical / Probabilistic Models:
- Univariate approach: Column-by-column PDF Estimation
- Multivariate approach: Kernel Density Estimation (KDE), Bayesian Networks
Deep Generative Models:
- Generative Adversarial Networks (GANs)
- Variational Autoencoders (VAEs)

In the statistical model approach, we can take univariate approaches where an individual column is examined and generated or univariate approaches where multiple correlated columns are generated and updated accordingly.

◼ Why Generating Synthetic Data

To make accurate predictions, machine learning models need to be trained on sufficient, high-quality data that will recur in the future.

Although there are many data enhancement techniques to prepare such data such as:

Imputation to fill missing values,
Data augmentation to estimate unknown values, and
Feature engineering to create more relevant features to the problem.

Generating synthetic data can offer a more comprehensive solution especially when:

Real data is extremely limited: New products, rare events, niche scenarios, hypothetical or future conditions lack historical data to test the model’s resilience.
Data privacy is paramount: Sensitive information (e.g., medical records, financial transactions, personal identifiable information) cannot be directly used for development or sharing due to regulations (GDPR, HIPAA) or ethical concerns.
Accelerating development: Providing immediate access to large datasets, removing dependencies on lengthy data collection or access approval processes.

Now, I’ll detail how it is leveraged in a real-world use case.

Univariate Approach: Column-by-Column PDF Estimation

Univariate approaches focus on understanding a probability density function (PDF) of each individual column (or feature) in the dataset.

This approach is based on the assumption where each column is independent without any correlation with other columns in the dataset.

Hence, sampling occurs independently for each column. When generating synthetic data, values for one column are drawn from its estimated univariate distribution, regardless of the values being generated for any other column.

Best when:

The dataset has dominate columns or simple enough to disregard correlations.
Need computationally efficient ways.

◼ How Univariate Approaches Work

In univariate approach, we take simple three steps to generate synthetic dataset:

Step 1. Estimate a theoretical PDF of the column we want to generate.

Step 2. Based on the estimation, generate new synthetic values for the column (I’ll cover various methods later).

Step 3. Combine the synthetic column with an existing dataset.

Step 1 is critical for synthetic data to accurately reflect a true underlying distribution of real-world data.

In the next section, I’ll explore how the algorithm mimic PDFs of the real world data, starting with an overview of PDFs.

◼ What is Probability Density Function (PDF)

A Probability Density Function (PDF) is a statistic method to describe the probability of a continuous variable taking on a given value.

The following figure visualizes a valid PDF for a continuous random variable X uniformly distributed between 0 and 5 (uniform distribution).

The area under the PDF curve indicates the probability of a random value falling within a certain range.

For instance, the probability of a random value x being in the range of 0.9 to 1.6 is 14% because the area under the PDF curve is estimated 0.14, as highlighted in pink.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. A PDF for a continuous random variable uniformly distributed between 0 and 5 (Created by Kuriko IWAI)

For a function f(x) to be a valid PDF, it must satisfy two conditions (which we can see from the diagram):

Non-negative for all possible values of x:

f(x) \geq 0 \text{ for } \forall x

The integral of the function over its entire domain must be equal to one:

\int_{-\infty}^\infty f(x) dx = 1

Different distributions have unique PDF shapes like bell curves, exponential decays, or uniform spreads.

The below figure visualizes various PDF variations.

PDFs in the first row are categorical, and the rest are continuous values. We can also find the uniform distribution referred in the previous example is one of the PDF variations.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. PDF variations (Created by Kuriko IWAI)

In context of synthetic data generation, a PDF of synthetic data needs to match with a PDF of real-world data.

In next section, I’ll explore how we can estimate a PDF for synthetic data.

◼ How to Estimate PDF of Synthetic Data

Based on data availability, we can take two strategies to estimate accurate PDFs:

Strategy 1. Form a strong assumption

When we don’t have any data to start with, we can form a strong assumption on the true underlying distribution based on our research and experts’ knowledge for instance, and create data from scratch.

Strategy 2. Use univariate non-parametric methods

When we have an original data (whether its sufficient or not), we can leverage univariate non-parametric methods where the algorithm learns the underlying distribution from the data without assuming any specific distribution.

Real-world data is imperfect with noise, skewness, heavy tails, and so on. It might follow a combination of multiple PDFs.

Strategy 2 helps tackle these imperfections especially when we don’t have confident clues on true PDFs.

Let us explore each strategy using a real-world use case.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Use Case: Synthetic Dataset for Customer Service Call Durations

Let us imagine a scenario where a manager of a customer service call center plans to reallocate their workforce based on call duration and frequency.

Problem: However, the call center didn’t track the call duration, so they don’t have call duration data.

Our Objective: We’ll create synthetic data that reflects true data distribution of call durations based on the manager’s insights.

Target Variable: call_duration_sec (continuous)

The primary variable to synthesize is call_duration_sec that represents the total time in seconds a customer service call lasts.

Other Features (Manually recorded by call center agents and stored in database)

call_type (categorical): "Order_Inquiry", "Technical_Support", "E-commerce_Return", "Billing_Issue", "General_Inquiry", "Complaint_Resolution".
customer_tier (categorical/ordinal): "Bronze", "Silver", "Gold", "Platinum".
previous_contact_in_7_days (binary/boolean): "True", "False".
product_category (categorical): "Electronics", "Apparel", "Home_Goods", "Software".
day_of_week (categorical): "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday".
time_of_day_bucket (categorical):"Morning_Peak", "Afternoon_Off-Peak", "Evening_Rush", "Late_Night".

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Original dataset with 10,000 samples

Strategy 1. Forming Assumptions

In this strategy, we first interview the call center manager and choose theoretical distributions based on their description.

We might choose one or multiple combinations from the PDF variations.

I’ll demonstrate three scenarios where the manager’s comments vary.

◼ Scenario 1.

Call Center Manager:

Most of our customers are just checking order statuses or confirming billing details. Since we have template answers to these common questions, most calls don’t take too long. I think it’s around five minutes. We rarely see super short calls unless it's a wrong number or truly marathon calls.

Choosing Distribution

“We rarely see super short calls or truly marathon calls” suggests most calls fall within a predictable range, and extreme durations are rare.

This indicates that the theoretical distribution has a central tendency with symmetrical deviations, which is a characteristics of a Gaussian (normal) distribution.

Tuning PDF Parameters

Normal distributions have two parameters, mean and sigma (standard deviation) to adjust. I set:

mean_duration=5: “I think it’s around five minutes max" and "pretty standard length” directly informs that the mean is five.
std_dev=1: Rare "super short" or "marathon calls" indicates that the spread of durations isn't extremely wide. A standard deviation of one minute means that most calls (about 68%) will fall within 4 to 6 minutes (5 ± 1), and nearly all (about 99.7%) within 2 to 8 minutes (5 ± 1×3), which aligns with the idea of few outliers.

Generating Synthetic Data

I generated and merged the feature call_duration_sec with the original dataset:

1import numpy as np
2np.random.seed(42)
3
4# generates synthetic data
5mean_duration = 5
6std_dev = 1
7durations = np.random.normal(loc=mean_duration, scale=std_dev, size=n_samples)
8
9# convert negative values to zero
10durations = np.maximum(durations, 0)
11
12# merge
13df_1 = df.copy()
14df_1['call_duration_sec'] = durations * 60 # convert unit from min to sec
15
16df_1.head().transpose()
17

▫ Results

We can find call_duration_sec in the dataset:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Scenario 1. A new dataset with synthetic data call_duration_sec

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Scenario 1 where call_duration_sec follows a normal distribution (Created by Kuriko IWAI)

◼ Scenario 2.

Call Center Manager:

We have very frequent quick calls – like customers just confirming the payment status. But we also have some calls that just drag on. In those calls, most of the time, customers are very upset and confused by some issues, and we need a lot of de-escalation. So, there's always that chance of a really, really long calls.

Choosing Distribution

“very frequent quick calls” with “chance of a really, really long calls.“ indicates long-tail characteristics of data distribution with higher density of short calls.

This indicates the theoretical distribution follows a shape of a (shifted) exponential distribution or Pareto distribution.

While both have long tails, a Pareto distribution's tail is heavier, meaning extremely long calls are more probable under a Pareto distribution than an exponential distribution.

According to the manager, they have “very frequent quick calls“ and “some calls that just drag on”, indicating the tail is long, but not so heavy.

So, I picked an exponential distribution as a theoretical distribution.

Tuning PDF Parameters

An exponential distribution needs to define the scale parameter:

scale=2: The scale parameter represents the mean of the distribution. Assuming “very frequent quick calls” pull the mean lower, I set two (min) as a suitable mean (scale).
Since scale = 1 / lambda_rate, I also set lambda_rate = 0.5.

Generating Synthetic Data

1import numpy as np
2
3np.random.seed(42)
4
5# generates synthetic data
6lambda_rate = 0.5
7n_samples = 10000
8durations = np.random.exponential(scale=1/lambda_rate, size=n_samples)
9
10# merge
11df_2 = df.copy()
12df_2['call_duration_sec'] = durations * 60
13
14df_2.head().transpose()
15

▫ Results

We can find call_duration_sec in the dataset:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. Scenario 2. A new dataset with synthetic data call_duration_sec

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Scenario 2 where call_duration_sec follows an exponential distribution (Created by Kuriko IWAI)

◼ Scenario 3.

Call Center Manager:

We have quick fix calls – very straightforward, often just providing a piece of information or resetting something, and these are usually done in a few minutes. But then we have our deep dive calls as well. These are highly involved, like setting up complex new services, resolving long-standing issues, or detailed product training. These calls always take significantly longer, often 15-20 minutes or more. There's not much in between; it's either quick or long.

Choosing Distribution

The manager’s comment indicates two types of distributions exist: one for “quick fix calls“ and the other for “deep dive calls“.

For “quick fix calls”, since the manager didn’t mention any specific characteristics, we can safely assume a normal distribution.

For “deep dive calls“, “15-20 minutes or more” indicates long tail characteristics of the data distribution. I picked an exponential distribution.

“There's not much in between; it's either quick or long.” indicates that we can equally split the data into the two groups.

Tuning PDF Parameters

For “quick fix calls“:

mean=2.5, std_dev=0.25: The manager exactly mentioned “these are usually done in a few minutes”. So, I set the parameters where 95% of the calls (two standard devisions away from the mean) fall between two to three minutes.

For “deep dive calls“:

scale=18: The manager mentioned most long calls last “15-20 minutes or more”, meaning the mean (scale) of the distribution might closer to 20 minutes. I choose 18 for scale, hence lambda=1/18.

Generate Synthetic Data

1import numpy as np
2
3np.random.seed(42)
4n_samples = 10000
5
6# quick calls (first group)
7mean_1 = 2.5
8std_1 = 0.25
9weight_1 = 0.5
10n_samples_1 = int(n_samples * weight_1)
11durations_1 = np.random.normal(loc=mean_1, scale=std_1, size=n_samples_1)
12durations_1 = np.maximum(durations_1, 0) # to avoid negative values
13
14
15# deep dive calls (second group)
16lambda_rate = 1/18
17weight_2 = 0.5
18n_samples_2 =  int(n_samples * weight_2)
19durations_2 = np.random.exponential(scale=1 / lambda_rate, size=n_samples_2)
20durations_2 = np.maximum(durations_2, 0) # to avoid negative values
21
22# combine two groups
23combined_durations = np.concatenate((durations_1, durations_2))
24np.random.shuffle(combined_durations) # shuffle data
25
26# (optional) generate call_type data
27call_types_1 = ['Quick_Fix'] * n_samples_1
28call_types_2 = ['Deep_Dive'] * n_samples_2
29combined_call_types = call_types_1 + call_types_2
30
31# add synthetic data
32df_3 = df.copy()
33df_3['call_duration_sec'] = combined_durations * 60 # convert to seconds
34df_3['call_type'] = combined_call_types
35
36df_3.head().transpose()
37

▫ Results

We can find call_duration_sec in the dataset:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Scenario 3. A new dataset with synthetic data call_duration_sec

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Scenario 3 where call_duration_sec follows a combination of multiple PDFs (Created by Kuriko IWAI)

These scenarios illustrate the importance of first forming sound assumptions about the true data distribution, as these expert-driven insights dictate vastly different synthetic data.

Now, take a look at the second strategy.

◼ Strategy 2. Using Univariate Non-Parametric Methods

This strategy is based on a scenario where we have some original data.

In this scenario, our original data of call_duration_sec feature looks like:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure I. Original data distribution of call_duration_sec (Created by Kuriko IWAI)

Now, we need to estimate a PDF based on this original data and mimic it for synthetic data.

Examining the diagram, we might think it follows two normal distributions in the range of x = 0 to 750, and then shifts to an uniform distribution in the long tail. But the initial segment from x = 0 to approximately 250 looks like a Chi-square distribution as well.

It is extremely challenging to interpret the true underlying distribution.

This is the reason we leverage non-parametric methods where we don’t need to assume any distribution patterns in advance.

Taking a subset of call_duration_sec: [70, 85, 60, 90, 75] for an example, let us take a look at four major methods:

Histograms,
Empirical Cumulative Distribution Function (ECDF),
Nearest Neighbor Approach, and
Kernel Density Estimation (KDE)

◼ Histograms

Histograms are the simplest non-parametric methods where the algorithm creates bins by dividing continuous values into numerical ranges, or categorical values into frequency ranges.

▫ Walkthrough Example

Now let us see how it works on the original data (test scores): [70, 85, 60, 90, 75]

Step 1: Sort the data [60, 70, 75, 85, 90] (optional but helpful for visualization and binning).

Step 2: Define bins and count frequencies.

This is the crucial step where binning bias comes into play.

Let's try two different binning strategies to show its effect.

Strategy A: Narrow Bins (Width of 10, starting at 50)

Bin 1: [50, 60) (scores from 50 up to, but not including, 60)
- Count: 0
- Frequency: 0/5 = 0%
Bin 2: [60, 70)
- Count: 1 (score 60)
- Frequency: 1/5 = 20%
Bin 3: [70, 80)
- Count: 2 (scores 70, 75)
- Frequency: 2/5 = 40%
Bin 4: [80, 90)
- Count: 1 (score 85)
- Frequency: 1/5 = 20%
Bin 5: [90, 100)
- Count: 1 (score 90)
- Frequency: 1/5 = 20%

Strategy B: Wider Bins (Width of 20, starting at 50)

Bin 1: [50, 70)
- Count: 1 (score 60)
- Frequency: 1/5 = 20%
Bin 2: [70, 90)
- Count: 3 (scores 70, 75, 85)
- Frequency: 3/5 = 60%
Bin 3: [90, 110)
- Count: 1 (score 90)
- Frequency: 1/5 = 20%

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure J. Binning strategies for synthetic data generation using histogram (Created by Kuriko IWAI)

Step 3. Generating a synthetic score.

1. Generate a random number between 0 and 1. Let's say u = 0.55.

2. Determine which bin this corresponds to based on cumulative frequencies:

Strategy A (blue):

Bin [50, 60): Cumulative Freq = 0%
Bin [60, 70): Cumulative Freq = 0% + 20% = 20%
Bin [70, 80): Cumulative Freq = 20% + 40% = 60% (Our u = 0.55 falls into this range)

Strategy B (red):

Bin [50, 70): Cumulative Freq = 20%
Bin [70, 90): Cumulative Freq = 20% + 60% = 80% (Our u = 0.55 falls into this range)

3. Pick a value from within that bin

Strategy A (blue): For bin [70, 80), the original points were 70 and 75. We'd randomly pick one, say 75.

Strategy B (red): For bin [70, 90), the original points were 70, 75, 85. We'd randomly pick one, say 85.

As we can see, the results vary based on bin and starting points.

Truly Non-Parametric Approach

In the last step of choosing a value from the bin, we could pick a random uniform value within the bin's range (e.g., np.random.uniform(70, 80)), let’s say 72.3, especially when dealing with continuous values.

But this implicitly makes a parametric assumption where the data points within that specific bin are uniformly distributed (following an uniform distribution).

This violates non-parametric assumptions of the method.

So, in histogram, sampling existing points is the truly non-parametric way.

Best When:

Data with discrete values or has a naturally binned structure.
Computational resources are extremely limited.

Cons:

Binning Bias: The choice of bin width and starting point affects the shape of the estimated distribution.
Discontinuity: A step-wise, discontinuous nature of histograms is not suitable for continuous data.
Lack of Smoothness: Cannot offer a smooth curve that a truly continuous distribution can offer.

◼ Empirical Cumulative Distribution Function (ECDF)

The ECDF is a step function that returns a smallest value in the proportion less than or equal to any given value x.

In synthetic data generation context, we use this value as a synthetic data sample.

▫ Walkthrough Example

Now let us see how it works on the original data (test scores): [70, 85, 60, 90, 75]

Step 1: Sort the data in ascending order. [60, 70, 75, 85, 90]

Step 2: Calculate the ECDF F(x) for each unique value:

For x < 60: F(x) = 0/5 = 0
For x = 60: F(x) = 1/5 = 0.2 (1 score is ≤ 60)
For x = 70: F(x) = 2/5 = 0.4 (2 scores are ≤ 70)
For x = 75: F(x) = 3/5 = 0.6 (3 scores are ≤ 75)
For x = 85: F(x) = 4/5 = 0.8 (4 scores are ≤85)
For x = 90: F(x) = 5/5 = 1.0 (5 scores are ≤90)
For x > 90: F(x) = 5/5 = 1.0

Step 3. Generate a random number u:

Pick a random number u from a uniform distribution between 0 and 1 (e.g., np.random.uniform(0, 1)).

Let’s say u = 0.5.

Step 4. Find a synthetic data point:

Take the smallest value x in the original sorted data that meets F(x) ≥ u as a synthetic data point.

Looking at the ECDF values, our synthetic score is 75:

F_5(x=60)=0.2
F_5(x=70)=0.4
F_5(x=75)=0.6 (This is the first value where F_5(x) ≥ u = 0.5)

Step 5. Merge: [60, 70, 75, 75, 85, 90]

Best when:

Synthetic data needs to strictly adhere to the observed data points.
The original data is ordinal or discrete
Maintain n the exact rank ordering.

Cons:

Reproduce values: ECDFs will only reproduce values present in the original dataset, leading to a discrete set of synthetic values for continuous variables.
Discrete values: Similar to histograms, ECDFs are inherently step-wise, which is not suitable for continuous data.

◼ Nearest Neighbor Approaches (e.g., K-Nearest Neighbors)

While more commonly used for classification or regression, the principle of nearest neighbors can be adapted for density estimation.

For a given point, its density can be estimated based on the distance to its k-th nearest neighbor.

▫ Walkthrough Example

Now let us see how it works on the original data (test scores): [70, 85, 60, 90, 75]

I’ll set k=2.

Step 1: Sort the data [60, 70, 75, 85, 90] (helpful for finding neighbors)

Step 2: Compute local density for each data point:

First, the algorithm compute distance to its k-th nearest neighbor for each data point.

A smaller distance implies higher local density (points are closer together), suggesting to add less noise when generating synthetic data around that point.

A larger distance implies lower local density, suggesting more noise to be added.

Let's calculate the distance to the 2nd nearest neighbor for each point:

For 60:

1st nearest neighbor: 70 (distance = 10)
2nd nearest neighbor: 75 (distance = 15)
Distance to the second nearest neighbor: d = 15

For 70:

1st nearest neighbor: 60 (distance = 10)
2nd nearest neighbor: 75 (distance = 5)
Distance to the second nearest neighbor: d = 5

For 75:

1st nearest neighbor: 70 (distance = 5)
2nd nearest neighbor: 85 (distance = 10)
Distance to the second nearest neighbor: d = 10

For 85:

1st nearest neighbor: 90 (distance = 5)
2nd nearest neighbor: 75 (distance = 10)
Distance to the second nearest neighbor: d = 10

For 90:

1st nearest neighbor: 85 (distance = 5)
2nd nearest neighbor: 75 (distance = 15)
Distance to the second nearest neighbor: d = 15

Step 3: Generate synthetic data point:

To generate a synthetic data point, the algorithm:

Randomly selects an original data point. Let's say 70.
Determine its local density measure. For 70, d = 5
Add noise proportional to the local density:

Since the distance of 5 is relatively small for 70, it's in a denser region.

The algorithm defines a noise factor:

A Gaussian distribution whose mean is zero and standard deviation is one (1 = d/C where C = 5) np.random.normal(0, 1).
Choose a random value from the Gaussian distribution as noise - let’s say 0.8

Now, our new synthetic sample is 70 + 0.8 = 70.8.

If the algorithm picked 60 where d = 15 (less dense area), the noise standard deviation is 15 / 5 = 3.

A random number from np.random.normal(0, 3) could be something like 2.5, leading to a synthetic point of 60 + 2.5 = 62.5.

This point has more deviation from its original counterpart, reflecting the sparser local density.

Best When:

Synthetic data stays very close to the original data points (e.g., for data augmentation where slight variations of existing points are desired).
The data has a complex, non-linear manifold structure that other methods might struggle to capture locally.

Cons:

Complexity: Can be more complex to implement for density estimation.
Boundary issues: May struggle with generating points accurately near the boundaries of the data distribution.
Computational cost: Finding nearest neighbors can be computationally intensive for large datasets.

◼ Kernel Density Estimation (KDE)

Kernel Density Estimation (KDE) learns the shape of the distribution directly from the data by placing a kernel function (a symmetric probability density function) at each data point.

Then it sums these functions to create a smooth, continuous estimate of the probability density function.

▫ Walkthrough Example (Conceptual)

Now let us see how it works conceptually on the original data (test scores): [70, 85, 60, 90, 75]

Step 1. Imagine a small Gaussian curve centered at each data point.

A bump at 60.
A bump at 70.
A bump at 75.
A bump at 85.
A bump at 90.

Step 2. The algorithm sums up all these bumps to estimate the PDF:

Where data points are clustered (e.g., 70, 75), the bumps overlap more, resulting in a higher peak in the estimated PDF.
Where data points are sparse (e.g., between 60 and 70, or above 90), the combined curve will be lower, indicating lower probability density.

Step 3. Generating a synthetic score:

Once this smooth PDF is estimated, we sample from it to get new synthetic scores by picking a point along the x axis. Higher parts of the PDF means denser, so have a higher change of being chosen.

Scenario 1: Sampling a typical score:

Given the original data, the density is highest around 70-85.
If we sample from this KDE, we are very likely to get a score like 72.3 or 80.5.
These values weren't in the original dataset, but they are plausible because they fall within a high-density region of the estimated distribution.

Scenario 2: Sampling a less typical score:

There's still some probability density at the tails (below 60 or above 90), though lower.
We sample a score like 58.1 or 92.7 with a lower probability.
These are outside the original range but are still considered plausible by the estimated distribution.

Best When:

Need smooth, continuous synthetic data for individual columns.
The true underlying distribution is unknown or complex.
Prioritize visual fidelity and realistic interpolation between observed data points.

Limitations:

Bandwidth selection: The choice of bandwidth is crucial and can significantly impact the resulting density estimate.
Boundary effects: KDE can sometimes produce density estimates that extend beyond the realistic range of the data (e.g., negative values for inherently positive data like age), especially near the boundaries of the observed data.
Computational cost: For very large datasets, fitting KDE can be more computationally intensive than simpler methods like histograms or ECDFs.

◼ Simulation on Use Case

With understanding of each method, let us get back to the use case.

Since the call_duration_sec is continuous, I’ll use KDE to generate synthetic data. It is straightforward:

1from scipy.stats import gaussian_kde
2
3# fit the orinal data to kde
4original_data = df_new['call_duration_sec']
5kde = gaussian_kde(original_data)
6
7# generate synthetic data
8# We'll generate the same number of non-NaN samples as were in the original data.
9n_synthetic_samples = len(df_new)
10synthetic_data = kde.resample(n_synthetic_samples)[0]
11synthetic_data = np.maximum(synthetic_data, 0) # make sure there's no negative values
12
13# add
14df_new['call_duration_sec_generated'] = synthetic_data
15
16df_new.head().transpose()
17

▫ Results

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Fig. A new dataset with synthetic data call_duration_sec_generated using KDE (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure K. Synthetic data distribution of call_duration_sec using KDE(blue) and original data (orange) (Created by Kuriko IWAI)

Various univariate approaches give us intuitive ways to generate synthetic data.

But what if call_duration_sec is correlated with other features like customer_tier or call_type?

In univariate approach, these correlations are disregarded, so when the call_type changes, the model will not capture the impact of the change on call_duration_sec.

Now, I’ll explore a multivariate approach where these correlations are well considered.

Multivariate Approach

Multivariate approach is an approach where the algorithm consider correlation among features including synthetic data.

Using the same use case, I’ll explore KDE and Bayesian Network in this section.

◼ Kernel Density Estimation (KDE) for Multivariate Approach

KDE can be used in a multivariate approach by including the columns whose correlation to be analyzed.

KDE then captures joint distributions directly by smoothing the data points in the feature space.

In our use case, first I encoded the categorical features to numerical features, and fit a KDE to the combined numerical features:

1import numpy as np
2from scipy.stats import gaussian_kde
3
4np.random.seed(42)
5df_new = df.copy()
6
7# store category-to-code mappings for decoding
8cat_mappings = {}
9categorical_cols = ['customer_tier', 'previous_contact_in_7_days', 'product_category', 'day_of_week', 'time_of_day_bucket']
10for col in categorical_cols:
11    df_new[col] = df_new[col].astype('category')  # convert to category dtype
12    cat_mappings[col] = df_new[col].cat.categories.tolist()  # store the categories for decoding later
13    df_new[col] = df_new[col].cat.codes # replace the column with its numerical codes
14
15
16# fits a kde to the combined numerical data
17kde_features = ['call_duration_sec'] + categorical_cols
18data_for_kde = df_new[kde_features].values.T  # transpose for gaussian_kde (d, N)
19kde = gaussian_kde(data_for_kde)
20

Then, I draw synthetic data from the numerical features using the kernel:

1# generates synthetic data
2n_synthetic_samples = 10000
3synthetic_samples_raw = kde.resample(n_synthetic_samples) # returns (d, n_synthetic_samples)
4

Lastly, I finalized the dataset by decoding numerical features to categorical features (optional), adjusting data types, and combining synthetic data:

1import pandas as pd
2
3# convert to dataframe
4synthetic_df_raw = pd.DataFrame(synthetic_samples_raw.T, columns=kde_features) # transpose back to (n_synthetic_samples, d)
5synthetic_df_raw['call_duration_sec'] = np.maximum(synthetic_df_raw['call_duration_sec'], 0)
6
7
8# finalize dataframe
9df_final = pd.DataFrame()
10df_final['call_type'] = 'Routine_Inquiry'
11df_final['call_duration_sec'] = synthetic_df_raw['call_duration_sec']
12
13
14# (optional) decode cat columns
15for col in categorical_cols:
16    # round the sampled numerical values to the nearest integer for decoding - range [0, num_categories - 1]
17    num_categories = len(cat_mappings[col])
18    decoded_codes = np.round(synthetic_df_raw[col]).astype(int)
19    decoded_codes = np.clip(decoded_codes, 0, num_categories - 1)
20
21    # map integer codes back to original categories
22    df_final[col] = [cat_mappings[col][code] for code in decoded_codes]
23
24    # convert back to category dtype
25    df_final[col] = df_final[col].astype('category')
26
27df_final['previous_contact_in_7_days'] = df_final['previous_contact_in_7_days'].astype(bool)
28
29synthetic_df.head().transpose()
30

▫ Results

We can find the kernel generates smooth, continuous synthetic data easier for the model to learn from:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure L. Comparison of the original data (left: blue) and synthetic data generated in the multivariate approach using KDE (right: green) (Created by Kuriko IWAI)

Based on the joint distribution analysis, other features are also updated in unique magnitude:

customer_tier

Bronze: 0.3412 → 0.3525
Gold: 0.3340 → 0.3309
Silver: 0.3248 → 0.3166

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure M1. Comparison of the original data (left) and updated data via KDE (right) (Created by Kuriko IWAI)

previous_contact_in_7_days

False: 0.802 → 0.8018
True: 0.198 → 0.1982

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure M2. Comparison of the original data (left) and updated data via KDE (right) (Created by Kuriko IWAI)

product_category

apparel 0.2569 → 0.2482
software 0.2529 → 0.2513
home_goods 0.2489 → 0.2529
electronics 0.2413 → 0.2476

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure M3. Comparison of the original data (left) and updated data via KDE (right) (Created by Kuriko IWAI)

day_of_week

Sun: 0.1461 → 0.1433
Mon: 0.1416 → 0.1389
Tue: 0.1424 → 0.1421
Wed: 0.1363 → 0.1447
Thu: 0.1454 → 0.1456
Fri: 0.1440 → 0.1429
Sat: 0.1442 → 0.1425

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure M4. Comparison of the original data (left) and updated data via KDE (right) (Created by Kuriko IWAI)

time_of_day_bucket

Morning_Peak: 0.2564 →0.2549
Afternoon_Off-Peak: 0.2519 → 0.2491
Evening_Rush: 0.2458 → 0.2455
Late_Night: 0.2459 → 0.2505

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure M5. Comparison of the original data (left) and updated data via KDE (right) (Created by Kuriko IWAI)

Using KDE in the multivariate approach is especially advantageous when we need to smooth out relatively limited amount of original data, while

preserving correlation among features,
capturing complexities - arbitrary shapes, skewness, and multimodality - in real-world data, and
securing data privacy by generating data based on the statistical properties without directly reproducing individual records.

On the other hand, it has major limitation: curse of dimensionality.

As the number of dimensions (columns) increases, the data becomes sparser, and the computational cost of KDE grows exponentially.

KDE struggles to accurately estimate the density in high-dimensional spaces, and the performance of the estimator can worsen.

Bayesian Networks are a better choice to handle this challenge.

◼ Bayesian Networks (BN)

Bayesian Networks (BNs) are probabilistic graphical models that represent a set of variables and their conditional dependencies using a directed acyclic graph (DAG).

Each node in the graph represents a random variable, and directed edges represent probabilistic dependencies (e.g., A influences B).

Associated with each node is a conditional probability distribution (CPD) that quantifies the relationship between a node and its parents.

▫ Key Difference from KDEs

There are three key differences from KDEs:

BNs are parametric or semi-parametric.

While the graph structure is learned, the CPDs associated with each node can be parametric (e.g., Gaussian, categorical) or non-parametric (e.g., using KDE for CPDs).

BNs explicitly model conditional dependencies between features:

They learn which featuress directly influence others, making the relationships more interpretable (Cf. KDE learns joint distributions among features).

This brings BNs unique strengths:

Handling high dimension feature spaces: By factorizing the joint distribution into conditional probabilities, BNs can handle higher-dimensional data more effectively than direct KDE.
High interpretability: The graphical structure clearly shows the assumed dependencies between features, insightful for understanding the data generation process.

◼ Simulation

To generate synthetic data, we first learn the structure of the graph (the BN) and the parameters (the CPDs) from the real data.

And once the network is learned, we can sample new data points by traversing the graph and sampling from the conditional distributions.

First, I created binned feature of call_duration_sec_binned because BNs are typically good at handling binned features rather than continuous features.

1import pandas as pd
2
3# creates a binned feature
4num_bins = 5000
5df['call_duration_sec_binned'] = pd.cut(
6    df['call_duration_sec'],
7    bins=num_bins,
8    labels=[f'Bin_{i+1}' for i in range(num_bins)],
9    include_lowest=True
10)
11
12# convert all categorical columns to 'category' dtype for pgmpy
13for col in df.columns:
14    if df[col].dtype == 'object' or df[col].dtype == 'bool':
15        df[col] = df[col].astype('category')
16
17# drop the original continuous column
18df_binned = df.drop(columns=['call_duration_sec'])
19

Next, structured BN using the DiscreteBayesianNetwork from the pgmpy library:

1from pgmpy.models import DiscreteBayesianNetwork
2
3# define edges (dependencies) in the bayesian network
4model = DiscreteBayesianNetwork([
5    ('customer_tier', 'call_duration_sec_binned'),
6    ('previous_contact_in_7_days', 'call_duration_sec_binned'),
7    ('product_category', 'call_duration_sec_binned'),
8    ('day_of_week', 'call_duration_sec_binned'),
9    ('time_of_day_bucket', 'call_duration_sec_binned'),
10    ('call_type', 'call_duration_sec_binned'),
11])
12

Then, learned the CPD of each node based on the original data using maximum likelihood estimate:

1from pgmpy.estimators import MaximumLikelihoodEstimator
2
3model.fit(df_binned, estimator=MaximumLikelihoodEstimator)
4

Lastly, draw samples from the learned structure:

1from pgmpy.sampling import BayesianModelSampling
2
3# generates samples using BayesianModelSampling
4n_synthetic_samples = n_samples
5synthetic_data_sampler = BayesianModelSampling(model)
6synthetic_df_binned = synthetic_data_sampler.forward_sample(size=n_synthetic_samples)
7
8# (optional) converts binned data back to a numerical representation for 'call_duration_sec'
9bins_edges = pd.cut(df['call_duration_sec'], bins=num_bins, include_lowest=True, retbins=True)[1] # gets bins
10bin_midpoints = [(bins_edges[i] + bins_edges[i+1]) / 2 for i in range(len(bins_edges) - 1)]
11bin_labels = [f'Bin_{i+1}' for i in range(num_bins)]
12synthetic_df_binned['call_duration_sec_generated'] = synthetic_df_binned['call_duration_sec_binned'].map(dict(zip(bin_labels, bin_midpoints)))
13
14synthetic_df_binned.head().transpose()
15

▫ Results

The new dataset contains a synthetic binned feature call_duration_sec_binned and numerical feature call_duration_sec_generated:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure N. New dataset (Created by Kuriko IWAI)

▫ BN Structure

The following diagram shows the learned BN structure.

Each number in the diagram indicates the mutual information (MI) score btween nodes; higher values suggest more correlations between the features (e.g., call_duration_sec_binnned has the strongest correlation (MI: 0.59) with day_of_week, while no correlation with call_type because it only contains one category).

This high interpretability is one of the key strength of BNs.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure O. Bayesian Network structure (Created by Kuriko IWAI)

▫ CPDs of Each Node (Feature)

The BN also learned conditional probabilities if the node (feature) has incoming edges, else marginal probabilities among features.

Conditional Probabilities

time_of_day_bucket

The below table shows the probability of time_of_day_bucket taking on a certain value, given the value of its parent, day_of_week.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure P1. Conditional probability computed in the BN structure (Created by Kuriko IWAI)

Mathematically, these conditional probabilities (P’s) are denoted:

P(time_of_day_bucket \= [State 1 (... in the figure)] | day_of_week \= Wed) = 0.24944974321349964

P(time_of_day_bucket \= [State 2 (... in the figure)] | day_of_week \= Wed) = 0.24651504035216434

P(time_of_day_bucket \= [State 3 (... in the figure)] | day_of_week \= Wed) = 0.2516507703595011

P(time_of_day_bucket \= [State 4 (... in the figure)] | day_of_week \= Wed) = 0.2523844460748349

Total of P is 1.

product_category

Same for product_category, it estimated conditional probability between customer_tier (being Silver for instance):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure P2. Conditional probability computed in the BN structure (Created by Kuriko IWAI)

Marginal Probability (No incoming edges)

BNs provide a summary of the proportion in the feature when it has no incoming edges:

customer_tier

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

day_of_week

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

previous_contact_in_7_days

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

In BN, the quality of the synthetic data heavily depends on how accurately the network structure and conditional probabilities are learned from the original data.

So, some consideration we need to mind is its computational cost.

Learning optimal network structures can be computationally intensive, especially for datasets with many variables.

Yet, BNs can effectively capture complex multivariate dependencies and correlations between variables, leading to more realistic synthetic datasets.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Wrapping Up

Statistical models are useful to generate synthetic data that accurately reflect true underlying data distribution.

In our experiment, we observed two approaches - univariate and multivariate.

When we don’t have any data to rely on, forming a strong assumption on underlying distributions is critical to achieve accuracy in synthetic data.

When we have original data (even small amount), we learned univariate non-paramedic approaches are simple and computationally efficient methods.

And to tackle the challenges in lack of correlation among features, Kernel Density Estimation (especially for feature spaces in lower dimension) and Bayesian Networks (for higher dimension) are powerful approaches to generate synthetic data while preserving correlation among features.

Overall, synthetic data generation is one of useful tactics to build robust models.

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

Share What You Learned

Kuriko IWAI, "A Guide to Synthetic Data Generation: Statistical and Probabilistic Approaches" in Kernel Labs

https://kuriko-iwai.com/synthetic-data-generation

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage
Master time series cross-validation. Learn to prevent data leakage and autocorrelation using Growing Window, Blocked K-Fold, and Purged CV with PyTorch and Scikit-Learn.
Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE
Master data augmentation for tabular datasets. Learn how to use noise injection, linear interpolation, and SMOTE algorithms to fix class imbalance and improve ML model generalization.
Maximum A Posteriori (MAP) Estimation: Balancing Data and Expert Knowledge
Learn the fundamentals of MAP estimation, how it differs from MLE, and its application in churn prediction. Discover why MAP is essential for regularization in machine learning.
Beyond Simple Imputation: Understanding MICE for Robust Data Science
Learn how the MICE algorithm handles missing data through iterative chain prediction. Explore PMM vs. Linear Regression imputation with Python code and Rubin’s Rules for pooling.
Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data
Learn how to boost regression model performance using hypothesis-driven feature engineering. Practical guide covering EDA, time-based features, and scaling for Elastic Net, Random Forest, and DFN.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

A Guide to Synthetic Data Generation: Statistical and Probabilistic Approaches

Explore statistical approaches to transform experts knowledge into data with practical examples

Table of Contents

Introduction

Building production-grade AI systems?

What is Synthetic Data Generation

◼ Why Generating Synthetic Data

Univariate Approach: Column-by-Column PDF Estimation

◼ How Univariate Approaches Work

◼ What is Probability Density Function (PDF)

◼ How to Estimate PDF of Synthetic Data

Building production-grade AI systems?

Use Case: Synthetic Dataset for Customer Service Call Durations

Strategy 1. Forming Assumptions

◼ Scenario 1.

▫ Results

◼ Scenario 2.

▫ Results

◼ Scenario 3.

▫ Results

◼ Strategy 2. Using Univariate Non-Parametric Methods

◼ Histograms

▫ Walkthrough Example

◼ Empirical Cumulative Distribution Function (ECDF)

▫ Walkthrough Example

◼ Nearest Neighbor Approaches (e.g., K-Nearest Neighbors)

▫ Walkthrough Example

◼ Kernel Density Estimation (KDE)

▫ Walkthrough Example (Conceptual)

◼ Simulation on Use Case

▫ Results

Multivariate Approach

◼ Kernel Density Estimation (KDE) for Multivariate Approach

▫ Results

◼ Bayesian Networks (BN)

▫ Key Difference from KDEs

◼ Simulation

▫ Results

▫ BN Structure

▫ CPDs of Each Node (Feature)

Building production-grade AI systems?

Wrapping Up

Building production-grade AI systems?

Share What You Learned

Continue Your Learning

Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage

Data Augmentation Techniques for Tabular Data: From Noise Injection to SMOTE

Maximum A Posteriori (MAP) Estimation: Balancing Data and Expert Knowledge

Beyond Simple Imputation: Understanding MICE for Robust Data Science

Maximizing Predictive Power: Best Practices in Feature Engineering for Tabular Data

Related Books for Further Understanding