Generative Adversarial Network (GAN): From Vanilla Minimax to ProGAN

Explore core GAN principles with a walkthrough example and major GAN architectures

Deep LearningData Science

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Generative Adversarial Network (GAN)
The Adversarial Game
Use Cases
How Vanilla GANs Work
The Generator’s Objective
The Discriminator’s Objective
Optimizing Vanilla GANs
The Training Process
The Nash Equilibrium
The Walkthrough Example
Hyperparameter Tuning
Types of GANs
Deep Convolutional GAN (DCGAN)
Conditional GAN (cGAN)
Progressive GAN (ProGAN)
Wrapping Up

Introduction

Generative Adversarial Networks (GANs) are widely used neural networks to create realistic synthetic content like text, audio, and high-resolution images and videos.

Because GANs have many types designed to address specific data and complexity hurdles, navigating their mechanics and the broad landscape can be challenging.

In this article, I’ll first explore the core mechanics of the foundational Vanilla (Standard) GAN with a walkthrough example, and then introduce three major, influential variations:

  • Deep Convolutional GAN (DCGAN),

  • Conditional GAN (cGAN), and

  • Progressive Growing of GANs (ProGAN).

What is Generative Adversarial Network (GAN)

Generative Adversarial Networks (GANs) are a class of deep learning architectures designed for generative modeling which focuses on generating new, realistic examples from original data.

The below diagram illustrates the architecture of a vanilla GAN, the most basic form of GAN:

Figure A. As the vanilla GAN architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. As the vanilla GAN architecture (Created by Kuriko IWAI)

GAN has many variations (which I'll cover later), it consists of two competing neural networks:

  • A generator (G) (blue in Figure A) and

  • A discriminator (D) (pink in Figure A).

Vanilla GANs are the basic form of GANs, using a fully-dense multilayer neural network for both the generator (G) and the discriminator (D).

In the network, G maps low-dimensional random seeds (grey box in Figure A) to synthetic data, fake samples (white boxes).

D, on the other hand, is trained to distinguish the fake samples from real ones (orange boxes) and outputs a probability score from zero (completely fake) to one (completely real).

The Adversarial Game

The adversarial game is a game theory played between G and D during training.

In the game:

  • G tries to fool D into classifying its fake samples as real, and

  • D tries to correctly identify the real samples from the fake ones.

This continuous competition improve the both networks to the points where G creates highly realistic fake samples while D spots even subtle imperfections.

At the end of the game where the network converges, G can generate so authentic fake samples that D cannot tell the difference from the real samples.

Use Cases

Common use cases of GANs are:

  • Generating highly realistic objects like images, video, and audio.
Figure B. Imaginary celebrities generated by ProGAN (source)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Imaginary celebrities generated by ProGAN (source)

  • Performing image-to-image translation like style transfer and domain adaptation, and
Figure C. Image-to-image translation by CycleGAN (source)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Image-to-image translation by CycleGAN (source)

  • Generating synthetic data for training other machine learning models:
Figure D. Synthetic medical image generation using GAN (source)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Synthetic medical image generation using GAN (source)

In practice, a specific GAN variant is selected for each use case based on factors like:

  • Desired output quality,

  • Desired levels of output diversity,

  • Required training stability, and

  • Data types like images, text, or time series.

In the next section, I’ll explore details on its mechanics.

How Vanilla GANs Work

The adversarial game of the vanilla GAN is the minimax game between D and G.

Mathematically, the game score is measured with a function called a value function V(D, G):

V(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))](A)V(D,G) = \mathbb{E}{x \sim p{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] \cdots \text{(A)}

where:

  • x: The real sample,

  • G(z): The fake sample created by G (G’s output),

  • D(G(z))): The probability of the fake sample G(z) being real (D’s output),

  • D(x): The probability of the real sample x being real (D’s output),

  • p_data​(x): The data distribution of the real samples,

  • p_z​(z): The data distribution of the random noise (i.e., Gaussian, uniform distribution), and

  • E: The expected value.

In the minimax game, D tries to maximize the value function, while G tries to minimize the function.

So, the overall objective function of the vanilla GAN is defined by combining these two networks’ objectives:

minGmaxDV(D,G) (B)\min_G \max_D V(D,G) \cdots \text{ (B)}

The Generator’s Objective

Now, as we learned, G’s objective is to minimize the value function:

minGV(D,G)\min_G V(D,G)

In Function A, G can only control the second term because the first term depends only on D and the real data x.

So, G’s objective function is simplified:

Lg=minGV(D,G)=minGEzpz(z)[log(1D(G(z)))]L_g = \min_G V(D,G) = \min_G \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

where L_g indicates the vanilla generator’s loss which measures how well G fools D across all fake samples z drawn from the data distribution p(z).

When successful, D recognizes the fake sample as real and outputs a value close to one:

D(G(z))1D(G(z)) \approx 1

which causes the log value in L_g to approach negative infinity, minimizing the generator’s loss:

log(1D(G(z)))log(1 - D(G(z))) \to -\infty

This provides the mathematical reason for G’s objective function where the value function is minimized when G can fool D with realistic fake samples.

The Vanishing Gradient Problem

In practice, the vanilla generator’s loss faces a vanishing gradient problem during early stages of training, especially when D becomes much stronger than G, returning D(G(z)) ≈ 0 for all fake samples.

The gradient of the vanilla generator’s loss is defined as below, leveraging the chain rule:

LGGo=log(1Do)Go=log(1Do)Do1×DoZ0×ZGo\frac{\partial \mathcal{L}G}{\partial G_o} =\frac{\partial \log(1-D_o)}{\partial G_o} = \underbrace{\frac{\partial \log(1-D_o)}{\partial D_o}}{\approx -1} \times \underbrace{\frac{\partial D_o}{\partial Z}}_{ \approx 0} \times \frac{\partial Z}{\partial G_o}

where:

  • ∂L_g/∂G: The gradient of the vanilla generator’s loss,

  • L_g: The (vanilla) generator’s loss,

  • D_o: The output values from D, D_o= D(G(z)),

  • G_o: The output values from G, and

  • Z: The input vector values to the activation function in the G’s hidden layers (i.e.,sigmoid). Z = Wx + b where W and b are weight and bias parameters and x is the input data.

Gradients vanish when D gets strong and outputs zeros D(G(z)) ≈ 0 because in this situation, ∂D/∂Z in the gradient quickly approaches to zero, cancelling out all other elements.

Modified Generator Loss

To tackle this challenge, G leverages the modified (non-saturating) generator loss as its training objective:

Lgmodified=minGEzpz(z)[log(D(G(z)))]\mathcal{L}g^{\text{modified}} = \min_G \mathbb{E}{z \sim p_z(z)}[-\log( D(G(z)) ) ]

The modified generator loss aims to maximize the log probability of fake samples being classified real.

Crucially, when D is strong enough to detect all fake samples and return D(G(z)) ≈ 0, the gradient of the modified generator loss approaches infinity:

LgmodifiedGo1D(G(z))10\frac{\partial \mathcal{L}_g^{\text{modified}}}{\partial G_o} \propto \frac{1}{D(G(z))} \to \frac{1}{0} \to \infty

This large gradient at the start of training signals G to move away from such easily-detectable fake samples, leading to stable initial training for G.

The Discriminator’s Objective

D, on the other hand, aims to maximize the value function:

maxDV(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]\max_D V(D,G) = \mathbb{E}{x \sim p{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

The first term of the formula indicates that D attempts to output one (D(x) = 1) for all real samples x drawn from the distribution p(x).

When D(x) = 1, D has highest confidence that x is real, which maximizes the term log D(x) as log(1) = 0.

The second term of the formula indicates that D attempts to detect fake samples G(z) as fake.

When D recognizes G(z) as fake, it outputs zero (D(G(z)) = 0), making the log value log(1 - D(G(z))) approach zero.

Optimizing Vanilla GANs

Optimization of vanilla GANs involves tuning and training both D and G until the network converges where G generates fake samples authentic enough to completely deceive D.

The Training Process

Training the networks is an adversarial process that alternates between D and G.

D is trained as a standard binary classifier that distinguishes real samples from fake ones.

In the process:

  • D first receives a batch of real samples (labeled as 1) and a batch of fake samples from G (labeled as 0).

  • The optimizer updates D's model parameters (weights and biases in each neuron) to increase the confidence on real samples (D(x)→1) and decrease the confidence on fake samples (D(G(z))→0).

Then, G is trained to deceive D:

  • G first takes a random noise vector (z) as input and generates fake samples G(z).

  • The optimizer updates G's model parameters to make D classify the fake samples as real (D(G(z))→1).

The Nash Equilibrium

The network converges when G and D reaches the Nash Equilibrium where D generates a random guess, 0.5, for both real and fake samples (D(x) = D(G(z) = 0.5).

The equilibrium indicates that:

  • G has learned the data distribution of the real samples, and

  • D can no longer distinguish fake samples from real ones, so outputs a random guess (0.5) for any input.

In this scenario, the loss for both networks stabilizes, and neither can improve without the other changing its model parameters.

The primary challenge here is to keep maintaining the power balance between D and G until the equilibrium, because the adversarial dynamics between the two makes the training process unstable.

For example:

  • Stronger D causes vanishing gradients: When D is trained too quickly compared to G, it easily identifies every fake sample, causing the gradient vanishing problem.

  • Stronger G causes mode collapse: When G is trained too quickly, G starts to generate one type of fake sample which can fool D, and keep generating the same fake sample throughout the training (mode collapse). This ends up with a lack of diversity in the fake samples.

The ideal training process involves continuous oscillation around the equilibrium point, following the steps like:

  • G improves,

  • D quickly catches up,

  • G learns new tricks,

  • D catches up,

  • … continues till the convergence.

The Walkthrough Example

Let us see oscillation over a simple example with synthetic tabular data.

Step 1. Both D and G are Weak.

At the start, both D and G are weak.

So, G first generates an obviously fake sample: G(z) = $10,000,000 for coffee shop sales, for instance.

Then, D receives this fake sample and the real sample x = $3,056 and generates:

  • For the real sample, D(x) = 0.5 as its random guess,

  • For the fake sample, D(G(z)) = 0.1.

  • The value function V(D, G) is V(D, G) = log(0.5) + log (1-0.1) ≈ -0.8

  • The modified generator loss is Lg = -log(D(G(z))) = -log(0.1) ≈ 2.30

The low V(D, G) value sends signals that:

  • D (aiming to maximize V(D, G)) lacked confidence in its classification, and

  • G (aiming to minimize V(D, G)) performed poorly as its fake sample was easily spotted by D as 0.1 probability of being real.

Step 2. D is Trained, G Remains Weak.

In the next epoch, D has improved and can more confidently distinguish between the real and fake samples:

  • For the real sample, D generate high value with confidence: D(x) = 0.9

  • For the fake sample, as G(z) has not improved yet, D gets very confident to generate D(G(z)) = 0.05.

  • The value function is V(D, G) = log(0.9) + log(1 - 0.05) ≈ -0.12 > -0.8.

  • The modified generator loss is Lg = -log(0.05) ≈ 3.00 > 2.30.

The higher V(D, G) value than the initial epoch indicates that D performed better and G performed poorly.

And because D(G(z)) moved closer to 0, the modified loss increased sharply from 2.30 to 3.00.

This huge penalty delivers an even stronger gradient signal to G, forcing it to escape the saturated region and improve the quality of a fake sample.

Step 3: G Caches Up.

Based on the feedback, G has improved its fake sample, generating G(z) = $10,000.

D starts to lose confidence and generates:

  • For the real sample, D(x) = 0.8

  • For the fake sample, D(G(z)) = 0.2.

  • The value function is V(D, G) = log(0.8) + log(1 - 0.2) ≈ -0.22 < -0.12.

  • The modified generator loss is Lg = -log(0.2) ≈ 1.61 << 3.00.

The lower V(D, G) value indicates D performed worse while G performed better.

And the modified loss dropped significantly from 3.00 to 1.61.

This drop confirms that G performed better because its penalty is now much lower.

G starts to successfully optimize its objective.

Step 4. Continuous Oscillation

The network repeats Steps 2 and 3, making good power balance between D and G.

Step 5. Converge (Reach the the Nash Equilibrium)

G has completely learned the real distribution and produces statistically identical fake samples.

Now, for both real and fake samples, D can no longer tell the difference, so it randomly guesses the authenticity, generating 0.5 for any sample (Nash Equilibrium).

  • For the real sample, D(x) = 0.5

  • For the fake sample, D(G(z)) = 0.5

  • The value function is V(D, G) = log(0.5) + log(1 - 0.5) ≈ -1.4

  • The modified generator loss is Lg = -log(0.5) ≈ 0.69 << 1.61

At this Nash Equilibrium, D is just guessing, and G has minimized its modified loss and the value function as much as possible, successfully producing realistic fake samples.

And this is the convergence of the vanilla GAN.

Hyperparameter Tuning

GANs require careful hyperparameter tuning to make continuous oscillation happen.

Hyperparameters unique to vanilla GANs involve:

Latent space dimension (z size)

  • The dimensionality of the random noise input vector fed to G.

  • The larger size (e.g., 100 to 200) can allow G to capture more features and generate greater diversity in fake samples.

D/G update ratio (k)

  • The ratio of the number of times D is trained (epochs) over G’s training epochs.

  • Setting k = 1 is common.

  • k > 1 (more training epochs to D than G) can be necessary to ensure D is robust enough to provide meaningful gradients to a weak G, and vice versa.

Other critical hyperparameters are similar to a standard deep neural network:

  • Network architecture: The number of layers, the activation functions used (e.g., Leaky ReLU is preferred over standard ReLU in D to prevent gradient sparsity).

  • Optimizer: The choice of the optimization algorithm and its parameter settings.

  • Learning rate (η): The step size for model parameter updates.

  • Batch size: The number of samples processed at once.

In vanilla GANs, these hyperparameters are set separately for each G and D to ensure the equilibrium.

That’s all for vanilla GANs.

In the next section, I’ll explore various forms of GANs.

Types of GANs

While numerous types of GANs exist, I’ll explore three major GAN architectures and their use cases in this section:

  • Deep Convolutional GAN (DCGAN)

  • Conditional GAN (cGAN) / Conditional DCGAN (cDCGAN)

  • Progressive GAN (proGAN)

Deep Convolutional GAN (DCGAN)

Deep convolutional GAN (DCGAN) uses convolutional neural networks (CNNs) for both G and D, instead of a standard neural network.

The below diagram illustrates DCGAN’s architecture:

Figure E. DCGAN architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. DCGAN architecture (Created by Kuriko IWAI)

G upscales a random noise into a larger, more detailed output by using transposed convolutions (or called deconvolutions).

D, on the other hand, uses standard convolutional layers to zoom out and look at the overall structure and details of the data to analyze its perfection.

This approach makes DCGANs effective for generating high-quality images.

Its use cases involve:

  • Synthetic image creation: Generating realistic images of non-existent objects.

  • Foundation models for image transformation.

  • Data argumentation: Generate realistic training examples.

  • Feature learning: Trains D to extract rich, hierarchical visual features.

Major DCGAN-based architectures involve:

  • SRGAN: Leverages DCGAN architecture with a perceptual loss function for upscaling low-resolution images.

  • AC-GAN: Conditional extension of DCGAN that classifies the class label of the input image on top of the real-fake sample classification.

  • Pix2Pix: A conditional GAN that uses DCGAN's convolutional concepts.

Conditional GAN (cGAN)

A Conditional GAN (cGAN) includes an additional input called the condition (or label) y to both G and D.

These conditions are applicable to both vanilla GANs and DCGANs.

The below diagram illustrates how a cDCGAN (conditions applied to a DCGAN) works:

Figure F. cDCGAN architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. cDCGAN architecture (Created by Kuriko IWAI)

In Figure C, conditions “sitting“ (orange boxes) are added to both real and fake samples.

This signals G to generate specific images of “sitting” cats.

And by adding more patterns in conditions to the real samples like “sleeping“, “eating“, and so on, G can learn to generate these images specifically.

So, the conditions provide context, enabling G to generate fake samples with specific characteristics based on the given conditions y rather than relying solely on random noise z.

The Objective function

The objective function of a cGAN is an extension of the original GAN's minimax game, where both G and D are conditioned with y:

minGmaxDV(D,G)=Expdata(x)[logD(xy)]+Ezpz(z)[log(1D(G(zy)))]\min_{G} \max_{D} V(D, G) = E_{x \sim p_{data}(x)}[\log D(x|y)] + E_{z \sim p_{z}(z)}[\log (1 - D(G(z|y)))]

This controlled generation makes cGANs useful for tasks requiring precise control over the output.

Its use cases involve:

  • Image generation: Generating images of specific classes (e.g., a "cat" or "car").

  • Image-to-image translation: a sketch to photo, outline to a photorealistic image.

  • Text-to-image synthesis: Creating images based on a descriptive text caption.

Major cGAN-based architectures involve:

  • Pix2Pix: A seminal architecture for general-purpose image-to-image translation.

  • CycleGAN: An architecture designed for unpaired image-to-image translation.

  • StarGAN: Focuses on multi-domain image-to-image translation using a single model.

Progressive GAN (ProGAN)

Progressive GAN (ProGAN or PGGAN) progressively adds layers to both G and D (called Critic) during training, aiming to generate exceptionally high-quality synthesis while achieving training stability.

Figure G. ProGAN and its training progress (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. ProGAN and its training progress (Created by Kuriko IWAI)

The training process starts with a small-scale GAN which outputs low resolution images like 4×4 pixels.

As training stabilizes, a new layer is added in a gradual manner to both G and D, effectively doubling the resolution like 8×816×16 → … → 1024×1024.

To ensure a stable transition, the network assigns small weights to a new layer at first, and as training proceeds, it increases the weights of the new layer.

The Objective Function - WGAN-GP Loss

ProGAN uses the Wasserstein GAN with Gradient Penalty (WGAN-GP) loss, which measures Wasserstein-1 Distance (Earth-Mover's Distance) between real and fake samples.

In ProGAN, D aims to maximize this loss, indicating that the classifier secures a maximum distance between fake and real samples.

Mathematically, this objective is generalized:

maxDLD=Expdata[D(x)]Ezpz[D(G(z))]λLGP\max_D L_D = \mathbb{E}{\mathbf{x} \sim p{data}} [D(\mathbf{x})] - \mathbb{E}{\mathbf{z} \sim p{\mathbf{z}}} [D(G(\mathbf{z}))] - \lambda \cdot L_{GP}

where L_{GP} is the Gradient Penalty term:

LGP=Ex^px^[(x^D(x^)21)2]L_{GP} = \mathbb{E}{\mathbf{\hat{x}} \sim p{\mathbf{\hat{x}}}} \left[ \left(\|\nabla_{\mathbf{\hat{x}}} D(\mathbf{\hat{x}})\|_2 - 1\right)^2 \right]

where:

  • x^: An interpolated sample between a real and a fake sample:
x^=ϵx+(1ϵ)G(z)\mathbf{\hat{x}} = \epsilon \mathbf{x} + (1-\epsilon) G(\mathbf{z})
  • λ: The penalty weight (λ = 10 is common).

WGAN-GP loss offers critical benefits for ProGANs to generate realistic images:

  • Stability for high-resolution training:

High-resolution training is unstable because the overlap of the real and fake data distributions is minimal.

With WGAN-GP loss, the earth-mover’s distance can avoid the vanishing gradient problem, while GP enforces smoothness of the loss function.

  • Meaningful loss metrics:

WGAN-GP generates a continuous loss, correlating directly with the image quality.

ProGAN’s use cases involve:

  • High-resolution image synthesis: Generates photorealistic images of faces, objects, and scenes at resolutions up to 1024x1024.

  • Synthetic data generation like large datasets of high-fidelity images for training other computer vision models.

Major ProGAN-based architectures involve:

  • StyleGAN: An evolution of ProGAN that retained the progressive growing idea but introduced a style-based generator to better control the features (styles) at different levels.

  • Other resolution-scaling architectures: Progressive training scheme is adopted to many subsequent GANs that generates high-resolution images.

Wrapping Up

GANs are highly capable in generating realistic, high-resolution synthetic data across various domains, including images, audio, and text, by mastering complex underlying data distributions.

In this article, we observed the working principles of vanilla GANs through a practical, walkthrough example and explored essential architectural variants with use cases.

Moving forward, the continuous evolution of GAN architectures promises breakthroughs in fields like creative design, drug discovery, and data augmentation, further polish synthetic content.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Generative Deep Learning: Teaching Machines To Paint, Write, Compose, and Play

Generative Deep Learning: Teaching Machines To Paint, Write, Compose, and Play

Share What You Learned

Kuriko IWAI, "Generative Adversarial Network (GAN): From Vanilla Minimax to ProGAN" in Kernel Labs

https://kuriko-iwai.com/gans

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.