However, navigating the intricacies of its specific encoder-decoder architecture and the various methods used to impose learning constraints remains a difficult task.

In this article, I’ll explore the core architecture of the autoencoder, covering its layered structure, various learning constraints to regulate overfitting, and practical use cases.

What is Autoencoder

An autoencoder (AE) is a type of artificial neural network used to copy inputs to outputs by learning unlabeled data through unsupervised learning.

The below diagram illustrates the foundational architecture of a vanilla AE using dense layers:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Vanilla AE architecture with standard dense layers (Created by Kuriko IWAI)

The vanilla AE consists of an encoder (colored pink in Figure A) and a decoder (colored blue in Figure A):

Encoder takes the input data and compresses it into the code (also called latent code or bottleneck vector), a lower-dimensional representation where the network learns the most important features of the data.
Decoder takes the compressed code and attempts to reconstruct the original input data to generate the reconstruction.

Notably, vanilla AEs take undercomplete representation where the dimensionality of the code layer (red) in the latent space is much smaller than the input or output layer.

This constraint forces the network to learn only the most salient features from the input data in order to successfully perform the reconstruction in the decoder layers.

On the other hand, non-vanilla AEs take overcomplete representation where the code layer takes more neurons than the input layers.

I will cover this type of AEs in the later section after explaining the vanilla AE’s architecture.

◼ Addressing Data Modality

AEs can address data modality by changing its layer architecture.

When dealing with images or sequential data, AEs leverages convolutional layers or recurrent network layers instead of standard dense layers.

▫ Dense AE

As showed in Figure A, a vanilla AE with standard dense layers is suitable for handling single, flat vector inputs like tabular data, feature vectors, or flattened images.

Its primary use cases include dimensionality reduction, feature extraction, and anomaly detection, leveraging its undercomplete structure where the code layer is squeezed.

▫ Recurrent AE (RAE)

Recurrent AEs (RAEs) are used for handling sequential data by leveraging its recurrent neural network layers like LSTMs or GRUs:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Vanilla RAE architecture (Created by Kuriko IWAI)

The recurrent nature allows the network to process inputs in sequence and capture temporal dependencies in sequential data like time series, text, or audio.

The code in RAE consists of a summary of the entire sequence history.

Its primary use cases are similar to vanilla AEs, but dealing with time series data.

▫ Convolutional AE (CAE)

Convolutional AEs (CAEs) are designed to learn spatial patterns through its convolutional layers in both encoder and decoder:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Vanilla CAE architecture (Created by Kuriko IWAI)

CAE uses convolutional layers (Conv2D or Conv3D) for downsampling, and deconvolutional layers (or called transposed convolutional layers) for upsampling.

By using convolutions, CAEs are best designed for learning the spatial structure in data like images or video frames.

Their primary use cases include:

Image (or video) reconstruction,
Image (or video) colorization, and
Image search.

In image colorization, CAE maps features in a black and white image with a colored image:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-1. Image colorization by vanilla CAE (Created by Kuriko IWAI)

In image reconstruction, CAE trains to remove noise from an input image so that it can output a clean version of the input image:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-2. Image reconstruction by vanilla CAE (Created by Kuriko IWAI)

In image search, deep CAEs compress images into 30 number vector images and translate them to matching images:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D-3. Image search by vanilla CAE (Created by Kuriko IWAI)

REs address data modality leveraging these various layer architectures.

In the next section, I’ll explore how vanilla AE works in detail.

How Vanilla Autoencoder Works

Regardless of the layer architecture, AEs take the three steps to reconstruct the input data:

Compression (encoding) process,
Decompression (decoding) process, and
Training process.

Let us take a look at each process.

◼ The Compression (Encoding) Process

The first step is that the encoder compresses representation of the input data through dimensionality reduction through its hidden layers.

Mathematically, this compression process is represented:

h = f(x) = s_f(Wx+b)

where:

h: The code,
x: Input vector,
W: The weight matrix of the encoder,
b: The bias vector of the encoder, and
s_f: The activation function like ReLU or sigmoid in the hidden layer of the encoder.

As showed in Figure A, the hidden layers in the encoder contain a progressively smaller number of nodes than the input layer so that the input data is compressed into the code (h) with much smaller dimensions than the input data.

◼ The Decompression (Decoding) Process

Then, the code (h) is passed onto the decoder.

The decoder maps the compressed code back to the original input space, producing the reconstruction.

Mathematically, this process is denoted:

r = g(h) = s_g(W'h + b')

where:

r: The reconstruction (output) produced by the decoder,
W': The weight matrix of the decoder,
b': The bias vector of the decoder, and
s_g: The activation function of the decoder.

The decoder comprises hidden layers with a progressively larger number of nodes that decompress the data, ultimately reconstructing the data back to its original, pre-encoding form.

◼ Types of Weight Matrix

The weight matrix of the decoder W’ is either tied or untied to those of the encoder.

▫ Tied Weights

Tied weights are common practices in simpler, linear AEs where the weights of the decoder are the true transpose of the weights of the corresponding encoder layers:

W' = W^T

where W and W’ is the weight matrix of the encoder and decoder respectively.

This constraint reduces the number of learnable parameters in the network, which helps prevent overfitting in simpler architectures.

▫ Untied Weights

Untied weights, on the other hand, are the standard for deep AE variants where the weights of the decoder layers (W′) are entirely independent of the weights of the encoder layers (W).

W' \neq W

Untied weights provides the network with greater model capacity and flexibility to learn highly complex, non-linear mappings.

◼ The Training Process

A fundamental goal of training AE is to minimize the reconstruction loss (error).

The reconstruction loss quantifies the difference between the reconstruction r and the ground truth, the original input in most cases, to assess the accuracy of the AE.

Just like other deep learning models, AE uses a loss function to measure its reconstruction loss.

Common loss functions are:

Mean Squared Error (MSE) for continuous data like image pixel values and
Binary Cross-Entropy for binary data like word counts in text.

Then, the AE optimizes its learnable parameters like weights and biases of the encoder and decoder, using an optimization algorithm.

This process compels the AE to learn the most efficient representation of the input data.

◼ Hyperparameter Tuning

To build robust AEs, tuning hyperparameters is critical in addition to constructing the architecture.

AEs have multiple hyperparameters such as:

Code size: Controls how much the data is to be compressed,
The number of hidden layers of decoder and encoder: Controls the depth of the AE,
The number of nodes per layer: Controls how much data is compressed and decompressed, and
Loss function: Controls over the reconstruction loss measurement.

Increasing the number of layers and/or the number of nodes per layer significantly makes the AE more robust to learn complex patterns.

But it entails overfitting where the AE starts to simply copy inputs to the output by memorizing all input data.

In vanilla AEs, setting a smaller code size regularizes the network by forcing it to significantly compress the input data.

Beyond the code size—particularly for non-vanilla AEs where the code layer has more dimensions than the input or output layers—other constraints are introduced.

I'll explore them in the next section.

Various Constraints to Prevent Overfitting

The core goal of all AEs is to copy the input to the output.

However, the complexity of the encoder and decoder structures makes overfitting a common challenge.

So, various architectural constraints have been introduced on top of the code size.

In this section, I’ll cover three methods and their major AE architectures:

Regularization in the loss function: Sparse AE (SAE), Contractive AE (CAE)
Noise Perturbation (Input corruption method): Denoising AE (DAE)
Probabilistic modeling: Variational AE (VAE)

◼ Regularization in the Loss Function

These AEs add a penalty term to the standard reconstruction loss.

▫ Sparse AE (SAE)

Sparse AEs (SAEs) introduce a sparsity penalty in the overcomplete representation where randomly selected neurons (grey in the diagram) do not fire during training.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. SAE architecture (Created by Kuriko IWAI)

The constraint:

Adds a sparsity penalty to the loss function.
Makes only a small subset of neurons in the code layer to be active (non-zero).

The goal:

To encourage the network to learn a diverse, independent set of features and improve feature interpretability.

Typical use cases:

Feature selection on complex data like images or text documents.
High dimensional data clustering for a downstream task like training a classifier.
Transfer learning where each layer of the deep neural network is trained as a SAE to learn highly organized features from the layer below.

▫ Contractive AE (CAE)

Contractive AEs (CAEs) aim to make the learned features less sensitive to small variations in the input data.

The constraint:

Adds a penalty term that measures the sensitivity of the code to small changes in the input.
The penalty term is based on the Frobenius norm of the Jacobian matrix (J = ∂h / ∂x) of the encoder’s activation function with respect to the input data:

\Omega(x) = || \frac {\partial h}{\partial x} ||_{F}^2 = \sum_i \sum_j (\frac{\partial h} {\partial x})^2

where:

Ω(x): The penalty term for a given input x,
h = f(Wx+b): The vector of activations in the hidden layer of the encoder,
x: The input data vector, and
J = ∂h / ∂x: The Jacobian matrix of the hidden layer activations h with respect to the input x.

The Jacobian matrix J describes how much each element of the hidden representation (h) changes for an infinitesimal change in each element of the input (x).

The Frobenius norm then squares every partial derivative in the Jacobian matrix and sums all to provide a single, scalar measure of the total sensitivity of the entire code.

For example, high sensitive code is given a higher penalty (25) than low sensitive code (0.25):

High Sensitivity Example: Let h = 5x.

The derivative: dh/dx = 5.
The penalty: Ω(x) = 5² = 25.
Interpretation*:* If x changes by 0.1, h changes by 0.5. The feature is highly sensitive to the input.

Low Sensitivity Example: Let h = 0.5x.

The derivative: dh/dx = 0.5.
The penalty: Ω(x) = 0.5² = 0.25 < 25
Interpretation: If x changes by 0.1, h changes by only 0.05. The feature is robust (insensitive) to the input change.

This penalty terms is added to the standard reconstruction loss:

L = L(x, g(h)) + \lambda \sum_x \Omega (x)

where:

L(x, g(h)): The reconstruction loss and
λ: The regularization parameter that controls the strength of the contractive penalty.

The goal:

To make the learned features robust and stable against minor variations or noise in the input data.

Typical use cases:

Feature extraction especially from the noisy data.
Dimensionality reduction for the high dimension data.
Image recognition tasks to capture the core identity of an object regardless of minor changes in lighting, rotation, or translation.

◼ Noise Perturbation

▫ Denoising AE (DAE)

Denoising AEs (DAEs) take an input corruption method where the network is fed corrupted input data and forced to recover the original input during the training.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. DAE architecture on CAE (Created by Kuriko IWAI)

The Constraint:

The input data is artificially corrupted by adding random noise or setting some values to zero.
The noisy input data is fed to the encoder layers of the DAE, while the decoder uses the clean, original input as ground truth.
The network must learn how to denoise the input to produce a clean reconstruction.

The Goal:

The denoising process makes the network learn a robust representation that can recover the true signal, improving its generalization capabilities.

Use cases:

Feature variation – It extracts only the required features of an image and generates the output by removing any noise or unnecessary interruption.

◼ Probabilistic Framework

▫ Variational AE (VAE)

Variational AEs (VAEs) are generative models which map the input to a probability distribution with the mean μ and standard deviation σ over the latent space.

The constraint:

The encoder is forced to map the input not to a single code, but to the parameters of a probability distribution in the latent space.
The loss function includes the reconstruction loss and a Kullback-Leibler (KL) divergence term, ensuring the latent distribution is close to a standard normal distribution.

The goal:

Ensures the latent space is smooth and continuous using KL divergence terms.
Enables the model to generate new data by sampling points from the continuous latent space.

Typical use cases:

Image generation: Generate new, novel images that visually resemble the training data. For example, after training on thousands of face images, a VAE can generate entirely new, plausible faces.
Sequence generation: Generates new sequences like novel sentences, music, or time series data. Takes the form of Variational Recurrent AEs (VRAEs).
Data imputation: Filling in missing or corrupted parts of an image or other data.
Disentangled representation learning: Separates the underlying factors of variation in the data to make the input features more interpretable.

Different types of AEs make the network better suit for different tasks and data types.

Choosing constraints is also key to build robust AEs.

Wrapping Up

The autoencoder (AE) stands as a powerful unsupervised neural network capable of learning efficient, lower-dimensional representations of data for tasks like anomaly detection, denoising, and dimensionality reduction.

We observed that success hinges on careful design choices, where the layer architecture, the implementation of learning constraints, and the tuning of hyperparameters are key to building robust AEs tailored to specific task goals and data types.

Moving forward, the exploration of hybrid architectures and dynamic constraint mechanisms further enhances AE performance and adaptability across increasingly complex, high-dimensional datasets.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Generative Adversarial Network (GAN): From Vanilla Minimax to ProGAN
Master the mechanics of Generative Adversarial Networks. Explore the minimax value function, solve vanishing gradients with modified loss, and compare DCGAN, cGAN, and ProGAN.
Decoding CNNs: A Deep Dive into Convolutional Neural Network Architectures
Deep dive into Convolutional Neural Network (CNN) architecture. Learn about kernels, stride, padding, pooling types, and a comparison of major models like VGG, GoogLeNet, and ResNet

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Generative Deep Learning: Teaching Machines To Paint, Write, Compose, and Play

Share What You Learned

Kuriko IWAI, "Autoencoders (AEs): Dense, CNN, and RNN Implementation Guide" in Kernel Labs

https://kuriko-iwai.com/autoencoders-for-advanced-unsupervised-learning

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Autoencoders (AEs): Dense, CNN, and RNN Implementation Guide

Explore the core mechanics of AEs with essential regularization techniques and various layer architectures

Table of Contents

Introduction

What is Autoencoder

◼ Addressing Data Modality

▫ Dense AE

▫ Recurrent AE (RAE)

▫ Convolutional AE (CAE)

How Vanilla Autoencoder Works

◼ The Compression (Encoding) Process

◼ The Decompression (Decoding) Process

◼ Types of Weight Matrix

▫ Tied Weights

▫ Untied Weights

◼ The Training Process

◼ Hyperparameter Tuning

Various Constraints to Prevent Overfitting

◼ Regularization in the Loss Function

▫ Sparse AE (SAE)

▫ Contractive AE (CAE)

◼ Noise Perturbation

▫ Denoising AE (DAE)

◼ Probabilistic Framework

▫ Variational AE (VAE)

Wrapping Up

Continue Your Learning

Generative Adversarial Network (GAN): From Vanilla Minimax to ProGAN

Decoding CNNs: A Deep Dive into Convolutional Neural Network Architectures

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?