A convolutional neural network (CNN) is a deep learning model designed to process and analyze visual data, particularly effective for tasks like image recognition or object detection.

However, the architecture of CNNs is notoriously difficult to grasp due to its inherent complexity and rapidly evolving nature.

This post explains a standard CNN architecture with variety of models in the CNN family, including the fundamental building blocks like:

Convolutional layers,
Pooling layers, and
Dense layers

with key concepts like stride, kernel, and pooling.

What is a Convolutional Neural Network (CNN)

A Convolutional Neural Network (CNN) is a specialized type of neural network inspired by the visual cortex of the human brain.

Unlike traditional neural networks that treat images as a flat array of pixels, CNNs use a hierarchical approach, learning to identify features from simple patterns like edges and curves to more complex objects and textures.

Its primary function is to adaptively learn these features as spatial hierarchies leveraging many layers with neurons.

The below diagram illustrates how a basic CNN architecture works on an image classification task:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Standard CNN architecture (Created by Kuriko IWAI)

Its primary function is to adaptively learn underlying features in the input data as spatial hierarchies by leveraging many layers with neurons.

Each layer performs a specific operation to complete the assigned task (an image classification task in case of Figure A):

Convolutional blocks involving convolutional layers and a pooling layer extracts unique features from the input data,
Flatten layer transforms the extracted data into one dimensional data,
Fully-connected layers processes the flattened data to learn classification, and
An output layer provides the final outcome of the network (a probabilistic distribution across the target classes: bird, lion, and cat).

In the next section, I’ll detail each component.

The Convolutional Block

A convolutional block is a fundamental building block in a CNN that contains a collection of convolutional layers and a pooling layer.

As shown in Figure A, these blocks are stacked one after another to form the core of the CNN architecture.

The below diagram details a basic convolutional block with a single convolutional layer (orange box) and a pooling layer (green box):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. A standard convolutional block architecture with a single filter (Created by Kuriko IWAI)

Although specific layers within a convolutional block can vary depending on the architecture, a standard convolutional block contains:

One or multiple convolutional layers and
A pooling layer.

◼ Convolutional Layer

A convolutional layer is the core of the block that detect specific features in the input data.

As shown in Figure B, the layer has

A filter with three 3-by-3 kernels (a small matrix of numbers),
Batch normalization, and
Non-linear activation.

First, the filter performs convolutional operation where specific features in the input data are highlighted as distinct feature maps.

Then, the network applies batch normalization and non-linear activation to these feature maps, and passes them onto the pooling layer.

▫ Convolutional Operation

Convolutional operation is an element-wise multiplication and summation process, helping the network to recognize features like edges, textures, and shapes.

The operation starts by the kernel sliding (or convolving) over the input data, multiplying its values with the corresponding pixel values in the image patch it’s currently on.

The results of these multiplications are summed up to produce a single value in a feature map.

For example, Figure C illustrates the initial three position sets of convolutional operation by a sobel kernel, a type of two dimensional kernel:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Convolutional operation by a sobel kernel (Created by* Kuriko IWAI)

In practice, the process continues for every possible position set in the input to create the complete the feature map.

Stride and paddings are key metrics in defining how many position sets the kernel should process.

▫ Stride and Paddings

Stride is the number of pixels that the kernel moves over the input matrix.

In Figure C, I set a stride of one so that the kernel shifts one pixel to the right after each computation.

While stride values of two or greater is rare, a larger stride yields a smaller output.

Padding, on the other hand, is a technique to add extra pixels around the input image borders.

Padding serves two primary purposes:

Preserve spatial dimensions: Convolutional operations reduce the size of the output feature map. Adding a border around the input image allows the output to have the same or a larger size than the input.
Prevent information loss at the borders: Pixels at the edges of an image are only processed a few times by the kernel, while pixels in the center are processed many times. Adding padding ensures that all pixels are treated equally, preventing the loss of important edge information.

The default setting is valid padding (or called “no padding”) where no padding is added to the input, so the kernel only moves over valid sections, resulting in an output that’s smaller than the input.

On the other hand, zero padding is a common padding method that adds zeros to the border. Its strategies include:

Same Padding: Adds just enough zeros to the borders so the output has the same dimensions as the input. The amount of padding is automatically calculated based on the kernel size and stride.
Full Padding: Adds a large amount of zeros to the borders to ensure every pixel even in the corners is at the center of the kernel at some point. This results in an output that’s larger than the input.

In Figure C, I applied valid padding (no padding) for simplicity.

With zero padding applied, the input data would look like Figure C’:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C’: Input with zero padding (Created by Kuriko IWAI)

And the kernel slides over these paddings during the convolutional operation.

▫ Mathematical Formation of Convolutional Operation

Generalizing with input data I, a kernel K with M height dimensions and N width dimensions, and the coordinates of the current pixel (i, j), the process is denoted:

(I * K)(i,j)= \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} I(i+m,j+n)⋅K(m,n)

where:

(I ∗ K): The feature map (The convolution of the input image I with the kernel K by applying cross-correlation operation),
(i, j): The coordinates of the current pixel,
I: The input data (matrix) with I(i, j) as a pixel value at row i and column j,
K: Kernel matrix with M height dimensions and N width dimensions, and
K(m, n): The weight value at row m and column n of the kernel (w_1 to w_9 in Figure B).

(I ∗ K) represents a feature map, an output from the convolutional operation.

When applying a two dimensional kernel, the size of the feature map (I * K) is defined:

O = \frac{n-f+2p} {s} + 1

where

O: The feature map (output) size,
n: The input size (height or width),
f: The kernel size,
p: Padding, and
s: Stride

In case of Figure C’, since:

n = 6
f = 3
p = 0, 1, 3
s = 1

The size of feature maps is computed:

O_{valid_padding} = ((6 - 3 + 0) / 1) + 1 = 4 < input size n = 6
O_{same_padding} = ((6 - 3 + 2) / 1) + 1 = 6 \= input size n = 6
O_{full_padding} = ((6 - 3 + 6) / 1) + 1 = 9 \> input size n = 6

These results showcase how the padding impacts the output size.

▫ The Number of Filters and Model Parameters

Lastly, I’ll cover the model parameters in a convolutional block.

The architecture in Figure B uses a single filter with three 3-by-3 kernels because the input data has three depth dimensions (The number of depth dimensions and the number of kernels have to be aligned).

Each kernel has nine (3 × 3) weight matrices.

Hence, the single filter has 27 weight matrices (9 weight matrices × 3 kernels) and 1 bias term, so in total 28 learnable model parameters.

These model parameters are optimized during training.

When the network has multiple filters, it affects the depth of the output (feature maps).

For example, two distinct filters yield two different feature maps, creating a depth of two:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. A standard convolutional block architecture with two filters (Created by Kuriko IWAI)

In Figure D, the pooling layer yields two activation maps because the convolutional layer with two filters yields two feature maps.

Each of these activation maps has highly activated neurons (pink cells in Figure D), depending on features that it captures.

So, using more filters allows the network to capture a greater variety of features from the input data, making it more suitable for tasks with high variability like recognizing real-world objects.

On the other hand, adding filters significantly increases the number of learnable model parameters.

In Figure C and Figure D, the number of the parameters increases from 28 to 56 per convolutional layer due to the filter increase.

This leads to longer training time, higher memory usage, and increasing risks of overfitting.

Finding the right balance is a key part of designing a CNN architecture.

◼ Batch Normalization

Some convolutional blocks include batch normalization (BN) process before the activation function.

This process normalizes the feature maps, which helps stabilize the training process by reducing internal covariate shift.

Internal covariate shift is a phenomenon where the distribution of the layer inputs in the neural network has changed during the training.

This shift forces each layer to readjust to the changing input distribution every epoch, which slows down the training process.

BN is a common solution to this challenge because the normalization (shifting the mean to zero, variance to one) can dense and stabilize the input distribution.

◼ Non-Linear Activation

In the last part of the operation in the convolutional layer, a non-linear activation function is applied to each feature map.

A common choice is a ReLU function which returns the maximum of zero or the input value to the neuron, the sum of weighted inputs and bias (x):

f(x) = max(0, x)

This process introduces non-linearity into the network, enabling it to learn complex patterns.

◼ Pooling Layer

A pooling layer downsamples the feature maps by reducing their spatial dimensions and generates activation maps.

The main purposes of the pooling layer are:

Reducing computational load by making the feature maps, hence, the number of parameters and calculations in subsequent layers, smaller and
Creating translation invariance to make the network more robust to feature shifts in the input, enabling it to recognize the feature regardless of its precise position.

Each activation map generated acts as a summary of features in the input data processed by the convolutional layers.

▫ Types of Pooling Operations

The below diagram shows various types of pooling operations:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Various pooling operations with stride s = 2 (Created by Kuriko IWAI)

▫ Max Pooling

The most common one is max pooling that selects the most activated feature in a given region.

This method extracts the most prominent feature in the region while discarding the rest as less important ones.

Common use cases:

Most computer vision tasks for classification, where the goal is to identify the presence of a feature regardless of its precise location.

▫ Average Pooling

Average pooling computes the average value of the elements within a pooling window.

Unlike max pooling, average pooling takes into account all values in the region. This can be beneficial in certain scenarios, as it helps to smooth out the feature map and reduce noise.

Common use cases:

Medical imaging.
Satellite imagery applications.
Any tasks that requires the overall distribution of features in a region where patterns are more important than single, strong signals.

▫ Lp-Pooling

Lp-pooling is a generalized form of pooling that includes both average and max pooling as special cases by computing the Lp norm of the values in the pooling window:

When p=1, it becomes average pooling.
When p=∞, it becomes max pooling.

Common use cases:

A generalized approach used in research.

▫ Mixed Pooling

Mixed pooling is a linear combination of max pooling and average pooling.

The network can learn the optimal combination for a given task, making it more flexible.

Common use cases:

A generalized approach used in research.

▫ Stochastic Pooling

Instead of deterministically choosing a value (like the max or average), stochastic pooling randomly samples an activation from the pooling region with a probability proportional to its magnitude.

This adds a degree of randomness that can help with regularization and reduce overfitting.

Common use cases:

A regularization method to combat overfitting, especially when dealing with smaller datasets.

▫ Global Pooling

Instead of applying a small sliding window, global pooling summarizes an entire feature map into a single value.

Global pooling is applied at the end of the convolutional part of a network, just before the final fully connected layers.

Two distinct types are:

Global Average Pooling (GAP) that calculates the average of all elements in the feature map, and
Global Max Pooling (GMP) that takes the maximum value from the entire feature map.

Although many exists, these pooling operations serve the same general purpose of downsampling feature maps to reduce dimensionality and computational load, while also providing translation invariance.

◼ Alternatives to Pooling Layers

Pooling layers can be replaced entirely by convolutional layers with a larger stride.

For instance, two strides can reduce the spatial dimensions of the feature map by half, achieving the same downsampling effect as a pooling layer.

There are two distinct benefits:

Optimal learning on downsizing: Allows the network to learn the optimal downsampling operation, potentially leading to more expressive models and better performance.
Preserving spatial information throughout the network: Strided convolutions can retain more information on the input data because they are not discarding values in a region, allowing the network to preserve more context.

Especially replacing all pooling layers with strided convolutions creates a fully convolutional network where spatial information is maintained throughout the network.

Research shows that the architecture is preferred in tasks that requires complex context understanding - like semantic segmentation, anomaly detection, object detection, or image super-resolution.

But as we saw in Figures B and C, adding convolutional layers can increase the number of learnable parameters, making the model more complex and computationally expensive.

So, considering the trade-off is key.

The Fully Connected (FC) Layer

The fully connected (FC) layer, also known as a dense layer, is the last part of transformation in a CNN where every neuron is connected to every neuron in the preceding layer.

The FC layer is responsible for making the ultimate decision based on the feature maps, the features extracted by the convolutional blocks.

These features extracted contain local features like edges, corners, and textures, and the FC layer takes these high-level, abstract features and uses them to perform the final task like:

Classification: For an image classification problem, the FC layer takes the flattened feature map and outputs the probability for each possible class.
Regression: For a regression task, the FC layer outputs a single value.

◼ The Flattening Step

As showed in Figure A, before the FC layer processes the feature maps, these feature maps are converted into a single, long 1D vector, a process called flattening.

This is necessary because FC layers only accept 1D vectors as input.

The flattened vector then serves as the input to the first FC layer, then is processed accordingly.

◼ Consideration

FC layers are excellent at learning global patterns and relationships among the features, making them highly effective for the final decision-making step.

But FC layers are parameter-heavy.

Because every neuron in the layer is fully connected to every neuron in the previous layer, the number of weights can grow very large.

This makes the network more prone to overfitting and increases computational cost.

Regularization techniques like dropout are used to mitigate this risk.

The Output Layer

The output layer in a CNN plays a critical role as the final layer that produces the actual output of the network.

The output layer takes the high-level, abstracted features from the convolutional blocks and FC layers and transforms them into a final output form.

As showed in Figure A, for classification tasks, the output layer applies a softmax activation to the input to generate a probability distribution over the predefined classes (bird, lion, and cat)

The softmax ensures that the output probabilities sum to 1, making them directly interpretable as class probabilities.

For regression tasks, the output layer consist of one or more neurons with linear or no activation function, providing continuous output values.

***

And that’s all for the CNN architecture.

After the convolutional layers have extracted a hierarchy of features, the network architecture transitions to a structure similar to a standard feedforward network.

And during training, learnable model parameters - the weights and biases of both the FC layers and the convolutional kernels are optimized through backpropagation.

Types of Convolutional Neural Networks

Variety of CNN architectures have been developed to address specific challenges.

First, CNNs are classified into three groups based on the number of kernel dimensions:

1D CNNs,
2D CNNs (Common CNN type. We used a sobel kernel in Figure C), and
3D CNNs

Let us take a look.

◼ 1D CNNs

1D CNNs are used for sequential data like time series analysis or natural language processing, where the filters move in one dimension along the sequence.

▫ Common use cases:

Analyzing and extracting sequential data like text, audio, or sensor data.

▫ Cons:

Not suitable for tasks requiring spatial feature extraction from images or video because 1D CNNs only consider one dimension of the input data.

◼ 2D CNNs

2D CNNs are the standard types used for image and video data.

As we observed in the previous section, the filters move over a two-dimensional plane to capture spatial features, making them highly effective at capturing spatial hierarchies.

▫ Common use cases:

Image classification, object detection, and any task involving static visual data.

▫ Cons:

Not optimized for volumetric data or sequences with a strong temporal component.

▫ Major 2D CNN Models

Here are some major models with 2D CNN architecture:

▫ LeNet-5

A pioneering model for handwritten digit recognition.
Best When: As a foundational model for simple image classification tasks with low-resolution images.
Cons: Limited in depth and capacity, making it unsuitable for complex, high-resolution image tasks.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. LeNet (Created by Kuriko IWAI)

▫ AlexNet

Significantly deeper and wider than LeNet-5, helping popularize deep learning.
Best When: For a more powerful image classification baseline than LeNet, suitable for larger datasets.
Cons: Relatively shallow architecture by today's standards. Less efficient large filters (11x11, 5x5).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. AlexNet (Created by Kuriko IWAI)

▫ VGGNet

Simple and uniform architecture with increasing the depth of the network using small (3x3) filters.
Best When: Used as a feature extractor for other models (robust and easy-to-understand).
Cons: Slow training and deployment time due to large parameters and memory consumption.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. VGGNet (Created by Kuriko IWAI)

▫ GoogLeNet

Deep network with fewer parameters leveraging the inception module where the network can choose filter size and pooling operation within a single layer.
Best When: For tasks requiring a high-performing and computationally efficient model.
Cons: The complex architecture.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure I. GoogLeNet (Created by Kuriko IWAI)

▫ ResNet (Residual Network)

A groundbreaking model that solved the vanishing gradient problem in very deep networks.
Best When: For training extremely deep networks for any vision task, as its residual connections prevent performance degradation with increasing depth.
Cons: The sheer depth can still lead to long training times, although they are more manageable than similarly deep non-residual networks.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure J. ResNet (Created by Kuriko IWAI)

▫ DenseNet

Focuses on connectivity where the defining feature in each layer is connected to every other layer in a feed-forward manner.
Best When: For tasks where memory and computational resources are a concern.
Cons: The dense connections lead to a large number of feature maps, consuming memory.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure K. DenseNet (Created by Kuriko IWAI)

◼ 3D CNNs

3D CNNs are used for volumetric data, such as medical scans (MRIs, CT scans) or video classification.

The filters operate in three dimensions, capturing spatial and temporal information simultaneously.

▫ Common use cases:

Analyzing data with three dimensions where both spatial and temporal features are important for classification.

▫ Cons:

Computationally expensive.
Require a large amount of data for training due to the high number of parameters.

▫ Major Models

DenseNet

Although common implementations of DenseNet is for 2D convolutions, the core principle of "dense blocks" can be extended to 3D convolutions for applications like medical imaging or video analysis.

Wrapping Up

CNNs are competitive neural network primarily used for analyzing visual imagery.

In this article, we explored the inner workings of a standard convolutional neural network (CNN) and observed how it performs across a variety of architectures.

By understanding the fundamental building blocks—convolutional, pooling, and fully connected layers—we can see how these networks are uniquely equipped to process spatial data like images.

Looking ahead, the evolution of CNNs promises even more sophisticated applications, from advancing medical diagnostics to enabling more robust autonomous systems.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Autoencoders (AEs): Dense, CNN, and RNN Implementation Guide
Master Undercomplete, Overcomplete AE mechanics. Architectural variations of tabular, spatial, sequential data, weight-tying, and loss optimization.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Generative Deep Learning: Teaching Machines To Paint, Write, Compose, and Play

Share What You Learned

Kuriko IWAI, "Decoding CNNs: A Deep Dive into Convolutional Neural Network Architectures" in Kernel Labs

https://kuriko-iwai.com/convolutional-neural-network

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Decoding CNNs: A Deep Dive into Convolutional Neural Network Architectures

Explore how CNN architectures work, leveraging convolutional, pooling, and fully connected layers

Table of Contents

Introduction

What is a Convolutional Neural Network (CNN)

The Convolutional Block

◼ Convolutional Layer

▫ Convolutional Operation

▫ Stride and Paddings

▫ Mathematical Formation of Convolutional Operation

▫ The Number of Filters and Model Parameters

◼ Batch Normalization

◼ Non-Linear Activation

◼ Pooling Layer

▫ Types of Pooling Operations

▫ Max Pooling

▫ Average Pooling

▫ Lp-Pooling

▫ Mixed Pooling

▫ Stochastic Pooling

▫ Global Pooling

◼ Alternatives to Pooling Layers

The Fully Connected (FC) Layer

◼ The Flattening Step

◼ Consideration

The Output Layer

Types of Convolutional Neural Networks

◼ 1D CNNs

▫ Common use cases:

▫ Cons:

◼ 2D CNNs

▫ Common use cases:

▫ Cons:

▫ Major 2D CNN Models

▫ LeNet-5

▫ AlexNet

▫ VGGNet

▫ GoogLeNet

▫ ResNet (Residual Network)

▫ DenseNet

◼ 3D CNNs

▫ Common use cases:

▫ Cons:

▫ Major Models

Wrapping Up

Continue Your Learning

Autoencoders (AEs): Dense, CNN, and RNN Implementation Guide

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?