Beyond the Black Box: Architecting Deep Feedforward Networks with NumPy

Deep Dive into Feedforward Architectures with Math and Code

Deep LearningData SciencePython

By Kuriko IWAI

Introduction

Deep Feedforward Networks (DFNs) are the foundational artificial neural network that many deep learning algorithms are built on.

I'll explore their mathematical formation with NumPy frameworks to gain deeper insights into how these neural networks operate in the black box.

◼ What is Deep Feedforward Network?

Deep Feedforward Networks (DFNs) is a key type of Feedforward Neural Network (FNN) with at least two hidden layers between the input and output layers.

This multi-layered (“deep“) structure enables DFNs to learn increasingly abstract and hierarchical features from data.

Key characteristics of an DFN include:

Multiple Hidden Layers: At least two hidden layers between the input and output layer.
Fully Connected (Dense) Layers: Each neuron in a given layer is connected to every neuron in the subsequent layer.
Parametric: Defined by a finite set of parameters .
Outputs both continuous values (for regression tasks, using a linear activation function) and discrete values (for classification tasks, using sigmoid or softmax function).
Non-linear Activation Functions: Uses non-linear activation functions in their hidden layers (e.g., sigmoid, tanh, ReLU).

This non-linearity enables DFNs to learn and approximate complex, non-linear relationships, overcoming the limitations of single-layer perceptrons.

In fact, with enough hidden neurons and proper training, DFNs become universal function approximators, capable of approximating any continuous function to a desired accuracy.

◼ Various Architectures in Feedforward Neural Network (FNN)

While Feedforward Neural Network is a broad category, here are some specific architectures that fall under it:

Single-Layer Perceptron
- The simplest form of an FNN with only an input/output layer.
- Can only solve a linearly separable problem.
Multi-Layer Perceptron
- Refers to multi-layered perceptrons with one or more hidden layers.
- DFN is included in the Multi-Layered Perceptron category.
Radial Basis Function (RBF) Networks:
- Uses a radial basis function (e.g., Gaussian function) as an activation function of the hidden layers.
- Often used for the function approximation and classification tasks.
Autoencoders (specifically the encoder and decoder parts):
- Designed for unsupervised learning, specifically for efficient data encodings.
- Consists of two parts: an encoder and a decoder (both are typically MLPs).

Designing a DFN Architecture

I’ll begin by designing a foundational Deep Feedforward Network (DFN) architecture optimized for tasks on structured/tabular data.

◼ How Feedforward Network Works

To simplify the concept, I’ll take a 2-layer neural network for an example and demonstrate the mathematical formation and coding behind the operation.

The entire architecture looks like this:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Forward pass process in the DFN with two hidden layers (Created by Kuriko IWAI)

◼ Forward Pass

A forward pass is the process of feeding input data through a neural network to produce an output.

Let us go through this process step by step:

◼ Step 1. Receiving Input

The input layer tasks an input (X), a matrix with n number of features as its entries:

X = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} (n = 4)

◼ Step 2. Performing a Linear Transformation

The model performs a linear transformation on the input with a randomly initialized weights (W_1) and bias (b_1) to generate a value (Z_1).

Mathematically, this process is defined by calculating weighted sum of input features:

\begin {align*} Z_1 &= W_1^T X + b_1 \\ \\ &= \begin{bmatrix} w_{11} w_{21}w_{31}w_{41} \\ w_{12} w_{22}w_{32}w_{42} \\ \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} \\ \\ &= \begin{bmatrix} w_{11}x_1 + w_{21}x_2 + w_{31}x_3 + w_{41}x_4 + b_1 \\ w_{12}x_1 + w_{22}x_2 + w_{32}x_3 + w_{42}x_4 + b_2 \end{bmatrix} \end {align*}

The process is defined simply using the ndarray from the NumPy library:

1import numpy as np
2
3z_1 = np.dot(W1.T, x) + b1
4

▫ Fully Connected Layers

Before moving onto the next step, let us recall that FDNs consist of fully connected layers.

This implies that all neurons in each layer take the input, compute a value, and pass it to its connected neuron in the next layer.

For instance, in the example of n=4, each entry of the input vector (x1, x2, x3, x4) are fed to a neuron with corresponding weight values. The process looks like:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

To keep things simple, I’ll focus on the layer level in this article.

◼ Step 3. First Hidden Layer Activation

The first hidden layer (h1) computes a value (A1) by applying its activation function to the input Z1.

A_1 = f(Z_1) \text{ }\text{ f: Activation function}

▫ Choosing the Right Activation Function

I’ll choose ReLU (Rectified Linear Unit) as an activation function for hidden layers due to its robustness and computational simplicity.

Mathematically, ReLU is defined:

f(x) = max(0, x)

Hence,

A_1 = max(0, Z_1)

This process is added to the coding block:

1import numpy as np
2
3z_1 = np.dot(W1.T, x) + b1 # linear transformation
4A_1 = np.maximum(0, z_1)   # ReLU
5

While no single activation function suits every scenario, understanding the problem's properties allows for a more informed choice, leading to faster network convergence.

For instance, for classifiers, Sigmoid functions and their variants often perform well.

However, both Sigmoid and tanh functions can suffer from the vanishing gradient problem especially in deeper networks, which is why ReLU is a popular default for most modern neural networks.

In cases of dead neurons, Leaky ReLU is a better alternative as well.

Learn More: Various activation functions: A Comprehensive Guide on Neural Network in Deep Learning

I’ll switch to other options in case ReLU doesn’t perform well later.

◼ Step 4: Second Linear Transformation

Perform a second linear transformation on A1 using a new, randomly initialized weight matrix (W2) and bias vector (b2).

This step is identical to Step 2, except it takes input from the preceding hidden layer.

Z_2 = W_2^T A_1 + b_2

◼ Step 5: Second Hidden Layer Activation

The second hidden layer (h2) applies an activation function (ReLU in our case) to Z2, generating the output value A2. Similar to Step 3, the only difference here is using Z2 as input instead of Z1.

A_2 = max(0,Z_2)

◼ Step 6: Final Linear Transformation

Perform a final linear transformation on A2 using a randomly initialized weight matrix (W3) and bias vector (b3).

Z_3 = W_3^T A_2 + b_3

◼ Step 7. The Output Layer

The output layer applies an output function to Z2 to generate the final prediction, denoted as y^.

The choice of output function depends on the nature of the problem:

Binary Classification: For problems with two classes, the sigmoid function is commonly used. It squashes the output to a value between 0 and 1, which can be interpreted as a probability.
Multinomial Classification: For problems with more than two classes, the softmax function is applied. This converts the outputs into a probability distribution over the classes, ensuring that the probabilities sum to 1.
Regression (Continuous Output): For continuous output values, a linear activation function (or no activation function) is typically used. This allows the neural network to output any real number, without bounding the prediction within a specific range.

I’ll choose a sigmoid function for a binary classification task:

Y_{pred} = \sigma(Z_3) = \frac{1} {1 + e^{-Z_3}}

Added Step 4 to 7, the coding block looks like:

1import numpy as np
2
3Z1 = np.dot(W1.T, x) + b1        # step 2
4A1 = np.maximum(0, Z1)           # step 3
5
6Z2 = np.dot(W2.T, x) + b2        # step 4
7A2 = np.maximum(0, Z2)           # step 5
8
9Z3 = np.dot(W3.T, x) + b3        # step 6
10
11Y_pred = 1 / (1 + np.exp(-Z3))   # step 7 (output = sigmoid function)
12

For now, we all notice that Step 3 to 6 can be repeated more (or removed) depending on the number of hidden layers.

Generally, the more input features the datasets have, the more complex the problem space becomes. Thus, adding more hidden layers often helps.

Considering this point, I’ll finalize the forward pass process by separately defining _relu(), _sigmoid(), and forward_pass() functions with a given layer size layer_dims as an argument for flexibilities on network architecture depending on the performance.

1import numpy as np
2
3def _relu(Z):
4    return np.maximum(0, Z)
5
6def _sigmoid(Z):
7    return 1 / (1 + np.exp(-Z))
8
9def forward_pass(X, layer_dims: list = [10, 8, 4, 1]):
10    # 1. initilizes and stores parameters (W, b), linear transformation results (Z), and activation results (A) in dict.
11    W = { 1: 0,}  # implies W_1 = 0
12    b = { 1: 0,}  # implies b_1 = 0
13    A = { 0: X,}  # input layer = 0
14    Z = dict()
15
16    # 2. starts a forwards pass in hidden layers (using ReLU)
17    L = len(layer_dims) - 1 # number of layers excluding input layer
18
19    # looping Step 2 to 6 (or more depending on the layer size)
20    for l in range(1, L):
21        Z_l = np.dot(A[l-1], W[l])+ b[l]
22        A_l = _relu(Z_l)
23
24        # stores Z_l, A_l for the next layer to refer.
25        A[l] = A_l
26        Z[l] = Z_l
27
28    # 3. generates the output (using sigmoid)
29    Z_L = np.dot(A[L-1], W[L]) + b[L] # takes activation results from the previous layer (A[L-1])
30    Y_pred = _sigmoid(Z_L)
31
32    return Y_pred
33

◼ One Last Note for Forward Pass - Weight Initialization

In the above coding block, I defined the initial weight as zero in the forward_pass() function. This is actually not a good idea because it ended up with zero output from all the neurons.

One of the common practices on the weight initialization for ReLU is He Initialization, taking following steps;

Initializes weights from a Gaussian distribution with mean 0:

W \sim N(0,\sigma^2)

Then, scales the standard deviation by 2/fan-in, ("fan-in"= the number of input connections to the neuron (n_in):

σ^2= \frac {2} {n_{in}}

Thus, the weights are initialized:

\begin{align} W_{ij} &= x_{ij} \cdot \sqrt{\frac{2}{n_{in}}} \text{ } \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right) \end{align}

He Initialization is primarily developed for activation functions with a non-zero gradient for positive inputs and a zero gradient for negative inputs, which include:

ReLU: max(0, x)
Leaky ReLU: max(0.01x, x) (or other small negative slope)
PReLU (Parametric ReLU): max(ax, x) where a is learned.
ELU (Exponential Linear Unit): shares some characteristics with ReLU variant.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. ReLU, LeakyReLU, PReLU (source)

For the final touch, I’ll refine the code by adding _initialize_parameters() function where the weights across all layers are applied He Initialization, while the biases are started with zero:

1import numpy as np
2
3# added
4def _initialize_parameters(layer_dims: list = [10, 8, 4, 1]):
5    np.random.seed(42)
6
7    W = dict()
8    b = dict()
9
10    for l in range(1, len(layer_dims)):
11        # He Initialization
12        W[l] = np.random.randn(
13            layer_dims[l-1], layer_dims[l]) * np.sqrt(2 / layer_dims[l-1])
14
15        # bias initialization (zero)
16        b[l] = np.zeros((1, layer_dims[l]))
17    return W, b
18
19def _relu(Z):
20    return np.maximum(0, Z)
21
22def _sigmoid(Z):
23    return 1 / (1 + np.exp(-Z))
24
25def forward_pass(X, layer_dims: list = [10, 8, 4, 1]):
26    W, b = _initialize_parameters(layer_dims=layer_dims) # updated
27    A = { 0: X, }
28    Z = dict()
29
30    L = len(layer_dims) - 1
31    for l in range(1, L):
32        Z_l = np.dot(A[l-1], W[l])+ b[l]
33        A_l = _relu(Z_l)
34
35        A[l] = A_l
36        Z[l] = Z_l
37
38    Z_L = np.dot(A[L-1], W[L]) + b[L]
39    Y_pred = _sigmoid(Z_L)
40
41    return Y_pred
42

Adding Back Propagation

Deep learning frameworks automatically perform back propagation, an algorithm used to efficiently compute the gradients of the loss function with respect to the model's parameters (W, b, A, Z).

This process is essential because the model will pass the computed gradients to the optimizer, and let it learn the optimal parameters (W, b) to make the best predictions.

Technically speaking, with or without optimizers, gradients can be computed independently. But without an optimizer, the parameters will not be updated. Thus, the model cannot learn any.

So, I’ll first demonstrate the back propagation process step by step, and in the later section, demonstrate how the optimizer utilizes the gradients in the learning process.

◼ Objective: Minimizing the Loss

The model’s objective is to find the parameters that can minimize the loss (an error between a model’s prediction and the true value).

To compute the loss, we use various loss functions depending on the specific task type (e.g., Mean Squared Error for regression problem).

I’ll choose the cross entropy loss function for a binary classification task.

The model starts by computing the loss of the prediction Y_pred:

L_{\text{batch}} = -\frac{1}{M} \sum_{j=1}^{M} [Y_{\text{true},j} \log(Y_{\text{pred},j}) + (1 - Y_{\text{true},j}) \log(1 - Y_{\text{pred},j})]

Here, M is the number samples in a specific batch.

In deep learning, the total dataset size is often enormous, making it computationally unrealistic to calculate the loss at once.

Therefore, the model processes the data in smaller subsets called mini-batches (or simply "batches") to compute the loss and its gradients.

Typical batch sizes range from 16 to 256 - 32 and 64 are very common choices.

▫ Applying Regularization to Prevent Overfitting

Regularization is a technique in machine learning used to prevent overfitting and improve the model's ability to generalize to unseen data.

Mathematically, it add a penalty terms (a small cost) that increases the loss when the model starts to memorize the training samples including noise (hence, the loss itself becomes nearly zero, yet the penalty term increases). Because of these conflicting values, the model has to carefully balance the loss and the penalty.

In deep learning, L2 regularization (also known as weight decay) is a common practice:

L_{\text{batch}} = -\frac{1}{M} \sum_{j=1}^{M} [Y_{\text{true},j} \log(Y_{\text{pred},j}) + (1 - Y_{\text{true},j}) \log(1 - Y_{\text{pred},j})] \underbrace{\text{ + }\frac{\lambda}{2M} \sum_{l=1}^{L} ||W_l||F^2}_{\text{L2 regularization}}

(M: Batch size, λ: weight decay coefficient (defining the regularization strength, higher - stronger), L: total number of the layers, Wl: weight vector of the l-th layer)

In the math formula, I applied the squared Frobenius norm as L2 regularization.

It computes the sum of the squares of all individual weights in the weight matrix (Wl), functioning as a penalty based on the magnitude of the weights in the layer because as the weight (entries of Wl) increases, the penalty increases. So, the model is encouraged to keep the weight small to generalize the learning so that they can minimize the loss.

I’ll define this loss function in the compute_loss() function. Assuming applying the function to a selected batch, I started the code with applying the lapse smoothing to avoid negative infinity, computed cross entropy loss, then looped all the layers to compute the L2 terms.

1def compute_loss(Y_pred, Y_true, W, lambda_reg = 0.01, layer_dims: list = [10, 8, 4, 1]):
2    m = Y_true.shape[0]
3
4    # applies lapse smoothing
5    epsilon = 1e-8
6    Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon)
7
8    # computes cross entropy loss
9    cross_entropy_loss = - (1/m) * np.sum(Y_true * np.log(Y_pred) + (1 - Y_true) * np.log(1 - Y_pred))
10
11    # computes L2 terms
12    l2_regularization_cost = 0
13    for l in range(1, len(layer_dims)):
14        l2_regularization_cost += np.sum(np.square(W[l]))
15    l2_regularization_cost = (lambda_reg / (2 * m)) * l2_regularization_cost
16
17    # computes the total loss (L)
18    L = cross_entropy_loss + l2_regularization_cost
19    return L
20

◼ Computing the Gradients Backward

After defining the loss function and computing the loss, the model starts the process of backward propagation - implemented in each step backward starting from the loss.

To understand how it works, I’ll make a list of its mathematical process in comparison with the forward pass steps:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

(A_0 = X, A_3 = Y_pred, ⊙ I(Z>0) refers to "zeros out" elements if Z is less or equal to zero.)

Figure C. Mathematical formation of forward pass and backward propagation (using cross entropy loss, L2 regularization) (Created by Kuriko IWAI)

The backward pass aims to transmit the partial derivatives (gradients) with respect to the weights (W) and biases (b) to the optimizer.

To compute these gradients, the model must first calculate the gradients with respect to the activations (A) and pre-activations (Z) within each layer.

In the figure, I added the gradients for each parameter—A, Z, W, and b—from left to right as columns.

The process begins with the computation of the gradient of the loss with respect to A3 (shown in red in the figure).

This value is then used to compute the gradient of the loss with respect to Z3 (pink), which in turn is used to compute the gradients with respect to W3 and b3. The same procedure continues in the rest of the layers (green, blue in the figure).

I’ll add this process in the coding block.

Similar to the forward pass, I’ll first define three functions: _relu_backward(), _sigmoid_backward(), and backward_pass(). Then, in the backward_pass() function, I started with computing the gradients in the third layers - looped in hidden layers, while storing the gradients w.r.t W and b so that the function can pass them to the optimizer.

1def _relu_backward(dLdA, Z):
2    dLdZ = np.array(dLdA, copy=True)
3    dLdZ[Z <= 0] = 0
4    return dLdZ
5
6def _sigmoid_backward(dLdA, Z):
7    sigmoid = 1 / (1 + np.exp(-Z))
8    dLdZ = dLdA * sigmoid * (1 - sigmoid)
9    return dLdZ
10
11def backward_pass(Y_pred, Y_true, Z, A, W, lambda_reg = 0.01, layer_dims: list = [10, 8, 4, 1]):
12    # storing gradients computed in the process (to pass to the optimizer)
13    dLdW = dict()
14    dLdb = dict()
15
16    m = Y_true.shape[0]
17    L = len(layer_dims) - 1 # output layer index
18
19    # dL/dA3
20    dLdA = - (np.divide(Y_true, Y_pred) - np.divide(1 - Y_true, 1 - Y_pred))
21
22    # dL/dZ3 (using sigmoid func = output func)
23    dLdZ = _sigmoid_backward(dLdA, Z[L])
24
25    # dL/dW3
26    dLdW[L] = (1/m) * np.dot(A[L-1].T, dLdZ) + (lambda_reg / m) * W[L]
27
28    # dL/db3
29    dLdb[L] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
30
31    # for gradients in hidden layers (l = 1, 2)
32    dLdA_prev_layer = np.dot(dLdZ, W[L].T)
33
34    for l in reversed(range(1, L)):
35        # dL/dZ
36        dLdZ = _relu_backward(dLdA_prev_layer, Z[l])
37
38        # dL/dW
39        dLdW[l] = (1/m) * np.dot(A[l-1].T, dLdZ) + (lambda_reg / m) * W[l]
40
41        # dL/db
42        dLdb[l] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
43
44        # Compute the dLdA in the previous layer
45        if l > 1:
46            dLdA_prev_layer = np.dot(dLdZ, W[l].T)
47
48    return dLdW, dLdb
49

Adding Optimizer

Now, we can move onto the optimizer settings.

Here, I’ll choose Adam (Adaptive Moment Estimation) for its robustness and computational simplicity out of various optimizers.

◼ How Adam Optimizer Works

Adam keeps updating the parameters W and b in every training epoch by keep adjusting their values with its unique adoptive learning rates:

\begin{align*} W_l^{(t+1)} &= W_l^{(t)} - \alpha \cdot \underbrace{ \frac{\hat{m}{W_l}^{(t)}}{\sqrt{\hat{v}{W_l}^{(t)}} + \epsilon} }{\text{Adam's adoptive learning rate for W}} \\ \\ b_l^{(t+1)} &= b_l^{(t)} - \alpha \cdot \underbrace{\frac{\hat{m}{b_l}^{(t)}}{\sqrt{\hat{v}{b_l}^{(t)}} + \epsilon}}{\text{Adam's adoptive learning rate for b}} \end{align*}

(Wl: weight vector in the l-th layer, bl: bias vector in the l-th layer, t: epoch, α: learning rate (default is 0.001), m^: first moment estimate, v^: second moment estimate, ϵ: small coefficient to avoid division by zero.

▫ How Adam Leverages the Gradients

To compute the adoptive leaning rate, Adam estimates the first and second moments (m^, v^ respectively).

This process requires the gradient of the loss function with respect to the weight or bias depending on the parameter that Adam is optimizing.

Take weights as example:

First moment estimate: Adam leverages the concept of Momentum to accelerate convergence by using the past gradient:

\begin{align*} \hat m_t &= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}

Second moment estimate: Adam leverages the advantages of RMSprop to scale its learning rate:

\begin{align*} \hat v_t &= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}

(β1, β2 : Decay rates, typically set to β1=0.9, β2=0.999)

So, I’ll add this process as update_parameters_adam() function to the coding block. It is a straightforward function that computes first and second moments and keeps updating W and b across all layers:

1def update_parameters_adam(learning_rate, W, b, dLdW, dLdb, t=0, beta_1 = 0.9, beta_2 = .999, layer_dims: list = [10, 8, 4, 1]):
2    L = len(layer_dims) - 1 # Index of the output layer
3    epsilon = 1e-8
4    t += 1 # incrementally increase t
5
6    # storing first moment estimates
7    m_W = dict()
8    m_b = dict()
9
10    # storing second moment estimates
11    v_W = dict()
12    v_b = dict()
13
14    # initialize moments as zero vectors
15    for l in range(1, len(layer_dims)):
16        m_W[l] = np.zeros_like(W[l])
17        m_b[l] = np.zeros_like(b[l])
18        v_W[l] = np.zeros_like(W[l])
19        v_b[l] = np.zeros_like(b[l])
20
21
22    # adam optimizer applied to all the layers
23    for l in range(1, L + 1):
24        dLdW = dLdW[l]
25        dLdb = dLdb[l]
26
27        # first moment estimates (m)
28        m_W[l] = beta_1 * m_W[l] + (1 - beta_1) * dLdW
29        m_b[l] = beta_1 * m_b[l] + (1 - beta_1) * dLdb
30
31        # second moment estimates (v)
32        v_W[l] = beta_2 * v_W[l] + (1 - beta_2) * (dLdW ** 2)
33        v_b[l] = beta_2 * v_b[l] + (1 - beta_2) * (dLdb ** 2)
34
35        # Adam: Bias correction for first moment
36        m_corrected_W = m_W[l] / (1 - beta_1**t)
37        m_corrected_b = m_b[l] / (1 - beta_1**t)
38
39        # Adam: Bias correction for second moment
40        v_corrected_W = v_W[l] / (1 - beta_2**t)
41        v_corrected_b = v_b[l] / (1 - beta_2**t)
42
43        # updates W and b
44        W[l] -= learning_rate * m_corrected_W / (np.sqrt(v_corrected_W) + epsilon)
45        b[l] -= learning_rate * m_corrected_b / (np.sqrt(v_corrected_b) + epsilon)
46
47    return W, b
48

Model Training

I'll define the train() function as a method within a Python class, making it easier to keep tracking all the parameter updates during the process.

In the train() function, I configured full-batch epochs just in case the batch size is not provided or the sample size (m) is smaller than the given batch size. (Note: This approach is generally not recommended due to the computational burden.)

Unless otherwise, the function utilizes the mini-batch approach taking the batch size as an argument:

1import numpy as np
2
3class DFN:
4    def __init__(self, layer_dims: list = [10, 8, 4, 1], lambda_reg: float =0.01, keep_prob: float = 1.0):
5        # nn architecture
6        self.layer_dims = layer_dims
7
8        # model params
9        self.W = {}
10        self.b = {}
11        self.Z = {}
12        self.A = {}
13
14        # gradients
15        self.dLdW = {}
16        self.dLdb = {}
17
18        # params for adam optimizer
19        self.m_W = {}
20        self.m_b = {}
21        self.v_W = {}
22        self.v_b = {}
23
24        self.beta_1 = 0.9
25        self.beta_2 = 0.999
26        self.epsilon = 1e-8
27        self.t = 0
28
29        # params for L2 reg
30        self.lambda_reg = lambda_reg
31
32        # weights and bias initialization
33        self._initialize_parameters()
34
35    # <---- skip ---->
36
37    def train(self, X, Y, epochs, learning_rate, batch_size=None, verbose=True):
38        m = X.shape[0]
39        self._is_training = True
40
41        # training epochs
42        for epoch in range(1, epochs + 1):
43            # 1. full-batch training (not recommended)
44            if batch_size is None or batch_size >= m:
45                # forwad pass - make a prediction
46                Y_pred = self.forward_pass(X)
47
48                # computes the loss for sanity check
49                train_loss = self.compute_loss(Y_pred, Y)
50
51                # backward pass - compute gradients
52                self.backward_pass(Y_pred, Y)
53
54                # computes optimal parameters using Adam
55                self.update_parameters_adam(learning_rate)
56
57            # 2. mini-batch training
58            else:
59                # shuffules the sample datasets
60                permutation = np.random.permutation(m)
61                shuffled_X = X[permutation, :]
62                shuffled_Y = Y[permutation, :]
63
64                num_batches = m // batch_size
65                for i in range(num_batches):
66                    # creates mini-batch training sets
67                    start = i * batch_size
68                    end = (i + 1) * batch_size
69                    batch_X = shuffled_X[start:end, :]
70                    batch_Y = shuffled_Y[start:end, :]
71
72                    # make a prediction and optimize params by batch
73                    Y_pred_batch = self.forward_pass(batch_X)
74                    self.backward_pass(Y_pred_batch, batch_Y)
75                    self.update_parameters_adam(learning_rate)
76
77                # finalizes the results and computes the loss after all mini-batches are processed.
78                self._is_training = False
79                Y_pred = self.forward_pass(X)
80                train_loss = self.compute_loss(Y_pred, Y)
81
82            if verbose and epoch % 100 == 0: 
83                print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}")
84

(To demonstrate the updates, I temporarily bypassed other functions.)

◼ Further Techniques to Address Overfitting

Deep leaning frameworks can handle complex decision boundaries to the point where they sometimes memorize all the noise. So, I need to carefully address overfitting in the training process.

Though I have already implemented the L2 regularization to the loss function, I’ll add more regularizations in the train() function: Dropout and Early Stopping - common regularization techniques in deep learning.

◼ Early Stopping

Early Stopping is a technique that stops model training when the loss on a validation set doesn't improve for a specified number of consecutive epochs - i.e., 50 epochs.

This keeps the model away from overfitting - where the training error gets extremely low, yet the validation error remains high.

I’ll add early-stopping logics to the training epoch loop with new arguments: X_val, y_val, patience, tol, and best_val_loss.

The logic keeps counting the number of epochs where the validation loss is lower than the past optimal validation loss plus the tolerance (tol, very small coefficient e.g. 1e-10, you can set zero). And if this count exceeds the patience (e.g. 50, 100, 200), the training is forcefully ended.

1import numpy as np
2
3class DFN:
4    def __init__(self, layer_dims: list = [10, 8, 4, 1], lambda_reg: float =0.01, patience: int = 100, tol=1e-10):
5
6        # ... the same as the prev coding block
7
8        # <--- early stopping params --->
9        self.tol = tol
10        self.patience: patience
11
12    # ... the same as the prev coding block
13
14    def train(self, X, Y, X_val, Y_val, epochs, learning_rate, batch_size=16, verbose=True):
15        m = X.shape[0]
16        self._is_training = True
17
18        # <--- early stopping params --->
19        use_early_stopping = X_val is not None and Y_val is not None
20        epochs_no_improve = 0
21        best_W = {l: self.W[l].copy() for l in self.W}
22        best_b = {l: self.b[l].copy() for l in self.b}
23        best_val_loss = float('inf')
24
25        # training epochs
26        for epoch in range(1, epochs + 1):
27
28            # ... the same as the prev coding block
29
30            # <--- early stopping  logic--->
31            val_loss = float(0)
32            if use_early_stopping:
33                self._is_training = False
34                Y_val_pred = self.forward_pass(X_val)
35                val_loss = self.compute_loss(Y_val_pred, Y_val)
36
37                if val_loss < best_val_loss + self.tol:
38                    best_val_loss = val_loss
39                    epochs_no_improve = 0
40                    best_W = {l: self.W[l].copy() for l in self.W}
41                    best_b = {l: self.b[l].copy() for l in self.b}
42                else:
43                    epochs_no_improve += 1
44                    if epochs_no_improve >= self.patience:
45                        print(f"\nEarly stopping at epoch {epoch}: Validation loss did not improve for {self.patience} epochs.")
46                        self.W = best_W
47                        self.b = best_b
48                        break # force quite
49
50            if verbose and epoch % 100 == 0: 
51                print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}")
52

◼ Dropout

Lastly, I’ll add Dropout - a technique to randomly "dropping out" (deactivating) a percentage of neurons during each training epochs.

Dropout forces the remaining neurons to learn more robust and independent features by preventing any single neuron from consistently relying on the presence of other neurons.

This makes the network less sensitive to minor variations in the input data and less prone to overfitting.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

I’ll conclude the Python class by adding the dropout logic with the rest of the functions I defined so far.

I added the keep_prob parameter to define the ratio of neurons to be kept (keep_prob = 0.9 → The network drops 10% of any random neurons in each layer). The backdrop process is triggered in the forward_pass() and backward_pass() functions during the training only when keep_prob value is less than 1.0.

1import numpy as np
2import pandas as pd
3import copy # To deep copy model parameters
4
5class DFN:
6    def __init__(
7            self,
8            layer_dims=[10, 8, 4, 1],
9            lambda_reg=0.01,
10            keep_prob=0.9,
11            tol=1e-10,
12            patience=50,
13            learning_rate=.05,
14            verbose=0
15        ):
16        self.layer_dims = layer_dims
17
18        self.W = {}
19        self.b = {}
20        self.dLdW = {}
21        self.dLdb = {}
22        self.A = {}
23        self.Z = {}
24
25        self.m_W = {}
26        self.m_b = {}
27        self.v_W = {}
28        self.v_b = {}
29
30        self.beta1 = 0.9
31        self.beta2 = 0.999
32        self.epsilon = 1e-8
33        self.learning_rate = learning_rate
34        self.t = 0
35
36        self.lambda_reg = lambda_reg
37
38        self.tol = tol
39        self.patience = patience
40        self.val_losses = []
41
42        # <--- Dropout params ---> 
43        self._is_training = True
44        self.keep_prob = keep_prob
45        self.D = {} # storing dropout masks
46
47        self.verbose = verbose
48        self._initialize_parameters()
49
50    def _initialize_parameters(self):
51        np.random.seed(42)
52        for l in range(1, len(self.layer_dims)):
53            self.W[l] = np.random.randn(self.layer_dims[l-1], self.layer_dims[l]) * np.sqrt(2 / self.layer_dims[l-1])
54            self.b[l] = np.zeros((1, self.layer_dims[l])) # type: ignore
55
56
57    def _relu(self, Z):
58        A = np.maximum(0, Z)
59        return A, Z
60
61
62    def _sigmoid(self, Z):
63        A = 1 / (1 + np.exp(-Z))
64        return A, Z
65
66
67    def forward_pass(self, X):
68        self.A[0] = X
69        L = len(self.layer_dims) - 1
70
71        for l in range(1, L):
72            self.Z[l] = np.dot(self.A[l-1], self.W[l]) + self.b[l]
73            self.A[l], _ = self._relu(self.Z[l])
74
75            # <--- Dropout --->
76            if self._is_training and self.keep_prob < 1.0:
77                self.D[l] = np.random.rand(*self.A[l].shape) < self.keep_prob
78                self.A[l] = self.A[l] * self.D[l]
79                self.A[l] = self.A[l] / self.keep_prob
80
81        self.Z[L] = np.dot(self.A[L-1], self.W[L]) + self.b[L]
82        self.A[L], _ = self._sigmoid(self.Z[L])
83        return self.A[L]
84
85    def compute_loss(self, Y_pred, Y_true):
86        m = Y_true.shape[0]
87        Y_pred = np.clip(Y_pred, self.epsilon, 1 - self.epsilon)
88
89        cross_entropy_loss = - (1/m) * np.sum(Y_true * np.log(Y_pred) + (1 - Y_true) * np.log(1 - Y_pred))
90
91        l2_regularization_cost = 0
92        for l in range(1, len(self.layer_dims)):
93            l2_regularization_cost += np.sum(np.square(self.W[l]))
94        l2_regularization_cost = (self.lambda_reg / (2 * m)) * l2_regularization_cost
95
96        L = cross_entropy_loss + l2_regularization_cost
97        return L
98
99
100    def _relu_backward(self, dLdA, Z):
101        dLdZ = np.array(dLdA, copy=True)
102        dLdZ[Z <= 0] = 0
103        return dLdZ
104
105
106    def _sigmoid_backward(self, dLdA, Z):
107        s = 1 / (1 + np.exp(-Z))
108        dLdZ = dLdA * s * (1 - s)
109        return dLdZ
110
111
112    def backward_pass(self, Y_pred, Y_true):
113        m = Y_true.shape[0]
114        L = len(self.layer_dims) - 1
115        dLdA = - (np.divide(Y_true, Y_pred) - np.divide(1 - Y_true, 1 - Y_pred))
116        dLdZ = self._sigmoid_backward(dLdA, self.Z[L])
117        self.dLdW[L] = (1/m) * np.dot(self.A[L-1].T, dLdZ) + (self.lambda_reg / m) * self.W[L]
118        self.dLdb[L] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
119        dLdA_prev_layer = np.dot(dLdZ, self.W[L].T)
120        for l in reversed(range(1, L)):
121
122            # <--- Dropout Logic ---> 
123            if self.keep_prob < 1.0:
124                dLdA_prev_layer = dLdA_prev_layer * self.D[l]
125                dLdA_prev_layer = dLdA_prev_layer / self.keep_prob
126
127            dLdZ = self._relu_backward(dLdA_prev_layer, self.Z[l])
128            self.dLdW[l] = (1/m) * np.dot(self.A[l-1].T, dLdZ) + (self.lambda_reg / m) * self.W[l]
129            self.dLdb[l] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
130
131            if l > 1:
132                dLdA_prev_layer = np.dot(dLdZ, self.W[l].T)
133
134    def update_parameters_adam(self):
135        for l in range(1, len(self.layer_dims)):
136            self.m_W[l] = np.zeros_like(self.W[l])
137            self.m_b[l] = np.zeros_like(self.b[l])
138            self.v_W[l] = np.zeros_like(self.W[l])
139            self.v_b[l] = np.zeros_like(self.b[l])
140
141        self.t += 1
142        L = len(self.layer_dims) - 1
143
144        for l in range(1, L + 1):
145            dLdW = self.dLdW[l]
146            dLdb = self.dLdb[l]
147
148            self.m_W[l] = self.beta1 * self.m_W[l] + (1 - self.beta1) * dLdW
149            self.m_b[l] = self.beta1 * self.m_b[l] + (1 - self.beta1) * dLdb
150
151            self.v_W[l] = self.beta2 * self.v_W[l] + (1 - self.beta2) * (dLdW ** 2)
152            self.v_b[l] = self.beta2 * self.v_b[l] + (1 - self.beta2) * (dLdb ** 2)
153
154            m_corrected_W = self.m_W[l] / (1 - self.beta1**self.t)
155            m_corrected_b = self.m_b[l] / (1 - self.beta1**self.t)
156
157            v_corrected_W = self.v_W[l] / (1 - self.beta2**self.t)
158            v_corrected_b = self.v_b[l] / (1 - self.beta2**self.t)
159
160            self.W[l] -= self.learning_rate * m_corrected_W / (np.sqrt(v_corrected_W) + self.epsilon)
161            self.b[l] -= self.learning_rate * m_corrected_b / (np.sqrt(v_corrected_b) + self.epsilon)
162
163            if self.verbose:
164                delta_W_L = self.learning_rate * m_corrected_W / (np.sqrt(v_corrected_W) + self.epsilon)
165                print(f"Delta W[L] stats (min, max, mean): {np.min(delta_W_L):.8f}, {np.max(delta_W_L):.8f}, {np.mean(delta_W_L):.8f}")
166
167    def train(self, X, Y, epochs, batch_size=16, verbose=True, X_val=None, Y_val=None):
168        m = X.shape[0]
169        self._is_training = True
170        use_early_stopping = X_val is not None and Y_val is not None
171        epochs_no_improve = 0
172        best_W = {l: self.W[l].copy() for l in self.W}
173        best_b = {l: self.b[l].copy() for l in self.b}      
174        best_val_loss = float('inf')  
175
176        for epoch in range(1, epochs + 1):
177            if batch_size is None or batch_size >= m:
178                Y_pred = self.forward_pass(X)
179                train_loss = self.compute_loss(Y_pred, Y)
180                self.backward_pass(Y_pred, Y)
181                self.update_parameters_adam()
182            else:
183                permutation = np.random.permutation(m)
184                shuffled_X = X[permutation, :]
185                shuffled_Y = Y[permutation, :]
186
187                num_batches = m // batch_size
188                for i in range(num_batches):
189                    start = i * batch_size
190                    end = (i + 1) * batch_size
191                    batch_X = shuffled_X[start:end, :]
192                    batch_Y = shuffled_Y[start:end, :]
193
194                    Y_pred_batch = self.forward_pass(batch_X)
195                    self.backward_pass(Y_pred_batch, batch_Y)
196                    self.update_parameters_adam()
197                self._is_training = False     # turn off dropout for validation loss computation
198                Y_pred = self.forward_pass(X)
199                train_loss = self.compute_loss(Y_pred, Y)
200                self._is_training = True      # turn back on
201
202            val_loss = None
203            if use_early_stopping:
204                self._is_training = False     # turn off dropout for validation loss computation
205                Y_val_pred = self.forward_pass(X_val)
206                val_loss = self.compute_loss(Y_val_pred, Y_val)
207                self._is_training = True      # turn back on
208
209                if  val_loss < best_val_loss + self.tol:
210                    best_val_loss = val_loss
211                    epochs_no_improve = 0
212                    best_W = {l: self.W[l].copy() for l in self.W}
213                    best_b = {l: self.b[l].copy() for l in self.b}
214                else:
215                    epochs_no_improve += 1
216                    if epochs_no_improve >= self.patience:
217                        print(f"\nEarly stopping at epoch {epoch}: Validation loss did not improve for {self.patience} epochs.")
218                        self.W = best_W
219                        self.b = best_b
220                        break
221
222            if verbose and epoch % 100 == 0:
223                if use_early_stopping:
224                    print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
225                else:
226                    print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}")
227        return self
228

Simulation

I’ll demonstrate how the custom DFN works for a binary classification task of churn prediction, using a telecom churn dataset from the UC Irvine Machine Learning Repository.

◼ Preparing Dataset

Dataset:

Iranian Churn from UC Irvine Machine Learning Repository (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license)

3,500 data samples
14 features:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Sample dataset (Image source)

Applying the column transformation and data augmentation to handle the class imbalance, I secured 3,178 datasets (1,643 for y=0, 1,535 for y=1) with 13 features as training sets:

1(3178, 13) (600, 13) (600, 13) (3178,) (600,) (600,)
2
3Counter({0: 1643, 1: 1535})
4

◼ Model Training, Hyperparameter Tuning

I tuned the hyperparameters multiple times and trained the model with the training dataset.

1from sklearn.metrics import f1_score, accuracy_score
2
3custom_model = DFN(
4    layer_dims=[X_train.shape[1], 30, 20, 10, 1], # adjusting NN
5    lambda_reg=0.1,
6    keep_prob=0.99,
7    tol=1e-10,
8    patience=1000,
9    learning_rate=0.01,
10).train(
11    X=X_train,
12    Y=convert(y_train),
13    X_val=X_val,
14    Y_val=convert(y_val),
15    epochs=5000,
16    batch_size=32,
17)
18

◼ Evaluation

Lastly, I evaluated the model's predictions on the training, validation, and test datasets, noting the accuracy and F1 scores.

1Y_pred_train = custom_model.predict(X=X_train, threshold=.5)
2Y_pred_val = custom_model.predict(X=X_val, threshold=.5)
3Y_pred_test = custom_model.predict(X=X_test, threshold=.5)
4

◼ Result

Accuracy score: Learning: 0.8763 → 0.8517, Generalization: 0.8400
F1 score: Learning: 0.8772 → 0.6367, Generalization: 0.6220

Good Training Performance:

The model is performing very well on the training data, achieving high accuracy (87.63%) and F1 score (87.72%). This indicates that the model has successfully learned the patterns within the data it was trained on.

Strong Generalization for Accuracy:

The validation accuracy (85.17%) is quite close to the training accuracy (87.63%), which is a positive sign. It suggests the model is generally able to make correct predictions on unseen data.

Concerning F1 Score Drop on Validation:

The F1 score on the test set (62.20%) is significantly lower than on the training set (87.72%). This is the most critical observation.

Wrapping Up

The custom DFN gives us deep insights into how neural networks work in the “black box”.

In practice, Python libraries like TensorFlow can offer more robust approaches, which can be areas for improvement for the custom DFN. Some examples are:

Gradient Calculation: Integrates an automatic differentiation engine instead of manually deriving and implementing gradients.
Optimizer Implementation: Offers dynamic learning rate adjustments instead of manual adjustment of the optimizer parameters.
Robust Task Handling: Adds broader options for initializers, regularization methods, optimizers, and loss/activation functions (In this experiment, I used one example per element.)
Performance: GPU-based frameworks instead of CPU (NumPy runs on the CPU).
MLOps: Adds custom callbacks for logging or checkpointing, access to metrics libraries or pre-implemented metrics for the performance evaluation and more.

Overall, the custom DFN provides direct control over every aspect of the network, from parameter initialization to the forward pass, backward pass, manual gradient calculation, regularization, and parameter updates.

This makes it an excellent foundational tool for understanding neural network mechanics.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Beyond the Black Box: Architecting Deep Feedforward Networks with NumPy" in Kernel Labs

https://kuriko-iwai.com/building-deep-feedforward-networks

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Beyond the Black Box: Architecting Deep Feedforward Networks with NumPy

Deep Dive into Feedforward Architectures with Math and Code

Table of Contents

Introduction

◼ What is Deep Feedforward Network?

◼ Various Architectures in Feedforward Neural Network (FNN)

Designing a DFN Architecture

◼ How Feedforward Network Works

◼ Forward Pass

◼ Step 1. Receiving Input

◼ Step 2. Performing a Linear Transformation

▫ Fully Connected Layers

◼ Step 3. First Hidden Layer Activation

▫ Choosing the Right Activation Function

◼ Step 4: Second Linear Transformation

◼ Step 5: Second Hidden Layer Activation

◼ Step 6: Final Linear Transformation

◼ Step 7. The Output Layer

◼ One Last Note for Forward Pass - Weight Initialization

Adding Back Propagation

◼ Objective: Minimizing the Loss

▫ Applying Regularization to Prevent Overfitting

◼ Computing the Gradients Backward

Adding Optimizer

◼ How Adam Optimizer Works

▫ How Adam Leverages the Gradients

Model Training

◼ Further Techniques to Address Overfitting

◼ Early Stopping

◼ Dropout

Simulation

◼ Preparing Dataset

◼ Model Training, Hyperparameter Tuning

◼ Evaluation

◼ Result

Wrapping Up

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?