Beyond the Black Box: Architecting Deep Feedforward Networks with NumPy
Deep Dive into Feedforward Architectures with Math and Code
By Kuriko IWAI

Table of Contents
IntroductionDesigning a DFN ArchitectureAdding Back PropagationAdding OptimizerModel TrainingSimulationWrapping UpIntroduction
Deep Feedforward Networks (DFNs) are the foundational artificial neural network that many deep learning algorithms are built on.
I'll explore their mathematical formation with NumPy frameworks to gain deeper insights into how these neural networks operate in the black box.
◼ What is Deep Feedforward Network?
Deep Feedforward Networks (DFNs) is a key type of Feedforward Neural Network (FNN) with at least two hidden layers between the input and output layers.
This multi-layered (“deep“) structure enables DFNs to learn increasingly abstract and hierarchical features from data.
Key characteristics of an DFN include:
Multiple Hidden Layers: At least two hidden layers between the input and output layer.
Fully Connected (Dense) Layers: Each neuron in a given layer is connected to every neuron in the subsequent layer.
Parametric: Defined by a finite set of parameters .
Outputs both continuous values (for regression tasks, using a linear activation function) and discrete values (for classification tasks, using sigmoid or softmax function).
Non-linear Activation Functions: Uses non-linear activation functions in their hidden layers (e.g., sigmoid, tanh, ReLU).
This non-linearity enables DFNs to learn and approximate complex, non-linear relationships, overcoming the limitations of single-layer perceptrons.
In fact, with enough hidden neurons and proper training, DFNs become universal function approximators, capable of approximating any continuous function to a desired accuracy.
◼ Various Architectures in Feedforward Neural Network (FNN)
While Feedforward Neural Network is a broad category, here are some specific architectures that fall under it:
Single-Layer Perceptron
The simplest form of an FNN with only an input/output layer.
Can only solve a linearly separable problem.
Multi-Layer Perceptron
Refers to multi-layered perceptrons with one or more hidden layers.
DFN is included in the Multi-Layered Perceptron category.
Radial Basis Function (RBF) Networks:
Uses a radial basis function (e.g., Gaussian function) as an activation function of the hidden layers.
Often used for the function approximation and classification tasks.
Autoencoders (specifically the encoder and decoder parts):
Designed for unsupervised learning, specifically for efficient data encodings.
Consists of two parts: an encoder and a decoder (both are typically MLPs).
Designing a DFN Architecture
I’ll begin by designing a foundational Deep Feedforward Network (DFN) architecture optimized for tasks on structured/tabular data.
◼ How Feedforward Network Works
To simplify the concept, I’ll take a 2-layer neural network for an example and demonstrate the mathematical formation and coding behind the operation.
The entire architecture looks like this:

ML Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Forward pass process in the DFN with two hidden layers (Created by Kuriko IWAI)
◼ Forward Pass
A forward pass is the process of feeding input data through a neural network to produce an output.
Let us go through this process step by step:
◼ Step 1. Receiving Input
The input layer tasks an input (X), a matrix with n number of features as its entries:
◼ Step 2. Performing a Linear Transformation
The model performs a linear transformation on the input with a randomly initialized weights (W_1) and bias (b_1) to generate a value (Z_1).
Mathematically, this process is defined by calculating weighted sum of input features:
The process is defined simply using the ndarray from the NumPy library:
1import numpy as np
2
3z_1 = np.dot(W1.T, x) + b1
4
▫ Fully Connected Layers
Before moving onto the next step, let us recall that FDNs consist of fully connected layers.
This implies that all neurons in each layer take the input, compute a value, and pass it to its connected neuron in the next layer.
For instance, in the example of n=4, each entry of the input vector (x1, x2, x3, x4) are fed to a neuron with corresponding weight values. The process looks like:

ML Labs | Kuriko IWAI | kuriko-iwai.com
To keep things simple, I’ll focus on the layer level in this article.
◼ Step 3. First Hidden Layer Activation
The first hidden layer (h1) computes a value (A1) by applying its activation function to the input Z1.
▫ Choosing the Right Activation Function
I’ll choose ReLU (Rectified Linear Unit) as an activation function for hidden layers due to its robustness and computational simplicity.
Mathematically, ReLU is defined:
Hence,
This process is added to the coding block:
1import numpy as np
2
3z_1 = np.dot(W1.T, x) + b1 # linear transformation
4A_1 = np.maximum(0, z_1) # ReLU
5
While no single activation function suits every scenario, understanding the problem's properties allows for a more informed choice, leading to faster network convergence.
For instance, for classifiers, Sigmoid functions and their variants often perform well.
However, both Sigmoid and tanh functions can suffer from the vanishing gradient problem especially in deeper networks, which is why ReLU is a popular default for most modern neural networks.
In cases of dead neurons, Leaky ReLU is a better alternative as well.
Learn More: Various activation functions: A Comprehensive Guide on Neural Network in Deep Learning
I’ll switch to other options in case ReLU doesn’t perform well later.
◼ Step 4: Second Linear Transformation
Perform a second linear transformation on A1 using a new, randomly initialized weight matrix (W2) and bias vector (b2).
This step is identical to Step 2, except it takes input from the preceding hidden layer.
◼ Step 5: Second Hidden Layer Activation
The second hidden layer (h2) applies an activation function (ReLU in our case) to Z2, generating the output value A2. Similar to Step 3, the only difference here is using Z2 as input instead of Z1.
◼ Step 6: Final Linear Transformation
Perform a final linear transformation on A2 using a randomly initialized weight matrix (W3) and bias vector (b3).
◼ Step 7. The Output Layer
The output layer applies an output function to Z2 to generate the final prediction, denoted as y^.
The choice of output function depends on the nature of the problem:
Binary Classification: For problems with two classes, the sigmoid function is commonly used. It squashes the output to a value between 0 and 1, which can be interpreted as a probability.
Multinomial Classification: For problems with more than two classes, the softmax function is applied. This converts the outputs into a probability distribution over the classes, ensuring that the probabilities sum to 1.
Regression (Continuous Output): For continuous output values, a linear activation function (or no activation function) is typically used. This allows the neural network to output any real number, without bounding the prediction within a specific range.
I’ll choose a sigmoid function for a binary classification task:
Added Step 4 to 7, the coding block looks like:
1import numpy as np
2
3Z1 = np.dot(W1.T, x) + b1 # step 2
4A1 = np.maximum(0, Z1) # step 3
5
6Z2 = np.dot(W2.T, x) + b2 # step 4
7A2 = np.maximum(0, Z2) # step 5
8
9Z3 = np.dot(W3.T, x) + b3 # step 6
10
11Y_pred = 1 / (1 + np.exp(-Z3)) # step 7 (output = sigmoid function)
12
For now, we all notice that Step 3 to 6 can be repeated more (or removed) depending on the number of hidden layers.
Generally, the more input features the datasets have, the more complex the problem space becomes. Thus, adding more hidden layers often helps.
Considering this point, I’ll finalize the forward pass process by separately defining _relu(), _sigmoid(), and forward_pass() functions with a given layer size layer_dims as an argument for flexibilities on network architecture depending on the performance.
1import numpy as np
2
3def _relu(Z):
4 return np.maximum(0, Z)
5
6def _sigmoid(Z):
7 return 1 / (1 + np.exp(-Z))
8
9def forward_pass(X, layer_dims: list = [10, 8, 4, 1]):
10 # 1. initilizes and stores parameters (W, b), linear transformation results (Z), and activation results (A) in dict.
11 W = { 1: 0,} # implies W_1 = 0
12 b = { 1: 0,} # implies b_1 = 0
13 A = { 0: X,} # input layer = 0
14 Z = dict()
15
16 # 2. starts a forwards pass in hidden layers (using ReLU)
17 L = len(layer_dims) - 1 # number of layers excluding input layer
18
19 # looping Step 2 to 6 (or more depending on the layer size)
20 for l in range(1, L):
21 Z_l = np.dot(A[l-1], W[l])+ b[l]
22 A_l = _relu(Z_l)
23
24 # stores Z_l, A_l for the next layer to refer.
25 A[l] = A_l
26 Z[l] = Z_l
27
28 # 3. generates the output (using sigmoid)
29 Z_L = np.dot(A[L-1], W[L]) + b[L] # takes activation results from the previous layer (A[L-1])
30 Y_pred = _sigmoid(Z_L)
31
32 return Y_pred
33
◼ One Last Note for Forward Pass - Weight Initialization
In the above coding block, I defined the initial weight as zero in the forward_pass() function. This is actually not a good idea because it ended up with zero output from all the neurons.
One of the common practices on the weight initialization for ReLU is He Initialization, taking following steps;
- Initializes weights from a Gaussian distribution with mean 0:
- Then, scales the standard deviation by 2/fan-in, ("fan-in"= the number of input connections to the neuron (n_in):
Thus, the weights are initialized:
He Initialization is primarily developed for activation functions with a non-zero gradient for positive inputs and a zero gradient for negative inputs, which include:
ReLU: max(0, x)
Leaky ReLU: max(0.01x, x) (or other small negative slope)
PReLU (Parametric ReLU): max(ax, x) where a is learned.
ELU (Exponential Linear Unit): shares some characteristics with ReLU variant.

ML Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. ReLU, LeakyReLU, PReLU (source)
For the final touch, I’ll refine the code by adding _initialize_parameters() function where the weights across all layers are applied He Initialization, while the biases are started with zero:
1import numpy as np
2
3# added
4def _initialize_parameters(layer_dims: list = [10, 8, 4, 1]):
5 np.random.seed(42)
6
7 W = dict()
8 b = dict()
9
10 for l in range(1, len(layer_dims)):
11 # He Initialization
12 W[l] = np.random.randn(
13 layer_dims[l-1], layer_dims[l]) * np.sqrt(2 / layer_dims[l-1])
14
15 # bias initialization (zero)
16 b[l] = np.zeros((1, layer_dims[l]))
17 return W, b
18
19def _relu(Z):
20 return np.maximum(0, Z)
21
22def _sigmoid(Z):
23 return 1 / (1 + np.exp(-Z))
24
25def forward_pass(X, layer_dims: list = [10, 8, 4, 1]):
26 W, b = _initialize_parameters(layer_dims=layer_dims) # updated
27 A = { 0: X, }
28 Z = dict()
29
30 L = len(layer_dims) - 1
31 for l in range(1, L):
32 Z_l = np.dot(A[l-1], W[l])+ b[l]
33 A_l = _relu(Z_l)
34
35 A[l] = A_l
36 Z[l] = Z_l
37
38 Z_L = np.dot(A[L-1], W[L]) + b[L]
39 Y_pred = _sigmoid(Z_L)
40
41 return Y_pred
42
Adding Back Propagation
Deep learning frameworks automatically perform back propagation, an algorithm used to efficiently compute the gradients of the loss function with respect to the model's parameters (W, b, A, Z).
This process is essential because the model will pass the computed gradients to the optimizer, and let it learn the optimal parameters (W, b) to make the best predictions.
Technically speaking, with or without optimizers, gradients can be computed independently. But without an optimizer, the parameters will not be updated. Thus, the model cannot learn any.
So, I’ll first demonstrate the back propagation process step by step, and in the later section, demonstrate how the optimizer utilizes the gradients in the learning process.
◼ Objective: Minimizing the Loss
The model’s objective is to find the parameters that can minimize the loss (an error between a model’s prediction and the true value).
To compute the loss, we use various loss functions depending on the specific task type (e.g., Mean Squared Error for regression problem).
I’ll choose the cross entropy loss function for a binary classification task.
The model starts by computing the loss of the prediction Y_pred:
Here, M is the number samples in a specific batch.
In deep learning, the total dataset size is often enormous, making it computationally unrealistic to calculate the loss at once.
Therefore, the model processes the data in smaller subsets called mini-batches (or simply "batches") to compute the loss and its gradients.
Typical batch sizes range from 16 to 256 - 32 and 64 are very common choices.
▫ Applying Regularization to Prevent Overfitting
Regularization is a technique in machine learning used to prevent overfitting and improve the model's ability to generalize to unseen data.
Mathematically, it add a penalty terms (a small cost) that increases the loss when the model starts to memorize the training samples including noise (hence, the loss itself becomes nearly zero, yet the penalty term increases). Because of these conflicting values, the model has to carefully balance the loss and the penalty.
In deep learning, L2 regularization (also known as weight decay) is a common practice:
(M: Batch size, λ: weight decay coefficient (defining the regularization strength, higher - stronger), L: total number of the layers, Wl: weight vector of the l-th layer)
In the math formula, I applied the squared Frobenius norm as L2 regularization.
It computes the sum of the squares of all individual weights in the weight matrix (Wl), functioning as a penalty based on the magnitude of the weights in the layer because as the weight (entries of Wl) increases, the penalty increases. So, the model is encouraged to keep the weight small to generalize the learning so that they can minimize the loss.
I’ll define this loss function in the compute_loss() function. Assuming applying the function to a selected batch, I started the code with applying the lapse smoothing to avoid negative infinity, computed cross entropy loss, then looped all the layers to compute the L2 terms.
1def compute_loss(Y_pred, Y_true, W, lambda_reg = 0.01, layer_dims: list = [10, 8, 4, 1]):
2 m = Y_true.shape[0]
3
4 # applies lapse smoothing
5 epsilon = 1e-8
6 Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon)
7
8 # computes cross entropy loss
9 cross_entropy_loss = - (1/m) * np.sum(Y_true * np.log(Y_pred) + (1 - Y_true) * np.log(1 - Y_pred))
10
11 # computes L2 terms
12 l2_regularization_cost = 0
13 for l in range(1, len(layer_dims)):
14 l2_regularization_cost += np.sum(np.square(W[l]))
15 l2_regularization_cost = (lambda_reg / (2 * m)) * l2_regularization_cost
16
17 # computes the total loss (L)
18 L = cross_entropy_loss + l2_regularization_cost
19 return L
20
◼ Computing the Gradients Backward
After defining the loss function and computing the loss, the model starts the process of backward propagation - implemented in each step backward starting from the loss.
To understand how it works, I’ll make a list of its mathematical process in comparison with the forward pass steps:

ML Labs | Kuriko IWAI | kuriko-iwai.com
(A_0 = X, A_3 = Y_pred, ⊙ I(Z>0) refers to "zeros out" elements if Z is less or equal to zero.)
Figure C. Mathematical formation of forward pass and backward propagation (using cross entropy loss, L2 regularization) (Created by Kuriko IWAI)
The backward pass aims to transmit the partial derivatives (gradients) with respect to the weights (W) and biases (b) to the optimizer.
To compute these gradients, the model must first calculate the gradients with respect to the activations (A) and pre-activations (Z) within each layer.
In the figure, I added the gradients for each parameter—A, Z, W, and b—from left to right as columns.
The process begins with the computation of the gradient of the loss with respect to A3 (shown in red in the figure).
This value is then used to compute the gradient of the loss with respect to Z3 (pink), which in turn is used to compute the gradients with respect to W3 and b3. The same procedure continues in the rest of the layers (green, blue in the figure).
I’ll add this process in the coding block.
Similar to the forward pass, I’ll first define three functions: _relu_backward(), _sigmoid_backward(), and backward_pass(). Then, in the backward_pass() function, I started with computing the gradients in the third layers - looped in hidden layers, while storing the gradients w.r.t W and b so that the function can pass them to the optimizer.
1def _relu_backward(dLdA, Z):
2 dLdZ = np.array(dLdA, copy=True)
3 dLdZ[Z <= 0] = 0
4 return dLdZ
5
6def _sigmoid_backward(dLdA, Z):
7 sigmoid = 1 / (1 + np.exp(-Z))
8 dLdZ = dLdA * sigmoid * (1 - sigmoid)
9 return dLdZ
10
11def backward_pass(Y_pred, Y_true, Z, A, W, lambda_reg = 0.01, layer_dims: list = [10, 8, 4, 1]):
12 # storing gradients computed in the process (to pass to the optimizer)
13 dLdW = dict()
14 dLdb = dict()
15
16 m = Y_true.shape[0]
17 L = len(layer_dims) - 1 # output layer index
18
19 # dL/dA3
20 dLdA = - (np.divide(Y_true, Y_pred) - np.divide(1 - Y_true, 1 - Y_pred))
21
22 # dL/dZ3 (using sigmoid func = output func)
23 dLdZ = _sigmoid_backward(dLdA, Z[L])
24
25 # dL/dW3
26 dLdW[L] = (1/m) * np.dot(A[L-1].T, dLdZ) + (lambda_reg / m) * W[L]
27
28 # dL/db3
29 dLdb[L] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
30
31 # for gradients in hidden layers (l = 1, 2)
32 dLdA_prev_layer = np.dot(dLdZ, W[L].T)
33
34 for l in reversed(range(1, L)):
35 # dL/dZ
36 dLdZ = _relu_backward(dLdA_prev_layer, Z[l])
37
38 # dL/dW
39 dLdW[l] = (1/m) * np.dot(A[l-1].T, dLdZ) + (lambda_reg / m) * W[l]
40
41 # dL/db
42 dLdb[l] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
43
44 # Compute the dLdA in the previous layer
45 if l > 1:
46 dLdA_prev_layer = np.dot(dLdZ, W[l].T)
47
48 return dLdW, dLdb
49
Adding Optimizer
Now, we can move onto the optimizer settings.
Here, I’ll choose Adam (Adaptive Moment Estimation) for its robustness and computational simplicity out of various optimizers.
◼ How Adam Optimizer Works
Adam keeps updating the parameters W and b in every training epoch by keep adjusting their values with its unique adoptive learning rates:
(Wl: weight vector in the l-th layer, bl: bias vector in the l-th layer, t: epoch, α: learning rate (default is 0.001), m^: first moment estimate, v^: second moment estimate, ϵ: small coefficient to avoid division by zero.
▫ How Adam Leverages the Gradients
To compute the adoptive leaning rate, Adam estimates the first and second moments (m^, v^ respectively).
This process requires the gradient of the loss function with respect to the weight or bias depending on the parameter that Adam is optimizing.
Take weights as example:
- First moment estimate: Adam leverages the concept of Momentum to accelerate convergence by using the past gradient:
- Second moment estimate: Adam leverages the advantages of RMSprop to scale its learning rate:
(β1, β2 : Decay rates, typically set to β1=0.9, β2=0.999)
So, I’ll add this process as update_parameters_adam() function to the coding block. It is a straightforward function that computes first and second moments and keeps updating W and b across all layers:
1def update_parameters_adam(learning_rate, W, b, dLdW, dLdb, t=0, beta_1 = 0.9, beta_2 = .999, layer_dims: list = [10, 8, 4, 1]):
2 L = len(layer_dims) - 1 # Index of the output layer
3 epsilon = 1e-8
4 t += 1 # incrementally increase t
5
6 # storing first moment estimates
7 m_W = dict()
8 m_b = dict()
9
10 # storing second moment estimates
11 v_W = dict()
12 v_b = dict()
13
14 # initialize moments as zero vectors
15 for l in range(1, len(layer_dims)):
16 m_W[l] = np.zeros_like(W[l])
17 m_b[l] = np.zeros_like(b[l])
18 v_W[l] = np.zeros_like(W[l])
19 v_b[l] = np.zeros_like(b[l])
20
21
22 # adam optimizer applied to all the layers
23 for l in range(1, L + 1):
24 dLdW = dLdW[l]
25 dLdb = dLdb[l]
26
27 # first moment estimates (m)
28 m_W[l] = beta_1 * m_W[l] + (1 - beta_1) * dLdW
29 m_b[l] = beta_1 * m_b[l] + (1 - beta_1) * dLdb
30
31 # second moment estimates (v)
32 v_W[l] = beta_2 * v_W[l] + (1 - beta_2) * (dLdW ** 2)
33 v_b[l] = beta_2 * v_b[l] + (1 - beta_2) * (dLdb ** 2)
34
35 # Adam: Bias correction for first moment
36 m_corrected_W = m_W[l] / (1 - beta_1**t)
37 m_corrected_b = m_b[l] / (1 - beta_1**t)
38
39 # Adam: Bias correction for second moment
40 v_corrected_W = v_W[l] / (1 - beta_2**t)
41 v_corrected_b = v_b[l] / (1 - beta_2**t)
42
43 # updates W and b
44 W[l] -= learning_rate * m_corrected_W / (np.sqrt(v_corrected_W) + epsilon)
45 b[l] -= learning_rate * m_corrected_b / (np.sqrt(v_corrected_b) + epsilon)
46
47 return W, b
48
Model Training
I'll define the train() function as a method within a Python class, making it easier to keep tracking all the parameter updates during the process.
In the train() function, I configured full-batch epochs just in case the batch size is not provided or the sample size (m) is smaller than the given batch size. (Note: This approach is generally not recommended due to the computational burden.)
Unless otherwise, the function utilizes the mini-batch approach taking the batch size as an argument:
1import numpy as np
2
3class DFN:
4 def __init__(self, layer_dims: list = [10, 8, 4, 1], lambda_reg: float =0.01, keep_prob: float = 1.0):
5 # nn architecture
6 self.layer_dims = layer_dims
7
8 # model params
9 self.W = {}
10 self.b = {}
11 self.Z = {}
12 self.A = {}
13
14 # gradients
15 self.dLdW = {}
16 self.dLdb = {}
17
18 # params for adam optimizer
19 self.m_W = {}
20 self.m_b = {}
21 self.v_W = {}
22 self.v_b = {}
23
24 self.beta_1 = 0.9
25 self.beta_2 = 0.999
26 self.epsilon = 1e-8
27 self.t = 0
28
29 # params for L2 reg
30 self.lambda_reg = lambda_reg
31
32 # weights and bias initialization
33 self._initialize_parameters()
34
35 # <---- skip ---->
36
37 def train(self, X, Y, epochs, learning_rate, batch_size=None, verbose=True):
38 m = X.shape[0]
39 self._is_training = True
40
41 # training epochs
42 for epoch in range(1, epochs + 1):
43 # 1. full-batch training (not recommended)
44 if batch_size is None or batch_size >= m:
45 # forwad pass - make a prediction
46 Y_pred = self.forward_pass(X)
47
48 # computes the loss for sanity check
49 train_loss = self.compute_loss(Y_pred, Y)
50
51 # backward pass - compute gradients
52 self.backward_pass(Y_pred, Y)
53
54 # computes optimal parameters using Adam
55 self.update_parameters_adam(learning_rate)
56
57 # 2. mini-batch training
58 else:
59 # shuffules the sample datasets
60 permutation = np.random.permutation(m)
61 shuffled_X = X[permutation, :]
62 shuffled_Y = Y[permutation, :]
63
64 num_batches = m // batch_size
65 for i in range(num_batches):
66 # creates mini-batch training sets
67 start = i * batch_size
68 end = (i + 1) * batch_size
69 batch_X = shuffled_X[start:end, :]
70 batch_Y = shuffled_Y[start:end, :]
71
72 # make a prediction and optimize params by batch
73 Y_pred_batch = self.forward_pass(batch_X)
74 self.backward_pass(Y_pred_batch, batch_Y)
75 self.update_parameters_adam(learning_rate)
76
77 # finalizes the results and computes the loss after all mini-batches are processed.
78 self._is_training = False
79 Y_pred = self.forward_pass(X)
80 train_loss = self.compute_loss(Y_pred, Y)
81
82 if verbose and epoch % 100 == 0:
83 print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}")
84
(To demonstrate the updates, I temporarily bypassed other functions.)
◼ Further Techniques to Address Overfitting
Deep leaning frameworks can handle complex decision boundaries to the point where they sometimes memorize all the noise. So, I need to carefully address overfitting in the training process.
Though I have already implemented the L2 regularization to the loss function, I’ll add more regularizations in the train() function: Dropout and Early Stopping - common regularization techniques in deep learning.
◼ Early Stopping
Early Stopping is a technique that stops model training when the loss on a validation set doesn't improve for a specified number of consecutive epochs - i.e., 50 epochs.
This keeps the model away from overfitting - where the training error gets extremely low, yet the validation error remains high.
I’ll add early-stopping logics to the training epoch loop with new arguments: X_val, y_val, patience, tol, and best_val_loss.
The logic keeps counting the number of epochs where the validation loss is lower than the past optimal validation loss plus the tolerance (tol, very small coefficient e.g. 1e-10, you can set zero). And if this count exceeds the patience (e.g. 50, 100, 200), the training is forcefully ended.
1import numpy as np
2
3class DFN:
4 def __init__(self, layer_dims: list = [10, 8, 4, 1], lambda_reg: float =0.01, patience: int = 100, tol=1e-10):
5
6 # ... the same as the prev coding block
7
8 # <--- early stopping params --->
9 self.tol = tol
10 self.patience: patience
11
12 # ... the same as the prev coding block
13
14 def train(self, X, Y, X_val, Y_val, epochs, learning_rate, batch_size=16, verbose=True):
15 m = X.shape[0]
16 self._is_training = True
17
18 # <--- early stopping params --->
19 use_early_stopping = X_val is not None and Y_val is not None
20 epochs_no_improve = 0
21 best_W = {l: self.W[l].copy() for l in self.W}
22 best_b = {l: self.b[l].copy() for l in self.b}
23 best_val_loss = float('inf')
24
25 # training epochs
26 for epoch in range(1, epochs + 1):
27
28 # ... the same as the prev coding block
29
30 # <--- early stopping logic--->
31 val_loss = float(0)
32 if use_early_stopping:
33 self._is_training = False
34 Y_val_pred = self.forward_pass(X_val)
35 val_loss = self.compute_loss(Y_val_pred, Y_val)
36
37 if val_loss < best_val_loss + self.tol:
38 best_val_loss = val_loss
39 epochs_no_improve = 0
40 best_W = {l: self.W[l].copy() for l in self.W}
41 best_b = {l: self.b[l].copy() for l in self.b}
42 else:
43 epochs_no_improve += 1
44 if epochs_no_improve >= self.patience:
45 print(f"\nEarly stopping at epoch {epoch}: Validation loss did not improve for {self.patience} epochs.")
46 self.W = best_W
47 self.b = best_b
48 break # force quite
49
50 if verbose and epoch % 100 == 0:
51 print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}")
52
◼ Dropout
Lastly, I’ll add Dropout - a technique to randomly "dropping out" (deactivating) a percentage of neurons during each training epochs.
Dropout forces the remaining neurons to learn more robust and independent features by preventing any single neuron from consistently relying on the presence of other neurons.
This makes the network less sensitive to minor variations in the input data and less prone to overfitting.

ML Labs | Kuriko IWAI | kuriko-iwai.com
I’ll conclude the Python class by adding the dropout logic with the rest of the functions I defined so far.
I added the keep_prob parameter to define the ratio of neurons to be kept (keep_prob = 0.9 → The network drops 10% of any random neurons in each layer). The backdrop process is triggered in the forward_pass() and backward_pass() functions during the training only when keep_prob value is less than 1.0.
1import numpy as np
2import pandas as pd
3import copy # To deep copy model parameters
4
5class DFN:
6 def __init__(
7 self,
8 layer_dims=[10, 8, 4, 1],
9 lambda_reg=0.01,
10 keep_prob=0.9,
11 tol=1e-10,
12 patience=50,
13 learning_rate=.05,
14 verbose=0
15 ):
16 self.layer_dims = layer_dims
17
18 self.W = {}
19 self.b = {}
20 self.dLdW = {}
21 self.dLdb = {}
22 self.A = {}
23 self.Z = {}
24
25 self.m_W = {}
26 self.m_b = {}
27 self.v_W = {}
28 self.v_b = {}
29
30 self.beta1 = 0.9
31 self.beta2 = 0.999
32 self.epsilon = 1e-8
33 self.learning_rate = learning_rate
34 self.t = 0
35
36 self.lambda_reg = lambda_reg
37
38 self.tol = tol
39 self.patience = patience
40 self.val_losses = []
41
42 # <--- Dropout params --->
43 self._is_training = True
44 self.keep_prob = keep_prob
45 self.D = {} # storing dropout masks
46
47 self.verbose = verbose
48 self._initialize_parameters()
49
50 def _initialize_parameters(self):
51 np.random.seed(42)
52 for l in range(1, len(self.layer_dims)):
53 self.W[l] = np.random.randn(self.layer_dims[l-1], self.layer_dims[l]) * np.sqrt(2 / self.layer_dims[l-1])
54 self.b[l] = np.zeros((1, self.layer_dims[l])) # type: ignore
55
56
57 def _relu(self, Z):
58 A = np.maximum(0, Z)
59 return A, Z
60
61
62 def _sigmoid(self, Z):
63 A = 1 / (1 + np.exp(-Z))
64 return A, Z
65
66
67 def forward_pass(self, X):
68 self.A[0] = X
69 L = len(self.layer_dims) - 1
70
71 for l in range(1, L):
72 self.Z[l] = np.dot(self.A[l-1], self.W[l]) + self.b[l]
73 self.A[l], _ = self._relu(self.Z[l])
74
75 # <--- Dropout --->
76 if self._is_training and self.keep_prob < 1.0:
77 self.D[l] = np.random.rand(*self.A[l].shape) < self.keep_prob
78 self.A[l] = self.A[l] * self.D[l]
79 self.A[l] = self.A[l] / self.keep_prob
80
81 self.Z[L] = np.dot(self.A[L-1], self.W[L]) + self.b[L]
82 self.A[L], _ = self._sigmoid(self.Z[L])
83 return self.A[L]
84
85 def compute_loss(self, Y_pred, Y_true):
86 m = Y_true.shape[0]
87 Y_pred = np.clip(Y_pred, self.epsilon, 1 - self.epsilon)
88
89 cross_entropy_loss = - (1/m) * np.sum(Y_true * np.log(Y_pred) + (1 - Y_true) * np.log(1 - Y_pred))
90
91 l2_regularization_cost = 0
92 for l in range(1, len(self.layer_dims)):
93 l2_regularization_cost += np.sum(np.square(self.W[l]))
94 l2_regularization_cost = (self.lambda_reg / (2 * m)) * l2_regularization_cost
95
96 L = cross_entropy_loss + l2_regularization_cost
97 return L
98
99
100 def _relu_backward(self, dLdA, Z):
101 dLdZ = np.array(dLdA, copy=True)
102 dLdZ[Z <= 0] = 0
103 return dLdZ
104
105
106 def _sigmoid_backward(self, dLdA, Z):
107 s = 1 / (1 + np.exp(-Z))
108 dLdZ = dLdA * s * (1 - s)
109 return dLdZ
110
111
112 def backward_pass(self, Y_pred, Y_true):
113 m = Y_true.shape[0]
114 L = len(self.layer_dims) - 1
115 dLdA = - (np.divide(Y_true, Y_pred) - np.divide(1 - Y_true, 1 - Y_pred))
116 dLdZ = self._sigmoid_backward(dLdA, self.Z[L])
117 self.dLdW[L] = (1/m) * np.dot(self.A[L-1].T, dLdZ) + (self.lambda_reg / m) * self.W[L]
118 self.dLdb[L] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
119 dLdA_prev_layer = np.dot(dLdZ, self.W[L].T)
120 for l in reversed(range(1, L)):
121
122 # <--- Dropout Logic --->
123 if self.keep_prob < 1.0:
124 dLdA_prev_layer = dLdA_prev_layer * self.D[l]
125 dLdA_prev_layer = dLdA_prev_layer / self.keep_prob
126
127 dLdZ = self._relu_backward(dLdA_prev_layer, self.Z[l])
128 self.dLdW[l] = (1/m) * np.dot(self.A[l-1].T, dLdZ) + (self.lambda_reg / m) * self.W[l]
129 self.dLdb[l] = (1/m) * np.sum(dLdZ, axis=0, keepdims=True)
130
131 if l > 1:
132 dLdA_prev_layer = np.dot(dLdZ, self.W[l].T)
133
134 def update_parameters_adam(self):
135 for l in range(1, len(self.layer_dims)):
136 self.m_W[l] = np.zeros_like(self.W[l])
137 self.m_b[l] = np.zeros_like(self.b[l])
138 self.v_W[l] = np.zeros_like(self.W[l])
139 self.v_b[l] = np.zeros_like(self.b[l])
140
141 self.t += 1
142 L = len(self.layer_dims) - 1
143
144 for l in range(1, L + 1):
145 dLdW = self.dLdW[l]
146 dLdb = self.dLdb[l]
147
148 self.m_W[l] = self.beta1 * self.m_W[l] + (1 - self.beta1) * dLdW
149 self.m_b[l] = self.beta1 * self.m_b[l] + (1 - self.beta1) * dLdb
150
151 self.v_W[l] = self.beta2 * self.v_W[l] + (1 - self.beta2) * (dLdW ** 2)
152 self.v_b[l] = self.beta2 * self.v_b[l] + (1 - self.beta2) * (dLdb ** 2)
153
154 m_corrected_W = self.m_W[l] / (1 - self.beta1**self.t)
155 m_corrected_b = self.m_b[l] / (1 - self.beta1**self.t)
156
157 v_corrected_W = self.v_W[l] / (1 - self.beta2**self.t)
158 v_corrected_b = self.v_b[l] / (1 - self.beta2**self.t)
159
160 self.W[l] -= self.learning_rate * m_corrected_W / (np.sqrt(v_corrected_W) + self.epsilon)
161 self.b[l] -= self.learning_rate * m_corrected_b / (np.sqrt(v_corrected_b) + self.epsilon)
162
163 if self.verbose:
164 delta_W_L = self.learning_rate * m_corrected_W / (np.sqrt(v_corrected_W) + self.epsilon)
165 print(f"Delta W[L] stats (min, max, mean): {np.min(delta_W_L):.8f}, {np.max(delta_W_L):.8f}, {np.mean(delta_W_L):.8f}")
166
167 def train(self, X, Y, epochs, batch_size=16, verbose=True, X_val=None, Y_val=None):
168 m = X.shape[0]
169 self._is_training = True
170 use_early_stopping = X_val is not None and Y_val is not None
171 epochs_no_improve = 0
172 best_W = {l: self.W[l].copy() for l in self.W}
173 best_b = {l: self.b[l].copy() for l in self.b}
174 best_val_loss = float('inf')
175
176 for epoch in range(1, epochs + 1):
177 if batch_size is None or batch_size >= m:
178 Y_pred = self.forward_pass(X)
179 train_loss = self.compute_loss(Y_pred, Y)
180 self.backward_pass(Y_pred, Y)
181 self.update_parameters_adam()
182 else:
183 permutation = np.random.permutation(m)
184 shuffled_X = X[permutation, :]
185 shuffled_Y = Y[permutation, :]
186
187 num_batches = m // batch_size
188 for i in range(num_batches):
189 start = i * batch_size
190 end = (i + 1) * batch_size
191 batch_X = shuffled_X[start:end, :]
192 batch_Y = shuffled_Y[start:end, :]
193
194 Y_pred_batch = self.forward_pass(batch_X)
195 self.backward_pass(Y_pred_batch, batch_Y)
196 self.update_parameters_adam()
197 self._is_training = False # turn off dropout for validation loss computation
198 Y_pred = self.forward_pass(X)
199 train_loss = self.compute_loss(Y_pred, Y)
200 self._is_training = True # turn back on
201
202 val_loss = None
203 if use_early_stopping:
204 self._is_training = False # turn off dropout for validation loss computation
205 Y_val_pred = self.forward_pass(X_val)
206 val_loss = self.compute_loss(Y_val_pred, Y_val)
207 self._is_training = True # turn back on
208
209 if val_loss < best_val_loss + self.tol:
210 best_val_loss = val_loss
211 epochs_no_improve = 0
212 best_W = {l: self.W[l].copy() for l in self.W}
213 best_b = {l: self.b[l].copy() for l in self.b}
214 else:
215 epochs_no_improve += 1
216 if epochs_no_improve >= self.patience:
217 print(f"\nEarly stopping at epoch {epoch}: Validation loss did not improve for {self.patience} epochs.")
218 self.W = best_W
219 self.b = best_b
220 break
221
222 if verbose and epoch % 100 == 0:
223 if use_early_stopping:
224 print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
225 else:
226 print(f"Epoch {epoch}/{epochs}, Train Loss: {train_loss:.4f}")
227 return self
228
Simulation
I’ll demonstrate how the custom DFN works for a binary classification task of churn prediction, using a telecom churn dataset from the UC Irvine Machine Learning Repository.
◼ Preparing Dataset
Dataset:
Iranian Churn from UC Irvine Machine Learning Repository (Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license)
3,500 data samples
14 features:

ML Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. Sample dataset (Image source)
Applying the column transformation and data augmentation to handle the class imbalance, I secured 3,178 datasets (1,643 for y=0, 1,535 for y=1) with 13 features as training sets:
1(3178, 13) (600, 13) (600, 13) (3178,) (600,) (600,)
2
3Counter({0: 1643, 1: 1535})
4
◼ Model Training, Hyperparameter Tuning
I tuned the hyperparameters multiple times and trained the model with the training dataset.
1from sklearn.metrics import f1_score, accuracy_score
2
3custom_model = DFN(
4 layer_dims=[X_train.shape[1], 30, 20, 10, 1], # adjusting NN
5 lambda_reg=0.1,
6 keep_prob=0.99,
7 tol=1e-10,
8 patience=1000,
9 learning_rate=0.01,
10).train(
11 X=X_train,
12 Y=convert(y_train),
13 X_val=X_val,
14 Y_val=convert(y_val),
15 epochs=5000,
16 batch_size=32,
17)
18
◼ Evaluation
Lastly, I evaluated the model's predictions on the training, validation, and test datasets, noting the accuracy and F1 scores.
1Y_pred_train = custom_model.predict(X=X_train, threshold=.5)
2Y_pred_val = custom_model.predict(X=X_val, threshold=.5)
3Y_pred_test = custom_model.predict(X=X_test, threshold=.5)
4
◼ Result
Accuracy score: Learning: 0.8763 → 0.8517, Generalization: 0.8400
F1 score: Learning: 0.8772 → 0.6367, Generalization: 0.6220
Good Training Performance:
The model is performing very well on the training data, achieving high accuracy (87.63%) and F1 score (87.72%). This indicates that the model has successfully learned the patterns within the data it was trained on.
Strong Generalization for Accuracy:
The validation accuracy (85.17%) is quite close to the training accuracy (87.63%), which is a positive sign. It suggests the model is generally able to make correct predictions on unseen data.
Concerning F1 Score Drop on Validation:
The F1 score on the test set (62.20%) is significantly lower than on the training set (87.72%). This is the most critical observation.
Wrapping Up
The custom DFN gives us deep insights into how neural networks work in the “black box”.
In practice, Python libraries like TensorFlow can offer more robust approaches, which can be areas for improvement for the custom DFN. Some examples are:
Gradient Calculation: Integrates an automatic differentiation engine instead of manually deriving and implementing gradients.
Optimizer Implementation: Offers dynamic learning rate adjustments instead of manual adjustment of the optimizer parameters.
Robust Task Handling: Adds broader options for initializers, regularization methods, optimizers, and loss/activation functions (In this experiment, I used one example per element.)
Performance: GPU-based frameworks instead of CPU (NumPy runs on the CPU).
MLOps: Adds custom callbacks for logging or checkpointing, access to metrics libraries or pre-implemented metrics for the performance evaluation and more.
Overall, the custom DFN provides direct control over every aspect of the network, from parameter initialization to the forward pass, backward pass, manual gradient calculation, regularization, and parameter updates.
This makes it an excellent foundational tool for understanding neural network mechanics.
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps
Share What You Learned
Kuriko IWAI, "Beyond the Black Box: Architecting Deep Feedforward Networks with NumPy" in ML Labs
https://kuriko-iwai.com/building-deep-feedforward-networks
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.