Deep Dive into Recurrent Neural Networks (RNN): Mechanics, Math, and Limitations
Explore core of sequential data modeling and how standard RNNs handle temporal dependencies
By Kuriko IWAI

Table of Contents
IntroductionWhat is Sequential DataIntroduction
Recurrent Neural Network (RNN) is a widely used artificial neural network designed to recognize patterns in sequences of data, such as text, speech, video, or time series data.
There are many architecture types in the RNN family, understanding how the standard architecture models sequential data is a key to developing a deeper intuition for more advanced models like LSTMs and GRUs.
In this article, I’ll detail its mechanisms and practical applications of standard RNNs, using a weather forecast example to walk through the process.
What is Sequential Data
Before detailing the RNN architecture, I’ll briefly cover the overview of sequential data to bridge the gap between how sequential data is structured and how RNNs are designed to process it.
◼ Sequences are Everywhere
Sequential data refers to the collections of variable length where order impacts future predictions.
Typically:
Each variable in the sequence can be repeated, and
The variable length can be infinite (e.g., natural language).
There are many sequential data in real-world, and variables in the sequence can be a character in words, a pixel in images, or a timestamped data in the time series:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Examples of sequential data (Created by Kuriko IWAI)
◼ Modeling Sequential Data
Modeling sequential data means predicting probabilities of succeeding data from learned context.
For example, predicting weather of Day 5 based on Day 1 to Day 4 weather conditions is sequential data modeling where the model estimates a probability of each weather condition on Day 5:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B-1. Weather forecast (Created by Kuriko IWAI)
▫ Why Sequence Matters
The easiest way to estimate such probability is to assume the independences of each variable: the weather condition of each day.
Under this assumption, the probability of rain on Day 5 is simply the general probability of a rainy day.
Let’s say 30% is a seasonal probability of rain, then Day 5 has the same 30%:
P_day5 (rainy) = P_seasonal (rainy) = 0.3
P_day5 (sunny) = P_seasonal (sunny) = 0.2
P_day5 (cloudy) = P_seasonal (cloudy) = 0.5
However, this Independence assumption does not match sequential structure of data because when we observe consecutive rainy days, it is natural to assume higher probability of rain on subsequent days.
A sequential model takes past weather as context to estimate a conditional probability of the target weather condition given such context, p(target | context).
For instance, 60% is a conditional probability of rain on Day 5 given Days 1 to 4’s weather:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B-2. Weather forecast as sequential data modeling (Created by Kuriko IWAI)
Mathematically, this process computes conditional probabilities (conditionals) of all variables in sequential data:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B-3. Applying the chain rule to the weather forecasting (Created by Kuriko IWAI)
And the product of all the conditionals indicates the joint probability of observing the exact weathers from Day 1 to Day 5 in the sequence, which is generalized:
where:
T: The total number of variable length (T=5), and
x_i: i-th state in the sequence (Day i’s weather condition).
This computation of joint probability is an RNN's primary job.
RNNs attempt to model the joint probability of the entire sequence by computing conditionals of each variable.
In the next section, I’ll explore how it actually works in the RNN architecture.
What is Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN) is a type of artificial neural network where connections between nodes form a graph along a sequence.
RNNs can handle the sequence using hidden states that function as memory to generate output based on the context observed in the previous hidden states.
The below diagram simplifies its architecture, compared with a standard Feedforward Neural Network:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C. Architecture comparison of FNN and RNN (Created by Kuriko IWAI)
RNNs performs sequential data handling in hidden states. Let us detail how it works.
How Recurrent Neural Networks Work
Except for the key characteristics of recursive computation in the hidden state, RNNs share the core mechanics of neural networks where they:
Perform forward pass,
Perform back propagation through time (BPTT), and
Adjust model parameters using optimizers.
◼ Performing Forward Pass
After receiving the input sequence X in vector format with input variables x’s:
Standard RNNs start by computing the first hidden state, h1, and its corresponding output, o1, using the initial hidden state h0 (zero vectors):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. Computation process of standard RNN architecture (Created by Kuriko IWAI)
Mathematically, this process is generalized with a time step t:
where:
σ: A non-linear activation function (e.g., sigmoid, tanh, ReLU) in hidden layer,
h_t: The hidden state vector at the time step t,
h_t-1: The hidden state vector from the previous time step t−1,
W_xh (pink square in the figure): An input-to-hidden weight matrix,
W_hh (green square in the figure): A recurrent weight matrix (or hidden-to-hidden weight matrix), and
b_h: A hidden layer bias vector.
W_xh, W_hh, and b_h are learnable model parameters.
W_xh and W_hh dictate the importance (weights) of input from the input layer and previous hidden layer respectively.
Then, the output o_t is computed:
where:
o_t represents the output vector at time step t,
W_ho: A hidden-to-output weight matrix, and
b_o: An output layer bias vector.
W_ho and b_o are also learnable model parameters where W_ho dictates importance of the current hidden state value computed by the formula (1) to the corresponding output.
During forward pass, the model performs the process from the first input variable x_0 to the last x_T sequentially and generates corresponding outputs.
In the weather forecast example, the RNN processes each day's weather sequentially, where when it process Day 3’s weather (x2), its hidden state retains information about the weathers on Days 1 and 2.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure E. Forward pass of a standard RNN (Created by Kuriko IWAI)
Here, the model parameters (W_xh, W_hh, W_ho, b_h, and b_o) are shared across the time steps.
◼ Perform Backward Pass Through Time (BPTT)
Similar to the standard FNN, after forward pass, the optimizer adjusts the model parameters (weights and biases) to minimize the loss.
This optimization requires Backward Pass Through Time (BPTT), where the model computes gradients, partial derivatives of the loss function with respect to each model parameter.
When we use a cross-entropy loss as a loss function for example, the total loss is an aggregation of the loss of each time step from t = 1 to T:
where:
L: The loss function,
θ: Model parameters (θ = { W_xh, W_hh, W_ho },
y: Actual value,
y^: Predicted value (= the output o), and
t: Time step (Total time steps = T)
Then the model parameters are adjusted:
Choices of optimizers can vary. Each optimizer has different formulation to compute the optimal model parameters.
Learn More: A Comprehensive Guide on Neural Network in Deep Learning
◼ Generating Final Output
After the training completes, the model generates its final output based on the architecture.
Here are the common scenarios:
1) Many-to-One Architecture (e.g., Sentiment Analysis)
This architecture generates a final output in a single value.
The output layer applies its activation function (e.g., sigmoid for binary classification or softmax for multi-class classification) to the output from the last time step T.
For example:
where:
O_final: The final output from the model and,
o_T: The final output from the hidden layer at T.
2) Many-to-Many (Synchronized) Architecture (e.g., Part-of-Speech Tagging)
In this case, final outputs are a sequence of outputs, each being applied to an activation function of its own output layer:
where outputs from o_1 to o_T are the values generated by the activation function in the output layer.
3) Sequence-to-Sequence Architecture (e.g., Encoder-Decoder models)
This architecture processes different lengths of the input and the output.
In an encoder-decoder example, first an encoder processes the entire input sequence X and updates the hidden states (or called context vectors), and then, the decoder generates an entire sequence of outputs:
where:
N: Length of output sequences (N ≠ T)
o’_t: An output (a probability distribution over target vocabulary) generated through an output layer with an activation function like softmax.
In the weather forecast example, the RNN generates a probability distribution over the possible weather conditions (rainy, sunny, cloudy), leveraging the many-to-one architecture.
Then, the most likely outcome is the one with the highest probability.
Notable Limitations of Standard RNNs
Now, we theoretically understand how a standard RNN models sequential data.
In this section, I’ll highlight two practical limitations of standard RNNs when it comes to dealing with long sequences:
Curse of Memory, and
Vanishing and Exploding Gradient Problems.
◼ Curse of Memory
Curse of memory refers to the phenomenon where the model struggles to effectively handle memory of sequential data due to its compounding errors and computational complexity.
We discussed earlier that RNNs share its model parameters across time steps in each epoch.
This is computationally efficient, but when any of these model parameters is slightly off, that error affects every step's computation, compounding its impact over long sequences.
For example, a slight change in the hidden state at a time step t+k is denoted:
where:
Δh_t+k: The resulting change in the hidden state at much later time step, t+k,
Δh_t−1: A small perturbation in the hidden state at an earlier time step, t−1, and
diag(σ′(a_j)): A diagonal matrix (Jacobian matrix) where the diagonal elements are the derivatives of the activation function of the pre-activation vector (input vector) a_j:
Here, the formula A indicates that the slight change in the hidden state from h_t+k to (h_t+k +Δh_t+k) is influenced by the product of the Jacobian matrices, compounding over k+1 time steps.
When RNNs deal with longer sequences, k is extremely large.
And the large k exponentially increases Δh_t+k, significantly impacting the final outcome h_t+k.
◼ Vanishing & Exploding Gradient Problems
The vanishing gradient problem refers to the phenomenon where gradients shrink exponentially towards zero, making contribution of past hidden states negligible.
The exploding gradient problem, on the other hand, refers to the phenomenon where gradients explode to infinite, making it difficult for the model to converge to an optimal set of parameters.
The below diagrams show a source of these problems, the recursive nature of hidden state computation:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure F. Recursive updates of the hidden state in the standard RNN architecture (Created by Kuriko IWAI)
In this process, the final hidden state h_T is computed by repeatedly multiplying the previous hidden state by the same hidden-to-hidden weight matrix, W_hh:
If the values in W_hh are small (specifically, if its largest eigenvalue is less than 1), the hidden state will shrink exponentially towards zero over time, causing the vanishing gradient problem.
Conversely, if the values in W_hh are large, the hidden state will grow exponentially towards infinity, causing the exploding gradient problem.
The problems affect the BPTT process because BPTT leverages the chain rule that involves a product of the hidden-to-hidden weight matrix W_hh:
(∂L/∂h: The gradient of the loss with respect to a hidden state at a time step k, σ': The derivative of the activation function in the hidden state)
This formula indicates that the magnitude of the gradient is bounded by the norm of W_hh raised to the power of the number of time steps, (t−k):
Distant hidden states have extremely large (t - k) values.
When the norm of the hidden-to-hidden matrix W_hh is less / greater than 1, the term ||W_hh||^t−k exponentially shrink or grow, causing the vanishing / exploding gradient problem respectively.
To tackle these challenges, advanced architectures like LTSMs or GPUs have developed. I’ll cover these architectures in a separate article.
Simulation
Now, let us see how the standard RNN handles sequential data with moderate length.
First, I’ll generate synthetic sequential data of maximum 1,500 lengths with a random weather variables:
1import numpy as np
2
3np.random.seed(42)
4
5weather_map = {0: 'Sunny', 1: 'Rainy', 2: 'Cloudy'}
6num_classes = len(weather_map)
7sequence_length = 1500
8raw_data = np.random.randint(0, num_classes, size=sequence_length)
9
10def create_dataset(data, look_back):
11 data_x, data_y = [], []
12
13 for i in range(len(data) - look_back):
14 seq = data[i:(i + look_back)]
15 target = data[i + look_back]
16 data_x.append(seq)
17 data_y.append(target)
18
19 return np.array(data_x), np.array(data_y)
20
◼ Defining the Model
Then, I defined the StandardRNN class on a many-to-one architecture, taking input, hidden state, and output sizes as arguments:
1import torch
2import torch.nn as nn
3
4class StandardRNN(nn.Module):
5 def __init__(self, input_size, hidden_size, output_size):
6 super(StandardRNN, self).__init__()
7 self.hidden_size = hidden_size
8
9 # rnn layers
10 self.rnn = nn.RNN(input_size, hidden_size, batch_first=False)
11
12 # output layer to map rnn outputs to weather categories
13 self.fc = nn.Linear(hidden_size, output_size)
14
15 def forward(self, x):
16 # initialize the hidden state with zeros
17 h0 = torch.zeros(1, x.size(1), self.hidden_size)
18
19 # pass the input through the hidden layers
20 o, hn = self.rnn(x, h0)
21
22 # take the output from the last time step for classification
23 o_final = self.fc(o[-1, :, :])
24 return o_final
25
◼ Generating Tensor Data
Using the synthetic data, defined tensor data for training. This process involves encoding the categorical input data:
1import torch
2import torch.nn as nn
3
4def one_hot_encode(data, num_classes):
5 return nn.functional.one_hot(
6 torch.from_numpy(data).long(), num_classes=num_classes
7 ).float()
8
9# generate training data
10input_sequences, target_labels = create_dataset(raw_data, look_back)
11
12# encode input data
13input_tensor = one_hot_encode(input_sequences, num_classes)
14input_tensor = input_tensor.permute(1, 0, 2)
15
16# target tensor
17target_tensor = torch.from_numpy(target_labels).long()
18
The look_back variable indicates the sequence length.
For comparison, I set it 10, 100, as 1000.
◼ Training
After instantiating the model, trained model with the Adam optimizer and the cross entropy loss:
1# instantiate a new model, loss function, and optimizer
2model = StandardRNN(input_size=num_classes, hidden_size=10, output_size=num_classes)
3criterion = nn.CrossEntropyLoss()
4optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
5
6# train the model
7loss_history = []
8for epoch in range(epochs):
9 optimizer.zero_grad()
10 outputs = model(input_tensor)
11 loss = criterion(outputs, target_tensor)
12 loss.backward()
13 optimizer.step()
14 loss_history.append(loss.item())
15
◼ Perform Inference
Lastly, performed inference:
1# perform inference
2with torch.inference_mode():
3 # use the last element in the dataset
4 last_sequence = raw_data[- look_back:]
5 input_for_pred = one_hot_encode(np.array([last_sequence]), num_classes)
6 input_for_pred = input_for_pred.permute(1, 0, 2)
7
8 # prediction result (probability distribution)
9 o_final = model(input_for_pred)
10
11 # choose a class with maximum probability
12 predicted_class_index = torch.argmax(o_final, dim=1).item()
13
14 # returns weather name instead of the class index
15 predicted_weather = weather_map[predicted_class_index]
16
◼ Results
The model output different results: “Sunny” in the shorter sequences (T = 10 or 100) and “Rainy” in the longer sequence (T = 1,000), indicating the impact of sequence length to the model’s learning process.
▫ Sequence length T = 10:
Predicted weather for the next day: Sunny
Probability distribution: [[0.3132, 0.0199, 0.0895]]
▫ Sequence length T = 100:
Predicted weather for the next day: Sunny
Probability distribution: [[ 0.3618, -0.9002, 0.2347]]
▫ Sequence length T = 1,000:
Predicted weather for the next day: Rainy
Probability distribution: [[-0.1279, 1.5855, 0.2085]]
◼ Visualizing Vanishing Gradient Problems
Now, let us see how the gradient problem occurs as the sequence length increases.
▫ Loss History (Learning Capabilities)
The below graph illustrates the loss history of different sequence lengths, where the computation of the losses got unstable as the sequence length increases:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure G-1. Standard RNN’s training loss history by sequence lengths of 10, 100, and 1,000 (Created by Kuriko IWAI)
▫ Sequence length T = 10 (Blue):
- Shows a relatively smooth and consistent decrease in loss, indicating that the model is effectively learning the patterns in the data.
▫ Sequence length T = 100 (Orange):
Shows that the loss decreases, but it is higher and more erratic than the short-sequence model.
Suggests that the vanishing gradient problem is beginning to take effect; the learning signal is weaker.
▫ Sequence length T = 1,000 (Red):
Shows a very unstable and high loss, with large, erratic spikes.
The gradient has become so small (vanished) or so large (exploded).
▫ The Magnitude of Gradients
The graph demonstrates the gradient fades away as the sequence gets longer:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure G-2. Standard RNN - Gradients by sequence lengths of 10, 100, and 1,000 (Created by Kuriko IWAI)
▫ Sequence length T = 10 (Blue):
- The line has the tallest and widest hump, indicating that for short sequences, the gradient is strong enough to allow for effective learning.
▫ Sequence length T = 100 (Orange):
The hump is significantly smaller and narrower than the blue line.
The gradient has already started to shrink after being propagated through 100 time steps, making learning difficult.
▫ Sequence length T = 1,000 (Red):
The hump is short and barely rises above zero.
The gradient has become so small after being multiplied over 1000 time steps that it's effectively zero.
The model has a very weak signal to update its weights, making it challenging to learn long-term dependencies.
Conclusion
Standard RNNs are well-suited for sequential data of short-to-moderate length.
In our simulation, we observed their stable learning with sequences below 100 time steps.
However, their performance declines sharply with longer sequences, a direct result of the vanishing and exploding gradient problems.
A clear understanding of these architectural limitations is critical for selecting the appropriate model for a given task, ensuring better performance and more efficient use of resources.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Mastering Long Short-Term Memory (LSTM) Networks
Understanding GRU Architecture and the Power of Path Signatures
A Deep Dive into Bidirectional RNNs, LSTMs, and GRUs
Deep Recurrent Neural Networks: Engineering Depth for Complex Sequences
Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Practical Time Series Analysis: Prediction with Statistics and Machine Learning
Share What You Learned
Kuriko IWAI, "Deep Dive into Recurrent Neural Networks (RNN): Mechanics, Math, and Limitations" in Kernel Labs
https://kuriko-iwai.com/recurrent-neural-network
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.




