Deep Dive into Recurrent Neural Networks (RNN): Mechanics, Math, and Limitations

Explore core of sequential data modeling and how standard RNNs handle temporal dependencies

Deep LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Sequential Data
Sequences are Everywhere
Modeling Sequential Data
What is Recurrent Neural Network (RNN)How Recurrent Neural Networks Work
Performing Forward Pass
Perform Backward Pass Through Time (BPTT)
Generating Final Output
Notable Limitations of Standard RNNs
Curse of Memory
Vanishing & Exploding Gradient Problems
Simulation
Defining the Model
Generating Tensor Data
Training
Perform Inference
Results
Visualizing Vanishing Gradient Problems
Conclusion

Introduction

Recurrent Neural Network (RNN) is a widely used artificial neural network designed to recognize patterns in sequences of data, such as text, speech, video, or time series data.

There are many architecture types in the RNN family, understanding how the standard architecture models sequential data is a key to developing a deeper intuition for more advanced models like LSTMs and GRUs.

In this article, I’ll detail its mechanisms and practical applications of standard RNNs, using a weather forecast example to walk through the process.

What is Sequential Data

Before detailing the RNN architecture, I’ll briefly cover the overview of sequential data to bridge the gap between how sequential data is structured and how RNNs are designed to process it.

Sequences are Everywhere

Sequential data refers to the collections of variable length where order impacts future predictions.

Typically:

  • Each variable in the sequence can be repeated, and

  • The variable length can be infinite (e.g., natural language).

There are many sequential data in real-world, and variables in the sequence can be a character in words, a pixel in images, or a timestamped data in the time series:

Fig. Examples of sequential data (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Examples of sequential data (Created by Kuriko IWAI)

Modeling Sequential Data

Modeling sequential data means predicting probabilities of succeeding data from learned context.

For example, predicting weather of Day 5 based on Day 1 to Day 4 weather conditions is sequential data modeling where the model estimates a probability of each weather condition on Day 5:

Fig. Weather forecast (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B-1. Weather forecast (Created by Kuriko IWAI)

Why Sequence Matters

The easiest way to estimate such probability is to assume the independences of each variable: the weather condition of each day.

Under this assumption, the probability of rain on Day 5 is simply the general probability of a rainy day.

Let’s say 30% is a seasonal probability of rain, then Day 5 has the same 30%:

  • P_day5 (rainy) = P_seasonal (rainy) = 0.3

  • P_day5 (sunny) = P_seasonal (sunny) = 0.2

  • P_day5 (cloudy) = P_seasonal (cloudy) = 0.5

However, this Independence assumption does not match sequential structure of data because when we observe consecutive rainy days, it is natural to assume higher probability of rain on subsequent days.

A sequential model takes past weather as context to estimate a conditional probability of the target weather condition given such context, p(target | context).

For instance, 60% is a conditional probability of rain on Day 5 given Days 1 to 4’s weather:

Fig. Weather forecast as sequential data modeling (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B-2. Weather forecast as sequential data modeling (Created by Kuriko IWAI)

Mathematically, this process computes conditional probabilities (conditionals) of all variables in sequential data:

Fig. Applying the chain rule to the weather forecasting (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B-3. Applying the chain rule to the weather forecasting (Created by Kuriko IWAI)

And the product of all the conditionals indicates the joint probability of observing the exact weathers from Day 1 to Day 5 in the sequence, which is generalized:

p(x)=t=1Tp(xtx1,x2,,xt1)p(x) = \prod_{t=1}^T p(x_t | x_1, x_2, \cdots, x_{t-1})

where:

  • T: The total number of variable length (T=5), and

  • x_i: i-th state in the sequence (Day i’s weather condition).

This computation of joint probability is an RNN's primary job.

RNNs attempt to model the joint probability of the entire sequence by computing conditionals of each variable.

In the next section, I’ll explore how it actually works in the RNN architecture.

What is Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN) is a type of artificial neural network where connections between nodes form a graph along a sequence.

RNNs can handle the sequence using hidden states that function as memory to generate output based on the context observed in the previous hidden states.

The below diagram simplifies its architecture, compared with a standard Feedforward Neural Network:

Fig. Architecture comparison of FNN and RNN (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Architecture comparison of FNN and RNN (Created by Kuriko IWAI)

RNNs performs sequential data handling in hidden states. Let us detail how it works.

How Recurrent Neural Networks Work

Except for the key characteristics of recursive computation in the hidden state, RNNs share the core mechanics of neural networks where they:

  • Perform forward pass,

  • Perform back propagation through time (BPTT), and

  • Adjust model parameters using optimizers.

Performing Forward Pass

After receiving the input sequence X in vector format with input variables x’s:

X=(x1,x2,,xT)X = (x_1, x_2, \cdots, x_T)

Standard RNNs start by computing the first hidden state, h1​, and its corresponding output, o1​, using the initial hidden state h0 (zero vectors):

Fig. Computation process of standard RNN architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Computation process of standard RNN architecture (Created by Kuriko IWAI)

Mathematically, this process is generalized with a time step t:

ht=σ(Whhht1+Wxhxt+bhpre-activation value at time step t: at) (1)h_t = \sigma( \underbrace{W_{hh} h_{t-1} + W_{xh} x_t + b_h}_{\text{pre-activation value at time step t: } a_t} ) \text{ } \cdots (1)

where:

  • σ: A non-linear activation function (e.g., sigmoid, tanh, ReLU) in hidden layer,

  • h_t: The hidden state vector at the time step t,

  • h_t-1: The hidden state vector from the previous time step t−1,

  • W_xh (pink square in the figure): An input-to-hidden weight matrix,

  • W_hh (green square in the figure): A recurrent weight matrix (or hidden-to-hidden weight matrix), and

  • b_h: A hidden layer bias vector.

W_xh, W_hh, and b_h are learnable model parameters.

W_xh and W_hh dictate the importance (weights) of input from the input layer and previous hidden layer respectively.

Then, the output o_t is computed:

ot=Whoht+bo(2)o_t​=W_{h_o​}h_t​+b_o​ \quad \cdots (2)

where:

  • o_t​ represents the output vector at time step t,

  • W_ho: A hidden-to-output weight matrix, and

  • b_o: An output layer bias vector.

W_ho and b_o are also learnable model parameters where W_ho dictates importance of the current hidden state value computed by the formula (1) to the corresponding output.

During forward pass, the model performs the process from the first input variable x_0 to the last x_T sequentially and generates corresponding outputs.

In the weather forecast example, the RNN processes each day's weather sequentially, where when it process Day 3’s weather (x2), its hidden state retains information about the weathers on Days 1 and 2.

Fig. Forward pass of a standard RNN (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Forward pass of a standard RNN (Created by Kuriko IWAI)

Here, the model parameters (W_xh, W_hh, W_ho, b_h, and b_o) are shared across the time steps.

Perform Backward Pass Through Time (BPTT)

Similar to the standard FNN, after forward pass, the optimizer adjusts the model parameters (weights and biases) to minimize the loss.

This optimization requires Backward Pass Through Time (BPTT), where the model computes gradients, partial derivatives of the loss function with respect to each model parameter.

When we use a cross-entropy loss as a loss function for example, the total loss is an aggregation of the loss of each time step from t = 1 to T:

Lθ(y,y^)=t=1Tytlog(y^t)(3)L_{\theta} (y, \hat y) = - \sum_{t=1}^T y_t \cdot log(\hat y_t) \quad \cdots (3)

where:

  • L: The loss function,

  • θ: Model parameters (θ = { W_xh, W_hh, W_ho },

  • y: Actual value,

  • y^: Predicted value (= the output o), and

  • t: Time step (Total time steps = T)

Then the model parameters are adjusted:

θ=argmaxθLθ(y,y^)\theta = \arg \max_{\theta} L_{\theta} (y, \hat y)

Choices of optimizers can vary. Each optimizer has different formulation to compute the optimal model parameters.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

Generating Final Output

After the training completes, the model generates its final output based on the architecture.

Here are the common scenarios:

1) Many-to-One Architecture (e.g., Sentiment Analysis)

This architecture generates a final output in a single value.

The output layer applies its activation function (e.g., sigmoid for binary classification or softmax for multi-class classification) to the output from the last time step T.

For example:

Ofinal=sigmoid(oT)=11+eoTO_{final} = sigmoid(o_T) = \frac{1} {1+e^{-o_T}}

where:

  • O_final: The final output from the model and,

  • o_T: The final output from the hidden layer at T.

2) Many-to-Many (Synchronized) Architecture (e.g., Part-of-Speech Tagging)

In this case, final outputs are a sequence of outputs, each being applied to an activation function of its own output layer:

Ofinal=(o1,o2,,oT)O_{final} = (o_1, o_2, \cdots, o_T)

where outputs from o_1 to o_T are the values generated by the activation function in the output layer.

3) Sequence-to-Sequence Architecture (e.g., Encoder-Decoder models)

This architecture processes different lengths of the input and the output.

In an encoder-decoder example, first an encoder processes the entire input sequence X and updates the hidden states (or called context vectors), and then, the decoder generates an entire sequence of outputs:

Ofinal=(o1,o2,,oN)O_{final} = (o'_1, o'_2, \cdots, o'_N)

where:

  • N: Length of output sequences (N ≠ T)

  • o’_t: An output (a probability distribution over target vocabulary) generated through an output layer with an activation function like softmax.

In the weather forecast example, the RNN generates a probability distribution over the possible weather conditions (rainy, sunny, cloudy), leveraging the many-to-one architecture.

Then, the most likely outcome is the one with the highest probability.

Notable Limitations of Standard RNNs

Now, we theoretically understand how a standard RNN models sequential data.

In this section, I’ll highlight two practical limitations of standard RNNs when it comes to dealing with long sequences:

  1. Curse of Memory, and

  2. Vanishing and Exploding Gradient Problems.

Curse of Memory

Curse of memory refers to the phenomenon where the model struggles to effectively handle memory of sequential data due to its compounding errors and computational complexity.

We discussed earlier that RNNs share its model parameters across time steps in each epoch.

This is computationally efficient, but when any of these model parameters is slightly off, that error affects every step's computation, compounding its impact over long sequences.

For example, a slight change in the hidden state at a time step t+k is denoted:

Δht+k(j=tt+kdiag(σ(aj))Whh)Δht1(A)\Delta h_{t+k} \approx \left( \prod_{j=t}^{t+k} \text{diag}(\sigma'(a_j))W_{hh} \right) \Delta h_{t-1} \quad \cdots (A)

where:

  • Δh_t+k: The resulting change in the hidden state at much later time step, t+k,

  • Δh_t−1: A small perturbation in the hidden state at an earlier time step, t−1, and

  • diag(σ′(a_j​)): A diagonal matrix (Jacobian matrix) where the diagonal elements are the derivatives of the activation function of the pre-activation vector (input vector) a_j​:

aj=(aj,1aj,2)    diag(σ(aj))=(σ(aj,1)00σ(aj,2))a_j = \begin{pmatrix} a_{j,1} \\ a_{j,2} \end{pmatrix} \implies \text{diag}(\sigma'(a_j)) = \begin{pmatrix} \sigma'(a_{j,1}) & 0 \\ 0 & \sigma'(a_{j,2}) \end{pmatrix}

Here, the formula A indicates that the slight change in the hidden state from h_t​+k to (h_t+k +Δh_t​+k) is influenced by the product of the Jacobian matrices, compounding over k+1 time steps.

When RNNs deal with longer sequences, k is extremely large.

And the large k exponentially increases Δh_t​+k, significantly impacting the final outcome h_t​+k.

Vanishing & Exploding Gradient Problems

The vanishing gradient problem refers to the phenomenon where gradients shrink exponentially towards zero, making contribution of past hidden states negligible.

The exploding gradient problem, on the other hand, refers to the phenomenon where gradients explode to infinite, making it difficult for the model to converge to an optimal set of parameters.

The below diagrams show a source of these problems, the recursive nature of hidden state computation:

Fig. Recursive updates of the hidden state in the standard RNN architecture (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Recursive updates of the hidden state in the standard RNN architecture (Created by Kuriko IWAI)

In this process, the final hidden state h_T is computed by repeatedly multiplying the previous hidden state by the same hidden-to-hidden weight matrix, W_hh:

hT=Whhht1=WhhTh0h_T = W_{hh} h_{t-1} = W_{hh}^T h_0

If the values in W_hh​ are small (specifically, if its largest eigenvalue is less than 1), the hidden state will shrink exponentially towards zero over time, causing the vanishing gradient problem.

Conversely, if the values in W_hh​ are large, the hidden state will grow exponentially towards infinity, causing the exploding gradient problem.

hT{0if Whh<1if Whh>1h_T \to \begin{cases} 0 &\text{if } ||W_{hh}|| < 1 \\ \infty &\text{if } ||W_{hh}|| > 1 \\ \end{cases}

The problems affect the BPTT process because BPTT leverages the chain rule that involves a product of the hidden-to-hidden weight matrix W_hh:

Lθhk=Lθhththt1hk+1hk=Lθhti=kt1hi+1hi=Lθhtj=kt1(σ(aj)Whh)\begin{align} \frac{\partial L_{\theta}}{\partial h_k} &= \frac{\partial L_{\theta}}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_{t-1}} \dots \frac{\partial h_{k+1}}{\partial h_k} \\ \\ &= \frac{\partial L_{\theta}}{\partial h_t} \prod_{i=k}^{t-1} \frac{\partial h_{i+1}}{\partial h_i} \\ \\ &= \frac{\partial L_{\theta}}{\partial h_t} \prod_{j=k}^{t-1} (\sigma'(a_j)W_{hh}) \end{align}

(∂L/∂h: The gradient of the loss with respect to a hidden state at a time step k​, σ': The derivative of the activation function in the hidden state)

This formula indicates that the magnitude of the gradient is bounded by the norm of W_hh​ raised to the power of the number of time steps, (t−k):

LhkLhkj=kt1σ(aj+1)WhhLhkj=kt1WhhLhkWhhtk\begin{align} \|\frac{\partial L}{\partial h_k}\| &\leq \|\frac{\partial L}{\partial h_k} \|\prod_{j=k}^{t-1} \|\sigma'(a_{j+1})W_{hh}\| \\ \\ &\leq \|\frac{\partial L}{\partial h_k} \|\prod_{j=k}^{t-1} \|W_{hh}\| \\ \\ &\leq \|\frac{\partial L}{\partial h_k} \|\|W_{hh}\|^{t-k} \end{align}

Distant hidden states have extremely large (t - k) values.

When the norm of the hidden-to-hidden matrix W_hh is less / greater than 1, the term ||W_hh||^t−k exponentially shrink or grow, causing the vanishing / exploding gradient problem respectively.

To tackle these challenges, advanced architectures like LTSMs or GPUs have developed. I’ll cover these architectures in a separate article.

Simulation

Now, let us see how the standard RNN handles sequential data with moderate length.

First, I’ll generate synthetic sequential data of maximum 1,500 lengths with a random weather variables:

1import numpy as np
2
3np.random.seed(42)
4
5weather_map = {0: 'Sunny', 1: 'Rainy', 2: 'Cloudy'}
6num_classes = len(weather_map)
7sequence_length = 1500
8raw_data = np.random.randint(0, num_classes, size=sequence_length)
9
10def create_dataset(data, look_back):
11    data_x, data_y = [], []
12
13    for i in range(len(data) - look_back):
14        seq = data[i:(i + look_back)]
15        target = data[i + look_back]
16        data_x.append(seq)
17        data_y.append(target)
18
19    return np.array(data_x), np.array(data_y)
20

Defining the Model

Then, I defined the StandardRNN class on a many-to-one architecture, taking input, hidden state, and output sizes as arguments:

1import torch
2import torch.nn as nn
3
4class StandardRNN(nn.Module):
5    def __init__(self, input_size, hidden_size, output_size):
6        super(StandardRNN, self).__init__()
7        self.hidden_size = hidden_size
8
9        # rnn layers
10        self.rnn = nn.RNN(input_size, hidden_size, batch_first=False)
11
12        # output layer to map rnn outputs to weather categories
13        self.fc = nn.Linear(hidden_size, output_size)
14
15    def forward(self, x):
16        # initialize the hidden state with zeros
17        h0 = torch.zeros(1, x.size(1), self.hidden_size)
18
19        # pass the input through the hidden layers
20        o, hn = self.rnn(x, h0)
21
22        # take the output from the last time step for classification
23        o_final = self.fc(o[-1, :, :])
24        return o_final
25

Generating Tensor Data

Using the synthetic data, defined tensor data for training. This process involves encoding the categorical input data:

1import torch
2import torch.nn as nn
3
4def one_hot_encode(data, num_classes):
5    return nn.functional.one_hot(
6            torch.from_numpy(data).long(), num_classes=num_classes
7           ).float()
8
9# generate training data
10input_sequences, target_labels = create_dataset(raw_data, look_back)
11
12# encode input data
13input_tensor = one_hot_encode(input_sequences, num_classes)
14input_tensor = input_tensor.permute(1, 0, 2)
15
16# target tensor
17target_tensor = torch.from_numpy(target_labels).long()
18

The look_back variable indicates the sequence length.

For comparison, I set it 10, 100, as 1000.

Training

After instantiating the model, trained model with the Adam optimizer and the cross entropy loss:

1# instantiate a new model, loss function, and optimizer
2model = StandardRNN(input_size=num_classes, hidden_size=10, output_size=num_classes)
3criterion = nn.CrossEntropyLoss()
4optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
5
6# train the model
7loss_history = []
8for epoch in range(epochs):
9    optimizer.zero_grad()
10    outputs = model(input_tensor)
11    loss = criterion(outputs, target_tensor)
12    loss.backward()
13    optimizer.step()
14    loss_history.append(loss.item())
15

Perform Inference

Lastly, performed inference:

1# perform inference
2with torch.inference_mode():
3    # use the last element in the dataset
4    last_sequence = raw_data[- look_back:]
5    input_for_pred = one_hot_encode(np.array([last_sequence]), num_classes)
6    input_for_pred = input_for_pred.permute(1, 0, 2)
7
8    # prediction result (probability distribution)
9    o_final = model(input_for_pred)
10
11    # choose a class with maximum probability 
12    predicted_class_index = torch.argmax(o_final, dim=1).item()
13
14    # returns weather name instead of the class index
15    predicted_weather = weather_map[predicted_class_index]
16

Results

The model output different results: “Sunny” in the shorter sequences (T = 10 or 100) and “Rainy” in the longer sequence (T = 1,000), indicating the impact of sequence length to the model’s learning process.

Sequence length T = 10:

  • Predicted weather for the next day: Sunny

  • Probability distribution: [[0.3132, 0.0199, 0.0895]]

Sequence length T = 100:

  • Predicted weather for the next day: Sunny

  • Probability distribution: [[ 0.3618, -0.9002, 0.2347]]

Sequence length T = 1,000:

  • Predicted weather for the next day: Rainy

  • Probability distribution: [[-0.1279, 1.5855, 0.2085]]

Visualizing Vanishing Gradient Problems

Now, let us see how the gradient problem occurs as the sequence length increases.

Loss History (Learning Capabilities)

The below graph illustrates the loss history of different sequence lengths, where the computation of the losses got unstable as the sequence length increases:

Fig. Standard RNN’s training loss history by sequence lengths of 10, 100, and 1,000 (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G-1. Standard RNN’s training loss history by sequence lengths of 10, 100, and 1,000 (Created by Kuriko IWAI)

Sequence length T = 10 (Blue):

  • Shows a relatively smooth and consistent decrease in loss, indicating that the model is effectively learning the patterns in the data.

Sequence length T = 100 (Orange):

  • Shows that the loss decreases, but it is higher and more erratic than the short-sequence model.

  • Suggests that the vanishing gradient problem is beginning to take effect; the learning signal is weaker.

Sequence length T = 1,000 (Red):

  • Shows a very unstable and high loss, with large, erratic spikes.

  • The gradient has become so small (vanished) or so large (exploded).

The Magnitude of Gradients

The graph demonstrates the gradient fades away as the sequence gets longer:

Fig. Standard RNN - Gradients by sequence lengths of 10, 100, and 1,000 (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G-2. Standard RNN - Gradients by sequence lengths of 10, 100, and 1,000 (Created by Kuriko IWAI)

Sequence length T = 10 (Blue):

  • The line has the tallest and widest hump, indicating that for short sequences, the gradient is strong enough to allow for effective learning.

Sequence length T = 100 (Orange):

  • The hump is significantly smaller and narrower than the blue line.

  • The gradient has already started to shrink after being propagated through 100 time steps, making learning difficult.

Sequence length T = 1,000 (Red):

  • The hump is short and barely rises above zero.

  • The gradient has become so small after being multiplied over 1000 time steps that it's effectively zero.

  • The model has a very weak signal to update its weights, making it challenging to learn long-term dependencies.

Conclusion

Standard RNNs are well-suited for sequential data of short-to-moderate length.

In our simulation, we observed their stable learning with sequences below 100 time steps.

However, their performance declines sharply with longer sequences, a direct result of the vanishing and exploding gradient problems.

A clear understanding of these architectural limitations is critical for selecting the appropriate model for a given task, ensuring better performance and more efficient use of resources.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Practical Time Series Analysis: Prediction with Statistics and Machine Learning

Practical Time Series Analysis: Prediction with Statistics and Machine Learning

Share What You Learned

Kuriko IWAI, "Deep Dive into Recurrent Neural Networks (RNN): Mechanics, Math, and Limitations" in Kernel Labs

https://kuriko-iwai.com/recurrent-neural-network

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.