A Deep Dive into Bidirectional RNNs, LSTMs, and GRUs

Explore how BRNNs handle contextual predictions over sequential data with practical examples

Machine LearningData SciencePython

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionWhat is Bidirectional Recurrent Neural Network (BRNN)
How Bidirectional Approaches Works on a Standard RNN
Consideration - Architecture Inheritance and Computational Cost
Simulation
Preparing for Synthetic Data
Initializing Bidirectional RNNs
Model Training & Evaluation
Results
When to Use BiLSTM, BiGRU, or BRNN
Bidirectional RNN (BRNN)
Bidirectional LSTM (BiLSTM)
Bidirectional GRU (BiGRU)
Conclusion

Introduction

Recurrent Neural Networks (RNNs) are widely used artificial neural networks designed to recognize patterns in sequences of data, such as text, speech, video, or time series data.

Their major algorithms like standard RNNs, LSTMs, and GRUs process the sequence in a strictly forward direction, posing a challenge to a task like handwriting recognition where a comprehensive understanding of future elements is crucial for accurate predictions.

Bidirectional Recurrent Neural Networks (BRNNs) were developed to overcome the challenge.

In this article, I’ll detail BRNNs mechanisms and applications using a synthetic data simulation.

What is Bidirectional Recurrent Neural Network (BRNN)

Bidirectional Recurrent Neural Network (BRNN) is a type of Recurrent Neural Network that can analyze sequential data in both future and past directions.

BRNNs find extensive applications in fields that demand rich contextual understanding, including:

  • Handwriting recognition

  • Speech recognition

  • Machine translation

  • Part-of-speech tagging

  • Dependency parsing

  • Protein structure prediction

BRNNs achieve the contextual understanding by first splitting hidden states into two distinct directions:

  • Forward state: Positive time direction from beginning to end and

  • Backward states: Negative time direction from end to beginning,

and then combining the outputs to feed into the same output layer.

Let us explore more details.

How Bidirectional Approaches Works on a Standard RNN

BRNNs can be built on most RNN architectures like LTSMs, GRU, and even SigGRU.

In this section, I’ll detail its application to a standard RNN for simplicity.

The below diagram showcases the process comparing with a non-bidirectional standard RNN:

Figure A. Comparison of a bidirectional RNN and standard RNN architectures (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Comparison of a bidirectional RNN and standard RNN architectures (Created by Kuriko IWAI)

After receiving an input sequence X, BRNNs first perform forward pass by computing two hidden state values h1(b) and h1(f) (The green circles in Figure A. The darker color indicates backward pass, and lighter one indicates forward pass).

Then, BRNNs combine these two hidden state values to generate output (o1, the blue circles in Figure A).

Mathematically, this process is generalized with a given time step t.

For the forward hidden state value:

ht(f)=tanh(Whh(f)ht1(f)+Wxh(f)xt+bh(f))h_t^{(f)} = tanh(W_{hh}^{(f)} h_{t-1}^{(f)} + W_{xh}^{(f)} x_t + b_h^{(f)})

where:

  • h_t(f)​: The forward hidden state vector at time step t,

  • tanh: The non-linear activation function of the forward direction hidden layer (can be another activation function like sigmoid),

  • W_hh(f)​: Forward recurrent weight matrix,

  • h_t−1(f)​: The forward hidden state vector from the previous time step t−1,

  • W_xh(f)​: Input-to-forward-hidden weight matrix,

  • x_t: The input at time step t, and

  • b_h(f)​: Forward hidden layer bias vector.

For the backward hidden state value:

ht(b)=tanh(Whh(b)ht+1(b)+Wxh(b)xt+bh(b))h_t^{(b)} = tanh(W_{hh}^{(b)} h_{t+1}^{(b)} + W_{xh}^{(b)} x_t + b_h^{(b)})

where:

  • h_t(b)​: The backward hidden state vector at time step t,

  • tanh: The non-linear activation function of the backward direction hidden layer (can be another activation function like sigmoid),

  • W_hh(b)​: Backward recurrent weight matrix,

  • h_{t+1}(b)​: The backward hidden state vector from the next time step t+1,

  • W_xh(b)​: Input-to-backward-hidden weight matrix,

  • x_t: The input at time step t, and

  • b_h(b)​: Backward hidden layer bias vector.

In the backward direction, the computation of the current hidden state depends on the future hidden state value h_t+1(b)​ because the model processes the sequence in reverse.

This is the reason that BRNNs can incorporate the future information.

Generating Outputs

In the output layer, the network combines the two hidden state values by computing the weighted sum.

Mathematically, this process is defined:

ot=Who(f)ht(f)+Who(b)ht(b)+boo_t = W_{ho}^{(f)} h_t^{(f)} + W_{ho}^{(b)} h_t^{(b)} + b_o

where:

  • o_t: The output at time step t,

  • h_t(f), h_t(b): The forward / backward hidden state vector at time step t,

  • W_ho(f)​: Forward-hidden-to-output weight matrix,

  • W_ho(b)​: Backward-hidden-to-output weight matrix, and

  • b_o​: Output layer bias vector.

To simplify the notation, we can use W_ho:

ot=Whoht+boht=[ht(f);ht(b)]o_t = W_{ho}h_t + b_o \quad \because h_t = [h_t^{(f)}; h_t^{(b)}]

The Walkthrough Example

Now, let us see how it works on a simple stock price prediction case (a regression task).

In case of a BRNN:

Let’s say the hidden state values and weights are given:

  • Forward hidden state at t=3: h3(f)​≈0.95

  • Backward hidden state at t=3: h3(b)​≈0.96

  • Final layer weight: W_o​ = 0.8

  • Final layer bias: b_o​ = 0.5

Step 1. Sum the forward and backward hidden states at t = 3:

h3=h3(f)+h3(b)=0.95+0.96=1.91h_3 = h_3^{(f)} + h_3^{(b)} = 0.95 + 0.96 = 1.91

Step 2. Apply the final linear layer:

y3=Woh3+bo=0.81.91+0.5=2.028y_3 = W_o h_3 + b_o = 0.8 * 1.91 + 0.5 = 2.028

Final Result

The final predicted stock price for time step t=3 is approximately $2.03.

— In case of the Standard RNN

On the other hand, a standard RNN takes just a single step to compute the output.

Using the same values as the BRNN case, given:

  • Final hidden state at t=3: h3 ​≈ 0.95

  • Final layer weight: W_o ​= 0.8

  • Final layer bias: b_o = 0.5

Step 1. Use the hidden state at t=3 from the forward pass:

h30.95h_3 \approx 0.95

Step 2. Apply the final linear layer:

y3=Woh3+bo=0.80.95+0.5=1.26y_3 = W_o h_3 + b_o = 0.8 * 0.95 + 0.5 = 1.26

Final Result

The final predicted stock price for time step t=3 using the standard RNN is $1.26.

This simple example demonstrates how BRNNs incorporate information from both forward and backward passes.

In a real-world scenario, the hidden states would be vectors, and the final layer would be a more complex matrix multiplication, but the underlying principle remains the same.

Consideration - Architecture Inheritance and Computational Cost

BRNNs are built on the same principles as its base RNN model, meaning they inherit the same challenges.

If the base model is a standard RNN, the BRNN will also suffer from the vanishing or exploding gradient problem, especially with long sequences.

LSTMs and GRUs, on the other hand, were specifically designed to address the gradient problem by leveraging their gate mechanisms.

When they are used in a bidirectional architecture, they can retain this advantage.

So, a Bidirectional LSTM or Bidirectional GRU can effectively process long-term dependencies in both directions.

Learn More: Standard RNN / LSTM / GRU

Another significant consideration is computational cost.

BRNNs run two separate RNNs—one for the forward pass and one for the backward pass.

As showed in Figure A, BRNNs require double the number of parameters and, consequently, doubles the computational and memory cost compared to a single-direction RNN of the same size.

For tasks with very long sequences, a Bidirectional GRU (BiGRU) is preferred over a Bidirectional LSTM (BiLSTM) because a GRU has fewer parameters than an LSTM, which helps mitigate the increased computational expense.

Simulation

Now, I’ll demonstrate how Bidirectional approaches work on a standard RNN, LSTM network, and GRU, using a weather forecast task.

All code is available in my Github Repo.

Preparing for Synthetic Data

First, I defined the create_dataset function that generates the 10,000 samples of synthetic weather data:

1import torch
2import numpy as np
3
4np.random.seed(42)
5torch.manual_seed(42)
6
7weather_map = {0: 'Sunny', 1: 'Cloudy', 2: 'Rainy'}
8num_classes = len(weather_map)
9sequence_length = 10000
10raw_data = np.random.randint(0, num_classes, size=sequence_length)
11
12def create_dataset(data, look_back):
13    data_x, data_y = [], []
14    for i in range(len(data) - look_back):
15        seq = data[i:(i + look_back)]
16        target = data[i + look_back]
17        data_x.append(seq)
18        data_y.append(target)
19    return np.array(data_x), np.array(data_y)
20

Initializing Bidirectional RNNs

I built:

  • The BRNN class: A bidirectional version of the standard RNN,

  • The BiLSTM class: A bidirectional version of the standard LSTM, and

  • The BiGRU class: A bidirectional version of the standard GRU,

using the PyTorch library.

Technically, in PyTorch, bidirectional approaches can be implemented by setting the bidirectional parameter to True within an RNN layer's configuration.

All models are on a many-to-one architecture where they generate a single final output.

1import torch
2import torch.nn as nn
3
4# bidirectional rnn (brnn)
5class BRNN(nn.Module):
6    def __init__(self, input_size, hidden_size, output_size):
7        super(BRNN, self).__init__()
8        self.hidden_size = hidden_size
9        self.num_directions = 2
10
11        # turn on the bidirectional parameter in the rnn layer
12        self.rnn = nn.RNN(input_size, hidden_size, bidirectional=True, batch_first=False)
13        self.fc = nn.Linear(self.num_directions * hidden_size, output_size)
14
15    def forward(self, x):
16        h0 = torch.zeros(self.num_directions, x.size(1), self.hidden_size)
17        o, hn = self.rnn(x, h0)
18        o_final = self.fc(o[-1, :, :])
19        return o_final
20
21
22# bidirectional lstm
23class BiLSTM(nn.Module):
24    def __init__(self, input_size, hidden_size, output_size):
25        super(BiLSTM, self).__init__()
26        self.hidden_size = hidden_size
27        self.num_directions = 2
28
29        # turn on the bidirectional parameter in the rnn layer
30        self.lstm = nn.LSTM(input_size, hidden_size, bidirectional=True, batch_first=False)
31        self.fc = nn.Linear(self.num_directions * hidden_size, output_size)
32
33    def forward(self, x):
34        h0 = torch.zeros(self.num_directions, x.size(1), self.hidden_size)
35        c0 = torch.zeros(self.num_directions, x.size(1), self.hidden_size)
36        o, (hn, cn) = self.lstm(x, (h0, c0))
37        o_final = self.fc(o[-1, :, :])
38        return o_final
39
40
41# bidirectional gru
42class BiGRU(nn.Module):
43    def __init__(self, input_size, hidden_size, output_size):
44        super(BiGRU, self).__init__()
45        self.hidden_size = hidden_size
46        self.num_directions = 2
47
48        # turn on the bidirectional parameter in the rnn layer
49        self.gru = nn.GRU(input_size, hidden_size, bidirectional=True, batch_first=False)
50        self.fc = nn.Linear(self.num_directions * hidden_size, output_size)
51
52    def forward(self, x):
53        h0 = torch.zeros(self.num_directions, x.size(1), self.hidden_size)
54        o, hn = self.gru(x, h0)
55        o_final = self.fc(o[-1, :, :])
56        return o_final
57

Model Training & Evaluation

Lastly, I trained models on datasets with different sequence lengths of 10, 100, and 1,000:

1import torch
2import torch.nn as nn
3import torch.optim as optim
4from torch.utils.data import DataLoader
5
6
7epochs = 200
8look_back_lengths = [10, 100, 1000]
9
10# create dataset
11input_sequences, target_labels = create_dataset(raw_data, look_back)
12
13# convert data to tensors
14input_tensor = one_hot_encode(input_sequences, num_classes)
15input_tensor = input_tensor.permute(1, 0, 2)
16target_tensor = torch.from_numpy(target_labels).long()
17
18# instantiate a new model, loss function, and optimizer for each sequence_length
19model = model_class(input_size=num_classes, hidden_size=10, output_size=num_classes)
20criterion = nn.CrossEntropyLoss()
21optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
22
23# training
24for epoch in range(epochs):
25    # zero the gradients
26    optimizer.zero_grad()
27
28    # forward pass
29    outputs = model(input_tensor)   
30    loss = criterion(outputs, target_tensor)
31
32    # backward pass
33    loss.backward()
34    optimizer.step()
35
36
37# perform inference
38with torch.inference_mode():
39    last_sequence = raw_data[-look_back:]
40    input_for_pred = one_hot_encode(np.array([last_sequence]), num_classes)
41    input_for_pred = input_for_pred.permute(1, 0, 2)
42
43    prediction_output = model(input_for_pred)
44    predicted_class = torch.argmax(prediction_output, dim=1).item()
45    predicted_weather = weather_map[predicted_class]
46

Results

BRNN vs Standard RNN

The BRNN demonstrates moderate advantages over the Standard RNN, especially as the look-back length increases:

Figure B. Loss histories during training (200 epochs) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Loss histories during training (200 epochs) (Created by Kuriko IWAI)

For the short sequence (t = 10, left), both BRNN and standard RNN performs good with stable learning capabilities. This indicates that bidirectional approaches offer limited benefits for this short sequence with simple models like standard RNNs.

For the middle sequence (t=100, middle), the standard RNN starts to show the gradient problem, while the BRNN stably decreased the losses. Its final prediction of “Rainy“ seems more credible than one from the standard RNN (Sunny).

For the long sequence (t=1,000, right), both the standard RNN and the BRNN slightly showed tendency of gradient problems but both resulted in “Rainy“ as their final prediction. The BRNN's loss curve drops below the Standard RNN's curve around the 100-epoch mark, and it ends with a lower overall loss.

Final Output

Except for t = 100, both models generate the same results:

Figure C. Final predictions (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Final predictions (Created by Kuriko IWAI)

BiLSTM vs LSTM

The Bidirectional LSTM (BiLSTM, green line in the graph) consistently outperforms the Standard LSTM (grey line in the graph):

Figure D. Loss histories during training (200 epochs) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Loss histories during training (200 epochs) (Created by Kuriko IWAI)

For the short sequence (t = 10, left), both models perform similarly, with the BiLSTM having a slightly lower final loss (1.0341) compared to the Standard LSTM (1.0365). The loss curves on the graph are nearly identical, suggesting that for short sequences, the additional context provided by the backward pass offers a minimal advantage.

For the middle sequence (t=100, middle), the BiLSTM demonstrates a clear advantage where the LSTM's loss curve plateaus, indicating it is struggling with the longer sequence. The BiLSTM, by contrast, continues to show a steady decrease in loss.

For the long sequence (t=1,000, right), the difference in performance is most pronounced here. The BiLSTM achieves a final loss of 1.0538, significantly lower than the LSTM's 1.0830. The LSTM's loss curve, on the other hand, shows signs of stagnation.

Final Output

The models generate completely different results across all three sequences:

Figure E. Final predictions (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Final predictions (Created by Kuriko IWAI)

Given the competitive learning histories of the BiLSTM, its final results seem more credible than ones from the standard LSTM.

BiGRU vs GRU

Similar to LSTMs, the Bidirectional GRU (BiGRU, blue line in the graph) consistently outperforms the Standard GRU (grey line in the graph), especially when the sequence length increases:

Figure F. Loss histories during training (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Loss histories during training (Created by Kuriko IWAI)

For the short sequence (t = 10, left), both models perform almost identically. This similarity indicates that the benefits of a bidirectional model are not significant for very short sequences.

For the middle sequence (t=100, middle), the BiGRU clearly demonstrates its advantage where the BiGRU's loss curve drops more sharply and shows less fluctuation.

For the long sequence (t=1,000, right), the difference in performance is most significant. The Standard GRU's loss curve is quite erratic and struggles to converge as effectively as the BiGRU's.

Final Output

The models show different results for t = 100 and 1,000:

Figure G. Final predictions (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Final predictions (Created by Kuriko IWAI)

Similar to the BiLSTM case, the BiGRU’s results seem more reasonable than those of the standard GRU given the stable learning paths.

When to Use BiLSTM, BiGRU, or BRNN

As we saw in the simulation, choosing the right bidirectional architecture—a Bidirectional RNN (BRNN), Bidirectional LSTM (BiLSTM), or Bidirectional GRU (BiGRU)—depends on a careful balance of computational cost, performance, and sequence length.

Here are some common scenarios.

Bidirectional RNN (BRNN)

When to Use

  • Short Sequences: When the sequential data is relatively short (e.g., a few dozen time steps), and the risk of vanishing or exploding gradients is low.

  • Simple Tasks: For tasks where the dependencies are not long-term and a simple model is sufficient. This can be a good starting point for a baseline model.

  • Limited Computational Resources: Working on a project with strict memory or processing constraints and cannot afford the higher parameter count of LSTMs or GRUs.

Drawbacks:

  • Gradient Problems: Inherits the vanishing and exploding gradient problems of standard RNNs, making it unsuitable for learning long-term dependencies. Performance degrades significantly on longer sequences.

Bidirectional LSTM (BiLSTM)

When to Use:

  • Long and Complex Sequences: This is the go-to choice for tasks requiring the model to remember information over long distances in both forward and backward directions. LSTMs' "gated" mechanisms are highly effective at preventing gradient issues.

  • High-stakes Applications: When high accuracy is critical, such as in professional-grade machine translation, complex speech recognition, or medical text analysis.

  • Rich Contextual Understanding: For tasks that truly benefit from a deep, bidirectional understanding of the entire sequence, like natural language understanding or protein structure prediction.

Drawbacks:

  • Highest Computational Cost: With a greater number of parameters, BiLSTMs are the most computationally expensive and memory-intensive of the three options. Training and inference can be slow, especially on very large datasets.

Bidirectional GRU (BiGRU)

When to Use:

  • A Balance of Performance and Efficiency: Need the long-term memory benefits of a BiLSTM but are constrained by computational resources. GRUs have a simpler architecture than LSTMs (two gates instead of three), resulting in fewer parameters.

  • Medium to Long Sequences: They perform very well on long sequences, often rivaling BiLSTMs while being significantly faster to train and use.

  • Mobile and Edge Devices: Due to their efficiency, BiGRUs are often preferred for deployment in resource-limited environments.

Drawbacks:

  • Potentially Lower Accuracy: While often a close second to BiLSTMs, there may be specific, highly complex tasks where the extra parameters of an LSTM provide a marginal but crucial performance edge.

In summary, for any task requiring a comprehensive understanding of sequential data, a bidirectional approach is a strong choice.

Considering the balance of computational resources and performance accuracy:

  • Start with a BiGRU for a good balance of performance and efficiency,

  • When results are not good enough, upgrade to a BiLSTM, and

  • Only consider a BRNN for very short or simple sequences.

Conclusion

Bidirectional approaches excel at incorporating future context into predictions, although their high computational cost can be a challenge.

In our simulations, we found that this approach was limited advantages for shorter sequences but offered a competitive and sharp learning path when dealing with longer sequences.

Bidirectional RNNs inherit both advantages and disadvantages of the base RNN architecture that it builds upon.

By understanding these strengths and weaknesses of each RNN architecture, we can optimize their application for specific tasks, leading to better performance and more efficient resource use.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Practical Time Series Analysis: Prediction with Statistics and Machine Learning

Practical Time Series Analysis: Prediction with Statistics and Machine Learning

Share What You Learned

Kuriko IWAI, "A Deep Dive into Bidirectional RNNs, LSTMs, and GRUs" in Kernel Labs

https://kuriko-iwai.com/bidirectional-recurrent-neural-network

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.