Deep Recurrent Neural Networks (DRNNs) is a powerful evolution of the standard Recurrent Neural Network (RNN) architecture, specifically designed to tackle complex sequential data.

By adding depth to the traditional RNN structure, DRNNs can capture more intricate dependencies, making them highly effective in a wide variety of applications, including Natural Language Processing (NLP), speech recognition, and video captioning.

In this article, I’ll explore primary methods for adding "depth" to an RNN, examining the architectural choices involved and their impact on model performance.

Building production-grade AI systems?

I help teams design and deploy scalable RAG pipelines, LLM systems, and MLOps infrastructure.

Or explore:

Dive deeper 👉 Research Archive
Learn by building 👉 AI Engineering Masterclass
Try it live 👉 Playground

What is Deep Recurrent Neural Network (DRNN)

A Deep Recurrent Neural Network (DRNN) is an extension of the standard Recurrent Neural Network (RNN) that incorporates multiple hidden layers, making it "deep."

RNNs are specifically designed to process sequential data by maintaining memory (or called hidden state) of the past inputs.

The below diagram simplifies a standard RNN architecture where it processes an input sequence of x's, memorizes x’s using hidden states (h's), and produces an output sequence of o's:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Simplified standard RNN architecture (Created by Kuriko IWAI)

Inheriting this architectural design, DRNNs add depth to its architecture to learn more complex and abstract sequential data.

Its major applications include:

Natural Language Processing (NLP): Machine translation, sentiment analysis, text generation, speech recognition,
Time Series Prediction: Stock market forecasting, weather prediction, and
Video Analysis: Activity recognition, video captioning.

◼ What is “Depth” in Deep Recurrent Neural Networks

Inheriting the core RNN architectural design, the depth of a DRNN falls into three distinct categories:

Vertical Depth: Information flows vertically through the stacked layers, allowing the model to learn a hierarchy of abstractions. This is the most common, straightforward approach.
Feedforward Depth (Deep I/O Functions): The hierarchy is in the multi-layered network that processes the input or output, not in the recurrent layers.
Temporal Depth (Deep Transition): Information flows horizontally through time within each layer, allowing the model to learn dependencies across the sequence.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Types of depths in a DRNN architecture (Created by Kuriko IWAI)

◼ 1. Adding Vertical Depth

This is the most common method to build a DRNN, which consists of multiple, independent RNN layers stacked vertically (also called Multilayer RNNs).

Each layer processes a progressively more abstract representation of the sequence, where the hidden state of layer n at time step t becomes the input to layer n+1 at the same time step t.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Adding vertical depth to an RNN architecture (Created by Kuriko IWAI)

◼ 2. Adding Feedforward Depth

Instead of adding vertical depth, this approach adds depth to the input/output layer by using a multi-layered feedforward network as the input-to-hidden or hidden-to-output functions.

This hierarchy is non-recurrent, and learns multi-layered representation before or after the RNN.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Adding feedforward depth to an RNN architecture (Created by Kuriko IWAI)

▫ 2-1. Depth to the Input-to-Hidden Function

This method leverages a multi-layered feedforward neural network to map inputs to the hidden layers, instead of using a single input layer.

This is also known as hierarchical feature learning where the model learns a hierarchy of features from raw input to more abstract, high-level inputs.

The below diagram shows major applications of hierarchical feature learning. In case of D/RNNs, it can be an NLP task where a sentence (raw input) is processed from word embeddings (low-level features) to contextual, sentence meanings (high-level features).

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Examples of hierarchical feature learning (Created by Kuriko IWAI based on source)

▫ 2-2. Deepening the Hidden-to-Output Function

Similar to the input function, this method uses a multi-layered feedforward network to map the hidden state to the output.

This can be particularly useful for disentangling the different factors of the hidden state, making the final prediction more sophisticated.

◼ 3. Adding Temporal Depth

This method replaces a single hidden-to-hidden layer with a multi-layered feedforward neural network to capture more complex temporal dynamics.

This does not increase the number of hidden layers, yet enables the model to learn more complex pattern in each time step using feedforward networks. Hence it theoretically increases temporal depth.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Adding temporal depth to an RNN architecture (Created by Kuriko IWAI)

These depths are added individually or in combination. In the next section, I’ll demonstrate common choices.

Practical Construction of Deep Recurrent Neural Networks

In practice, manually designing the architecture of a DRNN requires expert knowledge because the optimal DRNN architecture varies substantially for different tasks and data types.

Though the approach moves beyond a one-size-fits-all solution, in this section, I’ll cover four primal architectural choices that combines different types of depths we discussed earlier, with its major applications, in order of complexity; from simple to complex.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure G. Four primal architectural choices in constructing depth to a DRNN (Created by Kuriko IWAI)

◼ Option 1. Single RNN Layer with Deepened I/O and Hidden Functions

This architecture makes a single RNN layer deep in all three of its primary functions, creating a very wide and powerful model without stacking.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure H. Single RNN layer with deepened I/O and hidden functions (Created by Kuriko IWAI)

Types of Depth: Feedforward Depths + Temporal Depth

Best When:

Input features and/or outputs are complex, but the temporal dependency is short. So, primary challenge is modeling the complex relationships within each time step rather than across long periods
Constrained computational resources, such as in embedded systems or on-device inference.

Limitations:

Lack of a hierarchical structure. Unlike a stacked model, it cannot learn different levels of abstraction from the data, making it less effective at capturing very long-term dependencies.

Use Cases:

Gesture Recognition on Wearables: Analyzing a short sequence of accelerometer data from a smartwatch to detect a specific gesture.
Small-Scale Voice Assistants: Implementing a simple command-and-control voice model on a low-power chip, where the input audio signal is processed deeply, and the single layer makes a quick, accurate classification to trigger a command.

◼ Option 2. Stacked RNN with Deepened Output Function

This architecture features multiple stacked RNN layers followed by a multi-layered feedforward network that maps the final hidden state to the output.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure I. Stacked RNN with deepened output function (Created by Kuriko IWAI)

Types of Depth: Feedforward Depths + Vertical Depth

Best When:

The core sequential dependencies can be effectively modeled by the stacked RNN, but the final prediction requires a more powerful and non-linear mapping.
Most common and versatile deep RNN architecture.

Limitations:

A deepened output function cannot compensate the failure of stacked RNN in learning long-term relationship in data.
Adding significant number of model parameters to the output layer increases risk of overfitting.

Use Cases:

Sentiment Analysis: The stacked RNN handles NLP tasks on social media post, then the output layer makes the final positive, negative, or neutral classification.
Machine Translation: The stacked RNN acts as an encoder to understand the full meaning of a source sentence, and a decoder with a deep output function generates the target sentence word by word.

◼ Option 3. Stacked RNN with Deepened Hidden-to-Hidden Function

This architecture is where a multi-layered feedforward network is used for the recurrent transition, with the output of one RNN layer serving as the input to the next.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure J. Stacked RNN with deepened hidden-to-hidden function (Created by Kuriko IWAI)

Types of Depth: Temporal Depths + Vertical Depth

Best When:

Ideal choice for tasks that require capturing extremely complex and highly non-linear temporal dynamics.

Limitations:

High computational cost due to the deep transition function applied at every single time step.
More prone to vanishing or exploding gradients due to the increased depth in the recurrent loop.

Use Cases:

Financial Market Forecasting: Predicting subtle, non-linear price movements in a long time series of stock data.
Speech Recognition: Modeling the complex co-articulation patterns in human speech, where the transition from one phoneme to the next is highly dependent on a series of past and present sounds.

◼ Option 4. Stacked RNN with Deepened Input and Output Functions

This is a powerful but computationally intensive architecture that combines a deep feedforward network to preprocess the input, followed by stacked RNN layers, and another deep network for the output.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure J. Stacked RNN with deepened I/O functions (Created by Kuriko IWAI)

Types of Depth: Feedforward Depths + Vertical Depth

Best When:

Both the input data is high-dimensional and complex, and the final prediction is also intricate.
Commonly used in multi-modal learning.

Limitations:

Most computationally expensive of the architectures due to the combination of multiple deep components.
Requires a very large amount of data to train effectively.
Highly prone to overfitting without proper regularization.

Use Cases:

Video Captioning: The deep input function (a Convolutional Neural Network (CNN)) extracts rich features from each video frame, while the stacked RNN processes the sequence of frames, and the deep output function generates a coherent, grammatically correct sentence describing the video's content.
Complex Control Systems in Robotics: The deep input function processes raw sensor data from cameras, a stacked RNN models the sequence of these readings, and a deep output function generates a precise sequence of motor control commands.

Simulation

Now, I’ll build the four DRNNs to demonstrate each performance on a regression task.

For simplicity, I assume many-to-one architecture with a standard RNN as a base model.

◼ Option 1. Single RNN Layer with Deepened I/O and Hidden Functions

First, I defined the helper MLP class to deepen functions in the primary class.

Then defined the DRNN_SingleLayer class that handles predictions.

1
2import torch
3import torch.nn as nn
4
5# helper MLP class to deepen the function
6class MLP(nn.Module):
7    def __init__(self, input_size, hidden_size, output_size):
8        super(MLP, self).__init__()
9        self.net = nn.Sequential(
10            nn.Linear(input_size, hidden_size),
11            nn.ReLU(),
12            nn.Linear(hidden_size, hidden_size),
13            nn.ReLU(),
14            nn.Linear(hidden_size, output_size)
15        )
16
17    def forward(self, x):
18        return self.net(x)
19
20
21# primary class
22class DRNN_SingleLayer(nn.Module):
23    def __init__(self, input_size, hidden_size, output_size):
24        super(DRNN_SingleLayer, self).__init__()
25        self.hidden_size = hidden_size
26
27        # deepened input-to-hidden function.
28        self.deep_input = MLP(input_size + hidden_size, hidden_size * 2, hidden_size)
29
30        # deepened hidden-to-output function.
31        self.deep_output = MLP(hidden_size, hidden_size * 2, output_size)
32
33
34    def forward(self, x):
35        # check if the input shape is (batch_size, sequence_length, feature_size)
36        if x.dim() == 2: x = x.unsqueeze(-1)
37
38        batch_size = x.size(0)
39        sequence_length = x.size(1)
40
41        # initialize hidden state on the same device as the input tensor
42        h_t = torch.zeros(batch_size, self.hidden_size).to(x.device)
43
44        for t in range(sequence_length):
45            # concatenate the current input and the previous hidden state
46            combined_input = torch.cat((x[:, t, :], h_t), dim=1)
47
48            # deepened transition: use the helper mlp class to compute the next hidden state.
49            h_t = torch.tanh(self.deep_input(combined_input))
50
51        # pass the final hidden state through the deep output function.
52        prediction = self.deep_output(h_t)
53        return prediction
54

▫ Training and Validation

I created synthetic datasets with 1,000 sequences and trained the model over 50 epochs

(Using LTSM or GRU as a base model, we can increase the number of sequences.)

1import torch
2import torch.nn as nn
3
4def train_model(model, X_train, Y_train, X_test, Y_test, model_name, epochs=50):
5    # check for a GPU and move model and data to it
6    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
7    model.to(device)
8    X_train, Y_train = X_train.to(device), Y_train.to(device)
9    X_test, Y_test = X_test.to(device), Y_test.to(device)
10
11    # define loss func and optimizer
12    criterion = nn.MSELoss()
13    optimizer = optim.Adam(model.parameters(), lr=0.001)
14
15    print(f"\n--- Training {model_name} ---")
16    for epoch in range(epochs):
17        model.train()
18        optimizer.zero_grad()
19        outputs = model(X_train)
20        loss = criterion(outputs, Y_train)
21        loss.backward()
22
23        # add gradient clipping to prevent exploding gradients.
24        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
25        optimizer.step()
26
27        if (epoch+1) % 10 == 0:
28            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')
29
30    # evaluation on test data
31    model.eval()
32    with torch.no_grad():
33        test_predictions = model(X_test)
34        test_loss = criterion(test_predictions, Y_test)
35        print(f"Test Loss: {test_loss.item():.4f}")
36
37
38model = DRNN_SingleLayer(INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE)
39train_model(model, X_train, Y_train, X_test, Y_test, "Single RNN with Deep I/O/H2H Functions")
40

▫ Performance:

The model generated a generalization error (MSE over the test dataset) of 0.0112 after 50 epochs:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure K-1. Generalization performance on synthetic data (Created by Kuriko IWAI)

I’ll use this loss value as a performance benchmark.

◼ Option 2. Stacked RNN with Deepened Output Function

In this architecture, I built a standard stacked RNN with a deepened output function using the same helper MLP class:

1import torch
2import torch.nn as nn
3
4class DRNN_Stacked_O(nn.Module):
5    def __init__(self, input_size, hidden_size, num_layers, output_size):
6        super(StackedRNN_DeepOutput, self).__init__()
7        self.hidden_size = hidden_size
8        self.num_layers = num_layers
9
10        # standard stacked RNN
11        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
12
13        # deepen output with helper mlp class
14        self.deep_output = MLP(hidden_size, hidden_size * 2, output_size)
15
16
17    def forward(self, x):
18        # initialize hidden state on the same device as the input tensor.
19        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
20
21        # pass the input through the stacked RNN.
22        out, _ = self.rnn(x, h0)
23
24        # take out the final hidden state (assuming many-to-one architecture)
25        final_hidden_state = out[:, -1, :]
26
27        # pass the final hidden state through the deep output function.
28        prediction = self.deep_output(final_hidden_state)
29        return prediction
30

▫ Performance:

Using the same train_model function and hyperparameters, the model outperformed the benchmark’s generalization error:

1model = DRNN_Stacked_O(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS, OUTPUT_SIZE)
2train_model(model, X_train, Y_train, X_test, Y_test, "Stacked RNN with Deepened Output")
3

Test Loss (MSE): 0.0032

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure K-2. Generalization performance on synthetic data (Created by Kuriko IWAI)

◼ Option 3. Stacked RNN with Deepened Hidden-to-Hidden Function

To build this architecture, I first defined the DRNNCell class, a custom DRNN cell which deepens hidden-to-hidden functions, using the same helper MLP class.

The DRNNCell class is called during the feed forward process to compute the output from the current layer.

1import torch.nn as nn
2
3# define a custom drnn cell first
4class DRNNCell(nn.Module):
5    def __init__(self, input_size, hidden_size):
6        super(DRNNCell, self).__init__()
7
8        # deep hidden-to-hidden transition function - takes previous hidden state and input, and outputs the next hidden state
9        self.deep_h2h = MLP(input_size + hidden_size, hidden_size * 2, hidden_size)
10
11    def forward(self, input_tensor, h_prev):
12        combined = torch.cat((input_tensor, h_prev), dim=1)
13        h_next = torch.tanh(self.deep_h2h(combined))
14        return h_next
15
16
17class DRNN_Stacked_H2H(nn.Module):
18    def __init__(self, input_size, hidden_size, num_layers, output_size):
19        super(StackedRNN_DeepH2H, self).__init__()
20        self.hidden_size = hidden_size
21        self.num_layers = num_layers
22
23        # create a list of deep transition RNN cells for stacking.
24        self.cells = nn.ModuleList([
25            DRNNCell(input_size if i == 0 else hidden_size, hidden_size)
26            for i in range(num_layers)
27        ])
28
29        # standard output function (assuming linear layer)
30        self.output_layer = nn.Linear(hidden_size, output_size)
31
32
33    def forward(self, x):
34        batch_size = x.size(0)
35        sequence_length = x.size(1)
36
37        # initialize hidden states for all layers on the same device as the input
38        h_all_layers = [torch.zeros(batch_size, self.hidden_size).to(x.device) for _ in range(self.num_layers)]
39
40        for t in range(sequence_length):
41            # input to the first layer is the sequence element.
42            input_t = x[:, t, :]
43
44            # output of the current layer becomes the input for the next layer
45            for i in range(self.num_layers):
46                h_all_layers[i] = self.cells[i](input_t, h_all_layers[i])
47                input_t = h_all_layers[i]
48
49        # use the final hidden state of the top layer for the prediction
50        final_hidden_state = h_all_layers[-1]
51        prediction = self.output_layer(final_hidden_state)
52        return prediction
53

▫ Performance:

Using the same synthetic data and hyperparametners, the model outperformed the benchmark, yet underperformed its counter part: stacked RNN with deepened output (Option 2).

1model = DRNN_Stacked_H2H(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS, OUTPUT_SIZE)
2train_model(model, X_train, Y_train, X_test, Y_test, "Stacked RNN with Deep Hidden-to-Hidden")
3

Test Loss (MSE): 0.0110

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure K-3. Generalization performance on synthetic data (Created by Kuriko IWAI)

◼ Option 4. Stacked RNN with Deepened Input and Output Functions

Similar to Option 2, using the helper MLP class to deepen both input and output functions.

1import torch.nn as nn
2
3class DRNN_Stacked_IO(nn.Module):
4    def __init__(self, input_size, hidden_size, num_layers, output_size):
5        super(DRNN_Stacked_IO, self).__init__()
6        self.hidden_size = hidden_size
7        self.num_layers = num_layers
8
9        # deepened input function using a helper mlp class
10        self.deep_input = MLP(input_size, hidden_size, hidden_size)
11
12        # standard stacked RNN to process the deep input.
13        self.rnn = nn.RNN(hidden_size, hidden_size, num_layers, batch_first=True)
14
15        # deepened output function using a helper mlp class
16        self.deep_output = MLP(hidden_size, hidden_size * 2, output_size)
17
18
19    def forward(self, x):
20        # process each input in the sequence through the deep input function.
21        deep_inputs = torch.stack([self.deep_input(x[:, t, :]) for t in range(x.size(1))], dim=1)
22
23        # initialize hidden state on the same device as the input tensor.
24        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
25
26        # pass the deep inputs through the stacked RNN.
27        out, _ = self.rnn(deep_inputs, h0)
28
29        # prediction (assuming many-to-one architecture)
30        final_hidden_state = out[:, -1, :]
31        prediction = self.deep_output(final_hidden_state)
32        return prediction
33

▫ Performance:

Although achieving high accuracy, it underperformed its counter part of Option 2 and 3.

1model = DRNN_Stacked_IO(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS, OUTPUT_SIZE)
2train_model(model, X_train, Y_train, X_test, Y_test, "Stacked RNN with Deep Input and Output")
3

Test Loss (MSE): 0.0116

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure K-4. Generalization performance on synthetic data (Created by Kuriko IWAI)

◼ Overall Summary of the Results

The experimental results highlight that while deepening various components of an RNN can be beneficial, not all architectural choices yield the same performance gains.

In this task, the most successful approach was Option 2 (Stacked RNN with Deepened Output Function).

This model's superior performance (a 71% reduction in MSE from the benchmark) suggests that the most effective use of deep functions in the task is to focus them on processing the final, consolidated hidden state produced by a standard stacked RNN.

Options 3 and 4 also demonstrated improvements over the benchmark but to a lesser extent.

The deep hidden-to-hidden transitions in Option 3 and the deep input function in Option 4 seemed to provide marginal benefits compared to the simple, yet effective, stacking of standard RNN layers followed by a deep output.

Wrapping Up

Adding depth to an RNN architecture is a powerful strategy for building robust models that can effectively handle complex sequential data.

In our experiments, we observed deepening input, hidden, or output layers can significantly alter performance.

In practice, this deepening principle is applicable across the entire RNN family, including more advanced architectures like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs).

Here are some use cases that highlight when to prefer specific deepened architectures:

Deepened LSTMs are best when:

The sequential data exhibits long-range dependencies, such as in natural language processing tasks like machine translation or text summarization.
The data involves complex, time-sensitive patterns, like in financial time series forecasting or medical signal processing (e.g., EKG analysis).

Deepened GRUs are best when:

Need a balance between performance and computational efficiency.
The model needs to be deployed on edge devices or in real-time applications where low latency is critical.

Standard RNNs (deepened) are best when:

The sequential data has short-term dependencies and is relatively straightforward, such as in simple character-level prediction or some basic time series tasks.
Computational resources are extremely limited, and a simple, effective model is all that is required.

By carefully selecting the appropriate base architecture and deepening strategy, we can construct highly effective and robust models capable of tackling the most challenging sequential data problems.

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.