Gated Recurrent Units (GRUs) are architecture types in RNNs developed to tackle the gradient problem of standard RNNs and the high computational cost of LSTMs.

In this article, I'll explore its mechanisms for addressing the gradient problem while maintaining computational efficiency, using a simulation on weather forecasting.

What is Gated Recurrent Units (GRUs)

Gated Recurrent Units (GRUs) are types of RNN architecture developed to save computational cost while addressing vanishing gradient problems.

Unlike LSTMs using input, forget, and output gates to avoid gradient problems, GRUs leverage two gates - an update gate and reset gate - in the hidden layer where each gate controls how much information should be retained to update the hidden state.

The below diagram showcases the simplicity of the GRU architecture (middle), comparing with other major RNNs: Standard RNN (left) and LSTM (right):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Comparison of RNN architectures (Created by Kuriko IWAI)

The bottom row in the figure lists up the model parameters (weights) in each architecture.

GRUs have more parameters than standard RNNs, yet less than LSTMs because GRUs only use two gates instead of three.

Also, GRUs use hidden state values as output values at each time step, which contributes further streamlining the number of the model parameters.

In the next section, I’ll detail how GRUs tackle the vanishing gradient problem.

Why GRUs - Hybrid Solution on Cost Saving and Gradient Problems

Gradient problems are common challenges in standard RNNs where the gradient shrinks toward zero or explode toward infinity during back propagated through time (BPTT), failing models to converge.

In standard RNNs, the same hidden-to-hidden weight matrix (W_hh) is applied at every time step to update the hidden state.

When the model processes a long sequence, this creates a recursive chain of matrix multiplications:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. The multiplicative chain of hidden state updates during forward pass of standard RNNs (W_hh: hidden-to-hidden weight matrix, h_t: hidden state value at time step t, x_t: input value at time step t, o_t: output value at time step t) (Created by Kuriko IWAI)

During the BPTT, the gradients are also multiplied along this chain.

And each of these multiplications exponentially makes the gradient small or enormous within a very early time step when the largest eigenvalue of the W_hh matrix is less or more than one.

LSTMs were engineered to tackle this challenge by introducing the gate mechanism and cell states, but its computational cost is up to four times more than standard RNNs due to many model parameters as shown in Figure A.

GRUs were developed as a hybrid solution on the gradient problems of standard RNNs and expensive computational costs of LSTMs.

◼ How GRUs Tackle Vanishing Gradient Problems

Just like standard RNNs, GRUs computes gradients at hidden state, but can avoid vanishing gradients by using gates that regulate the flow of information.

The below diagram shows how the gates contribute the hidden state update:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. GRU architecture (Created by Kuriko IWAI)

In the diagram, the hidden state at time step t = 1 (h1) is computed as a linear interpolation of the previous hidden state h0 and the candidate hidden state value h’1 (green circle), using the update gate value z1 (blue circle).

Mathematically, the hidden state is generalized using time step t:

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde {h_t}

where:

h_t, h_{t-1}: The hidden state value at time step t, t-1,
~h_t: The candidate hidden state value at time step t, and
z_t: the update gate value at time step t.

◼ The Update Gate

The update gate value plays the key role, acting as a memory controller to decide how much information should be carried over to the next hidden state.

Mathematically, the update gate value is computed by taking weighted sum of the current input and previous hidden state value:

z_t = \sigma (W_{zx} x_t + W_{zh} h_{t-1} + b_z)

where:

z_t: The update gate value at time step t,
σ: The activation function of the update gate,
W_zx: The weight matrix from the input to the update gate,
x_t: The input at time step t,
W_zh: The weight matrix from the hidden state to the update gate, and
b_z: The bias term.

z_t = 1 indicates no information from the previous hidden state (h_{t−1}) should be carried over to the current hidden state (h_t).

z_t = 0 indicates all information from the previous hidden state (h_{t−1}) should be carried over to the current hidden state (h_t).

Then, the candidate hidden state (green circle) controls influence of the current input and previous hidden state value:

\tilde h_t = \sigma(W_{hx} x_t + W_{h}(r_t \odot h_{t-1}) + b_h)

where:

~h_t: The candidate hidden state value at time step t,
σ: The activation function at the candidate hidden state,
W_hx: The weight matrix from the input to the candidate hidden state,
x_t: The input at time step t,
W_h: The weight matrix from the hidden state to the candidate hidden state,
r_t: The reset gate value at time step t, and
b_h: The bias term.

◼ The Reset Gate - Combining Input and Forget Gates

The other gate of GRUs, the reset gate r_t (orange circle) controls how much information from the previous hidden state and the current input should be taken into the candidate hidden state value.

Its concept is to combine input and forget gates in LSTMs altogether to streamline the computational process.

Mathematically, the reset gate value is computed:

r_t = \sigma(W_{rx}x_t+ W_{rh}h_{t-1} +b_r)

where:

r_t: The reset gate value at time step t,
σ: The activation function at the reset gate,
W_rx: The weight matrix from the input to the reset gate,
x_t: The input at time step t,
W_rh: The weight matrix from the hidden state to the reset gate, and
b_r: The bias term.

A value close to 0 implies completely disregarding the past learning.

◼ Computing Gradients during BPTT

After taking these computational process in forward pass, GRUs perform BPTT where the gradient of the loss function is computed by tracing back the time steps.

Similar to standard RNNs, the gradient is the sum of contributions from all previous time steps:

\frac{\partial L}{\partial W} = \sum_{k=1}^{T} \frac{\partial L_{t}}{\partial h_{t}} \cdot \left( \prod_{j=k+1}^{t} \frac{\partial h_{j}}{\partial h_{j-1}} \right) \cdot \frac{\partial h_k}{\partial W}

where:

L: the Loss function,
W: weight matrix, and
h: hidden state at time step x.

In standard RNNs, the partial derivative of the hidden state ∂h_j / ∂h_{j-1} is given by:

\frac{\partial h_{j}}{\partial h_{j-1}} = \text{diag}(1 - \tanh^2(W_{hh} h_{j-1} + W_{xh} x_j)) \cdot W_{hh}

The gradient problem occurs in a long sequence because the first term of the formula will vanish, and the second term W_hh will either explode or vanish, leading the gradient zero or infinity.

On the other hand, the derivatives of hidden state in GRUs is given by:

\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(1-z_t) + z_t \odot \frac{\partial \tilde{h}t}{\partial h{t-1}}

where:

h_t, h_{t-1}: The hidden state value at time step t, t-1,
~h_t: The candidate hidden state value at time step t, and
z_t: the update gate value at time step t.

When the GRU choose the update gate value close to zero, the term (1 − z_t ) is close to one.

When this happens, the gradient can be passed back with almost no decay, effectively creating a highway for the error signal to travel back in time.

Let us see how it happens in a simple walkthrough example.

▫ The Walkthrough Example

Assume t = 3, z_2 = 0.1, and z_3 = 0.1.

Step 1: Gradient from t=3 to t=2 : The key derivative is ∂h3 / ∂h2 ≈ 1−z_3 = 1 - 0.1 = 0.9
Step 2: Gradient from t=2 to t=1: The key derivative is ∂h2 /∂h1 ≈ 1 − z_2 = 1 - 0.1 = 0.9
Step 3: Total Gradient from t=3 to t=1: Total Gradient ≈∂h3 / ∂h2⋅∂h2 / ∂h1 ≈ 0.9⋅0.9 = 0.81

GRUs’ mechanics to choose the gate values is the core mathematical reason that they can learn long-term dependencies without suffering from vanishing gradients.

On the other hand, for a standard RNN, let us assume:

The weight W_hh = 0.5, and
The derivative of the activation function tanh′(h_{j−1}) ≈ 0.4.

The gradient computation by a standard RNN flows:

Step 1: Gradient from t=3 to t=2: ∂h_3 / ∂h_2 = 0.4 ⋅ 0.5 = 0.2
Step 2: Gradient from t=2 to t=1: ∂h_2 / ∂h_1 = 0.4 ⋅ 0.5 = 0.2
Step 3: Total gradient from t=3 to t=1: Total gradient ≈∂h3 / ∂h2⋅∂h2 / ∂h1 ≈ 0.2⋅0.2 = 0.04

We can see that the shared weight W_hh forcefully vanishes the gradient just in two time steps.

◼ Saving Computational Cost

As we saw, GRUs’ computation process with the two gates streamlines the number of activation functions required as well as model parameters to be optimized compared to LSTMs.

On top of that, GRUs use the hidden state value directly as the output at each time step:

o_t = h_t

where o_t is an output value at time step t.

This further reduces the number of parameters and computations.

As a result, the computational cost of a GRU is saved up to approximately 75% of an LSTM's.

With understanding of the mechanism of GRUs, I’ll explore its advanced architecture, SigGRU in the next section.

Advanced Architecture - Signature GRU (SigGRU)

Standard GRUs rely on local context of the current input and the hidden state for their gate decisions.

Signature-GRU (SigGRU) aims to capture broader temporal patterns by replacing input values with learnable path signatures.

The path signatures combine signature values of the entire input trajectory from time step zero to t.

For instance, in the walkthrough example of t = 3 (hence the input sequence is X = (x1, x2, x3)), the input path would look like:

t=1: x1 = (1, 5)
t=2: x2=(2, 3)
t=3: x3=(3, 1)

The first level signature captures the total displacement, net changes in both time and value from the start till the end:

Change in time: 3 - 1 = 2
Change in value: 1 - 5 = -4
The first-level signature: (2, -4)

The negative value of -4 signals the decreasing trend from t = 1 to 3.

SigGRUs can take this long-term trend into the gate value computation, mathematically, by replacing the current input x with the signature path.

For the update gate value:

\begin {align} z_t &= \sigma (W_{zx} x_t + W_{zh} h_{t-1} + b_z) \\ \\ \implies z_t &= \sigma(W_{zx} Sig(X[0, t]) + W_{zh} h_{t-1} +b_z ) \end{align}

For the reset gate value:

\begin {align} r_t &= \sigma(W_{rx}x_t+ W_{rh}h_{t-1} +b_r) \\ \\ \implies r_t &= \sigma(W_{rx}Sig(X[0, t])+ W_{rh}h_{t-1} +b_r) \end{align}

The rest of the process is the same as standard GRUs.

Because SigGRUs can use the entire historical trajectory of the input sequence when deciding gate values, they are extremely beneficial in:

Complex time series forecasting like financial market prediction, environmental monitoring, or health care data analysis,
Sequential learning over irregular samples because path signatures are robust to irregular sampling rates,
"Shape" or "geometry" of sequences like trajectory analysis (e.g., robotics, sports analysis) or handwriting recognition (a stroke is considered as a trajectory).

Simulation

Now, I’ll examine the GRU and SigGRU over a weather forecasting task.

◼ Preparing for Synthetic Data

First, I defined the create_dataset function to generate a synthetic input sequence mimicking the real-world weather data:

1import torch
2import numpy as np
3
4np.random.seed(42)
5torch.manual_seed(42)
6
7# weather classification values
8weather_map = {0: 'Sunny', 1: 'Rainy', 2: 'Cloudy'}
9num_classes = len(weather_map)
10raw_data = np.random.randint(0, num_classes, size=sequence_length)
11
12def create_dataset(data, look_back):
13    data_x, data_y = [], []
14    for i in range(len(data) - look_back):
15        seq = data[i:(i + look_back)]
16        target = data[i + look_back]
17        data_x.append(seq)
18        data_y.append(target)
19    return np.array(data_x), np.array(data_y)
20

◼ Initializing Models

▫ GRU

I defined the GRU class to initialize the model, assuming a many-to-one architecture:

1import torch.nn as nn
2
3class GRU(nn.Module):
4    def __init__(self, input_size, hidden_size, output_size):
5        super(GRU, self).__init__()
6        self.hidden_size = hidden_size
7
8        # gru layer
9        self.gru = nn.GRU(input_size, hidden_size, batch_first=False)
10
11        # output layer
12        self.fc = nn.Linear(hidden_size, output_size)
13
14    def forward(self, x):
15        # initialize the hidden state with zeros
16        h0 = torch.zeros(1, x.size(1), self.hidden_size)
17
18        # pass the input through the gru layer
19        o, hn = self.gru(x, h0)
20
21        # use an output from the last time step for classification (many-to-one architecture)
22        o_final = self.fc(o[-1, :, :])
23        return o_final
24

▫ SigGRU

For SigGRU, the SigGRU class takes a simple signature path that computes the difference between the first and the last elements in the input sequence.

The signature path is combined with the linear layer to compute the final output value.

1import torch.nn as nn
2
3class SigGRU(nn.Module):
4    def __init__(self, input_size, hidden_size, output_size):
5        super(SigGRU, self).__init__()
6        self.hidden_size = hidden_size
7        self.input_size = input_size
8
9        # gru layer
10        self.gru = nn.GRU(input_size, hidden_size, batch_first=False)
11
12        # output layer that takes both the gru output and signature path as input. so the input size is hidden_size + input_size (for the signature).
13        self.fc = nn.Linear(hidden_size + input_size, output_size)
14
15    def forward(self, x):
16        # initialize the hidden state
17        h0 = torch.zeros(1, x.size(1), self.hidden_size)
18
19        # pass the input through the gru layer
20        o, hn = self.gru(x, h0)
21
22        # compute the signature - a vector that summarizes the entire path
23        signature = x[-1, :, :] - x[0, :, :] # x shape: (seq_len, batch_size, input_size)
24
25        # concatenate the gru's final hidden state with the signature.
26        combined_features = torch.cat((o[-1, :, :], signature), dim=1)
27
28        # pass the combined features to the final classification layer
29        o_final = self.fc(combined_features)
30        return o_final
31

◼ Model Training & Evaluation

Lastly, I trained models on datasets with different sequence lengths of 10, 100 and 1,000 for comparison:

1import torch
2import torch.nn as nn
3import torch.optim as optim
4from torch.utils.data import DataLoader
5from sklearn.model_selection import train_test_split
6
7epochs = 500
8sequence_length_to_test = [10, 100, 1000]
9
10for look_back in sequence_length_to_test:   
11    # create dataset
12    input_sequences, target_labels = create_dataset(raw_data, look_back)
13
14    # convert data to tensors
15    input_tensor = one_hot_encode(input_sequences, num_classes)
16    input_tensor = input_tensor.permute(1, 0, 2)
17    target_tensor = torch.from_numpy(target_labels).long()
18
19    # instantiate a new model, loss function, and optimizer for each sequence_length
20    model = model_class(input_size=num_classes, hidden_size=10, output_size=num_classes)
21    criterion = nn.CrossEntropyLoss()
22    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
23
24    # train the model
25    for epoch in range(epochs):
26        # zero the gradient
27        optimizer.zero_grad()
28
29        # prediction
30        outputs = model(input_tensor)
31
32        # compute loss
33        loss = criterion(outputs, target_tensor)
34
35        # bptt
36        loss.backward()
37        optimizer.step()
38
39    # perform inference
40    with torch.inference_mode():
41        last_sequence = raw_data[-look_back:]
42        input_for_pred = one_hot_encode(np.array([last_sequence]), num_classes)
43        input_for_pred = input_for_pred.permute(1, 0, 2)
44
45        prediction_output = model(input_for_pred)
46        predicted_class = torch.argmax(prediction_output, dim=1).item()
47
48        predicted_weather = weather_map[predicted_class]
49
50        return predicted_weather
51

◼ Results

GRU and SigGRU generate the same output for shorter sequences (t = 10 and 100).

But in the longer sequence (t = 1000), SigGRU generates different outcome (Rainy) from the GRU and standard RNN (Cloudy) by considering historical trends of the input sequence.

▫ GRU

t = 10: Loss (MSE): 1.0873, Prediction: Rainy - probabilities: [-0.1530, -0.1373, -0.0856]
t = 100: Loss (MSE): 1.0826, Prediction: Rainy - probabilities: [-0.1775, -0.3286, -0.0615]
t = 1,000: Loss (MSE): 1.0856, Prediction: Cloudy - probabilities: [0.0020, 0.1844, -0.0650]

▫ SigGRU

t = 10: Loss (MSE): 1.0660, Prediction: Rainy - probabilities: [-0.1380, 0.1518, 0.2880]
t = 100: Loss (MSE): 1.0859, Prediction: Rainy - probabilities: [0.1817, -0.0434, 0.4132]
t = 1,000: Loss (MSE): 1.0833, Prediction: Rainy - probabilities: [-0.0494, -0.1376, 0.0089]

▫ Cf. Standard RNN

t = 10: Loss (MSE): 1.0889, Prediction: Rainy - probabilities: [0.2300, 0.3845, 0.5169]
t = 100: Loss (MSE): 1.0919, Prediction: Rainy - probabilities: [0.3635, 0.1838, 0.0561]
t = 1,000: Loss (MSE): 1.0950, Prediction: Cloudy - probabilities: [0.1453, 0.2404, 0.2110]

◼ Loss History (Learning Capability)

The loss history showcased the effectiveness of the GRU and SigGRU over the standard RNN.

The below graphs shows the historical losses during the training:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Loss history of standard RNN (blue), GRU (green), and SigGRU (red) with time step t = 10 (left), 100 (middle), and 1,000 (right) (Created by Kuriko IWAI)

▫ Standard RNN

The Standard RNN model shows very little improvement, particularly as the sequence length increases. For t = 1,000, the loss barely decreases from the initial value, indicating that the model is struggling to learn from the long sequence due to the vanishing gradient problem.

▫ GRU

The GRU model performs better than the Standard RNN, especially at longer sequence lengths. For t = 100 and t = 1,000, the GRU's loss curve shows a more consistent downward trend compared to the Standard RNN. This demonstrates the GRU's ability to maintain a useful gradient over more time steps.

▫ SigGRU

The SigGRU model generally performs the best. It shows the steepest and most consistent decrease in loss across all sequence lengths. This suggests that providing a feature that summarizes the entire sequence can help the model learn more quickly and effectively.

Wrapping Up

GRUs excel at managing long sequential data, achieving better computational expense than LSTMs.

In the simulation, we observed that more complex designs like SigGRU outperforms GRUs or standard RNNs by learning trends in the input sequence.

By understanding these strengths and weaknesses of each RNN architecture, we can optimize their application for specific tasks, leading to better performance and more efficient resource use.