Mastering Long Short-Term Memory (LSTM) Networks
Uncover how LSTM architecture outperforms standard RNNs in a real-world predictive modeling task
By Kuriko IWAI

Table of Contents
IntroductionWhat is Long Short-Term Memory (LSTM) NetworksIntroduction
Long Short-Term Memory (LSTM) networks are a powerful extension of standard RNNs, designed specifically to overcome the vanishing gradient problem.
LSTMs have many applications from natural language processing to data analysis, but grasping how they manage information flow over long sequences can be challenging.
In this article, I’ll explore its core mechanics and demonstrate how they mitigate the vanishing gradient problem using a synthetic weather prediction task.
What is Long Short-Term Memory (LSTM) Networks
Long Short-Term Memory (LSTM) network is an advanced type of Recurrent Neural Network (RNN) designed to handle long temporal dependencies without suffering from the vanishing gradient problem.
Before diving into LSTMs, I’ll briefly go over standard RNN architectures and how vanishing gradient problems happen.
◼ Standard RNNs and Vanishing Gradient Problems
Recurrent Neural Network (RNN) is a type of neural network designed to handle sequential data.
Although there are many RNN variations including LSTMs, its core architecture always uses its hidden layer as a memory and recursively generates outputs based on the context of previous items in the sequence:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Comparison of a basic FNN and RNN architecture (Created by Kuriko IWAI)
During forward pass, a standard RNN updates its hidden state sequentially by multiplying the same hidden-to-hidden weight matrix (W_hh) and the previous hidden state value (h_{t-1}), and apply the multiplication to its activation function (σ).
This recurring process creates a multiplicative chain of matrix multiplication (green items in the below figure):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. The multiplicative chain of hidden state updates during forward pass of standard RNNs (W_hh: hidden-to-hidden weight matrix, h_t: hidden state value at time step t, x_t: input value at time step t, o_t: output value at time step t) (Created by Kuriko IWAI)
In the chain, the weight matrix W_hh is key to determining the influence of the previous hidden state on the next hidden state.
However, because W_hh is shared across all time steps, the matrix multiplication becomes excessively recursive when dealing with long sequence with the large number of time steps.
This is the root cause of the vanishing (or exploding) gradient problem, because standard RNNs use the gradient of the hidden state to optimize its model parameters during the BPTT (back propagation through time) process.
This means that when the largest eigenvalue of W_hh is less than (or greater than) one, the gradient shrinks (or explodes) exponentially as the number of the time steps increases.
The optimizer struggles to handle these zero (or infinite) gradients, making the model fail to converge.
Learn More: Modeling Sequential Data with Recurrent Neural Network (RNN)
◼ How LSTMs Tackle Vanishing Gradient Problems
LSTMs address this challenge by implementing the additive path for gradients, instead of the recursive matrix multiplication.
Technically, LSTMs leverage the gradient of the cell state, instead of the hidden state, to optimize their model parameters.
The below diagram illustrates forward pass of LSTMs:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C. Forward pass of a LSTM network (Created by Kuriko IWAI)
In the hidden layer, LSTMs run a multi-step process of updating the hidden state and the cell state (c0, c1, … ), leveraging the gate mechanism.
In the gate mechanism:
The forget gate (colored grey in Figure A) controls which information from the previous hidden state and the current input to forget,
The input gates (colored orange in Figure A) controls which information from the previous hidden state and the current input to retain, and
The output gates (colored red in Figure A) computes the final value to be exposed as a hidden state value.
This gate mechanism plays a key role to eliminate the vanishing gradient problem.
◼ “Deliberate” Vanishing Gradients in the Cell State
We learned that LSTMs leverages the gradient of the cell state instead of the hidden state.
Mathematically, the gradient of the cell state is denoted:
where:
L is the loss function,
C is a cell state at time step t, and
f_t and i_t are the values generated by the forget gate and input gate respectively at time step t (I’ll detail the computation process in the later section).
These f_t and i_t are the critical elements to terminate the vanishing gradient problem, because LSTMs intelligently compute these values, ranged from zero to one.
For instance,
f = 1 means remembering everything learned from the previous hidden state and the current input.
f = 0 means forgetting everything learned from the previous hidden state and the current input.
▫ The Walkthrough Example
For example, let us trace back the gradient from time step t=3 to t=1, assuming the initial gradient is one: ∂L/∂C_3 = 1.0.
- Step 1. Gradient from t=3 to t=2:
The network decides to retain the information by setting f_3 = 0.9:
- Step 2. Gradient from t=2 to t=1:The network decides to release the irrelevant information by setting f_2 = 0.01:
- Step 3: Total gradient from time step 3 to 1: 0.009.
In this example, the gradient is vanished deliberatively, but not accidentally, because the network decides to forget the irrelevant information.
The gradient of the standard RNN, on the other hand, is denoted:
where the term ∂h_j / ∂h_{j−1} is a Jacobian matrix whose elements depend on the weight matrix W_hh and the derivative of the activation function:
This mathematically explains the root cause of the gradient problem:
The tanh' derivative: Ranges from -1 to 1, making the multiplication shrink exponentially when t is large.
The weight matrix (W_hh):
If the largest eigenvalue of W_hh is less than 1, the gradient will vanish, accelerating the shrinking tanh’ derivative.
If the largest eigenvalue is greater than 1, the gradient explode, overpowering the dampening effect of the tanh' derivative.
The key point here is that there is no way for a standard RNN to avoid vanishing or exploding gradients because of the multiplicative process of the same W_hh.
In the previous walkthrough example to track BPTT from t=3 to t=1, let:
The weight W_hh = 0.5, and
The derivative of the activation function tanh′(h_{j−1}) ≈ 0.4.
The gradient computation by a standard RNN flows:
Step 1: Gradient from t=3 to t=2: ∂h_3 / ∂h_2 = 0.4 ⋅ 0.5 = 0.2
Step 2: Gradient from t=2 to t=1: ∂h_2 / ∂h_1 = 0.4 ⋅ 0.5 = 0.2
Step 3: Total gradient from t=3 to t=1:
Now, imagine t = 10,000. The total gradient is (0.2)^9999 = 1e-6989.
This implies that the network forces to discard all the information regardless of the importance of the past learning.
◼ The Hidden State’s Role in LSTMs
As shown in Figure A, the hidden state (h_t) still exists in LSTMs.
But unlike a standard RNN, it is not used to optimize the model parameters, but used to generate the final output that feeds into the next gate.
Mathematically, the process is defined:
where o_t is the value generated by the output gate at time step t such that:
tanh: tanh activation function of the output gate, and
C_t: The cell state value at time step t.
W_xo, W_ho: The weight matrices from input / hidden state to the output gate, and
b_o: A bias term of the output gate.
We notice that the computation of the output value is similar to a standard RNN.
◼ Real-World Applications of LSTMs
The intelligent gate mechanism makes LSTMs suitable in many real-world applications.
Common use cases include:
Natural Language Processing: Works as a foundational model for many NLP tasks, including language translation, sentiment analysis, and text generation.
Speech Recognition (voice assistants, transcription services): Processes the temporal sequence of sound waves in audio data by mapping these sequences to their corresponding text.
Time Series Forecasting: LSTMs are excellent for predicting future values based on historical data like stock market prediction, weather forecasting, and demand forecasting.
Music Generation: Composes new melodies and harmonies by learning patterns in existing music.
Video Analysis: Analyzes a sequence of video frames to recognize actions or track objects in the temporal context of a scene.
In the next section, I’ll walk through its forward pass —particualarly, computation process of the cell state values.
Forward Pass of LSTM Networks
Like standard RNNs, LSTMs take an input sequence, update hidden layers, and generate outputs, using the same variables like:
Input sequence X=(x_1,x_2, ⋯ ,x_T),
Hidden state values h_t, and
Non-linear activation functions.
We learned that its key differences lie in the cell state update using the gate mechanism where each gate computes the gate values to select information.
Mathematically, each gate deals with its own model parameters (weights and biases) as illustrated in Figure A.
Now, let us take a look at the computation of each gate first, then see how these impact the cell state value.
◼ Computing Input Gate Values
The input gate (orange in Figure B) computes its value by applying their own activation functions to a weighted sum of current input value and previous hidden state value:
where:
i_t: An input gate value (output from the input gate) at the current time step t,
σ: A non-linear activation function of the input gate,
x_t: The current input vector at time step t,
W_xi: A weight matrix connecting the current input x_t to the input gate,
W_hi: A weight matrix connecting the previous hidden state h_t−1 to the input gate, and
b_i: A bias vector for the input gate.
The weight matrices applied to the activation function control the importance of the current input x_t or the previous state value h_{t-1}.
Getting i_t closer to one indicates that the LSTM retains the most information.
◼ Computing Forget Gate Values
Similar to the input gate, the forget gate (grey in Figure B) computes its value by applying their own activation functions to a weighted sum of current input value and previous hidden state value:
where:
f_t: A forget gate value (output from the forget gate) at the current time step t,
σ: A non-linear activation function of the forget gate,
h_t−1: A hidden state from the previous time step t−1, and
W_x: A weight matrix applied to the current input x_t,
W_h: A weight matrix applied to the previous hidden state h_t−1, and
b_f: A bias vector for the forget gate.
The weight matrices applied to the activation function control the importance of the current input x_t or the previous state value h_{t-1}.
Getting f_t closer to one indicates that the LSTM forgets the most information.
◼ Computing Candidate Cell State
The network also computes the candidate cell state value (white in Figure B) to determine how much new input should be passed to the cell state:
where:
C~t: A candidate state value - potential new information to be added to the cell state,
W_xc: A weight matrix from the input x_t to the current cell state,
W_hc: A weight matrix from the hidden state at previous time step t-1 to the current candidate cell,
b_c: A bias vector for the candidate cell state.
A large positive value (close to 1) in a specific dimension of C~t means the network has identified strong new information that should be potentially added to the cell state.
A large negative value (close to -1) indicates strong new information that should potentially be subtracted from the cell state.
◼ Computing Cell State Values
Lastly, the cell state value C_t (red in Figure B) is computed by taking a weighted sum of the input and forget gate values:
where:
C_t: The cell state value at time step t,
f_t: A forget gate value (output from the forget gate) at the current time step t,
C_t-1: The previous cell state value at time step t-1,
i_t: An input gate value (output from the input gate) at the current time step t, and
C~t: A candidate state value - potential new information to be added to the cell state.
In the cell state formula,
The first term, controlled by the forget gate, determines how much information from the previous cell state is retained, while
The second term, controlled by the input gate, determines how much new information from the candidate cell state is added.
By adding multiple steps to weigh new inputs and previous learning, an LSTM can selectively retain or forget information over long sequences, allowing it to effectively capture long-term dependencies and mitigate the vanishing gradient problem.
◼ Computational Cost of LSTM Networks
As we observed in the formulas, LSTMs have more model parameters and non-linear activation functions than standard RNNs due to its gate mechanism.
This makes LSTMs computationally intensive, up to four times more expensive than standard RNNs, especially when dealing with a large dataset.
The below diagram showcases complexity of LSTM networks comparing with a standard RNN:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. Comparison of standard RNN and LSTM network architectures (Created by Kuriko IWAI)
We can find more model parameters in LSTMs.
GRUs were developed to tackle this computational challenge. I’ll explore it in a separate article.
Simulation
With understanding of how an LSTM network works, I’ll compare its performance with a standard RNN, using synthetic sequential data of weather conditions.
◼ Generating Synthetic Data
First, I prepared for a synthetic input sequence mimicking the real-world weather data:
1import numpy as np
2
3np.random.seed(42)
4
5weather_map = {0: 'Sunny', 1: 'Rainy', 2: 'Cloudy'}
6num_classes = len(weather_map)
7sequence_length = 500000 # extremely long sequence
8
9raw_data = np.random.randint(0, num_classes, size=sequence_length)
10
11def create_dataset(data, look_back):
12 data_x, data_y = [], []
13 for i in range(len(data) - look_back):
14 seq = data[i:(i + look_back)]
15 target = data[i + look_back]
16 data_x.append(seq)
17 data_y.append(target)
18
19 return np.array(data_x), np.array(data_y)
20
◼ Defining the LSTM Network
I’ll define the LSTMModel class to initialize the LSTM network on the PyTorch library.
During the forward pass, the nn.LSTM module generates an output value with a tuple of hidden state and cell state values.
Assuming a many-to-one architecture, the final output is computed by the nn.Linear activation function in the output layer.
1import torch
2import torch.nn as nn
3
4
5class LSTM(nn.Module):
6 def __init__(self, input_size, hidden_size, output_size):
7 super(LSTM, self).__init__()
8 self.hidden_size = hidden_size
9
10 # lstm layers
11 self.lstm = nn.LSTM(input_size, hidden_size, batch_first=False)
12
13 # output layer to map lstm outputs to weather categories
14 self.fc = nn.Linear(hidden_size, output_size)
15
16
17 def forward(self, x):
18 # initialize the hidden state and cell state with zeros.
19 num_layers = 1 # nn.LSTM has one layer by default
20 batch_size = x.size(1)
21 h0 = torch.zeros(num_layers, batch_size, self.hidden_size)
22 c0 = torch.zeros(num_layers, batch_size, self.hidden_size)
23
24 # pass the input and the initial states through the LSTM layers
25 o, (hn, cn) = self.lstm(x, (h0, c0))
26
27 # take the output from the last time step for classification
28 o_final = self.fc(o[-1, :, :])
29 return o_final
30
◼ Defining a Standard RNN
For comparison, I’ll define the StandardRNN class.
The StandardRNN class only outputs a hidden state value in addition to the output value during the forward pass process.
1import torch
2import torch.nn as nn
3
4
5class StandardRNN(nn.Module):
6 def __init__(self, input_size, hidden_size, output_size):
7 super(StandardRNN, self).__init__()
8 self.hidden_size = hidden_size
9
10 # rnn layers
11 self.rnn = nn.RNN(input_size, hidden_size, batch_first=False)
12
13 # output layer to map rnn outputs to weather categories
14 self.fc = nn.Linear(hidden_size, output_size)
15
16 def forward(self, x):
17 # initialize the hidden state with zeros
18 h0 = torch.zeros(1, x.size(1), self.hidden_size)
19
20 # pass the input through the hidden layers
21 o, hn = self.rnn(x, h0)
22
23 # take the output from the last time step for classification
24 o_final = self.fc(o[-1, :, :])
25 return o_final
26
◼ Model Training & Inference
Lastly, I trained the model with a different set of sequences and perform inference:
1import torch
2import torch.nn as nn
3import torch.optim as optim
4from torch.utils.data import DataLoader
5from sklearn.model_selection import train_test_split
6
7# generate dataset
8num_classes = 3
9input_sequences, target_labels = create_dataset(raw_data, look_back)
10
11
12# convert data to tensors
13input_tensor = one_hot_encode(input_sequences, num_classes)
14
15
16# lstm expects input shape (sequence_length, batch_size, input_size)
17input_tensor = input_tensor.permute(1, 0, 2)
18target_tensor = torch.from_numpy(target_labels).long()
19
20
21# instantiate a new model, loss function, and optimizer for each sequence_length
22model = LSTM(input_size=num_classes, hidden_size=10, output_size=num_classes)
23criterion = nn.CrossEntropyLoss()
24optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
25
26
27# train the model
28loss_history = []
29for epoch in range(epochs):
30 # zero the gradient
31 optimizer.zero_grad()
32
33 # perform forward pass
34 outputs = model(input_tensor)
35
36 # perform backward pass
37 loss = criterion(outputs, target_tensor)
38 loss.backward()
39 optimizer.step()
40
41 # for recording
42 loss_history.append(loss.item())
43
44
45# perform inference
46with torch.inference_mode():
47 # create input data from raw data (not used in training)
48 last_sequence = raw_data[-sequence_length:]
49 input_for_pred = one_hot_encode(np.array([last_sequence]), num_classes)
50 input_for_pred = input_for_pred.permute(1, 0, 2)
51
52 # make a prediction
53 prediction_output = model(input_for_pred)
54 predicted_class = torch.argmax(prediction_output, dim=1).item()
55 predicted_weather = weather_map[predicted_class]
56
◼ Results
Both an LSTM and a standard RNN generate the same result among all sequence lengths.
▫ 1. Sequence Length = 10,000
LSTMs
Training MSE 1.0585
Prediction: Rainy*, Prediction probabilities: [[-0.0982, -0.1910, 0.3882]]*
Standard RNNs
Training MSE 1.0796
Prediction: Rainy*, Prediction probabilities: [[-0.2650, -0.2228, -0.1560]]*
▫ 2. Sequence Length = 50,000
LSTMs
Training MSE: 1.0746
Prediction: Cloudy*, Prediction probabilities: [[-0.2796, 0.0365, -0.0602]]*
Standard RNNs
Training MSE 1.0962
Prediction: Rainy*, Prediction probabilities: [[-0.0424, 0.0275, -0.1372]]*
▫ 3. Sequence Length = 100,000
LSTMs
Training MSE: 1.1032
Prediction: Rainy*, Prediction probabilities: [[-0.2634, -0.1805, -0.0692]]*
Standard RNNs
Training MSE 1.0975
Prediction: Rainy*, Prediction probabilities: [[-0.4659, -0.4143, -0.4574]]*
However, their learning histories showed differences.
The below graphs track loss histories of LSTM (left) and standard RNN (right) over 400 training epochs:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure E. Loss history of the LSTM network (left) and standard RNN (right) (Created by Kuriko IWAI)
▫ Summary of the Simulation Results
▫ Quick Initial Convergence
Regardless of sequence length, both LSTMs and standard RNNs start with a similar training loss and converge quickly in the initial epochs (around 0–50).
This suggests that both models are able to quickly grasp the basic patterns in the data, even with longer sequences.
▫ Performance with Longer Sequences
However, LSTMs showed higher capabilities in learning long sequence.
LSTMs trained on 50,000 and 100,000-item sequences consistently achieve a lower final training loss, ultimately reaching the lowest loss. This indicates it’s learning the most complex and nuanced patterns.
On the other hand, Standard RNNs’ performance degrades as the sequence length increases.
The standard RNN models trained on 50,000 and 100,000-item sequences struggle to improve, showing a sign of the vanishing gradient problem where the model fails to learn from long-term dependencies.
▫ Training Volatility
For LSTMs, the loss curves for the 100,000-item sequence, exhibit more volatility. This is likely due to the increased complexity of the data, as the model must manage and remember a much larger history of information.
For standard RNNs, training loss becomes significantly more volatile with longer sequences. The 100,000-item sequence has a highly erratic loss curve with frequent spikes, indicating instability. The 10,000-item sequence, on the other hand, shows a much smoother and more stable training loss.
Conclusion
LSTMs excel at managing sequential data through their unique architectural design.
In our simulation, we observed that the sophisticated architecture of LSTMs demonstrated superior learning capabilities compared to standard RNNs, especially when handling long sequences.
This is a direct result of their ability to selectively retain and forget information, which effectively mitigates the vanishing gradient problem.
By understanding the distinct strengths and weaknesses of each RNN architecture, we can make more informed decisions about their application for specific tasks, ultimately leading to better performance and more efficient resource utilization.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Deep Dive into Recurrent Neural Networks (RNN): Mechanics, Math, and Limitations
Understanding GRU Architecture and the Power of Path Signatures
A Deep Dive into Bidirectional RNNs, LSTMs, and GRUs
Deep Recurrent Neural Networks: Engineering Depth for Complex Sequences
Advanced Cross-Validation for Sequential Data: A Guide to Avoiding Data Leakage
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Practical Time Series Analysis: Prediction with Statistics and Machine Learning
Share What You Learned
Kuriko IWAI, "Mastering Long Short-Term Memory (LSTM) Networks" in Kernel Labs
https://kuriko-iwai.com/long-short-term-memory-network
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.




