This chapter covers LSTM and GRU (Long Short. You will learn limitations of Vanilla RNN (vanishing, LSTM's cell state, and GRU architecture.
Learning Objectives
By reading this chapter, you will master the following:
- ā Understand the limitations of Vanilla RNN (vanishing and exploding gradient problems)
- ā Explain LSTM's cell state and gate mechanisms (forget, input, and output gates)
- ā Understand GRU architecture and its differences from LSTM
- ā Implement LSTM and GRU in PyTorch and apply them to real-world problems
- ā Understand the mechanism and advantages of Bidirectional RNN
- ā Build practical LSTM models for IMDb sentiment analysis tasks
2.1 Limitations of Vanilla RNN
Vanishing and Exploding Gradient Problems
The standard RNN (Vanilla RNN) learned in Chapter 1 can theoretically handle sequences of arbitrary length, but in practice it struggles to learn long-term dependencies.
In RNN's BPTT (Backpropagation Through Time), gradients at time $t$ propagate backward through time as follows:
$$ \frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}} $$The product term is the problem:
- Vanishing gradients: When $\left\|\frac{\partial h_t}{\partial h_{t-1}}\right\| < 1$, gradients decrease exponentially as time steps increase
- Exploding gradients: When $\left\|\frac{\partial h_t}{\partial h_{t-1}}\right\| > 1$, gradients increase exponentially
"Due to vanishing gradients, Vanilla RNN can barely learn long-term dependencies beyond 10 steps"
Visualizing the Problem
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
"""
Example: Visualizing the Problem
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
# Observe gradient magnitude in Vanilla RNN
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size):
super(SimpleRNN, self).__init__()
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
def forward(self, x):
output, hidden = self.rnn(x)
return output, hidden
# Create model
input_size = 10
hidden_size = 20
sequence_length = 50
model = SimpleRNN(input_size, hidden_size)
# Random input
x = torch.randn(1, sequence_length, input_size, requires_grad=True)
# Forward pass
output, hidden = model(x)
loss = output.sum()
# Backward pass
loss.backward()
# Calculate gradient norm at each time step
gradients = []
for t in range(sequence_length):
grad = x.grad[0, t, :].norm().item()
gradients.append(grad)
print("=== Gradient Propagation in Vanilla RNN ===")
print(f"Gradient norm at initial time: {gradients[0]:.6f}")
print(f"Gradient norm at middle time: {gradients[25]:.6f}")
print(f"Gradient norm at final time: {gradients[49]:.6f}")
print(f"\nGradient decay rate: {gradients[0] / gradients[49]:.2f}x")
print("ā Gradients get smaller as we go back in time (vanishing gradients)")
Concrete Example: Long-Term Dependency Task
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
# Task: Predict the first element at the last time step
def create_long_dependency_task(batch_size=32, seq_length=50):
"""
Task where important information is at the first time step and needs to be used at the end
Example: [5, 0, 0, ..., 0] ā predict 5 at the end
"""
x = torch.zeros(batch_size, seq_length, 10)
targets = torch.randint(0, 10, (batch_size,))
# Encode correct label at first time step
for i in range(batch_size):
x[i, 0, targets[i]] = 1.0
return x, targets
# Train with Vanilla RNN
class VanillaRNNClassifier(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(VanillaRNNClassifier, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
output, hidden = self.rnn(x)
# Use output at last time step
logits = self.fc(output[:, -1, :])
return logits
# Experiment
model = VanillaRNNClassifier(input_size=10, hidden_size=32, num_classes=10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Training
num_epochs = 100
for epoch in range(num_epochs):
x, targets = create_long_dependency_task(batch_size=32, seq_length=50)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, targets)
loss.backward()
optimizer.step()
if (epoch + 1) % 20 == 0:
with torch.no_grad():
x_test, targets_test = create_long_dependency_task(batch_size=100, seq_length=50)
logits_test = model(x_test)
_, predicted = logits_test.max(1)
accuracy = (predicted == targets_test).float().mean().item()
print(f"Epoch {epoch+1}: Accuracy = {accuracy*100:.2f}%")
print("\nā Vanilla RNN cannot learn long-term dependencies and has low accuracy (similar to random prediction)")
Solution: Gate Mechanisms
To solve this problem, LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) were proposed. They can effectively learn long-term dependencies by controlling information flow through gate mechanisms.
2.2 LSTM (Long Short-Term Memory)
Overview of LSTM
LSTM is a gated RNN architecture proposed by Hochreiter and Schmidhuber in 1997. It is characterized by having a cell state for long-term memory in addition to the hidden state of Vanilla RNN.
Four Components of LSTM
An LSTM cell consists of the following four elements:
| Gate | Role | Formula |
|---|---|---|
| Forget Gate | How much past memory to retain | $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$ |
| Input Gate | How much new information to add | $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$ |
| Candidate Cell | Content of new information to add | $\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$ |
| Output Gate | How much to output from cell state | $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$ |
Mathematical Definition of LSTM
The complete LSTM update equations are as follows:
$$ \begin{align} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f) \quad &\text{(Forget gate)} \\ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i) \quad &\text{(Input gate)} \\ \tilde{C}_t &= \tanh(W_C [h_{t-1}, x_t] + b_C) \quad &\text{(Candidate value)} \\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \quad &\text{(Cell state update)} \\ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o) \quad &\text{(Output gate)} \\ h_t &= o_t \odot \tanh(C_t) \quad &\text{(Hidden state update)} \end{align} $$Where:
- $\sigma$: Sigmoid function (outputs values between 0 and 1, used for gate control)
- $\odot$: Element-wise product (Hadamard product)
- $[h_{t-1}, x_t]$: Vector concatenation
- $W_*, b_*$: Learnable parameters
Role of Cell State
The cell state $C_t$ functions as an information highway:
$$ C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t $$- $f_t \odot C_{t-1}$: Selectively retain past memory (controlled by forget gate)
- $i_t \odot \tilde{C}_t$: Selectively add new information (controlled by input gate)
"Gradients flow directly through the cell state, making vanishing gradients less likely!"
Manual Implementation of LSTM
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import torch.nn.functional as F
class LSTMCell(nn.Module):
"""Manual implementation of LSTM cell (for educational purposes)"""
def __init__(self, input_size, hidden_size):
super(LSTMCell, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
# Forget gate
self.W_f = nn.Linear(input_size + hidden_size, hidden_size)
# Input gate
self.W_i = nn.Linear(input_size + hidden_size, hidden_size)
# Candidate value
self.W_C = nn.Linear(input_size + hidden_size, hidden_size)
# Output gate
self.W_o = nn.Linear(input_size + hidden_size, hidden_size)
def forward(self, x_t, h_prev, C_prev):
"""
x_t: (batch_size, input_size) - Current input
h_prev: (batch_size, hidden_size) - Previous hidden state
C_prev: (batch_size, hidden_size) - Previous cell state
"""
# Concatenate input and hidden state
combined = torch.cat([h_prev, x_t], dim=1)
# Forget gate: which information to forget
f_t = torch.sigmoid(self.W_f(combined))
# Input gate: which information to add
i_t = torch.sigmoid(self.W_i(combined))
# Candidate value: content of information to add
C_tilde = torch.tanh(self.W_C(combined))
# Cell state update
C_t = f_t * C_prev + i_t * C_tilde
# Output gate: which information to output
o_t = torch.sigmoid(self.W_o(combined))
# Hidden state update
h_t = o_t * torch.tanh(C_t)
return h_t, C_t
class ManualLSTM(nn.Module):
"""LSTM processing over multiple time steps"""
def __init__(self, input_size, hidden_size):
super(ManualLSTM, self).__init__()
self.hidden_size = hidden_size
self.cell = LSTMCell(input_size, hidden_size)
def forward(self, x, init_states=None):
"""
x: (batch_size, seq_length, input_size)
"""
batch_size, seq_length, _ = x.size()
# Initial state
if init_states is None:
h_t = torch.zeros(batch_size, self.hidden_size).to(x.device)
C_t = torch.zeros(batch_size, self.hidden_size).to(x.device)
else:
h_t, C_t = init_states
# Store outputs at each time step
outputs = []
# Process sequence time step by time step
for t in range(seq_length):
h_t, C_t = self.cell(x[:, t, :], h_t, C_t)
outputs.append(h_t.unsqueeze(1))
# Concatenate outputs
outputs = torch.cat(outputs, dim=1)
return outputs, (h_t, C_t)
# Operation check
batch_size = 4
seq_length = 10
input_size = 8
hidden_size = 16
model = ManualLSTM(input_size, hidden_size)
x = torch.randn(batch_size, seq_length, input_size)
outputs, (h_final, C_final) = model(x)
print("=== Manual LSTM Implementation Check ===")
print(f"Input size: {x.shape}")
print(f"Output size: {outputs.shape}")
print(f"Final hidden state: {h_final.shape}")
print(f"Final cell state: {C_final.shape}")
Using PyTorch's nn.LSTM
In actual development, we use PyTorch's optimized nn.LSTM:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: In actual development, we use PyTorch's optimizednn.LSTM:
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# PyTorch's LSTM
lstm = nn.LSTM(
input_size=10, # Input dimension
hidden_size=20, # Hidden state dimension
num_layers=2, # Number of LSTM layers
batch_first=True, # (batch, seq, feature) order
dropout=0.2, # Dropout between layers
bidirectional=False # Whether bidirectional
)
# Dummy data
batch_size = 32
seq_length = 15
input_size = 10
x = torch.randn(batch_size, seq_length, input_size)
# Forward pass
output, (h_n, c_n) = lstm(x)
print("=== Using PyTorch nn.LSTM ===")
print(f"Input: {x.shape}")
print(f"Output: {output.shape} # (batch, seq, hidden_size)")
print(f"Final hidden state: {h_n.shape} # (num_layers, batch, hidden_size)")
print(f"Final cell state: {c_n.shape} # (num_layers, batch, hidden_size)")
# Check number of parameters
total_params = sum(p.numel() for p in lstm.parameters())
print(f"\nTotal parameters: {total_params:,}")
print("ā Many parameters because each layer has 4 gates (f, i, C, o)")
LSTM and Long-Term Dependencies
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: LSTM and Long-Term Dependencies
Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
import torch.nn as nn
# Solve the previous long-term dependency task with LSTM
class LSTMClassifier(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(LSTMClassifier, self).__init__()
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
output, (h_n, c_n) = self.lstm(x)
# Use output at last time step
logits = self.fc(output[:, -1, :])
return logits
# Task creation function (same as before)
def create_long_dependency_task(batch_size=32, seq_length=50):
x = torch.zeros(batch_size, seq_length, 10)
targets = torch.randint(0, 10, (batch_size,))
for i in range(batch_size):
x[i, 0, targets[i]] = 1.0
return x, targets
# Train with LSTM
model = LSTMClassifier(input_size=10, hidden_size=32, num_classes=10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
print("=== Learning Long-Term Dependencies with LSTM ===")
num_epochs = 100
for epoch in range(num_epochs):
x, targets = create_long_dependency_task(batch_size=32, seq_length=50)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, targets)
loss.backward()
optimizer.step()
if (epoch + 1) % 20 == 0:
with torch.no_grad():
x_test, targets_test = create_long_dependency_task(batch_size=100, seq_length=50)
logits_test = model(x_test)
_, predicted = logits_test.max(1)
accuracy = (predicted == targets_test).float().mean().item()
print(f"Epoch {epoch+1}: Accuracy = {accuracy*100:.2f}%")
print("\nā LSTM can effectively learn long-term dependencies and achieve high accuracy!")
2.3 GRU (Gated Recurrent Unit)
Overview of GRU
GRU (Gated Recurrent Unit) is a simplified architecture of LSTM proposed by Cho et al. in 2014. It often achieves equal or better performance than LSTM with fewer parameters.
Differences Between LSTM and GRU
| Item | LSTM | GRU |
|---|---|---|
| Number of gates | 3 (forget, input, output) | 2 (reset, update) |
| States | Hidden state $h_t$ and cell state $C_t$ | Hidden state $h_t$ only |
| Parameters | More | Fewer (about 75% of LSTM) |
| Computation speed | Somewhat slower | Somewhat faster |
| Performance | Task dependent | Task dependent (advantageous for short sequences) |
Mathematical Definition of GRU
The GRU update equations are as follows:
$$ \begin{align} r_t &= \sigma(W_r [h_{t-1}, x_t] + b_r) \quad &\text{(Reset gate)} \\ z_t &= \sigma(W_z [h_{t-1}, x_t] + b_z) \quad &\text{(Update gate)} \\ \tilde{h}_t &= \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h) \quad &\text{(Candidate hidden state)} \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \quad &\text{(Hidden state update)} \end{align} $$Role of each gate:
- Reset gate $r_t$: How much to ignore past information (ignores past when close to 0)
- Update gate $z_t$: How much to mix past and present information (retains past when close to 0, adopts new information when close to 1)
"GRU integrates LSTM's forget gate and input gate with the update gate $z_t$"
Manual Implementation of GRU
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
class GRUCell(nn.Module):
"""Manual implementation of GRU cell"""
def __init__(self, input_size, hidden_size):
super(GRUCell, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
# Reset gate
self.W_r = nn.Linear(input_size + hidden_size, hidden_size)
# Update gate
self.W_z = nn.Linear(input_size + hidden_size, hidden_size)
# Candidate hidden state
self.W_h = nn.Linear(input_size + hidden_size, hidden_size)
def forward(self, x_t, h_prev):
"""
x_t: (batch_size, input_size)
h_prev: (batch_size, hidden_size)
"""
# Concatenate input and hidden state
combined = torch.cat([h_prev, x_t], dim=1)
# Reset gate
r_t = torch.sigmoid(self.W_r(combined))
# Update gate
z_t = torch.sigmoid(self.W_z(combined))
# Candidate hidden state (filter past with reset gate)
combined_reset = torch.cat([r_t * h_prev, x_t], dim=1)
h_tilde = torch.tanh(self.W_h(combined_reset))
# Hidden state update (mix past and present with update gate)
h_t = (1 - z_t) * h_prev + z_t * h_tilde
return h_t
class ManualGRU(nn.Module):
"""GRU processing over multiple time steps"""
def __init__(self, input_size, hidden_size):
super(ManualGRU, self).__init__()
self.hidden_size = hidden_size
self.cell = GRUCell(input_size, hidden_size)
def forward(self, x, init_state=None):
batch_size, seq_length, _ = x.size()
# Initial state
if init_state is None:
h_t = torch.zeros(batch_size, self.hidden_size).to(x.device)
else:
h_t = init_state
outputs = []
for t in range(seq_length):
h_t = self.cell(x[:, t, :], h_t)
outputs.append(h_t.unsqueeze(1))
outputs = torch.cat(outputs, dim=1)
return outputs, h_t
# Operation check
model = ManualGRU(input_size=8, hidden_size=16)
x = torch.randn(4, 10, 8)
outputs, h_final = model(x)
print("=== Manual GRU Implementation Check ===")
print(f"Input: {x.shape}")
print(f"Output: {outputs.shape}")
print(f"Final hidden state: {h_final.shape}")
print("ā GRU has no cell state, only hidden state")
Using PyTorch's nn.GRU
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Using PyTorch's nn.GRU
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# PyTorch's GRU
gru = nn.GRU(
input_size=10,
hidden_size=20,
num_layers=2,
batch_first=True,
dropout=0.2,
bidirectional=False
)
x = torch.randn(32, 15, 10)
output, h_n = gru(x)
print("=== Using PyTorch nn.GRU ===")
print(f"Input: {x.shape}")
print(f"Output: {output.shape}")
print(f"Final hidden state: {h_n.shape}")
# Parameter count comparison with LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
gru_params = sum(p.numel() for p in gru.parameters())
lstm_params = sum(p.numel() for p in lstm.parameters())
print(f"\nGRU parameters: {gru_params:,}")
print(f"LSTM parameters: {lstm_params:,}")
print(f"Difference: {lstm_params - gru_params:,} (GRU has {(lstm_params/gru_params - 1)*100:.1f}% fewer)")
Performance Comparison Between LSTM and GRU
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import time
class SequenceClassifier(nn.Module):
"""Generic sequence classifier"""
def __init__(self, input_size, hidden_size, num_classes, rnn_type='lstm'):
super(SequenceClassifier, self).__init__()
if rnn_type == 'lstm':
self.rnn = nn.LSTM(input_size, hidden_size, batch_first=True)
elif rnn_type == 'gru':
self.rnn = nn.GRU(input_size, hidden_size, batch_first=True)
else:
raise ValueError("rnn_type must be 'lstm' or 'gru'")
self.fc = nn.Linear(hidden_size, num_classes)
self.rnn_type = rnn_type
def forward(self, x):
if self.rnn_type == 'lstm':
output, (h_n, c_n) = self.rnn(x)
else:
output, h_n = self.rnn(x)
logits = self.fc(output[:, -1, :])
return logits
# Comparison experiment
def compare_models(seq_length=50):
print(f"\n=== Comparison with sequence length={seq_length} ===")
# Create models
lstm_model = SequenceClassifier(10, 32, 10, rnn_type='lstm')
gru_model = SequenceClassifier(10, 32, 10, rnn_type='gru')
# Generate data
x, targets = create_long_dependency_task(batch_size=32, seq_length=seq_length)
criterion = nn.CrossEntropyLoss()
# LSTM training
optimizer_lstm = torch.optim.Adam(lstm_model.parameters(), lr=0.001)
start = time.time()
for _ in range(50):
optimizer_lstm.zero_grad()
logits = lstm_model(x)
loss = criterion(logits, targets)
loss.backward()
optimizer_lstm.step()
lstm_time = time.time() - start
# GRU training
optimizer_gru = torch.optim.Adam(gru_model.parameters(), lr=0.001)
start = time.time()
for _ in range(50):
optimizer_gru.zero_grad()
logits = gru_model(x)
loss = criterion(logits, targets)
loss.backward()
optimizer_gru.step()
gru_time = time.time() - start
# Accuracy evaluation
x_test, targets_test = create_long_dependency_task(batch_size=100, seq_length=seq_length)
with torch.no_grad():
logits_lstm = lstm_model(x_test)
logits_gru = gru_model(x_test)
_, pred_lstm = logits_lstm.max(1)
_, pred_gru = logits_gru.max(1)
acc_lstm = (pred_lstm == targets_test).float().mean().item()
acc_gru = (pred_gru == targets_test).float().mean().item()
print(f"LSTM - Accuracy: {acc_lstm*100:.2f}%, Training time: {lstm_time:.2f}s")
print(f"GRU - Accuracy: {acc_gru*100:.2f}%, Training time: {gru_time:.2f}s")
# Compare at different sequence lengths
compare_models(seq_length=20)
compare_models(seq_length=50)
compare_models(seq_length=100)
print("\nā GRU tends to be more efficient for short sequences, LSTM advantageous for longer sequences")
2.4 Bidirectional RNN
What is Bidirectional RNN?
Bidirectional RNN processes sequences from both forward (front to back) and backward (back to front) directions and integrates information from both directions.
ā"] B --> C["x_2"] C --> D["Forward
ā"] D --> E["x_3"] E --> F["Backward
ā"] F --> C C --> G["Backward
ā"] G --> A B --> H["h_1"] D --> I["h_2"] F --> J["h_3 (backward)"] G --> K["h_2 (backward)"] style B fill:#b3e5fc style D fill:#b3e5fc style F fill:#ffab91 style G fill:#ffab91
Advantages of Bidirectional RNN
- Complete context capture: Can consider both past and future context at each position
- Part-of-speech tagging: Determine parts of speech by looking at before and after words
- Sentiment analysis: Judge sentiment by looking at entire sentence
- Machine translation: Use as encoder
"Bidirectional RNN cannot be used for real-time processing because the output at time $t$ depends on future information. It is suitable for offline processing (when entire sequence is available)."
Implementation of Bidirectional LSTM
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Implementation of Bidirectional LSTM
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# In PyTorch, just specify bidirectional=True
class BidirectionalLSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(BidirectionalLSTM, self).__init__()
self.lstm = nn.LSTM(
input_size,
hidden_size,
batch_first=True,
bidirectional=True # Enable bidirectional
)
# hidden_size * 2 because bidirectional
self.fc = nn.Linear(hidden_size * 2, num_classes)
def forward(self, x):
# output: (batch, seq, hidden_size * 2)
output, (h_n, c_n) = self.lstm(x)
# Use output at last time step
logits = self.fc(output[:, -1, :])
return logits
# Operation check
model = BidirectionalLSTM(input_size=10, hidden_size=32, num_classes=10)
x = torch.randn(4, 15, 10)
logits = model(x)
print("=== Bidirectional LSTM Operation Check ===")
print(f"Input: {x.shape}")
print(f"Output: {logits.shape}")
# Parameter count comparison
uni_lstm = nn.LSTM(10, 32, batch_first=True, bidirectional=False)
bi_lstm = nn.LSTM(10, 32, batch_first=True, bidirectional=True)
uni_params = sum(p.numel() for p in uni_lstm.parameters())
bi_params = sum(p.numel() for p in bi_lstm.parameters())
print(f"\nUnidirectional LSTM: {uni_params:,} parameters")
print(f"Bidirectional LSTM: {bi_params:,} parameters")
print(f"ā Bidirectional has about 2x the parameters")
Performance Comparison: Bidirectional vs Unidirectional
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Performance Comparison: Bidirectional vs Unidirectional
Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
import torch.nn as nn
class DirectionalClassifier(nn.Module):
def __init__(self, input_size, hidden_size, num_classes, bidirectional=False):
super(DirectionalClassifier, self).__init__()
self.lstm = nn.LSTM(
input_size,
hidden_size,
batch_first=True,
bidirectional=bidirectional
)
fc_input_size = hidden_size * 2 if bidirectional else hidden_size
self.fc = nn.Linear(fc_input_size, num_classes)
def forward(self, x):
output, _ = self.lstm(x)
logits = self.fc(output[:, -1, :])
return logits
# Comparison experiment
def compare_directionality():
print("\n=== Unidirectional vs Bidirectional Comparison ===")
# Create models
uni_model = DirectionalClassifier(10, 32, 10, bidirectional=False)
bi_model = DirectionalClassifier(10, 32, 10, bidirectional=True)
criterion = nn.CrossEntropyLoss()
# Training
for model, name in [(uni_model, "Unidirectional"), (bi_model, "Bidirectional")]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(100):
x, targets = create_long_dependency_task(batch_size=32, seq_length=50)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, targets)
loss.backward()
optimizer.step()
# Evaluation
x_test, targets_test = create_long_dependency_task(batch_size=100, seq_length=50)
with torch.no_grad():
logits_test = model(x_test)
_, predicted = logits_test.max(1)
accuracy = (predicted == targets_test).float().mean().item()
print(f"{name} LSTM - Accuracy: {accuracy*100:.2f}%")
compare_directionality()
print("\nā In this task, information is at the beginning, so bidirectional advantage is small")
print(" Bidirectional is advantageous for tasks where both past and future context are important, such as part-of-speech tagging")
2.5 Practice: IMDb Sentiment Analysis
IMDb Dataset
IMDb (Internet Movie Database) is a sentiment analysis dataset of movie reviews:
- 50,000 movie reviews (25,000 training, 25,000 test)
- 2-class classification: Positive, Negative
- Each review is English text
Data Preparation
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Data Preparation
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from collections import Counter
# Tokenizer
tokenizer = get_tokenizer('basic_english')
# Load dataset
train_iter = IMDB(split='train')
# Build vocabulary
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
# Build vocabulary (top 10,000 most frequent words)
vocab = build_vocab_from_iterator(
yield_tokens(IMDB(split='train')),
specials=['<unk>', '<pad>'],
max_tokens=10000
)
vocab.set_default_index(vocab['<unk>'])
print("=== IMDb Dataset Preparation ===")
print(f"Vocabulary size: {len(vocab)}")
print(f"<pad> token index: {vocab['<pad>']}")
print(f"<unk> token index: {vocab['<unk>']}")
# Sample tokenization
sample_text = "This movie is great!"
tokens = tokenizer(sample_text)
indices = [vocab[token] for token in tokens]
print(f"\nSample: '{sample_text}'")
print(f"Tokens: {tokens}")
print(f"Indices: {indices}")
</unk></unk></pad></pad></unk></pad></unk>
Dataset Class
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Dataset Class
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
class IMDbDataset(Dataset):
def __init__(self, split='train'):
self.data = list(IMDB(split=split))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
label, text = self.data[idx]
# Convert label to number (neg=0, pos=1)
label = 1 if label == 'pos' else 0
# Tokenize text and convert to indices
tokens = tokenizer(text)
indices = [vocab[token] for token in tokens]
return torch.tensor(indices), torch.tensor(label)
def collate_batch(batch):
"""
Pad sequences in batch to same length
"""
texts, labels = zip(*batch)
# Padding
texts_padded = pad_sequence(texts, batch_first=True, padding_value=vocab['<pad>'])
labels = torch.stack(labels)
return texts_padded, labels
# Create data loaders
train_dataset = IMDbDataset(split='train')
test_dataset = IMDbDataset(split='test')
train_loader = DataLoader(
train_dataset,
batch_size=64,
shuffle=True,
collate_fn=collate_batch
)
test_loader = DataLoader(
test_dataset,
batch_size=64,
shuffle=False,
collate_fn=collate_batch
)
print("\n=== Data Loader Check ===")
texts, labels = next(iter(train_loader))
print(f"Batch size: {texts.shape[0]}")
print(f"Sequence length (max): {texts.shape[1]}")
print(f"Labels: {labels[:5]}")
</pad>
LSTM Sentiment Analysis Model
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: LSTM Sentiment Analysis Model
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
class LSTMSentimentClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes, num_layers=2, dropout=0.5):
super(LSTMSentimentClassifier, self).__init__()
# Embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=vocab['<pad>'])
# LSTM layer
self.lstm = nn.LSTM(
embedding_dim,
hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
bidirectional=True
)
# Classification layer
self.fc = nn.Linear(hidden_dim * 2, num_classes) # *2 because bidirectional
# Dropout
self.dropout = nn.Dropout(dropout)
def forward(self, text):
# text: (batch, seq_len)
# Embedding: (batch, seq_len, embedding_dim)
embedded = self.dropout(self.embedding(text))
# LSTM: output (batch, seq_len, hidden_dim * 2)
output, (hidden, cell) = self.lstm(embedded)
# Use output at last time step
# Or concatenate final hidden states of forward and backward
# hidden: (num_layers * 2, batch, hidden_dim)
# Concatenate forward and backward of final layer
hidden_forward = hidden[-2, :, :]
hidden_backward = hidden[-1, :, :]
hidden_concat = torch.cat([hidden_forward, hidden_backward], dim=1)
# Dropout + classification
hidden_concat = self.dropout(hidden_concat)
logits = self.fc(hidden_concat)
return logits
# Create model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nUsing device: {device}")
model = LSTMSentimentClassifier(
vocab_size=len(vocab),
embedding_dim=100,
hidden_dim=256,
num_classes=2,
num_layers=2,
dropout=0.5
).to(device)
print(model)
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")
</pad>
Training Loop
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Training Loop
Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
import torch.nn as nn
import torch.optim as optim
# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
def train_epoch(model, loader, criterion, optimizer, device):
model.train()
running_loss = 0.0
correct = 0
total = 0
for texts, labels in loader:
texts, labels = texts.to(device), labels.to(device)
optimizer.zero_grad()
logits = model(texts)
loss = criterion(logits, labels)
loss.backward()
# Gradient clipping (prevent gradient explosion)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
optimizer.step()
running_loss += loss.item()
_, predicted = logits.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
epoch_loss = running_loss / len(loader)
epoch_acc = 100. * correct / total
return epoch_loss, epoch_acc
def test_epoch(model, loader, criterion, device):
model.eval()
running_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for texts, labels in loader:
texts, labels = texts.to(device), labels.to(device)
logits = model(texts)
loss = criterion(logits, labels)
running_loss += loss.item()
_, predicted = logits.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
epoch_loss = running_loss / len(loader)
epoch_acc = 100. * correct / total
return epoch_loss, epoch_acc
# Execute training
num_epochs = 5
best_acc = 0
print("\n=== Training Start ===")
for epoch in range(num_epochs):
train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
test_loss, test_acc = test_epoch(model, test_loader, criterion, device)
print(f"Epoch [{epoch+1}/{num_epochs}]")
print(f" Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
print(f" Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")
if test_acc > best_acc:
best_acc = test_acc
torch.save(model.state_dict(), 'best_imdb_lstm.pth')
print(f"\nTraining complete! Best accuracy: {best_acc:.2f}%")
Inference and Interpretation
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
def predict_sentiment(model, text, vocab, tokenizer, device):
"""Predict sentiment of single text"""
model.eval()
# Tokenize
tokens = tokenizer(text)
indices = [vocab[token] for token in tokens]
# Convert to tensor
text_tensor = torch.tensor(indices).unsqueeze(0).to(device) # (1, seq_len)
# Predict
with torch.no_grad():
logits = model(text_tensor)
probs = torch.softmax(logits, dim=1)
pred = logits.argmax(1).item()
sentiment = "Positive" if pred == 1 else "Negative"
confidence = probs[0, pred].item()
return sentiment, confidence
# Test
test_reviews = [
"This movie is absolutely amazing! I loved every moment.",
"Terrible film. Waste of time and money.",
"It was okay, nothing special but not bad either.",
"One of the best movies I've ever seen!",
"Boring and predictable. Would not recommend."
]
print("\n=== Sentiment Analysis Prediction Results ===")
for review in test_reviews:
sentiment, confidence = predict_sentiment(model, review, vocab, tokenizer, device)
print(f"\nReview: {review}")
print(f"Prediction: {sentiment} (Confidence: {confidence*100:.2f}%)")
2.6 Guidelines for Choosing Between LSTM and GRU
Selection Criteria
| Situation | Recommendation | Reason |
|---|---|---|
| Long sequences (>100) | LSTM | Cell state retains long-term memory |
| Short sequences (<50) | GRU | Fewer parameters, more efficient |
| Computational constraints | GRU | About 25% fewer parameters |
| High accuracy essential | LSTM | Higher expressive power |
| Real-time processing | GRU | Faster computation |
| Both contexts needed | Bidirectional LSTM/GRU | Utilizes information from both directions |
| Uncertain | Try both | High task dependency |
Choosing Hyperparameters
- Hidden layer size: 64-512 (depending on task complexity)
- Number of layers: 1-3 layers (too deep risks overfitting)
- Dropout: 0.2-0.5 (prevent overfitting)
- Embedding dimension: 50-300 (depending on vocabulary size)
- Learning rate: 0.0001-0.001 (Adam recommended)
- Batch size: 32-128 (depending on memory)
Implementation Best Practices
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
class BestPracticeLSTM(nn.Module):
"""LSTM model incorporating best practices"""
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
super(BestPracticeLSTM, self).__init__()
# 1. Specify padding_idx in Embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
# 2. Bidirectional LSTM
self.lstm = nn.LSTM(
embedding_dim,
hidden_dim,
num_layers=2,
batch_first=True,
dropout=0.3, # Inter-layer Dropout
bidirectional=True
)
# 3. Batch Normalization (optional)
self.batch_norm = nn.BatchNorm1d(hidden_dim * 2)
# 4. Dropout
self.dropout = nn.Dropout(0.5)
# 5. Classification layer
self.fc = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
embedded = self.dropout(self.embedding(x))
output, (hidden, cell) = self.lstm(embedded)
# Concatenate final hidden states of forward and backward
hidden_concat = torch.cat([hidden[-2], hidden[-1]], dim=1)
# Batch Norm (optional)
hidden_concat = self.batch_norm(hidden_concat)
# Dropout + classification
hidden_concat = self.dropout(hidden_concat)
logits = self.fc(hidden_concat)
return logits
# Training considerations
def train_with_best_practices(model, train_loader, num_epochs=10):
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=2
)
for epoch in range(num_epochs):
model.train()
for texts, labels in train_loader:
optimizer.zero_grad()
logits = model(texts)
loss = criterion(logits, labels)
loss.backward()
# Gradient clipping (essential)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
optimizer.step()
# Adjust learning rate based on validation loss
val_loss = evaluate(model, val_loader)
scheduler.step(val_loss)
print("=== Best Practices ===")
print("1. Specify padding_idx to exclude <pad> from training")
print("2. Bidirectional LSTM for complete context capture")
print("3. Dropout to prevent overfitting")
print("4. Gradient clipping to prevent gradient explosion")
print("5. Learning rate scheduler to improve optimization")
</pad>
Exercises
Exercise 1: Observe LSTM Gate Operations
Visualize the values of each LSTM gate (forget, input, output) and observe how they control information.
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Visualize the values of each LSTM gate (forget, input, outpu
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# Exercise: Plot each gate's values over time for simple sequence data
# Hint: Record f_t, i_t, o_t values and graph with matplotlib
Exercise 2: Compare Convergence Speed of GRU and LSTM
Train GRU and LSTM on the same task and compare training curves (loss and accuracy).
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - torch>=2.0.0, <2.3.0
"""
Example: Train GRU and LSTM on the same task and compare training cur
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
# Exercise: Create GRU and LSTM models
# Exercise: Train on same data and record loss and accuracy for each epoch
# Exercise: Plot training curves
# Evaluation metrics: convergence speed, final accuracy, training time
Exercise 3: Verify Effect of Bidirectional RNN
Compare performance of unidirectional and bidirectional RNN on part-of-speech tagging task.
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Compare performance of unidirectional and bidirectional RNN
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# Exercise: Implement unidirectional and bidirectional LSTM models
# Exercise: Compare performance on part-of-speech tagging task (predict part of speech for each word)
# Hint: Use datasets like UD_English from torchtext.datasets
# Verify bidirectional advantage on tasks where both past and future context are important
Exercise 4: Relationship Between Sequence Length and Performance
Compare LSTM and GRU performance at different sequence lengths (10, 50, 100, 200) and determine which is stronger at long-term dependencies.
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Compare LSTM and GRU performance at different sequence lengt
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# Exercise: Compare accuracy with LSTM and GRU
# Exercise: Create graph of sequence length vs accuracy
# Analyze at what sequence length performance differences become significant
Exercise 5: Improve IMDb Sentiment Analysis
Improve the basic LSTM model to increase test accuracy.
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Improve the basic LSTM model to increase test accuracy.
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
import torch.nn as nn
# 1. Use pre-trained embeddings (GloVe, Word2Vec)
# 2. Add Attention mechanism
# 3. Adjust number of layers and hidden_size
# 4. Data Augmentation (back-translation, etc.)
# 5. Ensemble learning
# Goal: Improve accuracy by +2% or more from baseline
Summary
In this chapter, we learned about LSTM, GRU, and their applications.
Key Points
- Vanilla RNN limitations: Difficulty learning long-term dependencies due to vanishing/exploding gradients
- LSTM: Realizes long-term memory with cell state and gate mechanisms (forget, input, output)
- GRU: Simplified LSTM, operates efficiently with 2 gates (reset, update)
- LSTM vs GRU differences: Trade-offs in parameter count, computation speed, and performance
- Bidirectional RNN: Processes from both directions to fully capture context
- Practice: Applied to real-world NLP task with IMDb sentiment analysis
- Best practices: Gradient clipping, Dropout, learning rate scheduling
Next Steps
In the next chapter, we will learn about Sequence-to-Sequence (Seq2Seq) and Attention mechanisms. You will master techniques essential for sequence transformation tasks such as machine translation and summarization.