This chapter covers the fundamentals of RNN Fundamentals and Forward Propagation, which what is sequence data?. You will learn characteristics of sequence data, basic structure of RNNs, and mathematical definition.
Learning Objectives
By reading this chapter, you will be able to:
- ✅ Understand the characteristics of sequence data and the need for RNNs
- ✅ Explain the basic structure of RNNs and the concept of hidden states
- ✅ Master the mathematical definition and computational process of forward propagation
- ✅ Understand the principles of Backpropagation Through Time (BPTT)
- ✅ Explain the causes and solutions for vanishing/exploding gradient problems
- ✅ Implement RNNs in PyTorch and build character-level language models
1.1 What is Sequence Data?
Limitations of Traditional Neural Networks
Traditional feedforward networks accept fixed-length inputs and return fixed-length outputs. However, much real-world data consists of variable-length sequences.
"Sequence data has temporal or spatial order. The ability to remember past information and predict the future is necessary."
Examples of Sequence Data
| Data Type | Examples | Characteristics |
|---|---|---|
| Natural Language | Sentences, conversations, translations | Word order determines meaning |
| Speech | Speech recognition, music generation | Has temporal continuity |
| Time-Series Data | Stock prices, temperature, sensor values | Past values influence the future |
| Video | Action recognition, video generation | Temporal dependencies between frames |
| DNA Sequences | Genetic analysis, protein prediction | Base sequence order determines function |
Problem: Why Feedforward Networks Are Insufficient
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Problem: Why Feedforward Networks Are Insufficient
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
import numpy as np
# Example of sequence data: simple sentences
sentence1 = ["I", "love", "machine", "learning"]
sentence2 = ["machine", "learning", "I", "love"]
print("Sentence 1:", " ".join(sentence1))
print("Sentence 2:", " ".join(sentence2))
print("\nSame words but different meanings")
# Problems with feedforward networks
# 1. Fixed-length input is required
# 2. Word order information is lost (Bag-of-Words approach)
# Example of sequences with different lengths
sequences = [
["Hello"],
["How", "are", "you"],
["The", "quick", "brown", "fox", "jumps"]
]
print("\n=== Variable-Length Sequence Problem ===")
for i, seq in enumerate(sequences, 1):
print(f"Sequence {i}: length={len(seq)}, content={seq}")
print("\nFeedforward networks cannot handle these uniformly")
print("→ RNNs are needed!")
Output:
Sentence 1: I love machine learning
Sentence 2: machine learning I love
Same words but different meanings
=== Variable-Length Sequence Problem ===
Sequence 1: length=1, content=['Hello']
Sequence 2: length=3, content=['How', 'are', 'you']
Sequence 3: length=5, content=['The', 'quick', 'brown', 'fox', 'jumps']
Feedforward networks cannot handle these uniformly
→ RNNs are needed!
RNN Task Classification
RNNs are classified by input-output format as follows:
| Type | Input→Output | Applications |
|---|---|---|
| One-to-Many | 1 → N | Image captioning, music generation |
| Many-to-One | N → 1 | Sentiment analysis, document classification |
| Many-to-Many (Asynchronous) | N → M | Machine translation, text summarization |
| Many-to-Many (Synchronized) | N → N | POS tagging, video frame classification |
1.2 Basic Structure of RNNs
Concept of Hidden States
The core of RNNs (Recurrent Neural Networks) is the mechanism to remember past information using hidden states.
Basic RNN Equations
The hidden state $h_t$ and output $y_t$ at time $t$ are calculated as follows:
$$ \begin{align} h_t &= \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) \\ y_t &= W_{hy} h_t + b_y \end{align} $$Where:
- $x_t$: Input vector at time $t$
- $h_t$: Hidden state at time $t$
- $h_{t-1}$: Hidden state at time $t-1$ (memory from previous time)
- $y_t$: Output at time $t$
- $W_{xh}$: Weight matrix from input to hidden state
- $W_{hh}$: Weight matrix from hidden state to hidden state (recurrent connection)
- $W_{hy}$: Weight matrix from hidden state to output
- $b_h, b_y$: Bias terms
Parameter Sharing
An important feature of RNNs is that they share the same parameters across all time steps.
Important: Parameter sharing allows handling sequences of arbitrary length while significantly reducing the number of parameters.
| Comparison | Feedforward NN | RNN |
|---|---|---|
| Parameters | Independent per layer | Shared across all time steps |
| Input Length | Fixed | Variable |
| Memory Mechanism | None | Hidden state |
| Computation Graph | Acyclic | Cyclic (recurrent) |
Unrolling
To understand RNN computation, we unroll it in the time direction for visualization.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
# Manual RNN implementation (simplified)
class SimpleRNN:
def __init__(self, input_size, hidden_size, output_size):
# Parameter initialization (simplified Xavier initialization)
self.Wxh = np.random.randn(hidden_size, input_size) * 0.01
self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01
self.Why = np.random.randn(output_size, hidden_size) * 0.01
self.bh = np.zeros((hidden_size, 1))
self.by = np.zeros((output_size, 1))
self.hidden_size = hidden_size
def forward(self, inputs):
"""
Perform forward propagation
Parameters:
-----------
inputs : list of np.array
List of input vectors at each time step [x_0, x_1, ..., x_T]
Returns:
--------
outputs : list of np.array
Outputs at each time step
hidden_states : list of np.array
Hidden states at each time step
"""
h = np.zeros((self.hidden_size, 1)) # Initial hidden state
hidden_states = []
outputs = []
for x in inputs:
# Update hidden state: h_t = tanh(Wxh @ x_t + Whh @ h_{t-1} + bh)
h = np.tanh(np.dot(self.Wxh, x) + np.dot(self.Whh, h) + self.bh)
# Output: y_t = Why @ h_t + by
y = np.dot(self.Why, h) + self.by
hidden_states.append(h)
outputs.append(y)
return outputs, hidden_states
# Usage example
input_size = 3
hidden_size = 5
output_size = 2
sequence_length = 4
rnn = SimpleRNN(input_size, hidden_size, output_size)
# Dummy sequence input
inputs = [np.random.randn(input_size, 1) for _ in range(sequence_length)]
# Forward propagation
outputs, hidden_states = rnn.forward(inputs)
print("=== SimpleRNN Operation Check ===\n")
print(f"Input size: {input_size}")
print(f"Hidden state size: {hidden_size}")
print(f"Output size: {output_size}")
print(f"Sequence length: {sequence_length}\n")
print("Shape of hidden states at each time step:")
for t, h in enumerate(hidden_states):
print(f" h_{t}: {h.shape}")
print("\nShape of outputs at each time step:")
for t, y in enumerate(outputs):
print(f" y_{t}: {y.shape}")
print("\nParameters:")
print(f" Wxh (input→hidden): {rnn.Wxh.shape}")
print(f" Whh (hidden→hidden): {rnn.Whh.shape}")
print(f" Why (hidden→output): {rnn.Why.shape}")
Output:
=== SimpleRNN Operation Check ===
Input size: 3
Hidden state size: 5
Output size: 2
Sequence length: 4
Shape of hidden states at each time step:
h_0: (5, 1)
h_1: (5, 1)
h_2: (5, 1)
h_3: (5, 1)
Shape of outputs at each time step:
y_0: (2, 1)
y_1: (2, 1)
y_2: (2, 1)
y_3: (2, 1)
Parameters:
Wxh (input→hidden): (5, 3)
Whh (hidden→hidden): (5, 5)
Why (hidden→output): (2, 5)
1.3 Mathematical Definition of Forward Propagation
Detailed Computation Process
Let's examine RNN forward propagation step by step. Consider an input sequence $(x_1, x_2, \ldots, x_T)$ of length $T$.
Step 1: Initial Hidden State
$$ h_0 = \mathbf{0} \quad \text{or} \quad h_0 \sim \mathcal{N}(0, \sigma^2) $$Typically, the initial hidden state is initialized as a zero vector.
Step 2: Update Hidden State at Each Time Step
For time $t = 1, 2, \ldots, T$:
$$ \begin{align} a_t &= W_{xh} x_t + W_{hh} h_{t-1} + b_h \quad \text{(Linear transformation)} \\ h_t &= \tanh(a_t) \quad \text{(Activation function)} \end{align} $$Step 3: Output Calculation
$$ y_t = W_{hy} h_t + b_y $$For classification tasks, apply softmax function additionally:
$$ \hat{y}_t = \text{softmax}(y_t) = \frac{\exp(y_t)}{\sum_j \exp(y_{t,j})} $$Concrete Numerical Example
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Concrete Numerical Example
Purpose: Demonstrate core concepts and implementation patterns
Target: Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""
import numpy as np
# Trace computation with a small example
np.random.seed(42)
# Parameter settings (small size for simplicity)
input_size = 2
hidden_size = 3
output_size = 1
# Weight initialization (fixed values for easy verification)
Wxh = np.array([[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6]]) # (3, 2)
Whh = np.array([[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9]]) # (3, 3)
Why = np.array([[0.2, 0.4, 0.6]]) # (1, 3)
bh = np.zeros((3, 1))
by = np.zeros((1, 1))
# Sequence input (3 time steps)
x1 = np.array([[1.0], [0.5]])
x2 = np.array([[0.8], [0.3]])
x3 = np.array([[0.6], [0.9]])
print("=== Detailed RNN Forward Propagation ===\n")
# Initial hidden state
h0 = np.zeros((3, 1))
print("Initial hidden state h_0:")
print(h0.T)
# Time t=1
print("\n--- Time t=1 ---")
print(f"Input x_1: {x1.T}")
a1 = np.dot(Wxh, x1) + np.dot(Whh, h0) + bh
print(f"Linear transformation a_1 = Wxh @ x_1 + Whh @ h_0 + bh:")
print(a1.T)
h1 = np.tanh(a1)
print(f"Hidden state h_1 = tanh(a_1):")
print(h1.T)
y1 = np.dot(Why, h1) + by
print(f"Output y_1 = Why @ h_1 + by:")
print(y1.T)
# Time t=2
print("\n--- Time t=2 ---")
print(f"Input x_2: {x2.T}")
a2 = np.dot(Wxh, x2) + np.dot(Whh, h1) + bh
print(f"Linear transformation a_2 = Wxh @ x_2 + Whh @ h_1 + bh:")
print(a2.T)
h2 = np.tanh(a2)
print(f"Hidden state h_2 = tanh(a_2):")
print(h2.T)
y2 = np.dot(Why, h2) + by
print(f"Output y_2 = Why @ h_2 + by:")
print(y2.T)
# Time t=3
print("\n--- Time t=3 ---")
print(f"Input x_3: {x3.T}")
a3 = np.dot(Wxh, x3) + np.dot(Whh, h2) + bh
print(f"Linear transformation a_3 = Wxh @ x_3 + Whh @ h_2 + bh:")
print(a3.T)
h3 = np.tanh(a3)
print(f"Hidden state h_3 = tanh(a_3):")
print(h3.T)
y3 = np.dot(Why, h3) + by
print(f"Output y_3 = Why @ h_3 + by:")
print(y3.T)
print("\n=== Summary ===")
print("Hidden states are updated over time and retain past information")
Output Example:
=== Detailed RNN Forward Propagation ===
Initial hidden state h_0:
[[0. 0. 0.]]
--- Time t=1 ---
Input x_1: [[1. 0.5]]
Linear transformation a_1 = Wxh @ x_1 + Whh @ h_0 + bh:
[[0.2 0.5 0.8]]
Hidden state h_1 = tanh(a_1):
[[0.19737532 0.46211716 0.66403677]]
Output y_1 = Why @ h_1 + by:
[[0.62507946]]
--- Time t=2 ---
Input x_2: [[0.8 0.3]]
Linear transformation a_2 = Wxh @ x_2 + Whh @ h_1 + bh:
[[0.29047307 0.65308434 1.00569561]]
Hidden state h_2 = tanh(a_2):
[[0.28267734 0.57345841 0.76354129]]
Output y_2 = Why @ h_2 + by:
[[0.74487427]]
--- Time t=3 ---
Input x_3: [[0.6 0.9]]
Linear transformation a_3 = Wxh @ x_3 + Whh @ h_2 + bh:
[[0.35687144 0.8098169 1.25276236]]
Hidden state h_3 = tanh(a_3):
[[0.34242503 0.66919951 0.84956376]]
Output y_3 = Why @ h_3 + by:
[[0.84642439]]
=== Summary ===
Hidden states are updated over time and retain past information
1.4 Backpropagation Through Time (BPTT)
Basic Principles of BPTT
Backpropagation Through Time (BPTT) is the learning algorithm for RNNs. It applies standard backpropagation to the unrolled network.
Loss Function
The loss over the entire sequence is the sum of losses at each time step:
$$ L = \sum_{t=1}^{T} L_t = \sum_{t=1}^{T} \mathcal{L}(y_t, \hat{y}_t) $$Gradient Computation
The gradient with respect to hidden state $h_t$ at time $t$ includes contributions from all future time steps:
$$ \frac{\partial L}{\partial h_t} = \frac{\partial L_t}{\partial h_t} + \frac{\partial L_{t+1}}{\partial h_t} $$Expanding this recursively:
$$ \frac{\partial L}{\partial h_t} = \sum_{k=t}^{T} \frac{\partial L_k}{\partial h_k} \prod_{j=t+1}^{k} \frac{\partial h_j}{\partial h_{j-1}} $$Vanishing and Exploding Gradient Problems
The biggest challenges in BPTT are vanishing gradients and exploding gradients.
Core Issue: For long sequences, gradients either decay or grow exponentially as they backpropagate through time.
Mathematical Explanation
The gradient of the hidden state is computed via the chain rule:
$$ \frac{\partial h_t}{\partial h_{t-1}} = W_{hh}^T \cdot \text{diag}(\tanh'(a_{t-1})) $$Going back $k$ time steps:
$$ \frac{\partial h_t}{\partial h_{t-k}} = \prod_{j=0}^{k-1} \frac{\partial h_{t-j}}{\partial h_{t-j-1}} $$This product leads to:
- Vanishing gradients: When $\|W_{hh}\| < 1$ and $|\tanh'(x)| < 1$, the product approaches 0
- Exploding gradients: When $\|W_{hh}\| > 1$, the product diverges
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
import numpy as np
import matplotlib.pyplot as plt
# Simulate vanishing/exploding gradients
def simulate_gradient_flow(W_norm, sequence_length=50):
"""
Simulate gradient propagation
Parameters:
-----------
W_norm : float
Norm of weight matrix (simplified to 1D)
sequence_length : int
Length of sequence
Returns:
--------
gradients : np.array
Magnitude of gradients at each time step
"""
# Average value of tanh derivative (approximately 0.4)
tanh_derivative = 0.4
# Initial gradient value
gradient = 1.0
gradients = [gradient]
# Calculate gradients going back in time
for t in range(sequence_length - 1):
gradient *= W_norm * tanh_derivative
gradients.append(gradient)
return np.array(gradients[::-1]) # Reverse to chronological order
# Simulate with different norms
sequence_length = 50
W_norms = [0.5, 1.0, 2.0, 4.0]
colors = ['blue', 'green', 'orange', 'red']
labels = ['W_norm=0.5 (vanishing)', 'W_norm=1.0 (stable)', 'W_norm=2.0 (exploding)', 'W_norm=4.0 (exploding)']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
for W_norm, color, label in zip(W_norms, colors, labels):
gradients = simulate_gradient_flow(W_norm, sequence_length)
# Linear scale
ax1.plot(gradients, color=color, label=label, linewidth=2)
# Logarithmic scale
ax2.semilogy(gradients, color=color, label=label, linewidth=2)
ax1.set_xlabel('Time (past ← present)')
ax1.set_ylabel('Gradient magnitude')
ax1.set_title('Gradient Propagation (Linear Scale)')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax2.set_xlabel('Time (past ← present)')
ax2.set_ylabel('Gradient magnitude (log)')
ax2.set_title('Gradient Propagation (Logarithmic Scale)')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
print("Visualization of vanishing/exploding gradients displayed")
# Numerical analysis
print("\n=== Analysis of Gradient Decay/Growth ===\n")
for W_norm in W_norms:
gradients = simulate_gradient_flow(W_norm, sequence_length)
print(f"W_norm={W_norm}:")
print(f" Initial gradient: {gradients[-1]:.6f}")
print(f" Gradient 50 steps back: {gradients[0]:.6e}")
print(f" Decay rate: {gradients[0]/gradients[-1]:.6e}\n")
Output Example:
Visualization of vanishing/exploding gradients displayed
=== Analysis of Gradient Decay/Growth ===
W_norm=0.5:
Initial gradient: 1.000000
Gradient 50 steps back: 7.105427e-15
Decay rate: 7.105427e-15
W_norm=1.0:
Initial gradient: 1.000000
Gradient 50 steps back: 1.125899e-12
Decay rate: 1.125899e-12
W_norm=2.0:
Initial gradient: 1.000000
Gradient 50 steps back: 1.152922e+03
Decay rate: 1.152922e+03
W_norm=4.0:
Initial gradient: 1.000000
Gradient 50 steps back: 1.329228e+12
Decay rate: 1.329228e+12
Solution for Exploding Gradients: Gradient Clipping
Gradient clipping scales gradients when their norm exceeds a threshold.
$$ \mathbf{g} \leftarrow \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \leq \theta \\ \theta \frac{\mathbf{g}}{\|\mathbf{g}\|} & \text{if } \|\mathbf{g}\| > \theta \end{cases} $$# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
def gradient_clipping_example():
"""Example implementation of gradient clipping"""
# Dummy parameters and gradients
params = [
torch.randn(10, 10, requires_grad=True),
torch.randn(5, 10, requires_grad=True)
]
# Set large gradients (simulate explosion)
params[0].grad = torch.randn(10, 10) * 100 # Large gradients
params[1].grad = torch.randn(5, 10) * 100
# Norm before clipping
total_norm_before = 0
for p in params:
if p.grad is not None:
total_norm_before += p.grad.data.norm(2).item() ** 2
total_norm_before = total_norm_before ** 0.5
# Gradient clipping
max_norm = 1.0
torch.nn.utils.clip_grad_norm_(params, max_norm)
# Norm after clipping
total_norm_after = 0
for p in params:
if p.grad is not None:
total_norm_after += p.grad.data.norm(2).item() ** 2
total_norm_after = total_norm_after ** 0.5
print("=== Effect of Gradient Clipping ===\n")
print(f"Gradient norm before clipping: {total_norm_before:.4f}")
print(f"Gradient norm after clipping: {total_norm_after:.4f}")
print(f"Threshold: {max_norm}")
print(f"\nGradients constrained below threshold!")
gradient_clipping_example()
Output:
=== Effect of Gradient Clipping ===
Gradient norm before clipping: 163.4521
Gradient norm after clipping: 1.0000
Threshold: 1.0
Gradients constrained below threshold!
Solution for Vanishing Gradients: Architectural Improvements
Fundamental solutions to the vanishing gradient problem include the following architectures:
| Method | Key Features | Effect |
|---|---|---|
| LSTM | Control information flow with gate mechanisms | Learn long-term dependencies |
| GRU | Simplified LSTM with 2 gates | Reduced parameters, faster |
| Residual Connection | Direct gradient propagation via skip connections | Train deep networks |
| Layer Normalization | Normalize per layer | Stabilize training |
Note: LSTM and GRU will be covered in detail in the next chapter.
1.5 RNN Implementation in PyTorch
Basic Usage of nn.RNN
In PyTorch, use the torch.nn.RNN class to define RNN layers.
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: In PyTorch, use thetorch.nn.RNNclass to define RNN layers.
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# Basic RNN syntax
rnn = nn.RNN(
input_size=10, # Input feature dimension
hidden_size=20, # Hidden state dimension
num_layers=2, # Number of RNN layers
nonlinearity='tanh', # Activation function ('tanh' or 'relu')
batch_first=True, # (batch, seq, feature) order
dropout=0.0, # Dropout rate (between layers)
bidirectional=False # Bidirectional RNN
)
# Dummy input (batch size 3, sequence length 5, features 10)
x = torch.randn(3, 5, 10)
# Initial hidden state (num_layers, batch, hidden_size)
h0 = torch.zeros(2, 3, 20)
# Forward propagation
output, hn = rnn(x, h0)
print("=== PyTorch RNN Operation Check ===\n")
print(f"Input size: {x.shape}")
print(f" [Batch, Sequence, Features] = [{x.shape[0]}, {x.shape[1]}, {x.shape[2]}]")
print(f"\nInitial hidden state: {h0.shape}")
print(f" [Layers, Batch, Hidden] = [{h0.shape[0]}, {h0.shape[1]}, {h0.shape[2]}]")
print(f"\nOutput size: {output.shape}")
print(f" [Batch, Sequence, Hidden] = [{output.shape[0]}, {output.shape[1]}, {output.shape[2]}]")
print(f"\nFinal hidden state: {hn.shape}")
print(f" [Layers, Batch, Hidden] = [{hn.shape[0]}, {hn.shape[1]}, {hn.shape[2]}]")
# Calculate parameter count
total_params = sum(p.numel() for p in rnn.parameters())
print(f"\nTotal parameters: {total_params:,}")
Output:
=== PyTorch RNN Operation Check ===
Input size: torch.Size([3, 5, 10])
[Batch, Sequence, Features] = [3, 5, 10]
Initial hidden state: torch.Size([2, 3, 20])
[Layers, Batch, Hidden] = [2, 3, 20]
Output size: torch.Size([3, 5, 20])
[Batch, Sequence, Hidden] = [3, 5, 20]
Final hidden state: torch.Size([2, 3, 20])
[Layers, Batch, Hidden] = [2, 3, 20]
Total parameters: 1,240
nn.RNNCell vs nn.RNN
PyTorch has two RNN implementations:
| Class | Features | Use Cases |
|---|---|---|
| nn.RNN | Process entire sequence at once | Standard use, efficient |
| nn.RNNCell | Process one time step at a time manually | When custom control is needed |
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: PyTorch has two RNN implementations:
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# Usage example of nn.RNNCell
input_size = 5
hidden_size = 10
batch_size = 3
sequence_length = 4
rnn_cell = nn.RNNCell(input_size, hidden_size)
# Sequence input
x_sequence = torch.randn(batch_size, sequence_length, input_size)
# Initial hidden state
h = torch.zeros(batch_size, hidden_size)
print("=== Manual Loop with nn.RNNCell ===\n")
# Manual loop through each time step
outputs = []
for t in range(sequence_length):
x_t = x_sequence[:, t, :] # Input at time t
h = rnn_cell(x_t, h) # Update hidden state
outputs.append(h)
print(f"Time t={t}: input {x_t.shape} → hidden state {h.shape}")
# Stack outputs
output = torch.stack(outputs, dim=1)
print(f"\nAll time steps output: {output.shape}")
print(f" [Batch, Sequence, Hidden] = [{output.shape[0]}, {output.shape[1]}, {output.shape[2]}]")
# Comparison with nn.RNN
rnn = nn.RNN(input_size, hidden_size, batch_first=True)
output_rnn, hn_rnn = rnn(x_sequence)
print(f"\nnn.RNN output: {output_rnn.shape}")
print("→ Same result as nn.RNNCell loop but computed at once")
Output:
=== Manual Loop with nn.RNNCell ===
Time t=0: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])
Time t=1: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])
Time t=2: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])
Time t=3: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])
All time steps output: torch.Size([3, 4, 10])
[Batch, Sequence, Hidden] = [3, 4, 10]
nn.RNN output: torch.Size([3, 4, 10])
→ Same result as nn.RNNCell loop but computed at once
Many-to-One Task: Sentiment Analysis
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import torch.nn.functional as F
class SentimentRNN(nn.Module):
"""
Many-to-One RNN: Classify sentiment from entire sentence
"""
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
super(SentimentRNN, self).__init__()
# Word embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# RNN layer
self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
# Output layer
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
# x: (batch, seq_len)
# Word embedding
embedded = self.embedding(x) # (batch, seq_len, embedding_dim)
# RNN
output, hidden = self.rnn(embedded)
# output: (batch, seq_len, hidden_dim)
# hidden: (1, batch, hidden_dim)
# Use only last hidden state (Many-to-One)
last_hidden = hidden.squeeze(0) # (batch, hidden_dim)
# Classification
logits = self.fc(last_hidden) # (batch, output_dim)
return logits
# Model definition
vocab_size = 5000 # Vocabulary size
embedding_dim = 100 # Word embedding dimension
hidden_dim = 128 # Hidden state dimension
output_dim = 2 # 2-class classification (positive/negative)
model = SentimentRNN(vocab_size, embedding_dim, hidden_dim, output_dim)
# Dummy data (batch size 4, sequence length 10)
x = torch.randint(0, vocab_size, (4, 10))
logits = model(x)
print("=== Sentiment RNN (Many-to-One) ===\n")
print(f"Input (word IDs): {x.shape}")
print(f" [Batch, Sequence] = [{x.shape[0]}, {x.shape[1]}]")
print(f"\nOutput (logits): {logits.shape}")
print(f" [Batch, Classes] = [{logits.shape[0]}, {logits.shape[1]}]")
# Convert to probabilities
probs = F.softmax(logits, dim=1)
print(f"\nProbability distribution:")
print(probs)
# Parameter count
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")
print(f" Breakdown:")
print(f" Embedding: {vocab_size * embedding_dim:,}")
print(f" RNN: {(embedding_dim * hidden_dim + hidden_dim * hidden_dim + 2 * hidden_dim):,}")
print(f" FC: {(hidden_dim * output_dim + output_dim):,}")
Output Example:
=== Sentiment RNN (Many-to-One) ===
Input (word IDs): torch.Size([4, 10])
[Batch, Sequence] = [4, 10]
Output (logits): torch.Size([4, 2])
[Batch, Classes] = [4, 2]
Probability distribution:
tensor([[0.5234, 0.4766],
[0.4892, 0.5108],
[0.5123, 0.4877],
[0.4956, 0.5044]], grad_fn=)
Total parameters: 529,410
Breakdown:
Embedding: 500,000
RNN: 29,152
FC: 258
1.6 Practice: Character-Level Language Model
What is a Character-Level RNN?
A character-level language model learns text at the character level and predicts the next character.
- Input: Previous character sequence
- Output: Probability distribution of the next character
- After training: Can generate new text
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
"""
Example: Acharacter-level language modellearns text at the character
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# Sample text (Shakespeare-style)
text = """To be or not to be, that is the question.
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles"""
# Create character set
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
vocab_size = len(chars)
print("=== Character-Level Language Model ===\n")
print(f"Text length: {len(text)} characters")
print(f"Vocabulary size: {vocab_size} characters")
print(f"Character set: {''.join(chars)}")
# Convert text to numbers
encoded_text = [char_to_idx[ch] for ch in text]
print(f"\nOriginal text (first 50 characters):")
print(text[:50])
print(f"\nEncoded:")
print(encoded_text[:50])
# RNN language model definition
class CharRNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim):
super(CharRNN, self).__init__()
self.hidden_dim = hidden_dim
# Character embedding
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# RNN layer
self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
# Output layer
self.fc = nn.Linear(hidden_dim, vocab_size)
def forward(self, x, hidden=None):
# x: (batch, seq_len)
embedded = self.embedding(x) # (batch, seq_len, embedding_dim)
output, hidden = self.rnn(embedded, hidden) # (batch, seq_len, hidden_dim)
logits = self.fc(output) # (batch, seq_len, vocab_size)
return logits, hidden
def init_hidden(self, batch_size):
return torch.zeros(1, batch_size, self.hidden_dim)
# Model initialization
embedding_dim = 32
hidden_dim = 64
model = CharRNN(vocab_size, embedding_dim, hidden_dim)
print(f"\n=== CharRNN Model ===")
print(f"Embedding dimension: {embedding_dim}")
print(f"Hidden state dimension: {hidden_dim}")
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
Output Example:
=== Character-Level Language Model ===
Text length: 179 characters
Vocabulary size: 36 characters
Character set: ',.Tabdefghilmnopqrstuwy
Original text (first 50 characters):
To be or not to be, that is the question.
Whethe
Encoded:
[7, 22, 0, 13, 14, 0, 22, 23, 0, 21, 22, 25, 0, 25, 22, 0, 13, 14, 2, 0, 25, 17, 10, 25, 0, 18, 24, 0, 25, 17, 14, 0, 23, 26, 14, 24, 25, 18, 22, 21, 3, 1, 8, 17, 14, 25, 17, 14, 23, 0]
=== CharRNN Model ===
Embedding dimension: 32
Hidden state dimension: 64
Total parameters: 8,468
Training and Text Generation
def create_sequences(encoded_text, seq_length):
"""
Create training sequences
"""
X, y = [], []
for i in range(len(encoded_text) - seq_length):
X.append(encoded_text[i:i+seq_length])
y.append(encoded_text[i+1:i+seq_length+1])
return torch.LongTensor(X), torch.LongTensor(y)
# Data preparation
seq_length = 25
X, y = create_sequences(encoded_text, seq_length)
print(f"=== Dataset ===")
print(f"Number of sequences: {len(X)}")
print(f"Input size: {X.shape}")
print(f"Target size: {y.shape}")
# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Simplified training loop
num_epochs = 100
batch_size = 32
print(f"\n=== Training Started ===")
model.train()
for epoch in range(num_epochs):
total_loss = 0
hidden = None
for i in range(0, len(X), batch_size):
batch_X = X[i:i+batch_size]
batch_y = y[i:i+batch_size]
# Forward propagation
optimizer.zero_grad()
logits, hidden = model(batch_X, hidden)
# Loss calculation (reshape required)
loss = criterion(logits.view(-1, vocab_size), batch_y.view(-1))
# Backpropagation
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
optimizer.step()
# Detach hidden state (truncate BPTT)
hidden = hidden.detach()
total_loss += loss.item()
if (epoch + 1) % 20 == 0:
avg_loss = total_loss / (len(X) // batch_size)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
print("\nTraining complete!")
# Text generation
def generate_text(model, start_str, length=100, temperature=1.0):
"""
Generate text with trained model
Parameters:
-----------
model : CharRNN
Trained model
start_str : str
Starting string
length : int
Number of characters to generate
temperature : float
Temperature parameter (higher = more random)
"""
model.eval()
# Encode starting string
chars_encoded = [char_to_idx[ch] for ch in start_str]
input_seq = torch.LongTensor(chars_encoded).unsqueeze(0)
hidden = None
generated = start_str
with torch.no_grad():
for _ in range(length):
# Prediction
logits, hidden = model(input_seq, hidden)
# Output at last time step
logits = logits[0, -1, :] / temperature
probs = torch.softmax(logits, dim=0)
# Sampling
next_char_idx = torch.multinomial(probs, 1).item()
next_char = idx_to_char[next_char_idx]
generated += next_char
# Next input
input_seq = torch.LongTensor([[next_char_idx]])
return generated
# Execute text generation
print("\n=== Text Generation ===\n")
start_str = "To be"
generated_text = generate_text(model, start_str, length=100, temperature=0.8)
print(f"Starting string: '{start_str}'")
print(f"\nGenerated text:")
print(generated_text)
Output Example:
=== Dataset ===
Number of sequences: 154
Input size: torch.Size([154, 25])
Target size: torch.Size([154, 25])
=== Training Started ===
Epoch 20/100, Loss: 2.1234
Epoch 40/100, Loss: 1.8765
Epoch 60/100, Loss: 1.5432
Epoch 80/100, Loss: 1.2987
Epoch 100/100, Loss: 1.0654
Training complete!
=== Text Generation ===
Starting string: 'To be'
Generated text:
To be or not the question.
Whether 'tis nobler in the mind to suffer
The slings and arrows of out
Note: Actual output varies due to training randomness. Better results are obtained with longer text and more epochs.
Summary
In this chapter, we learned about RNN fundamentals and forward propagation.
Key Points
- Sequence data has temporal order and is difficult to handle with traditional NNs
- Hidden states allow RNNs to remember past information
- Parameter sharing enables uniform processing of arbitrary-length sequences
- BPTT is used for training, but vanishing/exploding gradients are challenges
- Gradient clipping suppresses exploding gradients
- PyTorch nn.RNN enables easy implementation
Next Chapter Preview
Chapter 2 will cover the following topics:
- LSTM (Long Short-Term Memory) mechanisms
- GRU (Gated Recurrent Unit) structure
- Bidirectional RNN
- Seq2Seq models and Attention mechanisms
Exercises
Exercise 1: Hidden State Size Calculation
Problem: Calculate the number of parameters for the following RNN.
- Input size: 50
- Hidden state size: 128
- Output size: 10
Solution:
# Wxh: input→hidden
Wxh_params = 50 * 128 = 6,400
# Whh: hidden→hidden
Whh_params = 128 * 128 = 16,384
# Why: hidden→output
Why_params = 128 * 10 = 1,280
# Bias terms
bh_params = 128
by_params = 10
# Total
total = 6,400 + 16,384 + 1,280 + 128 + 10 = 24,202
Answer: 24,202 parameters
Exercise 2: Sequence Length and Memory Usage
Problem: Calculate the total memory (number of elements) required for hidden states during forward propagation for an RNN with batch size 32, sequence length 100, and hidden state size 256.
Solution:
# Hidden state at each time step: (batch_size, hidden_size)
# Need to store for the sequence length
memory_elements = batch_size * seq_length * hidden_size
= 32 * 100 * 256
= 819,200 elements
# For float32 (4 bytes)
memory_bytes = 819,200 * 4 = 3,276,800 bytes ≈ 3.2 MB
Answer: Approximately 3.2 MB (more needed during backpropagation)
Exercise 3: Many-to-Many RNN Implementation
Problem: Implement a Many-to-Many RNN in PyTorch for part-of-speech tagging (assigning POS labels to each word).
Solution Example:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
class POSTaggingRNN(nn.Module):
"""
Many-to-Many RNN: Predict POS tag for each word
"""
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_tags):
super(POSTaggingRNN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, num_tags)
def forward(self, x):
# x: (batch, seq_len)
embedded = self.embedding(x) # (batch, seq_len, embedding_dim)
output, _ = self.rnn(embedded) # (batch, seq_len, hidden_dim)
logits = self.fc(output) # (batch, seq_len, num_tags)
return logits
# Usage example
vocab_size = 5000
embedding_dim = 100
hidden_dim = 128
num_tags = 45 # Penn Treebank POS tag count
model = POSTaggingRNN(vocab_size, embedding_dim, hidden_dim, num_tags)
# Dummy data
x = torch.randint(0, vocab_size, (8, 20)) # batch=8, seq_len=20
logits = model(x)
print(f"Input: {x.shape}")
print(f"Output: {logits.shape}") # (8, 20, 45)
print("Predict POS tag for each word at each time step")
Exercise 4: Vanishing Gradient Experiment
Problem: Train RNNs with different weight initializations and observe the impact of vanishing gradients.
Solution Example:
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
def test_gradient_vanishing(init_scale):
"""
Observe gradients by varying weight initialization scale
"""
rnn = nn.RNN(10, 20, batch_first=True)
# Manually initialize weights
for name, param in rnn.named_parameters():
if 'weight' in name:
nn.init.uniform_(param, -init_scale, init_scale)
# Dummy data (long sequence)
x = torch.randn(1, 50, 10)
target = torch.randn(1, 50, 20)
# Forward propagation
output, _ = rnn(x)
loss = ((output - target) ** 2).mean()
# Backpropagation
loss.backward()
# Calculate gradient norms
grad_norms = []
for name, param in rnn.named_parameters():
if param.grad is not None:
grad_norms.append(param.grad.norm().item())
return grad_norms
# Experiment with different initialization scales
scales = [0.01, 0.1, 0.5, 1.0, 2.0]
results = {scale: test_gradient_vanishing(scale) for scale in scales}
print("=== Gradient Norm Comparison ===")
for scale, norms in results.items():
avg_norm = sum(norms) / len(norms)
print(f"Initialization scale {scale}: average gradient norm = {avg_norm:.6f}")
print("\nToo small scale causes vanishing, too large causes exploding gradients")
Exercise 5: Understanding Bidirectional RNN
Problem: Implement a bidirectional RNN and explain the differences from a unidirectional RNN.
Solution Example:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Solution Example:
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""
import torch
import torch.nn as nn
# Bidirectional RNN
bi_rnn = nn.RNN(10, 20, batch_first=True, bidirectional=True)
# Unidirectional RNN
uni_rnn = nn.RNN(10, 20, batch_first=True, bidirectional=False)
# Dummy input
x = torch.randn(3, 5, 10) # batch=3, seq_len=5, features=10
# Forward propagation
bi_output, bi_hidden = bi_rnn(x)
uni_output, uni_hidden = uni_rnn(x)
print("=== Bidirectional RNN vs Unidirectional RNN ===\n")
print(f"Input size: {x.shape}")
print(f"\nBidirectional RNN:")
print(f" Output: {bi_output.shape}") # (3, 5, 40) ← 20*2
print(f" Hidden state: {bi_hidden.shape}") # (2, 3, 20) ← 2 directions
print(f"\nUnidirectional RNN:")
print(f" Output: {uni_output.shape}") # (3, 5, 20)
print(f" Hidden state: {uni_hidden.shape}") # (1, 3, 20)
print("\nBidirectional RNN features:")
print(" ✓ Captures both forward and backward context")
print(" ✓ Output dimension doubles (forward + backward)")
print(" ✓ Uses future information, improves accuracy")
print(" ✗ Not suitable for real-time processing (needs entire sequence)")