Chapter 1: RNN Fundamentals and Forward Propagation

This chapter covers the fundamentals of RNN Fundamentals and Forward Propagation, which what is sequence data?. You will learn characteristics of sequence data, basic structure of RNNs, and mathematical definition.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the characteristics of sequence data and the need for RNNs
✅ Explain the basic structure of RNNs and the concept of hidden states
✅ Master the mathematical definition and computational process of forward propagation
✅ Understand the principles of Backpropagation Through Time (BPTT)
✅ Explain the causes and solutions for vanishing/exploding gradient problems
✅ Implement RNNs in PyTorch and build character-level language models

1.1 What is Sequence Data?

Limitations of Traditional Neural Networks

Traditional feedforward networks accept fixed-length inputs and return fixed-length outputs. However, much real-world data consists of variable-length sequences.

"Sequence data has temporal or spatial order. The ability to remember past information and predict the future is necessary."

Examples of Sequence Data

Data Type	Examples	Characteristics
Natural Language	Sentences, conversations, translations	Word order determines meaning
Speech	Speech recognition, music generation	Has temporal continuity
Time-Series Data	Stock prices, temperature, sensor values	Past values influence the future
Video	Action recognition, video generation	Temporal dependencies between frames
DNA Sequences	Genetic analysis, protein prediction	Base sequence order determines function

Problem: Why Feedforward Networks Are Insufficient

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Problem: Why Feedforward Networks Are Insufficient

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np

# Example of sequence data: simple sentences
sentence1 = ["I", "love", "machine", "learning"]
sentence2 = ["machine", "learning", "I", "love"]

print("Sentence 1:", " ".join(sentence1))
print("Sentence 2:", " ".join(sentence2))
print("\nSame words but different meanings")

# Problems with feedforward networks
# 1. Fixed-length input is required
# 2. Word order information is lost (Bag-of-Words approach)

# Example of sequences with different lengths
sequences = [
    ["Hello"],
    ["How", "are", "you"],
    ["The", "quick", "brown", "fox", "jumps"]
]

print("\n=== Variable-Length Sequence Problem ===")
for i, seq in enumerate(sequences, 1):
    print(f"Sequence {i}: length={len(seq)}, content={seq}")

print("\nFeedforward networks cannot handle these uniformly")
print("→ RNNs are needed!")

Output:

Sentence 1: I love machine learning
Sentence 2: machine learning I love

Same words but different meanings

=== Variable-Length Sequence Problem ===
Sequence 1: length=1, content=['Hello']
Sequence 2: length=3, content=['How', 'are', 'you']
Sequence 3: length=5, content=['The', 'quick', 'brown', 'fox', 'jumps']

Feedforward networks cannot handle these uniformly
→ RNNs are needed!

RNN Task Classification

RNNs are classified by input-output format as follows:

graph TD subgraph "One-to-Many (Sequence Generation)" A1[Single Input] --> B1[Multiple Outputs] end subgraph "Many-to-One (Sequence Classification)" A2[Multiple Inputs] --> B2[Single Output] end subgraph "Many-to-Many (Sequence Transformation)" A3[Multiple Inputs] --> B3[Multiple Outputs] end subgraph "Many-to-Many (Synchronized)" A4[Multiple Inputs] --> B4[Output at Each Step] end style A1 fill:#e3f2fd style B1 fill:#ffebee style A2 fill:#e3f2fd style B2 fill:#ffebee style A3 fill:#e3f2fd style B3 fill:#ffebee style A4 fill:#e3f2fd style B4 fill:#ffebee

Type	Input→Output	Applications
One-to-Many	1 → N	Image captioning, music generation
Many-to-One	N → 1	Sentiment analysis, document classification
Many-to-Many (Asynchronous)	N → M	Machine translation, text summarization
Many-to-Many (Synchronized)	N → N	POS tagging, video frame classification

1.2 Basic Structure of RNNs

Concept of Hidden States

The core of RNNs (Recurrent Neural Networks) is the mechanism to remember past information using hidden states.

Basic RNN Equations

The hidden state $h_t$ and output $y_t$ at time $t$ are calculated as follows:

$$ \begin{align} h_t &= \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) \\ y_t &= W_{hy} h_t + b_y \end{align} $$

Where:

$x_t$: Input vector at time $t$
$h_t$: Hidden state at time $t$
$h_{t-1}$: Hidden state at time $t-1$ (memory from previous time)
$y_t$: Output at time $t$
$W_{xh}$: Weight matrix from input to hidden state
$W_{hh}$: Weight matrix from hidden state to hidden state (recurrent connection)
$W_{hy}$: Weight matrix from hidden state to output
$b_h, b_y$: Bias terms

Parameter Sharing

An important feature of RNNs is that they share the same parameters across all time steps.

Important: Parameter sharing allows handling sequences of arbitrary length while significantly reducing the number of parameters.

Comparison	Feedforward NN	RNN
Parameters	Independent per layer	Shared across all time steps
Input Length	Fixed	Variable
Memory Mechanism	None	Hidden state
Computation Graph	Acyclic	Cyclic (recurrent)

Unrolling

To understand RNN computation, we unroll it in the time direction for visualization.

graph LR X0[x_0] --> H0[h_0] H0 --> Y0[y_0] H0 --> H1[h_1] X1[x_1] --> H1 H1 --> Y1[y_1] H1 --> H2[h_2] X2[x_2] --> H2 H2 --> Y2[y_2] H2 --> H3[h_3] X3[x_3] --> H3 H3 --> Y3[y_3] style X0 fill:#e3f2fd style X1 fill:#e3f2fd style X2 fill:#e3f2fd style X3 fill:#e3f2fd style H0 fill:#fff3e0 style H1 fill:#fff3e0 style H2 fill:#fff3e0 style H3 fill:#fff3e0 style Y0 fill:#ffebee style Y1 fill:#ffebee style Y2 fill:#ffebee style Y3 fill:#ffebee

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np

# Manual RNN implementation (simplified)
class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        # Parameter initialization (simplified Xavier initialization)
        self.Wxh = np.random.randn(hidden_size, input_size) * 0.01
        self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01
        self.Why = np.random.randn(output_size, hidden_size) * 0.01
        self.bh = np.zeros((hidden_size, 1))
        self.by = np.zeros((output_size, 1))

        self.hidden_size = hidden_size

    def forward(self, inputs):
        """
        Perform forward propagation

        Parameters:
        -----------
        inputs : list of np.array
            List of input vectors at each time step [x_0, x_1, ..., x_T]

        Returns:
        --------
        outputs : list of np.array
            Outputs at each time step
        hidden_states : list of np.array
            Hidden states at each time step
        """
        h = np.zeros((self.hidden_size, 1))  # Initial hidden state
        hidden_states = []
        outputs = []

        for x in inputs:
            # Update hidden state: h_t = tanh(Wxh @ x_t + Whh @ h_{t-1} + bh)
            h = np.tanh(np.dot(self.Wxh, x) + np.dot(self.Whh, h) + self.bh)
            # Output: y_t = Why @ h_t + by
            y = np.dot(self.Why, h) + self.by

            hidden_states.append(h)
            outputs.append(y)

        return outputs, hidden_states

# Usage example
input_size = 3
hidden_size = 5
output_size = 2
sequence_length = 4

rnn = SimpleRNN(input_size, hidden_size, output_size)

# Dummy sequence input
inputs = [np.random.randn(input_size, 1) for _ in range(sequence_length)]

# Forward propagation
outputs, hidden_states = rnn.forward(inputs)

print("=== SimpleRNN Operation Check ===\n")
print(f"Input size: {input_size}")
print(f"Hidden state size: {hidden_size}")
print(f"Output size: {output_size}")
print(f"Sequence length: {sequence_length}\n")

print("Shape of hidden states at each time step:")
for t, h in enumerate(hidden_states):
    print(f"  h_{t}: {h.shape}")

print("\nShape of outputs at each time step:")
for t, y in enumerate(outputs):
    print(f"  y_{t}: {y.shape}")

print("\nParameters:")
print(f"  Wxh (input→hidden): {rnn.Wxh.shape}")
print(f"  Whh (hidden→hidden): {rnn.Whh.shape}")
print(f"  Why (hidden→output): {rnn.Why.shape}")

Output:

=== SimpleRNN Operation Check ===

Input size: 3
Hidden state size: 5
Output size: 2
Sequence length: 4

Shape of hidden states at each time step:
  h_0: (5, 1)
  h_1: (5, 1)
  h_2: (5, 1)
  h_3: (5, 1)

Shape of outputs at each time step:
  y_0: (2, 1)
  y_1: (2, 1)
  y_2: (2, 1)
  y_3: (2, 1)

Parameters:
  Wxh (input→hidden): (5, 3)
  Whh (hidden→hidden): (5, 5)
  Why (hidden→output): (2, 5)

1.3 Mathematical Definition of Forward Propagation

Detailed Computation Process

Let's examine RNN forward propagation step by step. Consider an input sequence $(x_1, x_2, \ldots, x_T)$ of length $T$.

Step 1: Initial Hidden State

$$ h_0 = \mathbf{0} \quad \text{or} \quad h_0 \sim \mathcal{N}(0, \sigma^2) $$

Typically, the initial hidden state is initialized as a zero vector.

Step 2: Update Hidden State at Each Time Step

For time $t = 1, 2, \ldots, T$:

$$ \begin{align} a_t &= W_{xh} x_t + W_{hh} h_{t-1} + b_h \quad \text{(Linear transformation)} \\ h_t &= \tanh(a_t) \quad \text{(Activation function)} \end{align} $$

Step 3: Output Calculation

$$ y_t = W_{hy} h_t + b_y $$

For classification tasks, apply softmax function additionally:

$$ \hat{y}_t = \text{softmax}(y_t) = \frac{\exp(y_t)}{\sum_j \exp(y_{t,j})} $$

Concrete Numerical Example

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Concrete Numerical Example

Purpose: Demonstrate core concepts and implementation patterns
Target: Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""

import numpy as np

# Trace computation with a small example
np.random.seed(42)

# Parameter settings (small size for simplicity)
input_size = 2
hidden_size = 3
output_size = 1

# Weight initialization (fixed values for easy verification)
Wxh = np.array([[0.1, 0.2],
                [0.3, 0.4],
                [0.5, 0.6]])  # (3, 2)

Whh = np.array([[0.1, 0.2, 0.3],
                [0.4, 0.5, 0.6],
                [0.7, 0.8, 0.9]])  # (3, 3)

Why = np.array([[0.2, 0.4, 0.6]])  # (1, 3)

bh = np.zeros((3, 1))
by = np.zeros((1, 1))

# Sequence input (3 time steps)
x1 = np.array([[1.0], [0.5]])
x2 = np.array([[0.8], [0.3]])
x3 = np.array([[0.6], [0.9]])

print("=== Detailed RNN Forward Propagation ===\n")

# Initial hidden state
h0 = np.zeros((3, 1))
print("Initial hidden state h_0:")
print(h0.T)

# Time t=1
print("\n--- Time t=1 ---")
print(f"Input x_1: {x1.T}")
a1 = np.dot(Wxh, x1) + np.dot(Whh, h0) + bh
print(f"Linear transformation a_1 = Wxh @ x_1 + Whh @ h_0 + bh:")
print(a1.T)
h1 = np.tanh(a1)
print(f"Hidden state h_1 = tanh(a_1):")
print(h1.T)
y1 = np.dot(Why, h1) + by
print(f"Output y_1 = Why @ h_1 + by:")
print(y1.T)

# Time t=2
print("\n--- Time t=2 ---")
print(f"Input x_2: {x2.T}")
a2 = np.dot(Wxh, x2) + np.dot(Whh, h1) + bh
print(f"Linear transformation a_2 = Wxh @ x_2 + Whh @ h_1 + bh:")
print(a2.T)
h2 = np.tanh(a2)
print(f"Hidden state h_2 = tanh(a_2):")
print(h2.T)
y2 = np.dot(Why, h2) + by
print(f"Output y_2 = Why @ h_2 + by:")
print(y2.T)

# Time t=3
print("\n--- Time t=3 ---")
print(f"Input x_3: {x3.T}")
a3 = np.dot(Wxh, x3) + np.dot(Whh, h2) + bh
print(f"Linear transformation a_3 = Wxh @ x_3 + Whh @ h_2 + bh:")
print(a3.T)
h3 = np.tanh(a3)
print(f"Hidden state h_3 = tanh(a_3):")
print(h3.T)
y3 = np.dot(Why, h3) + by
print(f"Output y_3 = Why @ h_3 + by:")
print(y3.T)

print("\n=== Summary ===")
print("Hidden states are updated over time and retain past information")

Output Example:

=== Detailed RNN Forward Propagation ===

Initial hidden state h_0:
[[0. 0. 0.]]

--- Time t=1 ---
Input x_1: [[1.  0.5]]
Linear transformation a_1 = Wxh @ x_1 + Whh @ h_0 + bh:
[[0.2 0.5 0.8]]
Hidden state h_1 = tanh(a_1):
[[0.19737532 0.46211716 0.66403677]]
Output y_1 = Why @ h_1 + by:
[[0.62507946]]

--- Time t=2 ---
Input x_2: [[0.8 0.3]]
Linear transformation a_2 = Wxh @ x_2 + Whh @ h_1 + bh:
[[0.29047307 0.65308434 1.00569561]]
Hidden state h_2 = tanh(a_2):
[[0.28267734 0.57345841 0.76354129]]
Output y_2 = Why @ h_2 + by:
[[0.74487427]]

--- Time t=3 ---
Input x_3: [[0.6 0.9]]
Linear transformation a_3 = Wxh @ x_3 + Whh @ h_2 + bh:
[[0.35687144 0.8098169  1.25276236]]
Hidden state h_3 = tanh(a_3):
[[0.34242503 0.66919951 0.84956376]]
Output y_3 = Why @ h_3 + by:
[[0.84642439]]

=== Summary ===
Hidden states are updated over time and retain past information

1.4 Backpropagation Through Time (BPTT)

Basic Principles of BPTT

Backpropagation Through Time (BPTT) is the learning algorithm for RNNs. It applies standard backpropagation to the unrolled network.

Loss Function

The loss over the entire sequence is the sum of losses at each time step:

$$ L = \sum_{t=1}^{T} L_t = \sum_{t=1}^{T} \mathcal{L}(y_t, \hat{y}_t) $$

Gradient Computation

The gradient with respect to hidden state $h_t$ at time $t$ includes contributions from all future time steps:

$$ \frac{\partial L}{\partial h_t} = \frac{\partial L_t}{\partial h_t} + \frac{\partial L_{t+1}}{\partial h_t} $$

Expanding this recursively:

$$ \frac{\partial L}{\partial h_t} = \sum_{k=t}^{T} \frac{\partial L_k}{\partial h_k} \prod_{j=t+1}^{k} \frac{\partial h_j}{\partial h_{j-1}} $$

Vanishing and Exploding Gradient Problems

The biggest challenges in BPTT are vanishing gradients and exploding gradients.

Core Issue: For long sequences, gradients either decay or grow exponentially as they backpropagate through time.

Mathematical Explanation

The gradient of the hidden state is computed via the chain rule:

$$ \frac{\partial h_t}{\partial h_{t-1}} = W_{hh}^T \cdot \text{diag}(\tanh'(a_{t-1})) $$

Going back $k$ time steps:

$$ \frac{\partial h_t}{\partial h_{t-k}} = \prod_{j=0}^{k-1} \frac{\partial h_{t-j}}{\partial h_{t-j-1}} $$

This product leads to:

Vanishing gradients: When $\|W_{hh}\| < 1$ and $|\tanh'(x)| < 1$, the product approaches 0
Exploding gradients: When $\|W_{hh}\| > 1$, the product diverges

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import numpy as np
import matplotlib.pyplot as plt

# Simulate vanishing/exploding gradients
def simulate_gradient_flow(W_norm, sequence_length=50):
    """
    Simulate gradient propagation

    Parameters:
    -----------
    W_norm : float
        Norm of weight matrix (simplified to 1D)
    sequence_length : int
        Length of sequence

    Returns:
    --------
    gradients : np.array
        Magnitude of gradients at each time step
    """
    # Average value of tanh derivative (approximately 0.4)
    tanh_derivative = 0.4

    # Initial gradient value
    gradient = 1.0
    gradients = [gradient]

    # Calculate gradients going back in time
    for t in range(sequence_length - 1):
        gradient *= W_norm * tanh_derivative
        gradients.append(gradient)

    return np.array(gradients[::-1])  # Reverse to chronological order

# Simulate with different norms
sequence_length = 50
W_norms = [0.5, 1.0, 2.0, 4.0]
colors = ['blue', 'green', 'orange', 'red']
labels = ['W_norm=0.5 (vanishing)', 'W_norm=1.0 (stable)', 'W_norm=2.0 (exploding)', 'W_norm=4.0 (exploding)']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

for W_norm, color, label in zip(W_norms, colors, labels):
    gradients = simulate_gradient_flow(W_norm, sequence_length)

    # Linear scale
    ax1.plot(gradients, color=color, label=label, linewidth=2)

    # Logarithmic scale
    ax2.semilogy(gradients, color=color, label=label, linewidth=2)

ax1.set_xlabel('Time (past ← present)')
ax1.set_ylabel('Gradient magnitude')
ax1.set_title('Gradient Propagation (Linear Scale)')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.set_xlabel('Time (past ← present)')
ax2.set_ylabel('Gradient magnitude (log)')
ax2.set_title('Gradient Propagation (Logarithmic Scale)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
print("Visualization of vanishing/exploding gradients displayed")

# Numerical analysis
print("\n=== Analysis of Gradient Decay/Growth ===\n")
for W_norm in W_norms:
    gradients = simulate_gradient_flow(W_norm, sequence_length)
    print(f"W_norm={W_norm}:")
    print(f"  Initial gradient: {gradients[-1]:.6f}")
    print(f"  Gradient 50 steps back: {gradients[0]:.6e}")
    print(f"  Decay rate: {gradients[0]/gradients[-1]:.6e}\n")

Output Example:

Visualization of vanishing/exploding gradients displayed

=== Analysis of Gradient Decay/Growth ===

W_norm=0.5:
  Initial gradient: 1.000000
  Gradient 50 steps back: 7.105427e-15
  Decay rate: 7.105427e-15

W_norm=1.0:
  Initial gradient: 1.000000
  Gradient 50 steps back: 1.125899e-12
  Decay rate: 1.125899e-12

W_norm=2.0:
  Initial gradient: 1.000000
  Gradient 50 steps back: 1.152922e+03
  Decay rate: 1.152922e+03

W_norm=4.0:
  Initial gradient: 1.000000
  Gradient 50 steps back: 1.329228e+12
  Decay rate: 1.329228e+12

Solution for Exploding Gradients: Gradient Clipping

Gradient clipping scales gradients when their norm exceeds a threshold.

$$ \mathbf{g} \leftarrow \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \leq \theta \\ \theta \frac{\mathbf{g}}{\|\mathbf{g}\|} & \text{if } \|\mathbf{g}\| > \theta \end{cases} $$

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch

def gradient_clipping_example():
    """Example implementation of gradient clipping"""
    # Dummy parameters and gradients
    params = [
        torch.randn(10, 10, requires_grad=True),
        torch.randn(5, 10, requires_grad=True)
    ]

    # Set large gradients (simulate explosion)
    params[0].grad = torch.randn(10, 10) * 100  # Large gradients
    params[1].grad = torch.randn(5, 10) * 100

    # Norm before clipping
    total_norm_before = 0
    for p in params:
        if p.grad is not None:
            total_norm_before += p.grad.data.norm(2).item() ** 2
    total_norm_before = total_norm_before ** 0.5

    # Gradient clipping
    max_norm = 1.0
    torch.nn.utils.clip_grad_norm_(params, max_norm)

    # Norm after clipping
    total_norm_after = 0
    for p in params:
        if p.grad is not None:
            total_norm_after += p.grad.data.norm(2).item() ** 2
    total_norm_after = total_norm_after ** 0.5

    print("=== Effect of Gradient Clipping ===\n")
    print(f"Gradient norm before clipping: {total_norm_before:.4f}")
    print(f"Gradient norm after clipping: {total_norm_after:.4f}")
    print(f"Threshold: {max_norm}")
    print(f"\nGradients constrained below threshold!")

gradient_clipping_example()

Output:

=== Effect of Gradient Clipping ===

Gradient norm before clipping: 163.4521
Gradient norm after clipping: 1.0000
Threshold: 1.0

Gradients constrained below threshold!

Solution for Vanishing Gradients: Architectural Improvements

Fundamental solutions to the vanishing gradient problem include the following architectures:

Method	Key Features	Effect
LSTM	Control information flow with gate mechanisms	Learn long-term dependencies
GRU	Simplified LSTM with 2 gates	Reduced parameters, faster
Residual Connection	Direct gradient propagation via skip connections	Train deep networks
Layer Normalization	Normalize per layer	Stabilize training

Note: LSTM and GRU will be covered in detail in the next chapter.

1.5 RNN Implementation in PyTorch

Basic Usage of nn.RNN

In PyTorch, use the torch.nn.RNN class to define RNN layers.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: In PyTorch, use thetorch.nn.RNNclass to define RNN layers.

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Basic RNN syntax
rnn = nn.RNN(
    input_size=10,      # Input feature dimension
    hidden_size=20,     # Hidden state dimension
    num_layers=2,       # Number of RNN layers
    nonlinearity='tanh', # Activation function ('tanh' or 'relu')
    batch_first=True,   # (batch, seq, feature) order
    dropout=0.0,        # Dropout rate (between layers)
    bidirectional=False # Bidirectional RNN
)

# Dummy input (batch size 3, sequence length 5, features 10)
x = torch.randn(3, 5, 10)

# Initial hidden state (num_layers, batch, hidden_size)
h0 = torch.zeros(2, 3, 20)

# Forward propagation
output, hn = rnn(x, h0)

print("=== PyTorch RNN Operation Check ===\n")
print(f"Input size: {x.shape}")
print(f"  [Batch, Sequence, Features] = [{x.shape[0]}, {x.shape[1]}, {x.shape[2]}]")
print(f"\nInitial hidden state: {h0.shape}")
print(f"  [Layers, Batch, Hidden] = [{h0.shape[0]}, {h0.shape[1]}, {h0.shape[2]}]")
print(f"\nOutput size: {output.shape}")
print(f"  [Batch, Sequence, Hidden] = [{output.shape[0]}, {output.shape[1]}, {output.shape[2]}]")
print(f"\nFinal hidden state: {hn.shape}")
print(f"  [Layers, Batch, Hidden] = [{hn.shape[0]}, {hn.shape[1]}, {hn.shape[2]}]")

# Calculate parameter count
total_params = sum(p.numel() for p in rnn.parameters())
print(f"\nTotal parameters: {total_params:,}")

Output:

=== PyTorch RNN Operation Check ===

Input size: torch.Size([3, 5, 10])
  [Batch, Sequence, Features] = [3, 5, 10]

Initial hidden state: torch.Size([2, 3, 20])
  [Layers, Batch, Hidden] = [2, 3, 20]

Output size: torch.Size([3, 5, 20])
  [Batch, Sequence, Hidden] = [3, 5, 20]

Final hidden state: torch.Size([2, 3, 20])
  [Layers, Batch, Hidden] = [2, 3, 20]

Total parameters: 1,240

nn.RNNCell vs nn.RNN

PyTorch has two RNN implementations:

Class	Features	Use Cases
nn.RNN	Process entire sequence at once	Standard use, efficient
nn.RNNCell	Process one time step at a time manually	When custom control is needed

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: PyTorch has two RNN implementations:

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Usage example of nn.RNNCell
input_size = 5
hidden_size = 10
batch_size = 3
sequence_length = 4

rnn_cell = nn.RNNCell(input_size, hidden_size)

# Sequence input
x_sequence = torch.randn(batch_size, sequence_length, input_size)

# Initial hidden state
h = torch.zeros(batch_size, hidden_size)

print("=== Manual Loop with nn.RNNCell ===\n")

# Manual loop through each time step
outputs = []
for t in range(sequence_length):
    x_t = x_sequence[:, t, :]  # Input at time t
    h = rnn_cell(x_t, h)       # Update hidden state
    outputs.append(h)
    print(f"Time t={t}: input {x_t.shape} → hidden state {h.shape}")

# Stack outputs
output = torch.stack(outputs, dim=1)

print(f"\nAll time steps output: {output.shape}")
print(f"  [Batch, Sequence, Hidden] = [{output.shape[0]}, {output.shape[1]}, {output.shape[2]}]")

# Comparison with nn.RNN
rnn = nn.RNN(input_size, hidden_size, batch_first=True)
output_rnn, hn_rnn = rnn(x_sequence)

print(f"\nnn.RNN output: {output_rnn.shape}")
print("→ Same result as nn.RNNCell loop but computed at once")

Output:

=== Manual Loop with nn.RNNCell ===

Time t=0: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])
Time t=1: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])
Time t=2: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])
Time t=3: input torch.Size([3, 5]) → hidden state torch.Size([3, 10])

All time steps output: torch.Size([3, 4, 10])
  [Batch, Sequence, Hidden] = [3, 4, 10]

nn.RNN output: torch.Size([3, 4, 10])
→ Same result as nn.RNNCell loop but computed at once

Many-to-One Task: Sentiment Analysis

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class SentimentRNN(nn.Module):
    """
    Many-to-One RNN: Classify sentiment from entire sentence
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SentimentRNN, self).__init__()

        # Word embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # RNN layer
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)

        # Output layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # x: (batch, seq_len)

        # Word embedding
        embedded = self.embedding(x)  # (batch, seq_len, embedding_dim)

        # RNN
        output, hidden = self.rnn(embedded)
        # output: (batch, seq_len, hidden_dim)
        # hidden: (1, batch, hidden_dim)

        # Use only last hidden state (Many-to-One)
        last_hidden = hidden.squeeze(0)  # (batch, hidden_dim)

        # Classification
        logits = self.fc(last_hidden)  # (batch, output_dim)

        return logits

# Model definition
vocab_size = 5000      # Vocabulary size
embedding_dim = 100    # Word embedding dimension
hidden_dim = 128       # Hidden state dimension
output_dim = 2         # 2-class classification (positive/negative)

model = SentimentRNN(vocab_size, embedding_dim, hidden_dim, output_dim)

# Dummy data (batch size 4, sequence length 10)
x = torch.randint(0, vocab_size, (4, 10))
logits = model(x)

print("=== Sentiment RNN (Many-to-One) ===\n")
print(f"Input (word IDs): {x.shape}")
print(f"  [Batch, Sequence] = [{x.shape[0]}, {x.shape[1]}]")
print(f"\nOutput (logits): {logits.shape}")
print(f"  [Batch, Classes] = [{logits.shape[0]}, {logits.shape[1]}]")

# Convert to probabilities
probs = F.softmax(logits, dim=1)
print(f"\nProbability distribution:")
print(probs)

# Parameter count
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")
print(f"  Breakdown:")
print(f"    Embedding: {vocab_size * embedding_dim:,}")
print(f"    RNN: {(embedding_dim * hidden_dim + hidden_dim * hidden_dim + 2 * hidden_dim):,}")
print(f"    FC: {(hidden_dim * output_dim + output_dim):,}")

Output Example:

=== Sentiment RNN (Many-to-One) ===

Input (word IDs): torch.Size([4, 10])
  [Batch, Sequence] = [4, 10]

Output (logits): torch.Size([4, 2])
  [Batch, Classes] = [4, 2]

Probability distribution:
tensor([[0.5234, 0.4766],
        [0.4892, 0.5108],
        [0.5123, 0.4877],
        [0.4956, 0.5044]], grad_fn=)

Total parameters: 529,410
  Breakdown:
    Embedding: 500,000
    RNN: 29,152
    FC: 258

1.6 Practice: Character-Level Language Model

What is a Character-Level RNN?

A character-level language model learns text at the character level and predicts the next character.

Input: Previous character sequence
Output: Probability distribution of the next character
After training: Can generate new text

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0

"""
Example: Acharacter-level language modellearns text at the character 

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Sample text (Shakespeare-style)
text = """To be or not to be, that is the question.
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles"""

# Create character set
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

vocab_size = len(chars)

print("=== Character-Level Language Model ===\n")
print(f"Text length: {len(text)} characters")
print(f"Vocabulary size: {vocab_size} characters")
print(f"Character set: {''.join(chars)}")

# Convert text to numbers
encoded_text = [char_to_idx[ch] for ch in text]

print(f"\nOriginal text (first 50 characters):")
print(text[:50])
print(f"\nEncoded:")
print(encoded_text[:50])

# RNN language model definition
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(CharRNN, self).__init__()
        self.hidden_dim = hidden_dim

        # Character embedding
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # RNN layer
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)

        # Output layer
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embedding_dim)
        output, hidden = self.rnn(embedded, hidden)  # (batch, seq_len, hidden_dim)
        logits = self.fc(output)  # (batch, seq_len, vocab_size)
        return logits, hidden

    def init_hidden(self, batch_size):
        return torch.zeros(1, batch_size, self.hidden_dim)

# Model initialization
embedding_dim = 32
hidden_dim = 64
model = CharRNN(vocab_size, embedding_dim, hidden_dim)

print(f"\n=== CharRNN Model ===")
print(f"Embedding dimension: {embedding_dim}")
print(f"Hidden state dimension: {hidden_dim}")
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

Output Example:

=== Character-Level Language Model ===

Text length: 179 characters
Vocabulary size: 36 characters
Character set:  ',.Tabdefghilmnopqrstuwy

Original text (first 50 characters):
To be or not to be, that is the question.
Whethe

Encoded:
[7, 22, 0, 13, 14, 0, 22, 23, 0, 21, 22, 25, 0, 25, 22, 0, 13, 14, 2, 0, 25, 17, 10, 25, 0, 18, 24, 0, 25, 17, 14, 0, 23, 26, 14, 24, 25, 18, 22, 21, 3, 1, 8, 17, 14, 25, 17, 14, 23, 0]

=== CharRNN Model ===
Embedding dimension: 32
Hidden state dimension: 64
Total parameters: 8,468

Training and Text Generation

def create_sequences(encoded_text, seq_length):
    """
    Create training sequences
    """
    X, y = [], []
    for i in range(len(encoded_text) - seq_length):
        X.append(encoded_text[i:i+seq_length])
        y.append(encoded_text[i+1:i+seq_length+1])
    return torch.LongTensor(X), torch.LongTensor(y)

# Data preparation
seq_length = 25
X, y = create_sequences(encoded_text, seq_length)

print(f"=== Dataset ===")
print(f"Number of sequences: {len(X)}")
print(f"Input size: {X.shape}")
print(f"Target size: {y.shape}")

# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Simplified training loop
num_epochs = 100
batch_size = 32

print(f"\n=== Training Started ===")
model.train()

for epoch in range(num_epochs):
    total_loss = 0
    hidden = None

    for i in range(0, len(X), batch_size):
        batch_X = X[i:i+batch_size]
        batch_y = y[i:i+batch_size]

        # Forward propagation
        optimizer.zero_grad()
        logits, hidden = model(batch_X, hidden)

        # Loss calculation (reshape required)
        loss = criterion(logits.view(-1, vocab_size), batch_y.view(-1))

        # Backpropagation
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)

        optimizer.step()

        # Detach hidden state (truncate BPTT)
        hidden = hidden.detach()

        total_loss += loss.item()

    if (epoch + 1) % 20 == 0:
        avg_loss = total_loss / (len(X) // batch_size)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

print("\nTraining complete!")

# Text generation
def generate_text(model, start_str, length=100, temperature=1.0):
    """
    Generate text with trained model

    Parameters:
    -----------
    model : CharRNN
        Trained model
    start_str : str
        Starting string
    length : int
        Number of characters to generate
    temperature : float
        Temperature parameter (higher = more random)
    """
    model.eval()

    # Encode starting string
    chars_encoded = [char_to_idx[ch] for ch in start_str]
    input_seq = torch.LongTensor(chars_encoded).unsqueeze(0)

    hidden = None
    generated = start_str

    with torch.no_grad():
        for _ in range(length):
            # Prediction
            logits, hidden = model(input_seq, hidden)

            # Output at last time step
            logits = logits[0, -1, :] / temperature
            probs = torch.softmax(logits, dim=0)

            # Sampling
            next_char_idx = torch.multinomial(probs, 1).item()
            next_char = idx_to_char[next_char_idx]

            generated += next_char

            # Next input
            input_seq = torch.LongTensor([[next_char_idx]])

    return generated

# Execute text generation
print("\n=== Text Generation ===\n")
start_str = "To be"
generated_text = generate_text(model, start_str, length=100, temperature=0.8)
print(f"Starting string: '{start_str}'")
print(f"\nGenerated text:")
print(generated_text)

Output Example:

=== Dataset ===
Number of sequences: 154
Input size: torch.Size([154, 25])
Target size: torch.Size([154, 25])

=== Training Started ===
Epoch 20/100, Loss: 2.1234
Epoch 40/100, Loss: 1.8765
Epoch 60/100, Loss: 1.5432
Epoch 80/100, Loss: 1.2987
Epoch 100/100, Loss: 1.0654

Training complete!

=== Text Generation ===

Starting string: 'To be'

Generated text:
To be or not the question.
Whether 'tis nobler in the mind to suffer
The slings and arrows of out

Note: Actual output varies due to training randomness. Better results are obtained with longer text and more epochs.

Summary

In this chapter, we learned about RNN fundamentals and forward propagation.

Key Points

Sequence data has temporal order and is difficult to handle with traditional NNs
Hidden states allow RNNs to remember past information
Parameter sharing enables uniform processing of arbitrary-length sequences
BPTT is used for training, but vanishing/exploding gradients are challenges
Gradient clipping suppresses exploding gradients
PyTorch nn.RNN enables easy implementation

Next Chapter Preview

Chapter 2 will cover the following topics:

LSTM (Long Short-Term Memory) mechanisms
GRU (Gated Recurrent Unit) structure
Bidirectional RNN
Seq2Seq models and Attention mechanisms

Exercises

Exercise 1: Hidden State Size Calculation

Problem: Calculate the number of parameters for the following RNN.

Input size: 50
Hidden state size: 128
Output size: 10

Solution:

# Wxh: input→hidden
Wxh_params = 50 * 128 = 6,400

# Whh: hidden→hidden
Whh_params = 128 * 128 = 16,384

# Why: hidden→output
Why_params = 128 * 10 = 1,280

# Bias terms
bh_params = 128
by_params = 10

# Total
total = 6,400 + 16,384 + 1,280 + 128 + 10 = 24,202

Answer: 24,202 parameters

Exercise 2: Sequence Length and Memory Usage

Problem: Calculate the total memory (number of elements) required for hidden states during forward propagation for an RNN with batch size 32, sequence length 100, and hidden state size 256.

Solution:

# Hidden state at each time step: (batch_size, hidden_size)
# Need to store for the sequence length

memory_elements = batch_size * seq_length * hidden_size
                = 32 * 100 * 256
                = 819,200 elements

# For float32 (4 bytes)
memory_bytes = 819,200 * 4 = 3,276,800 bytes ≈ 3.2 MB

Answer: Approximately 3.2 MB (more needed during backpropagation)

Exercise 3: Many-to-Many RNN Implementation

Problem: Implement a Many-to-Many RNN in PyTorch for part-of-speech tagging (assigning POS labels to each word).

Solution Example:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn

class POSTaggingRNN(nn.Module):
    """
    Many-to-Many RNN: Predict POS tag for each word
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_tags):
        super(POSTaggingRNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_tags)

    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embedding_dim)
        output, _ = self.rnn(embedded)  # (batch, seq_len, hidden_dim)
        logits = self.fc(output)  # (batch, seq_len, num_tags)
        return logits

# Usage example
vocab_size = 5000
embedding_dim = 100
hidden_dim = 128
num_tags = 45  # Penn Treebank POS tag count

model = POSTaggingRNN(vocab_size, embedding_dim, hidden_dim, num_tags)

# Dummy data
x = torch.randint(0, vocab_size, (8, 20))  # batch=8, seq_len=20
logits = model(x)

print(f"Input: {x.shape}")
print(f"Output: {logits.shape}")  # (8, 20, 45)
print("Predict POS tag for each word at each time step")

Exercise 4: Vanishing Gradient Experiment

Problem: Train RNNs with different weight initializations and observe the impact of vanishing gradients.

Solution Example:

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

def test_gradient_vanishing(init_scale):
    """
    Observe gradients by varying weight initialization scale
    """
    rnn = nn.RNN(10, 20, batch_first=True)

    # Manually initialize weights
    for name, param in rnn.named_parameters():
        if 'weight' in name:
            nn.init.uniform_(param, -init_scale, init_scale)

    # Dummy data (long sequence)
    x = torch.randn(1, 50, 10)
    target = torch.randn(1, 50, 20)

    # Forward propagation
    output, _ = rnn(x)
    loss = ((output - target) ** 2).mean()

    # Backpropagation
    loss.backward()

    # Calculate gradient norms
    grad_norms = []
    for name, param in rnn.named_parameters():
        if param.grad is not None:
            grad_norms.append(param.grad.norm().item())

    return grad_norms

# Experiment with different initialization scales
scales = [0.01, 0.1, 0.5, 1.0, 2.0]
results = {scale: test_gradient_vanishing(scale) for scale in scales}

print("=== Gradient Norm Comparison ===")
for scale, norms in results.items():
    avg_norm = sum(norms) / len(norms)
    print(f"Initialization scale {scale}: average gradient norm = {avg_norm:.6f}")

print("\nToo small scale causes vanishing, too large causes exploding gradients")

Exercise 5: Understanding Bidirectional RNN

Problem: Implement a bidirectional RNN and explain the differences from a unidirectional RNN.

Solution Example:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Solution Example:

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Bidirectional RNN
bi_rnn = nn.RNN(10, 20, batch_first=True, bidirectional=True)

# Unidirectional RNN
uni_rnn = nn.RNN(10, 20, batch_first=True, bidirectional=False)

# Dummy input
x = torch.randn(3, 5, 10)  # batch=3, seq_len=5, features=10

# Forward propagation
bi_output, bi_hidden = bi_rnn(x)
uni_output, uni_hidden = uni_rnn(x)

print("=== Bidirectional RNN vs Unidirectional RNN ===\n")

print(f"Input size: {x.shape}")

print(f"\nBidirectional RNN:")
print(f"  Output: {bi_output.shape}")  # (3, 5, 40) ← 20*2
print(f"  Hidden state: {bi_hidden.shape}")  # (2, 3, 20) ← 2 directions

print(f"\nUnidirectional RNN:")
print(f"  Output: {uni_output.shape}")  # (3, 5, 20)
print(f"  Hidden state: {uni_hidden.shape}")  # (1, 3, 20)

print("\nBidirectional RNN features:")
print("  ✓ Captures both forward and backward context")
print("  ✓ Output dimension doubles (forward + backward)")
print("  ✓ Uses future information, improves accuracy")
print("  ✗ Not suitable for real-time processing (needs entire sequence)")

Learning Objectives

1.1 What is Sequence Data?

Limitations of Traditional Neural Networks

Examples of Sequence Data

Problem: Why Feedforward Networks Are Insufficient

RNN Task Classification

1.2 Basic Structure of RNNs

Concept of Hidden States

Basic RNN Equations

Parameter Sharing

Unrolling

1.3 Mathematical Definition of Forward Propagation

Detailed Computation Process

Step 1: Initial Hidden State

Step 2: Update Hidden State at Each Time Step

Step 3: Output Calculation

Concrete Numerical Example

1.4 Backpropagation Through Time (BPTT)

Basic Principles of BPTT

Loss Function

Gradient Computation

Vanishing and Exploding Gradient Problems

Mathematical Explanation

Solution for Exploding Gradients: Gradient Clipping

Solution for Vanishing Gradients: Architectural Improvements

1.5 RNN Implementation in PyTorch

Basic Usage of nn.RNN

nn.RNNCell vs nn.RNN

Many-to-One Task: Sentiment Analysis

1.6 Practice: Character-Level Language Model

What is a Character-Level RNN?

Training and Text Generation

Summary

Key Points

Next Chapter Preview

Exercises

Disclaimer