Chapter 3: Learning Algorithms

Understanding Gradient Descent, Backpropagation, and Training Dynamics

Reading Time: 35-40 minutes Code Examples: 9 Exercises: 7 Difficulty: Intermediate
In this chapter, we will learn how neural networks learn from data. We will understand gradient descent optimization, the backpropagation algorithm for computing gradients, mini-batch learning strategies, and techniques for monitoring and visualizing training progress.

Learning Objectives

1. Gradient Descent

1.1 Basic Concepts of Optimization

Optimization is the process of finding the parameter values that minimize (or maximize) an objective function. In deep learning, we want to find weights and biases that minimize the loss function.

The optimization problem can be written as:

$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$

Where:

1.2 Batch Gradient Descent

Gradient descent is an iterative algorithm that updates parameters in the direction opposite to the gradient (steepest descent direction).

Update rule:

$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$

Where:

graph LR A[Initialize Parameters] --> B[Compute Loss] B --> C[Compute Gradients] C --> D[Update Parameters] D --> E{Converged?} E -->|No| B E -->|Yes| F[Done] style A fill:#e3f2fd style F fill:#e8f5e9
import numpy as np

def gradient_descent_demo():
    """
    Simple gradient descent demonstration on a quadratic function
    f(x) = x^2, minimum at x = 0
    """
    # Initial point
    x = 5.0

    # Learning rate
    learning_rate = 0.1

    # Store history for visualization
    history = [x]

    # Gradient descent iterations
    for i in range(20):
        # Compute gradient: df/dx = 2x
        gradient = 2 * x

        # Update parameter
        x = x - learning_rate * gradient

        history.append(x)

        if i < 5 or i == 19:
            print(f"Iteration {i+1}: x = {x:.6f}, f(x) = {x**2:.6f}")

    return history

print("Gradient Descent on f(x) = x^2")
print("=" * 40)
history = gradient_descent_demo()

1.3 Geometric Meaning of Gradients

The gradient $\nabla \mathcal{L}$ is a vector pointing in the direction of steepest increase of the loss function. Therefore, we move in the opposite direction to decrease the loss.

Key Properties of Gradients

import numpy as np

def gradient_2d_demo():
    """
    Gradient descent on a 2D function
    f(x, y) = x^2 + 2y^2 (elliptical paraboloid)
    Minimum at (0, 0)
    """
    # Initial point
    x, y = 4.0, 3.0

    learning_rate = 0.1
    history = [(x, y)]

    print("2D Gradient Descent on f(x,y) = x^2 + 2y^2")
    print("=" * 50)

    for i in range(15):
        # Compute gradients
        grad_x = 2 * x
        grad_y = 4 * y  # df/dy = 4y

        # Update parameters
        x = x - learning_rate * grad_x
        y = y - learning_rate * grad_y

        history.append((x, y))

        if i < 3 or i == 14:
            loss = x**2 + 2*y**2
            print(f"Iter {i+1}: x={x:.4f}, y={y:.4f}, f(x,y)={loss:.4f}")

    return history

history_2d = gradient_2d_demo()

2. Backpropagation

2.1 Review of the Chain Rule

The chain rule allows us to compute derivatives of composite functions. If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

For neural networks with multiple layers, we apply the chain rule repeatedly:

$$\frac{\partial \mathcal{L}}{\partial w^{(1)}} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \cdots \frac{\partial z^{(1)}}{\partial w^{(1)}}$$

2.2 Gradient Propagation from Output to Input

Backpropagation is an efficient algorithm for computing gradients by propagating errors backward through the network.

graph RL subgraph Forward Pass X[Input X] --> H1[Hidden 1] H1 --> H2[Hidden 2] H2 --> Y[Output Y] Y --> L[Loss L] end subgraph Backward Pass dL[dL/dL = 1] --> dY[dL/dY] dY --> dH2[dL/dH2] dH2 --> dH1[dL/dH1] dH1 --> dX[dL/dX] end style X fill:#e3f2fd style L fill:#f3e5f5 style dL fill:#fff3e0

2.3 Gradient Computation at Each Layer

For a fully connected layer with ReLU activation:

Forward pass:

Backward pass:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

class DenseLayerWithBackprop:
    """
    Dense layer with forward and backward pass
    """

    def __init__(self, n_input, n_output, activation='relu'):
        self.W = np.random.randn(n_input, n_output) * np.sqrt(2.0 / n_input)
        self.b = np.zeros((1, n_output))
        self.activation = activation

        # Cache for backpropagation
        self.cache = {}
        # Gradients
        self.dW = None
        self.db = None

    def forward(self, X):
        self.cache['X'] = X
        self.cache['Z'] = np.dot(X, self.W) + self.b

        if self.activation == 'relu':
            self.cache['A'] = relu(self.cache['Z'])
        elif self.activation == 'softmax':
            self.cache['A'] = softmax(self.cache['Z'])
        else:
            self.cache['A'] = self.cache['Z']

        return self.cache['A']

    def backward(self, dA):
        """
        Backward pass

        Parameters:
        -----------
        dA : ndarray
            Gradient of loss with respect to this layer's activation

        Returns:
        --------
        dX : ndarray
            Gradient of loss with respect to input
        """
        m = dA.shape[0]  # Batch size

        # Gradient through activation
        if self.activation == 'relu':
            dZ = dA * relu_derivative(self.cache['Z'])
        else:
            dZ = dA  # For softmax, gradient is already computed

        # Gradients for weights and biases
        self.dW = np.dot(self.cache['X'].T, dZ) / m
        self.db = np.mean(dZ, axis=0, keepdims=True)

        # Gradient for previous layer
        dX = np.dot(dZ, self.W.T)

        return dX

# Example: Manual backpropagation
np.random.seed(42)

# Create simple network
layer1 = DenseLayerWithBackprop(4, 8, 'relu')
layer2 = DenseLayerWithBackprop(8, 3, 'softmax')

# Sample data
X = np.array([[1, 2, 3, 4], [4, 3, 2, 1]], dtype=float)
y_true = np.array([[1, 0, 0], [0, 0, 1]])  # One-hot encoded

# Forward pass
A1 = layer1.forward(X)
A2 = layer2.forward(A1)

print("Forward pass:")
print(f"  Input shape: {X.shape}")
print(f"  Hidden activation shape: {A1.shape}")
print(f"  Output (predictions): \n{A2}")

# Compute loss gradient (Cross Entropy with Softmax)
# For CE + Softmax, dL/dZ_output = (y_pred - y_true) / m
dL_dA2 = (A2 - y_true)

# Backward pass
dA1 = layer2.backward(dL_dA2)
dX = layer1.backward(dA1)

print("\nBackward pass:")
print(f"  dW2 shape: {layer2.dW.shape}, mean: {layer2.dW.mean():.6f}")
print(f"  dW1 shape: {layer1.dW.shape}, mean: {layer1.dW.mean():.6f}")

3. Mini-batch Learning and SGD

3.1 Impact of Batch Size

Batch Type Size Gradient Quality Speed Memory
Batch GD All data Most accurate Slow updates High
Mini-batch 32-256 Good balance Fast Moderate
SGD 1 Very noisy Fastest updates Low

3.2 Stochastic Gradient Descent (SGD)

SGD updates parameters using a single sample (or small batch) at a time, introducing noise that can help escape local minima.

import numpy as np

def create_batches(X, y, batch_size, shuffle=True):
    """
    Create mini-batches from data

    Parameters:
    -----------
    X : ndarray
        Input features
    y : ndarray
        Target labels
    batch_size : int
        Number of samples per batch
    shuffle : bool
        Whether to shuffle data

    Yields:
    -------
    X_batch, y_batch : tuple
        Mini-batch of features and labels
    """
    n_samples = X.shape[0]
    indices = np.arange(n_samples)

    if shuffle:
        np.random.shuffle(indices)

    for start_idx in range(0, n_samples, batch_size):
        end_idx = min(start_idx + batch_size, n_samples)
        batch_indices = indices[start_idx:end_idx]
        yield X[batch_indices], y[batch_indices]

# Example usage
np.random.seed(42)
X = np.random.randn(100, 4)
y = np.random.randint(0, 3, 100)

print("Mini-batch iteration:")
for i, (X_batch, y_batch) in enumerate(create_batches(X, y, batch_size=32)):
    print(f"  Batch {i+1}: X shape = {X_batch.shape}, y shape = {y_batch.shape}")

3.3 SGD with Momentum

Momentum accelerates SGD by accumulating a velocity vector in directions of persistent gradient.

Update rules:

$$\mathbf{v} \leftarrow \beta \mathbf{v} + (1-\beta) \nabla_{\boldsymbol{\theta}} \mathcal{L}$$

$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \mathbf{v}$$

Where $\beta$ (typically 0.9) is the momentum coefficient.

Intuition: Think of a ball rolling down a hill. Momentum helps it:

import numpy as np

class SGDOptimizer:
    """
    SGD optimizer with optional momentum
    """

    def __init__(self, learning_rate=0.01, momentum=0.0):
        self.lr = learning_rate
        self.momentum = momentum
        self.velocities = {}

    def update(self, params, grads):
        """
        Update parameters

        Parameters:
        -----------
        params : dict
            Parameter name -> parameter array
        grads : dict
            Parameter name -> gradient array
        """
        for name in params:
            if name not in self.velocities:
                self.velocities[name] = np.zeros_like(params[name])

            # Update velocity
            self.velocities[name] = (
                self.momentum * self.velocities[name] +
                (1 - self.momentum) * grads[name]
            )

            # Update parameter
            params[name] -= self.lr * self.velocities[name]

# Compare SGD with and without momentum
def optimize_comparison():
    """
    Compare SGD with and without momentum on a 2D function
    f(x, y) = x^2 + 10*y^2 (elongated valley)
    """
    def loss_fn(x, y):
        return x**2 + 10*y**2

    def grad_fn(x, y):
        return 2*x, 20*y

    # Initial point
    x0, y0 = 10.0, 1.0

    results = {}

    for name, momentum in [('SGD', 0.0), ('SGD+Momentum', 0.9)]:
        x, y = x0, y0
        v_x, v_y = 0.0, 0.0
        lr = 0.05
        history = [(x, y, loss_fn(x, y))]

        for _ in range(50):
            gx, gy = grad_fn(x, y)

            v_x = momentum * v_x + (1 - momentum) * gx
            v_y = momentum * v_y + (1 - momentum) * gy

            x -= lr * v_x
            y -= lr * v_y

            history.append((x, y, loss_fn(x, y)))

        results[name] = history
        print(f"{name}: Final loss = {history[-1][2]:.6f}")

    return results

print("SGD vs SGD with Momentum")
print("=" * 40)
results = optimize_comparison()

4. Learning Rate and Epochs

4.1 Choosing the Learning Rate

The learning rate $\eta$ is one of the most important hyperparameters. It controls the step size during optimization.

Learning Rate Behavior Symptoms
Too small Slow convergence Loss decreases very slowly
Just right Smooth convergence Loss decreases steadily
Too large Overshooting Loss oscillates or explodes
import numpy as np

def learning_rate_demo():
    """
    Demonstrate the effect of different learning rates
    """
    def loss_fn(x):
        return x**2

    def grad_fn(x):
        return 2*x

    learning_rates = [0.01, 0.1, 0.5, 1.1]
    x0 = 10.0
    n_steps = 20

    print("Effect of Learning Rate")
    print("=" * 50)

    for lr in learning_rates:
        x = x0
        history = [loss_fn(x)]

        for _ in range(n_steps):
            x = x - lr * grad_fn(x)
            history.append(loss_fn(x))

        final_loss = history[-1]
        if np.isnan(final_loss) or final_loss > 1e10:
            status = "DIVERGED"
        elif final_loss < 0.01:
            status = "CONVERGED"
        else:
            status = "SLOW"

        print(f"lr={lr:.2f}: Final loss = {final_loss:.4f} ({status})")

learning_rate_demo()

4.2 Determining Number of Epochs

An epoch is one complete pass through the entire training dataset. The number of epochs affects:

4.3 Interpreting Learning Curves

A learning curve plots loss (and/or accuracy) against training iterations or epochs. It reveals:

graph LR subgraph Good Fit A1[Low Training Loss] --> B1[Low Validation Loss] end subgraph Underfitting A2[High Training Loss] --> B2[High Validation Loss] end subgraph Overfitting A3[Low Training Loss] --> B3[High Validation Loss] end

5. Visualizing Training

5.1 Loss Curves

import numpy as np

class TrainingHistory:
    """
    Track and store training metrics
    """

    def __init__(self):
        self.train_loss = []
        self.val_loss = []
        self.train_acc = []
        self.val_acc = []

    def log(self, train_loss, val_loss=None, train_acc=None, val_acc=None):
        self.train_loss.append(train_loss)
        if val_loss is not None:
            self.val_loss.append(val_loss)
        if train_acc is not None:
            self.train_acc.append(train_acc)
        if val_acc is not None:
            self.val_acc.append(val_acc)

    def summary(self):
        print("\nTraining Summary:")
        print(f"  Final train loss: {self.train_loss[-1]:.4f}")
        if self.val_loss:
            print(f"  Final val loss: {self.val_loss[-1]:.4f}")
        if self.train_acc:
            print(f"  Final train accuracy: {self.train_acc[-1]:.2%}")
        if self.val_acc:
            print(f"  Final val accuracy: {self.val_acc[-1]:.2%}")

# Simulate training history
np.random.seed(42)
history = TrainingHistory()

for epoch in range(50):
    # Simulate decreasing loss with noise
    train_loss = 2.0 * np.exp(-epoch/15) + 0.1 + np.random.normal(0, 0.05)
    val_loss = 2.0 * np.exp(-epoch/15) + 0.15 + np.random.normal(0, 0.08)

    # After epoch 30, simulate overfitting
    if epoch > 30:
        val_loss += (epoch - 30) * 0.02

    train_acc = 1 - train_loss/2.5
    val_acc = 1 - val_loss/2.5

    history.log(train_loss, val_loss, train_acc, val_acc)

history.summary()
print(f"\nBest validation loss at epoch: {np.argmin(history.val_loss) + 1}")

5.2 Comparing Training and Validation Loss

import numpy as np

def analyze_learning_curves(train_loss, val_loss):
    """
    Analyze learning curves to detect training issues

    Parameters:
    -----------
    train_loss : list
        Training loss history
    val_loss : list
        Validation loss history

    Returns:
    --------
    diagnosis : str
        Training status and recommendations
    """
    train_loss = np.array(train_loss)
    val_loss = np.array(val_loss)

    # Check for divergence
    if train_loss[-1] > train_loss[0]:
        return "DIVERGING: Learning rate may be too high"

    # Check for underfitting
    if train_loss[-1] > 0.5 and val_loss[-1] > 0.5:
        return "UNDERFITTING: Model may need more capacity or training"

    # Check for overfitting
    best_val_epoch = np.argmin(val_loss)
    if best_val_epoch < len(val_loss) - 5:
        gap = val_loss[-1] - train_loss[-1]
        if gap > 0.1:
            return f"OVERFITTING: Best epoch was {best_val_epoch + 1}, consider early stopping"

    # Check convergence
    recent_improvement = val_loss[-5] - val_loss[-1] if len(val_loss) >= 5 else 1
    if recent_improvement < 0.01:
        return "CONVERGED: Training appears to have plateaued"

    return "TRAINING: Still improving"

# Test analysis
print("Learning Curve Analysis")
print("=" * 40)

# Simulate different scenarios
scenarios = {
    "Good training": (
        [2.0, 1.5, 1.0, 0.7, 0.5, 0.4, 0.35, 0.32, 0.30, 0.29],
        [2.1, 1.6, 1.1, 0.8, 0.6, 0.5, 0.45, 0.42, 0.40, 0.39]
    ),
    "Overfitting": (
        [2.0, 1.0, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, 0.002],
        [2.1, 1.1, 0.6, 0.4, 0.35, 0.4, 0.5, 0.6, 0.7, 0.8]
    ),
    "Diverging": (
        [2.0, 3.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0, 1000.0],
        [2.1, 3.5, 6.0, 15.0, 30.0, 70.0, 150.0, 300.0, 700.0, 1500.0]
    )
}

for name, (train, val) in scenarios.items():
    diagnosis = analyze_learning_curves(train, val)
    print(f"\n{name}:")
    print(f"  {diagnosis}")

5.3 Complete Training Loop

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_deriv(x):
    return (x > 0).astype(float)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

def cross_entropy_loss(y_true, y_pred, epsilon=1e-15):
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

class SimpleNetwork:
    """
    Simple neural network with complete training loop
    """

    def __init__(self, layer_sizes):
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
            W = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
            b = np.zeros((1, n_out))
            self.weights.append(W)
            self.biases.append(b)

    def forward(self, X):
        self.activations = [X]
        self.pre_activations = []

        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = np.dot(A, W) + b
            self.pre_activations.append(Z)

            if i < len(self.weights) - 1:
                A = relu(Z)
            else:
                A = softmax(Z)
            self.activations.append(A)

        return A

    def backward(self, y_true):
        m = y_true.shape[0]
        grads_W = []
        grads_b = []

        # Output layer gradient (CE + softmax)
        dZ = self.activations[-1] - y_true

        for i in range(len(self.weights) - 1, -1, -1):
            dW = np.dot(self.activations[i].T, dZ) / m
            db = np.mean(dZ, axis=0, keepdims=True)

            grads_W.insert(0, dW)
            grads_b.insert(0, db)

            if i > 0:
                dA = np.dot(dZ, self.weights[i].T)
                dZ = dA * relu_deriv(self.pre_activations[i - 1])

        return grads_W, grads_b

    def update(self, grads_W, grads_b, learning_rate):
        for i in range(len(self.weights)):
            self.weights[i] -= learning_rate * grads_W[i]
            self.biases[i] -= learning_rate * grads_b[i]

    def train(self, X_train, y_train, X_val, y_val,
              epochs=100, batch_size=32, learning_rate=0.01):
        history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
        n_samples = X_train.shape[0]

        for epoch in range(epochs):
            # Shuffle data
            indices = np.random.permutation(n_samples)
            X_shuffled = X_train[indices]
            y_shuffled = y_train[indices]

            # Mini-batch training
            for start in range(0, n_samples, batch_size):
                end = min(start + batch_size, n_samples)
                X_batch = X_shuffled[start:end]
                y_batch = y_shuffled[start:end]

                # Forward
                _ = self.forward(X_batch)

                # Backward
                grads_W, grads_b = self.backward(y_batch)

                # Update
                self.update(grads_W, grads_b, learning_rate)

            # Evaluate
            train_pred = self.forward(X_train)
            val_pred = self.forward(X_val)

            train_loss = cross_entropy_loss(y_train, train_pred)
            val_loss = cross_entropy_loss(y_val, val_pred)

            train_acc = np.mean(np.argmax(train_pred, axis=1) == np.argmax(y_train, axis=1))
            val_acc = np.mean(np.argmax(val_pred, axis=1) == np.argmax(y_val, axis=1))

            history['train_loss'].append(train_loss)
            history['val_loss'].append(val_loss)
            history['train_acc'].append(train_acc)
            history['val_acc'].append(val_acc)

            if (epoch + 1) % 20 == 0:
                print(f"Epoch {epoch+1}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}, val_acc={val_acc:.2%}")

        return history

# Create synthetic dataset
np.random.seed(42)
n_samples = 500
n_features = 4
n_classes = 3

X = np.random.randn(n_samples, n_features)
y_labels = np.random.randint(0, n_classes, n_samples)
y = np.eye(n_classes)[y_labels]  # One-hot encode

# Split data
split = int(0.8 * n_samples)
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]

# Train network
print("Training Neural Network")
print("=" * 50)
network = SimpleNetwork([4, 16, 8, 3])
history = network.train(X_train, y_train, X_val, y_val,
                        epochs=100, batch_size=32, learning_rate=0.1)

print(f"\nFinal Results:")
print(f"  Training accuracy: {history['train_acc'][-1]:.2%}")
print(f"  Validation accuracy: {history['val_acc'][-1]:.2%}")

Exercises

Exercise 1: Implement Gradient Checking

Problem: Implement numerical gradient checking to verify backpropagation is correct.

$$\frac{\partial \mathcal{L}}{\partial \theta} \approx \frac{\mathcal{L}(\theta + \epsilon) - \mathcal{L}(\theta - \epsilon)}{2\epsilon}$$

Compare numerical gradients with backprop gradients for a small network.

Exercise 2: Learning Rate Finder

Problem: Implement a learning rate finder that:

  1. Starts with a very small learning rate (e.g., 1e-7)
  2. Increases it exponentially each batch
  3. Records the loss at each step
  4. Finds the learning rate where loss is decreasing fastest
Exercise 3: Momentum Variants

Problem: Implement and compare:

  1. Standard momentum
  2. Nesterov accelerated gradient (NAG)

Test on f(x,y) = x^2 + 100*y^2 and compare convergence.

Exercise 4: Batch Size Experiment

Problem: Train the same network with batch sizes [1, 8, 32, 128, full batch]:

  1. Plot training loss curves for each
  2. Measure wall-clock time per epoch
  3. Compare final validation accuracy
Exercise 5: Early Stopping

Problem: Implement early stopping:

  1. Monitor validation loss
  2. Stop training if val loss doesn't improve for N epochs (patience)
  3. Restore weights from the best epoch
Exercise 6: Gradient Clipping

Problem: Implement gradient clipping to prevent exploding gradients:

  1. Clip by value: cap gradient magnitudes
  2. Clip by norm: scale gradients if total norm exceeds threshold
Exercise 7: Training on XOR

Problem: Train a neural network to solve the XOR problem:

Plot the decision boundary after training.

Summary

In this chapter, we learned how neural networks learn:

Next Chapter Preview: In Chapter 4, we will learn about regularization techniques (L1, L2, Dropout, Batch Normalization) and advanced optimization algorithms (Adam, RMSprop) that improve model generalization and training stability.

Disclaimer