Learning Objectives
- Understand the concept of optimization and the gradient descent algorithm
- Master the backpropagation algorithm using the chain rule
- Learn the differences between batch, mini-batch, and stochastic gradient descent
- Understand the role of learning rate and how to choose it
- Know how to interpret learning curves and training dynamics
- Implement a complete training loop from scratch
1. Gradient Descent
1.1 Basic Concepts of Optimization
Optimization is the process of finding the parameter values that minimize (or maximize) an objective function. In deep learning, we want to find weights and biases that minimize the loss function.
The optimization problem can be written as:
$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$
Where:
- $\boldsymbol{\theta}$: all parameters (weights and biases)
- $\mathcal{L}$: loss function
- $\boldsymbol{\theta}^*$: optimal parameters
1.2 Batch Gradient Descent
Gradient descent is an iterative algorithm that updates parameters in the direction opposite to the gradient (steepest descent direction).
Update rule:
$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$
Where:
- $\eta$: learning rate (step size)
- $\nabla_{\boldsymbol{\theta}} \mathcal{L}$: gradient of loss with respect to parameters
import numpy as np
def gradient_descent_demo():
"""
Simple gradient descent demonstration on a quadratic function
f(x) = x^2, minimum at x = 0
"""
# Initial point
x = 5.0
# Learning rate
learning_rate = 0.1
# Store history for visualization
history = [x]
# Gradient descent iterations
for i in range(20):
# Compute gradient: df/dx = 2x
gradient = 2 * x
# Update parameter
x = x - learning_rate * gradient
history.append(x)
if i < 5 or i == 19:
print(f"Iteration {i+1}: x = {x:.6f}, f(x) = {x**2:.6f}")
return history
print("Gradient Descent on f(x) = x^2")
print("=" * 40)
history = gradient_descent_demo()
1.3 Geometric Meaning of Gradients
The gradient $\nabla \mathcal{L}$ is a vector pointing in the direction of steepest increase of the loss function. Therefore, we move in the opposite direction to decrease the loss.
Key Properties of Gradients
- Direction: Points toward steepest increase
- Magnitude: Indicates how steep the slope is
- At minimum: Gradient is zero (or close to zero)
import numpy as np
def gradient_2d_demo():
"""
Gradient descent on a 2D function
f(x, y) = x^2 + 2y^2 (elliptical paraboloid)
Minimum at (0, 0)
"""
# Initial point
x, y = 4.0, 3.0
learning_rate = 0.1
history = [(x, y)]
print("2D Gradient Descent on f(x,y) = x^2 + 2y^2")
print("=" * 50)
for i in range(15):
# Compute gradients
grad_x = 2 * x
grad_y = 4 * y # df/dy = 4y
# Update parameters
x = x - learning_rate * grad_x
y = y - learning_rate * grad_y
history.append((x, y))
if i < 3 or i == 14:
loss = x**2 + 2*y**2
print(f"Iter {i+1}: x={x:.4f}, y={y:.4f}, f(x,y)={loss:.4f}")
return history
history_2d = gradient_2d_demo()
2. Backpropagation
2.1 Review of the Chain Rule
The chain rule allows us to compute derivatives of composite functions. If $y = f(g(x))$, then:
$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$
For neural networks with multiple layers, we apply the chain rule repeatedly:
$$\frac{\partial \mathcal{L}}{\partial w^{(1)}} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \cdots \frac{\partial z^{(1)}}{\partial w^{(1)}}$$
2.2 Gradient Propagation from Output to Input
Backpropagation is an efficient algorithm for computing gradients by propagating errors backward through the network.
2.3 Gradient Computation at Each Layer
For a fully connected layer with ReLU activation:
Forward pass:
- $\mathbf{z} = \mathbf{W}\mathbf{a}^{(prev)} + \mathbf{b}$
- $\mathbf{a} = \text{ReLU}(\mathbf{z})$
Backward pass:
- $\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}} \odot \text{ReLU}'(\mathbf{z})$
- $\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot (\mathbf{a}^{(prev)})^T$
- $\frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}}$
- $\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(prev)}} = \mathbf{W}^T \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{z}}$
import numpy as np
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
class DenseLayerWithBackprop:
"""
Dense layer with forward and backward pass
"""
def __init__(self, n_input, n_output, activation='relu'):
self.W = np.random.randn(n_input, n_output) * np.sqrt(2.0 / n_input)
self.b = np.zeros((1, n_output))
self.activation = activation
# Cache for backpropagation
self.cache = {}
# Gradients
self.dW = None
self.db = None
def forward(self, X):
self.cache['X'] = X
self.cache['Z'] = np.dot(X, self.W) + self.b
if self.activation == 'relu':
self.cache['A'] = relu(self.cache['Z'])
elif self.activation == 'softmax':
self.cache['A'] = softmax(self.cache['Z'])
else:
self.cache['A'] = self.cache['Z']
return self.cache['A']
def backward(self, dA):
"""
Backward pass
Parameters:
-----------
dA : ndarray
Gradient of loss with respect to this layer's activation
Returns:
--------
dX : ndarray
Gradient of loss with respect to input
"""
m = dA.shape[0] # Batch size
# Gradient through activation
if self.activation == 'relu':
dZ = dA * relu_derivative(self.cache['Z'])
else:
dZ = dA # For softmax, gradient is already computed
# Gradients for weights and biases
self.dW = np.dot(self.cache['X'].T, dZ) / m
self.db = np.mean(dZ, axis=0, keepdims=True)
# Gradient for previous layer
dX = np.dot(dZ, self.W.T)
return dX
# Example: Manual backpropagation
np.random.seed(42)
# Create simple network
layer1 = DenseLayerWithBackprop(4, 8, 'relu')
layer2 = DenseLayerWithBackprop(8, 3, 'softmax')
# Sample data
X = np.array([[1, 2, 3, 4], [4, 3, 2, 1]], dtype=float)
y_true = np.array([[1, 0, 0], [0, 0, 1]]) # One-hot encoded
# Forward pass
A1 = layer1.forward(X)
A2 = layer2.forward(A1)
print("Forward pass:")
print(f" Input shape: {X.shape}")
print(f" Hidden activation shape: {A1.shape}")
print(f" Output (predictions): \n{A2}")
# Compute loss gradient (Cross Entropy with Softmax)
# For CE + Softmax, dL/dZ_output = (y_pred - y_true) / m
dL_dA2 = (A2 - y_true)
# Backward pass
dA1 = layer2.backward(dL_dA2)
dX = layer1.backward(dA1)
print("\nBackward pass:")
print(f" dW2 shape: {layer2.dW.shape}, mean: {layer2.dW.mean():.6f}")
print(f" dW1 shape: {layer1.dW.shape}, mean: {layer1.dW.mean():.6f}")
3. Mini-batch Learning and SGD
3.1 Impact of Batch Size
| Batch Type | Size | Gradient Quality | Speed | Memory |
|---|---|---|---|---|
| Batch GD | All data | Most accurate | Slow updates | High |
| Mini-batch | 32-256 | Good balance | Fast | Moderate |
| SGD | 1 | Very noisy | Fastest updates | Low |
3.2 Stochastic Gradient Descent (SGD)
SGD updates parameters using a single sample (or small batch) at a time, introducing noise that can help escape local minima.
import numpy as np
def create_batches(X, y, batch_size, shuffle=True):
"""
Create mini-batches from data
Parameters:
-----------
X : ndarray
Input features
y : ndarray
Target labels
batch_size : int
Number of samples per batch
shuffle : bool
Whether to shuffle data
Yields:
-------
X_batch, y_batch : tuple
Mini-batch of features and labels
"""
n_samples = X.shape[0]
indices = np.arange(n_samples)
if shuffle:
np.random.shuffle(indices)
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
batch_indices = indices[start_idx:end_idx]
yield X[batch_indices], y[batch_indices]
# Example usage
np.random.seed(42)
X = np.random.randn(100, 4)
y = np.random.randint(0, 3, 100)
print("Mini-batch iteration:")
for i, (X_batch, y_batch) in enumerate(create_batches(X, y, batch_size=32)):
print(f" Batch {i+1}: X shape = {X_batch.shape}, y shape = {y_batch.shape}")
3.3 SGD with Momentum
Momentum accelerates SGD by accumulating a velocity vector in directions of persistent gradient.
Update rules:
$$\mathbf{v} \leftarrow \beta \mathbf{v} + (1-\beta) \nabla_{\boldsymbol{\theta}} \mathcal{L}$$
$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \mathbf{v}$$
Where $\beta$ (typically 0.9) is the momentum coefficient.
Intuition: Think of a ball rolling down a hill. Momentum helps it:
- Accelerate in consistent directions
- Smooth out oscillations
- Escape shallow local minima
import numpy as np
class SGDOptimizer:
"""
SGD optimizer with optional momentum
"""
def __init__(self, learning_rate=0.01, momentum=0.0):
self.lr = learning_rate
self.momentum = momentum
self.velocities = {}
def update(self, params, grads):
"""
Update parameters
Parameters:
-----------
params : dict
Parameter name -> parameter array
grads : dict
Parameter name -> gradient array
"""
for name in params:
if name not in self.velocities:
self.velocities[name] = np.zeros_like(params[name])
# Update velocity
self.velocities[name] = (
self.momentum * self.velocities[name] +
(1 - self.momentum) * grads[name]
)
# Update parameter
params[name] -= self.lr * self.velocities[name]
# Compare SGD with and without momentum
def optimize_comparison():
"""
Compare SGD with and without momentum on a 2D function
f(x, y) = x^2 + 10*y^2 (elongated valley)
"""
def loss_fn(x, y):
return x**2 + 10*y**2
def grad_fn(x, y):
return 2*x, 20*y
# Initial point
x0, y0 = 10.0, 1.0
results = {}
for name, momentum in [('SGD', 0.0), ('SGD+Momentum', 0.9)]:
x, y = x0, y0
v_x, v_y = 0.0, 0.0
lr = 0.05
history = [(x, y, loss_fn(x, y))]
for _ in range(50):
gx, gy = grad_fn(x, y)
v_x = momentum * v_x + (1 - momentum) * gx
v_y = momentum * v_y + (1 - momentum) * gy
x -= lr * v_x
y -= lr * v_y
history.append((x, y, loss_fn(x, y)))
results[name] = history
print(f"{name}: Final loss = {history[-1][2]:.6f}")
return results
print("SGD vs SGD with Momentum")
print("=" * 40)
results = optimize_comparison()
4. Learning Rate and Epochs
4.1 Choosing the Learning Rate
The learning rate $\eta$ is one of the most important hyperparameters. It controls the step size during optimization.
| Learning Rate | Behavior | Symptoms |
|---|---|---|
| Too small | Slow convergence | Loss decreases very slowly |
| Just right | Smooth convergence | Loss decreases steadily |
| Too large | Overshooting | Loss oscillates or explodes |
import numpy as np
def learning_rate_demo():
"""
Demonstrate the effect of different learning rates
"""
def loss_fn(x):
return x**2
def grad_fn(x):
return 2*x
learning_rates = [0.01, 0.1, 0.5, 1.1]
x0 = 10.0
n_steps = 20
print("Effect of Learning Rate")
print("=" * 50)
for lr in learning_rates:
x = x0
history = [loss_fn(x)]
for _ in range(n_steps):
x = x - lr * grad_fn(x)
history.append(loss_fn(x))
final_loss = history[-1]
if np.isnan(final_loss) or final_loss > 1e10:
status = "DIVERGED"
elif final_loss < 0.01:
status = "CONVERGED"
else:
status = "SLOW"
print(f"lr={lr:.2f}: Final loss = {final_loss:.4f} ({status})")
learning_rate_demo()
4.2 Determining Number of Epochs
An epoch is one complete pass through the entire training dataset. The number of epochs affects:
- Underfitting: Too few epochs - model hasn't learned enough
- Overfitting: Too many epochs - model memorizes training data
- Just right: Validation loss stops improving
4.3 Interpreting Learning Curves
A learning curve plots loss (and/or accuracy) against training iterations or epochs. It reveals:
- Convergence: Is the model learning?
- Overfitting: Training loss decreasing but validation loss increasing
- Capacity: Is the model complex enough for the task?
5. Visualizing Training
5.1 Loss Curves
import numpy as np
class TrainingHistory:
"""
Track and store training metrics
"""
def __init__(self):
self.train_loss = []
self.val_loss = []
self.train_acc = []
self.val_acc = []
def log(self, train_loss, val_loss=None, train_acc=None, val_acc=None):
self.train_loss.append(train_loss)
if val_loss is not None:
self.val_loss.append(val_loss)
if train_acc is not None:
self.train_acc.append(train_acc)
if val_acc is not None:
self.val_acc.append(val_acc)
def summary(self):
print("\nTraining Summary:")
print(f" Final train loss: {self.train_loss[-1]:.4f}")
if self.val_loss:
print(f" Final val loss: {self.val_loss[-1]:.4f}")
if self.train_acc:
print(f" Final train accuracy: {self.train_acc[-1]:.2%}")
if self.val_acc:
print(f" Final val accuracy: {self.val_acc[-1]:.2%}")
# Simulate training history
np.random.seed(42)
history = TrainingHistory()
for epoch in range(50):
# Simulate decreasing loss with noise
train_loss = 2.0 * np.exp(-epoch/15) + 0.1 + np.random.normal(0, 0.05)
val_loss = 2.0 * np.exp(-epoch/15) + 0.15 + np.random.normal(0, 0.08)
# After epoch 30, simulate overfitting
if epoch > 30:
val_loss += (epoch - 30) * 0.02
train_acc = 1 - train_loss/2.5
val_acc = 1 - val_loss/2.5
history.log(train_loss, val_loss, train_acc, val_acc)
history.summary()
print(f"\nBest validation loss at epoch: {np.argmin(history.val_loss) + 1}")
5.2 Comparing Training and Validation Loss
import numpy as np
def analyze_learning_curves(train_loss, val_loss):
"""
Analyze learning curves to detect training issues
Parameters:
-----------
train_loss : list
Training loss history
val_loss : list
Validation loss history
Returns:
--------
diagnosis : str
Training status and recommendations
"""
train_loss = np.array(train_loss)
val_loss = np.array(val_loss)
# Check for divergence
if train_loss[-1] > train_loss[0]:
return "DIVERGING: Learning rate may be too high"
# Check for underfitting
if train_loss[-1] > 0.5 and val_loss[-1] > 0.5:
return "UNDERFITTING: Model may need more capacity or training"
# Check for overfitting
best_val_epoch = np.argmin(val_loss)
if best_val_epoch < len(val_loss) - 5:
gap = val_loss[-1] - train_loss[-1]
if gap > 0.1:
return f"OVERFITTING: Best epoch was {best_val_epoch + 1}, consider early stopping"
# Check convergence
recent_improvement = val_loss[-5] - val_loss[-1] if len(val_loss) >= 5 else 1
if recent_improvement < 0.01:
return "CONVERGED: Training appears to have plateaued"
return "TRAINING: Still improving"
# Test analysis
print("Learning Curve Analysis")
print("=" * 40)
# Simulate different scenarios
scenarios = {
"Good training": (
[2.0, 1.5, 1.0, 0.7, 0.5, 0.4, 0.35, 0.32, 0.30, 0.29],
[2.1, 1.6, 1.1, 0.8, 0.6, 0.5, 0.45, 0.42, 0.40, 0.39]
),
"Overfitting": (
[2.0, 1.0, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, 0.002],
[2.1, 1.1, 0.6, 0.4, 0.35, 0.4, 0.5, 0.6, 0.7, 0.8]
),
"Diverging": (
[2.0, 3.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0, 1000.0],
[2.1, 3.5, 6.0, 15.0, 30.0, 70.0, 150.0, 300.0, 700.0, 1500.0]
)
}
for name, (train, val) in scenarios.items():
diagnosis = analyze_learning_curves(train, val)
print(f"\n{name}:")
print(f" {diagnosis}")
5.3 Complete Training Loop
import numpy as np
def relu(x):
return np.maximum(0, x)
def relu_deriv(x):
return (x > 0).astype(float)
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
def cross_entropy_loss(y_true, y_pred, epsilon=1e-15):
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
class SimpleNetwork:
"""
Simple neural network with complete training loop
"""
def __init__(self, layer_sizes):
self.weights = []
self.biases = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
W = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
b = np.zeros((1, n_out))
self.weights.append(W)
self.biases.append(b)
def forward(self, X):
self.activations = [X]
self.pre_activations = []
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = np.dot(A, W) + b
self.pre_activations.append(Z)
if i < len(self.weights) - 1:
A = relu(Z)
else:
A = softmax(Z)
self.activations.append(A)
return A
def backward(self, y_true):
m = y_true.shape[0]
grads_W = []
grads_b = []
# Output layer gradient (CE + softmax)
dZ = self.activations[-1] - y_true
for i in range(len(self.weights) - 1, -1, -1):
dW = np.dot(self.activations[i].T, dZ) / m
db = np.mean(dZ, axis=0, keepdims=True)
grads_W.insert(0, dW)
grads_b.insert(0, db)
if i > 0:
dA = np.dot(dZ, self.weights[i].T)
dZ = dA * relu_deriv(self.pre_activations[i - 1])
return grads_W, grads_b
def update(self, grads_W, grads_b, learning_rate):
for i in range(len(self.weights)):
self.weights[i] -= learning_rate * grads_W[i]
self.biases[i] -= learning_rate * grads_b[i]
def train(self, X_train, y_train, X_val, y_val,
epochs=100, batch_size=32, learning_rate=0.01):
history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
n_samples = X_train.shape[0]
for epoch in range(epochs):
# Shuffle data
indices = np.random.permutation(n_samples)
X_shuffled = X_train[indices]
y_shuffled = y_train[indices]
# Mini-batch training
for start in range(0, n_samples, batch_size):
end = min(start + batch_size, n_samples)
X_batch = X_shuffled[start:end]
y_batch = y_shuffled[start:end]
# Forward
_ = self.forward(X_batch)
# Backward
grads_W, grads_b = self.backward(y_batch)
# Update
self.update(grads_W, grads_b, learning_rate)
# Evaluate
train_pred = self.forward(X_train)
val_pred = self.forward(X_val)
train_loss = cross_entropy_loss(y_train, train_pred)
val_loss = cross_entropy_loss(y_val, val_pred)
train_acc = np.mean(np.argmax(train_pred, axis=1) == np.argmax(y_train, axis=1))
val_acc = np.mean(np.argmax(val_pred, axis=1) == np.argmax(y_val, axis=1))
history['train_loss'].append(train_loss)
history['val_loss'].append(val_loss)
history['train_acc'].append(train_acc)
history['val_acc'].append(val_acc)
if (epoch + 1) % 20 == 0:
print(f"Epoch {epoch+1}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}, val_acc={val_acc:.2%}")
return history
# Create synthetic dataset
np.random.seed(42)
n_samples = 500
n_features = 4
n_classes = 3
X = np.random.randn(n_samples, n_features)
y_labels = np.random.randint(0, n_classes, n_samples)
y = np.eye(n_classes)[y_labels] # One-hot encode
# Split data
split = int(0.8 * n_samples)
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]
# Train network
print("Training Neural Network")
print("=" * 50)
network = SimpleNetwork([4, 16, 8, 3])
history = network.train(X_train, y_train, X_val, y_val,
epochs=100, batch_size=32, learning_rate=0.1)
print(f"\nFinal Results:")
print(f" Training accuracy: {history['train_acc'][-1]:.2%}")
print(f" Validation accuracy: {history['val_acc'][-1]:.2%}")
Exercises
Exercise 1: Implement Gradient Checking
Problem: Implement numerical gradient checking to verify backpropagation is correct.
$$\frac{\partial \mathcal{L}}{\partial \theta} \approx \frac{\mathcal{L}(\theta + \epsilon) - \mathcal{L}(\theta - \epsilon)}{2\epsilon}$$
Compare numerical gradients with backprop gradients for a small network.
Exercise 2: Learning Rate Finder
Problem: Implement a learning rate finder that:
- Starts with a very small learning rate (e.g., 1e-7)
- Increases it exponentially each batch
- Records the loss at each step
- Finds the learning rate where loss is decreasing fastest
Exercise 3: Momentum Variants
Problem: Implement and compare:
- Standard momentum
- Nesterov accelerated gradient (NAG)
Test on f(x,y) = x^2 + 100*y^2 and compare convergence.
Exercise 4: Batch Size Experiment
Problem: Train the same network with batch sizes [1, 8, 32, 128, full batch]:
- Plot training loss curves for each
- Measure wall-clock time per epoch
- Compare final validation accuracy
Exercise 5: Early Stopping
Problem: Implement early stopping:
- Monitor validation loss
- Stop training if val loss doesn't improve for N epochs (patience)
- Restore weights from the best epoch
Exercise 6: Gradient Clipping
Problem: Implement gradient clipping to prevent exploding gradients:
- Clip by value: cap gradient magnitudes
- Clip by norm: scale gradients if total norm exceeds threshold
Exercise 7: Training on XOR
Problem: Train a neural network to solve the XOR problem:
- Input: 4 samples [(0,0), (0,1), (1,0), (1,1)]
- Output: [0, 1, 1, 0]
- Network: 2-4-1 with sigmoid output
Plot the decision boundary after training.
Summary
In this chapter, we learned how neural networks learn:
- Gradient Descent: Iteratively updates parameters in the direction of steepest loss decrease
- Backpropagation: Efficiently computes gradients using the chain rule
- Mini-batch SGD: Balances gradient accuracy with computational efficiency
- Momentum: Accelerates training by accumulating velocity
- Learning Rate: Critical hyperparameter that controls step size
- Learning Curves: Essential for diagnosing training issues
Next Chapter Preview: In Chapter 4, we will learn about regularization techniques (L1, L2, Dropout, Batch Normalization) and advanced optimization algorithms (Adam, RMSprop) that improve model generalization and training stability.