Learning Objectives
- Understand overfitting and the bias-variance tradeoff
- Master L1 and L2 regularization techniques
- Implement and apply Dropout correctly
- Understand Batch Normalization and its variants
- Know when to use different optimization algorithms (Adam, RMSprop)
- Apply learning rate scheduling and early stopping
1. Understanding Overfitting
1.1 Training Error vs Generalization Error
Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, resulting in poor performance on new, unseen data.
| Scenario | Training Error | Validation Error | Status |
|---|---|---|---|
| Underfitting | High | High | Model too simple |
| Good Fit | Low | Low | Optimal |
| Overfitting | Very Low | High | Model memorizing |
1.2 Bias-Variance Tradeoff
The bias-variance tradeoff describes the relationship between model complexity and error:
$$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$
- Bias: Error from simplifying assumptions (underfitting)
- Variance: Error from sensitivity to training data (overfitting)
- Irreducible Error: Noise inherent in the problem
Underfitting] B -->|Just Right| D[Optimal
Good Generalization] B -->|Too Complex| E[High Variance
Overfitting] style D fill:#e8f5e9 style C fill:#fff3e0 style E fill:#ffebee
2. Regularization Techniques
2.1 L1 Regularization (Lasso)
L1 regularization adds the sum of absolute weights to the loss function:
$$\mathcal{L}_{L1} = \mathcal{L}_{original} + \lambda \sum_{i} |w_i|$$
Characteristics:
- Promotes sparsity (drives some weights to exactly zero)
- Acts as feature selection
- Useful when many features are irrelevant
import numpy as np
def l1_regularization_loss(weights, lambda_l1=0.01):
"""
Compute L1 regularization term
Parameters:
-----------
weights : list of ndarray
Network weights
lambda_l1 : float
Regularization strength
Returns:
--------
l1_loss : float
L1 regularization term
"""
l1_loss = 0
for W in weights:
l1_loss += np.sum(np.abs(W))
return lambda_l1 * l1_loss
def l1_gradient(W, lambda_l1=0.01):
"""
Gradient of L1 regularization
"""
return lambda_l1 * np.sign(W)
# Example
np.random.seed(42)
W = np.random.randn(10, 5)
l1_loss = l1_regularization_loss([W], lambda_l1=0.01)
l1_grad = l1_gradient(W, lambda_l1=0.01)
print(f"L1 Loss: {l1_loss:.4f}")
print(f"L1 Gradient mean: {l1_grad.mean():.4f}")
print(f"Non-zero gradient elements: {np.sum(l1_grad != 0)}")
2.2 L2 Regularization (Ridge / Weight Decay)
L2 regularization adds the sum of squared weights to the loss:
$$\mathcal{L}_{L2} = \mathcal{L}_{original} + \frac{\lambda}{2} \sum_{i} w_i^2$$
Characteristics:
- Penalizes large weights more heavily
- Encourages smaller, distributed weights
- Most commonly used regularization in deep learning
import numpy as np
def l2_regularization_loss(weights, lambda_l2=0.01):
"""
Compute L2 regularization term
Parameters:
-----------
weights : list of ndarray
Network weights
lambda_l2 : float
Regularization strength (weight decay)
Returns:
--------
l2_loss : float
L2 regularization term
"""
l2_loss = 0
for W in weights:
l2_loss += np.sum(W ** 2)
return 0.5 * lambda_l2 * l2_loss
def l2_gradient(W, lambda_l2=0.01):
"""
Gradient of L2 regularization
"""
return lambda_l2 * W
# Example: Compare weight distributions with/without L2
np.random.seed(42)
# Simulate training effect
W_original = np.random.randn(100, 50) * 2 # Large weights
# With L2, weights would be smaller
W_l2 = W_original * 0.5 # Simulated effect
print("Weight statistics:")
print(f" Original: mean={W_original.mean():.4f}, std={W_original.std():.4f}")
print(f" With L2: mean={W_l2.mean():.4f}, std={W_l2.std():.4f}")
print(f"\nL2 loss (original): {l2_regularization_loss([W_original]):.4f}")
print(f"L2 loss (with L2): {l2_regularization_loss([W_l2]):.4f}")
2.3 Dropout
Dropout randomly sets a fraction of neurons to zero during training, forcing the network to learn redundant representations.
During training:
$$\tilde{h} = h \odot m, \quad m_i \sim \text{Bernoulli}(1-p)$$
During inference:
$$h_{test} = h \cdot (1-p)$$
Or equivalently, scale during training (inverted dropout).
import numpy as np
class Dropout:
"""
Dropout layer implementation
"""
def __init__(self, drop_rate=0.5):
"""
Parameters:
-----------
drop_rate : float
Probability of dropping a neuron (0 to 1)
"""
self.drop_rate = drop_rate
self.mask = None
self.training = True
def forward(self, X):
"""
Forward pass with dropout
Parameters:
-----------
X : ndarray
Input activations
Returns:
--------
output : ndarray
Dropped out activations
"""
if self.training and self.drop_rate > 0:
# Create binary mask
self.mask = np.random.binomial(1, 1 - self.drop_rate, X.shape)
# Apply mask and scale (inverted dropout)
return X * self.mask / (1 - self.drop_rate)
else:
return X
def backward(self, dout):
"""
Backward pass
"""
if self.training and self.drop_rate > 0:
return dout * self.mask / (1 - self.drop_rate)
else:
return dout
def train(self):
self.training = True
def eval(self):
self.training = False
# Example usage
np.random.seed(42)
dropout = Dropout(drop_rate=0.5)
X = np.ones((3, 5)) # All ones for clarity
print("Dropout Example (drop_rate=0.5)")
print("=" * 40)
print(f"Input:\n{X}")
dropout.train()
output_train = dropout.forward(X)
print(f"\nTraining mode output:\n{output_train}")
print(f"Active neurons: {np.sum(output_train > 0)}/{X.size}")
dropout.eval()
output_eval = dropout.forward(X)
print(f"\nEval mode output:\n{output_eval}")
Important: Always switch to evaluation mode during inference:
- PyTorch:
model.eval() - Training: Dropout active, Batch Norm uses batch statistics
- Inference: Dropout disabled, Batch Norm uses running statistics
3. Batch Normalization
3.1 Internal Covariate Shift
Internal covariate shift refers to the change in the distribution of layer inputs during training. Batch Normalization addresses this by normalizing layer inputs.
3.2 How Batch Normalization Works
For a mini-batch of activations, Batch Norm performs:
- Compute mean and variance:
$$\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_B)^2$$
- Normalize:
$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
- Scale and shift (learnable parameters):
$$y_i = \gamma \hat{x}_i + \beta$$
import numpy as np
class BatchNorm:
"""
Batch Normalization layer
"""
def __init__(self, n_features, momentum=0.9, epsilon=1e-5):
self.gamma = np.ones((1, n_features))
self.beta = np.zeros((1, n_features))
# Running statistics for inference
self.running_mean = np.zeros((1, n_features))
self.running_var = np.ones((1, n_features))
self.momentum = momentum
self.epsilon = epsilon
self.training = True
# Cache for backprop
self.cache = {}
def forward(self, X):
if self.training:
# Compute batch statistics
mean = np.mean(X, axis=0, keepdims=True)
var = np.var(X, axis=0, keepdims=True)
# Update running statistics
self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * mean
self.running_var = self.momentum * self.running_var + (1 - self.momentum) * var
# Normalize
X_norm = (X - mean) / np.sqrt(var + self.epsilon)
# Cache for backprop
self.cache = {'X': X, 'mean': mean, 'var': var, 'X_norm': X_norm}
else:
# Use running statistics
X_norm = (X - self.running_mean) / np.sqrt(self.running_var + self.epsilon)
# Scale and shift
return self.gamma * X_norm + self.beta
def train(self):
self.training = True
def eval(self):
self.training = False
# Example
np.random.seed(42)
bn = BatchNorm(n_features=4)
# Simulated activations (batch_size=8, features=4)
X = np.random.randn(8, 4) * 5 + 10 # Mean ~10, std ~5
print("Batch Normalization Example")
print("=" * 40)
print(f"Input stats: mean={X.mean():.2f}, std={X.std():.2f}")
bn.train()
output = bn.forward(X)
print(f"Output stats: mean={output.mean():.4f}, std={output.std():.4f}")
3.3 Layer Normalization Comparison
| Aspect | Batch Normalization | Layer Normalization |
|---|---|---|
| Normalizes over | Batch dimension | Feature dimension |
| Batch size dependency | Yes (needs large batches) | No |
| Best for | CNNs, feedforward | RNNs, Transformers |
| Inference behavior | Uses running stats | Same as training |
4. Optimization Algorithms
4.1 SGD with Momentum
Covered in Chapter 3. Accumulates velocity to accelerate consistent gradient directions.
4.2 AdaGrad
AdaGrad adapts learning rates for each parameter based on historical gradients:
$$G_t = G_{t-1} + g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$
Problem: Learning rate monotonically decreases, can become too small.
4.3 RMSprop
RMSprop fixes AdaGrad's diminishing learning rate by using exponential moving average:
$$E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t$$
4.4 Adam / AdamW
Adam (Adaptive Moment Estimation) combines momentum and RMSprop:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(First moment)}$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(Second moment)}$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(Bias correction)}$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$
import numpy as np
class Adam:
"""
Adam optimizer implementation
"""
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {} # First moment
self.v = {} # Second moment
self.t = 0 # Time step
def update(self, params, grads):
"""
Update parameters using Adam
Parameters:
-----------
params : dict
Parameter name -> parameter array
grads : dict
Parameter name -> gradient array
"""
self.t += 1
for name in params:
if name not in self.m:
self.m[name] = np.zeros_like(params[name])
self.v[name] = np.zeros_like(params[name])
# Update biased first moment estimate
self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * grads[name]
# Update biased second raw moment estimate
self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * (grads[name] ** 2)
# Compute bias-corrected first moment estimate
m_hat = self.m[name] / (1 - self.beta1 ** self.t)
# Compute bias-corrected second raw moment estimate
v_hat = self.v[name] / (1 - self.beta2 ** self.t)
# Update parameters
params[name] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
# Compare optimizers
def compare_optimizers():
"""
Compare SGD, Momentum, and Adam on a test function
"""
def rosenbrock(x, y):
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_grad(x, y):
dx = -2 * (1 - x) - 400 * x * (y - x**2)
dy = 200 * (y - x**2)
return dx, dy
optimizers = {
'SGD': {'lr': 0.0001, 'momentum': 0},
'Momentum': {'lr': 0.0001, 'momentum': 0.9},
'Adam': {'lr': 0.01}
}
results = {}
for name, config in optimizers.items():
x, y = -1.0, 1.0
history = [(x, y, rosenbrock(x, y))]
if name == 'Adam':
m_x, m_y = 0, 0
v_x, v_y = 0, 0
beta1, beta2 = 0.9, 0.999
epsilon = 1e-8
lr = config['lr']
for t in range(1, 1001):
dx, dy = rosenbrock_grad(x, y)
m_x = beta1 * m_x + (1 - beta1) * dx
m_y = beta1 * m_y + (1 - beta1) * dy
v_x = beta2 * v_x + (1 - beta2) * dx**2
v_y = beta2 * v_y + (1 - beta2) * dy**2
m_x_hat = m_x / (1 - beta1**t)
m_y_hat = m_y / (1 - beta1**t)
v_x_hat = v_x / (1 - beta2**t)
v_y_hat = v_y / (1 - beta2**t)
x -= lr * m_x_hat / (np.sqrt(v_x_hat) + epsilon)
y -= lr * m_y_hat / (np.sqrt(v_y_hat) + epsilon)
history.append((x, y, rosenbrock(x, y)))
else:
v_x, v_y = 0, 0
lr = config['lr']
momentum = config['momentum']
for _ in range(1000):
dx, dy = rosenbrock_grad(x, y)
v_x = momentum * v_x + dx
v_y = momentum * v_y + dy
x -= lr * v_x
y -= lr * v_y
history.append((x, y, rosenbrock(x, y)))
results[name] = history
print(f"{name}: Final (x, y) = ({x:.4f}, {y:.4f}), loss = {rosenbrock(x, y):.6f}")
return results
print("Optimizer Comparison on Rosenbrock Function")
print("=" * 50)
print("Minimum at (1, 1) with value 0")
print()
results = compare_optimizers()
Optimizer Selection Guidelines
- Adam: Good default, works well for most problems
- SGD + Momentum: Often better final performance with tuning
- AdamW: Adam with proper weight decay, recommended for transformers
- RMSprop: Good for RNNs and non-stationary objectives
5. Learning Rate Scheduling
5.1 Step Decay
Reduce learning rate by a factor at specific epochs:
$$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$$
def step_decay(epoch, initial_lr=0.1, drop_factor=0.5, epochs_drop=10):
"""
Step decay learning rate schedule
"""
return initial_lr * (drop_factor ** (epoch // epochs_drop))
# Example
print("Step Decay Schedule:")
for epoch in [0, 10, 20, 30, 40]:
lr = step_decay(epoch)
print(f" Epoch {epoch}: lr = {lr:.4f}")
5.2 Cosine Annealing
Smoothly decrease learning rate following a cosine curve:
$$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t \pi}{T}))$$
import numpy as np
def cosine_annealing(epoch, total_epochs, lr_max=0.1, lr_min=0.0001):
"""
Cosine annealing learning rate schedule
"""
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / total_epochs))
# Example
print("\nCosine Annealing Schedule (50 epochs):")
for epoch in [0, 10, 25, 40, 50]:
lr = cosine_annealing(epoch, total_epochs=50)
print(f" Epoch {epoch}: lr = {lr:.6f}")
5.3 Warmup
Warmup gradually increases learning rate at the start of training:
def warmup_schedule(epoch, warmup_epochs=5, base_lr=0.1):
"""
Linear warmup schedule
"""
if epoch < warmup_epochs:
return base_lr * (epoch + 1) / warmup_epochs
return base_lr
# Example with warmup + cosine decay
def warmup_cosine(epoch, total_epochs=100, warmup_epochs=10, lr_max=0.1, lr_min=0.0001):
if epoch < warmup_epochs:
return lr_max * (epoch + 1) / warmup_epochs
else:
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))
print("\nWarmup + Cosine Schedule:")
for epoch in [0, 5, 10, 50, 100]:
lr = warmup_cosine(epoch)
print(f" Epoch {epoch}: lr = {lr:.6f}")
6. Early Stopping
Early stopping monitors validation loss and stops training when it stops improving, preventing overfitting.
class EarlyStopping:
"""
Early stopping to prevent overfitting
"""
def __init__(self, patience=10, min_delta=0.0001, restore_best=True):
"""
Parameters:
-----------
patience : int
Number of epochs to wait for improvement
min_delta : float
Minimum change to qualify as improvement
restore_best : bool
Whether to restore best weights
"""
self.patience = patience
self.min_delta = min_delta
self.restore_best = restore_best
self.best_loss = float('inf')
self.counter = 0
self.best_weights = None
def __call__(self, val_loss, model_weights=None):
"""
Check if training should stop
Returns:
--------
should_stop : bool
True if training should stop
"""
if val_loss < self.best_loss - self.min_delta:
self.best_loss = val_loss
self.counter = 0
if self.restore_best and model_weights is not None:
self.best_weights = [w.copy() for w in model_weights]
return False
else:
self.counter += 1
if self.counter >= self.patience:
print(f"Early stopping triggered after {self.patience} epochs without improvement")
return True
return False
# Simulate training with early stopping
np.random.seed(42)
early_stopping = EarlyStopping(patience=5)
print("Simulated Training with Early Stopping")
print("=" * 50)
for epoch in range(100):
# Simulate val loss: decreases then increases (overfitting)
if epoch < 20:
val_loss = 1.0 - epoch * 0.03 + np.random.normal(0, 0.02)
else:
val_loss = 0.4 + (epoch - 20) * 0.02 + np.random.normal(0, 0.02)
if epoch % 5 == 0:
print(f"Epoch {epoch}: val_loss = {val_loss:.4f}")
if early_stopping(val_loss):
print(f"Stopped at epoch {epoch}")
print(f"Best validation loss: {early_stopping.best_loss:.4f}")
break
Exercises
Exercise 1: L1 vs L2 Regularization
Problem: Train two identical networks on the same data, one with L1 and one with L2 regularization:
- Compare the weight distributions (histogram)
- Count the number of near-zero weights (|w| < 0.01)
- Explain the difference in terms of sparsity
Exercise 2: Dropout Rate Experiment
Problem: Train networks with dropout rates of [0, 0.1, 0.3, 0.5, 0.7]:
- Plot training and validation accuracy for each
- Find the optimal dropout rate
- Observe what happens with very high dropout
Exercise 3: Batch Normalization Ablation
Problem: Compare training with and without Batch Normalization:
- Plot loss curves for both
- Try different learning rates
- Measure the sensitivity to initialization
Exercise 4: Optimizer Comparison
Problem: Train the same network with SGD, Momentum, RMSprop, and Adam:
- Use the same learning rate initially, then tune each
- Plot convergence curves
- Measure final test accuracy
Exercise 5: Learning Rate Schedule Design
Problem: Implement and compare:
- Constant learning rate
- Step decay (halve every 20 epochs)
- Cosine annealing
- Warmup + cosine
Which achieves the best final accuracy?
Exercise 6: Early Stopping Analysis
Problem: Train a network prone to overfitting:
- Plot train/val loss without early stopping
- Implement early stopping with patience=5, 10, 20
- Compare final test accuracy for each patience value
Summary
In this chapter, we learned techniques to improve model training:
- Overfitting: When models memorize training data; detected by gap between train/val loss
- L1/L2 Regularization: Penalize large weights; L1 promotes sparsity, L2 is most common
- Dropout: Randomly disable neurons during training for redundant representations
- Batch Normalization: Normalize layer inputs for stable, faster training
- Adam: Adaptive optimizer combining momentum and per-parameter learning rates
- Learning Rate Scheduling: Adjust learning rate during training (warmup, cosine decay)
- Early Stopping: Stop training when validation loss stops improving
Next Chapter Preview: In Chapter 5, we will apply everything we've learned to practical projects, including MNIST classification, data preprocessing pipelines, model evaluation, and hyperparameter tuning.