Chapter 4: Regularization and Optimization

Preventing Overfitting and Improving Training Efficiency

Reading Time: 30-35 minutes Code Examples: 8 Exercises: 6 Difficulty: Intermediate
In this chapter, we will learn techniques to improve model generalization and training efficiency. We will cover regularization methods to prevent overfitting, normalization techniques for stable training, and advanced optimization algorithms that adapt learning rates automatically.

Learning Objectives

1. Understanding Overfitting

1.1 Training Error vs Generalization Error

Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, resulting in poor performance on new, unseen data.

Scenario Training Error Validation Error Status
Underfitting High High Model too simple
Good Fit Low Low Optimal
Overfitting Very Low High Model memorizing

1.2 Bias-Variance Tradeoff

The bias-variance tradeoff describes the relationship between model complexity and error:

$$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

graph LR A[Model Complexity] --> B{Balance?} B -->|Too Simple| C[High Bias
Underfitting] B -->|Just Right| D[Optimal
Good Generalization] B -->|Too Complex| E[High Variance
Overfitting] style D fill:#e8f5e9 style C fill:#fff3e0 style E fill:#ffebee

2. Regularization Techniques

2.1 L1 Regularization (Lasso)

L1 regularization adds the sum of absolute weights to the loss function:

$$\mathcal{L}_{L1} = \mathcal{L}_{original} + \lambda \sum_{i} |w_i|$$

Characteristics:

import numpy as np

def l1_regularization_loss(weights, lambda_l1=0.01):
    """
    Compute L1 regularization term

    Parameters:
    -----------
    weights : list of ndarray
        Network weights
    lambda_l1 : float
        Regularization strength

    Returns:
    --------
    l1_loss : float
        L1 regularization term
    """
    l1_loss = 0
    for W in weights:
        l1_loss += np.sum(np.abs(W))
    return lambda_l1 * l1_loss

def l1_gradient(W, lambda_l1=0.01):
    """
    Gradient of L1 regularization
    """
    return lambda_l1 * np.sign(W)

# Example
np.random.seed(42)
W = np.random.randn(10, 5)

l1_loss = l1_regularization_loss([W], lambda_l1=0.01)
l1_grad = l1_gradient(W, lambda_l1=0.01)

print(f"L1 Loss: {l1_loss:.4f}")
print(f"L1 Gradient mean: {l1_grad.mean():.4f}")
print(f"Non-zero gradient elements: {np.sum(l1_grad != 0)}")

2.2 L2 Regularization (Ridge / Weight Decay)

L2 regularization adds the sum of squared weights to the loss:

$$\mathcal{L}_{L2} = \mathcal{L}_{original} + \frac{\lambda}{2} \sum_{i} w_i^2$$

Characteristics:

import numpy as np

def l2_regularization_loss(weights, lambda_l2=0.01):
    """
    Compute L2 regularization term

    Parameters:
    -----------
    weights : list of ndarray
        Network weights
    lambda_l2 : float
        Regularization strength (weight decay)

    Returns:
    --------
    l2_loss : float
        L2 regularization term
    """
    l2_loss = 0
    for W in weights:
        l2_loss += np.sum(W ** 2)
    return 0.5 * lambda_l2 * l2_loss

def l2_gradient(W, lambda_l2=0.01):
    """
    Gradient of L2 regularization
    """
    return lambda_l2 * W

# Example: Compare weight distributions with/without L2
np.random.seed(42)

# Simulate training effect
W_original = np.random.randn(100, 50) * 2  # Large weights

# With L2, weights would be smaller
W_l2 = W_original * 0.5  # Simulated effect

print("Weight statistics:")
print(f"  Original: mean={W_original.mean():.4f}, std={W_original.std():.4f}")
print(f"  With L2:  mean={W_l2.mean():.4f}, std={W_l2.std():.4f}")
print(f"\nL2 loss (original): {l2_regularization_loss([W_original]):.4f}")
print(f"L2 loss (with L2):  {l2_regularization_loss([W_l2]):.4f}")

2.3 Dropout

Dropout randomly sets a fraction of neurons to zero during training, forcing the network to learn redundant representations.

During training:

$$\tilde{h} = h \odot m, \quad m_i \sim \text{Bernoulli}(1-p)$$

During inference:

$$h_{test} = h \cdot (1-p)$$

Or equivalently, scale during training (inverted dropout).

import numpy as np

class Dropout:
    """
    Dropout layer implementation
    """

    def __init__(self, drop_rate=0.5):
        """
        Parameters:
        -----------
        drop_rate : float
            Probability of dropping a neuron (0 to 1)
        """
        self.drop_rate = drop_rate
        self.mask = None
        self.training = True

    def forward(self, X):
        """
        Forward pass with dropout

        Parameters:
        -----------
        X : ndarray
            Input activations

        Returns:
        --------
        output : ndarray
            Dropped out activations
        """
        if self.training and self.drop_rate > 0:
            # Create binary mask
            self.mask = np.random.binomial(1, 1 - self.drop_rate, X.shape)
            # Apply mask and scale (inverted dropout)
            return X * self.mask / (1 - self.drop_rate)
        else:
            return X

    def backward(self, dout):
        """
        Backward pass
        """
        if self.training and self.drop_rate > 0:
            return dout * self.mask / (1 - self.drop_rate)
        else:
            return dout

    def train(self):
        self.training = True

    def eval(self):
        self.training = False

# Example usage
np.random.seed(42)
dropout = Dropout(drop_rate=0.5)

X = np.ones((3, 5))  # All ones for clarity

print("Dropout Example (drop_rate=0.5)")
print("=" * 40)
print(f"Input:\n{X}")

dropout.train()
output_train = dropout.forward(X)
print(f"\nTraining mode output:\n{output_train}")
print(f"Active neurons: {np.sum(output_train > 0)}/{X.size}")

dropout.eval()
output_eval = dropout.forward(X)
print(f"\nEval mode output:\n{output_eval}")

Important: Always switch to evaluation mode during inference:

3. Batch Normalization

3.1 Internal Covariate Shift

Internal covariate shift refers to the change in the distribution of layer inputs during training. Batch Normalization addresses this by normalizing layer inputs.

3.2 How Batch Normalization Works

For a mini-batch of activations, Batch Norm performs:

  1. Compute mean and variance:

    $$\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_B)^2$$

  2. Normalize:

    $$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

  3. Scale and shift (learnable parameters):

    $$y_i = \gamma \hat{x}_i + \beta$$

import numpy as np

class BatchNorm:
    """
    Batch Normalization layer
    """

    def __init__(self, n_features, momentum=0.9, epsilon=1e-5):
        self.gamma = np.ones((1, n_features))
        self.beta = np.zeros((1, n_features))

        # Running statistics for inference
        self.running_mean = np.zeros((1, n_features))
        self.running_var = np.ones((1, n_features))

        self.momentum = momentum
        self.epsilon = epsilon
        self.training = True

        # Cache for backprop
        self.cache = {}

    def forward(self, X):
        if self.training:
            # Compute batch statistics
            mean = np.mean(X, axis=0, keepdims=True)
            var = np.var(X, axis=0, keepdims=True)

            # Update running statistics
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * mean
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * var

            # Normalize
            X_norm = (X - mean) / np.sqrt(var + self.epsilon)

            # Cache for backprop
            self.cache = {'X': X, 'mean': mean, 'var': var, 'X_norm': X_norm}
        else:
            # Use running statistics
            X_norm = (X - self.running_mean) / np.sqrt(self.running_var + self.epsilon)

        # Scale and shift
        return self.gamma * X_norm + self.beta

    def train(self):
        self.training = True

    def eval(self):
        self.training = False

# Example
np.random.seed(42)
bn = BatchNorm(n_features=4)

# Simulated activations (batch_size=8, features=4)
X = np.random.randn(8, 4) * 5 + 10  # Mean ~10, std ~5

print("Batch Normalization Example")
print("=" * 40)
print(f"Input stats: mean={X.mean():.2f}, std={X.std():.2f}")

bn.train()
output = bn.forward(X)
print(f"Output stats: mean={output.mean():.4f}, std={output.std():.4f}")

3.3 Layer Normalization Comparison

Aspect Batch Normalization Layer Normalization
Normalizes over Batch dimension Feature dimension
Batch size dependency Yes (needs large batches) No
Best for CNNs, feedforward RNNs, Transformers
Inference behavior Uses running stats Same as training

4. Optimization Algorithms

4.1 SGD with Momentum

Covered in Chapter 3. Accumulates velocity to accelerate consistent gradient directions.

4.2 AdaGrad

AdaGrad adapts learning rates for each parameter based on historical gradients:

$$G_t = G_{t-1} + g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$

Problem: Learning rate monotonically decreases, can become too small.

4.3 RMSprop

RMSprop fixes AdaGrad's diminishing learning rate by using exponential moving average:

$$E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t$$

4.4 Adam / AdamW

Adam (Adaptive Moment Estimation) combines momentum and RMSprop:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(First moment)}$$

$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(Second moment)}$$

$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(Bias correction)}$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

import numpy as np

class Adam:
    """
    Adam optimizer implementation
    """

    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step

    def update(self, params, grads):
        """
        Update parameters using Adam

        Parameters:
        -----------
        params : dict
            Parameter name -> parameter array
        grads : dict
            Parameter name -> gradient array
        """
        self.t += 1

        for name in params:
            if name not in self.m:
                self.m[name] = np.zeros_like(params[name])
                self.v[name] = np.zeros_like(params[name])

            # Update biased first moment estimate
            self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * grads[name]

            # Update biased second raw moment estimate
            self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * (grads[name] ** 2)

            # Compute bias-corrected first moment estimate
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)

            # Compute bias-corrected second raw moment estimate
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)

            # Update parameters
            params[name] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

# Compare optimizers
def compare_optimizers():
    """
    Compare SGD, Momentum, and Adam on a test function
    """
    def rosenbrock(x, y):
        return (1 - x)**2 + 100 * (y - x**2)**2

    def rosenbrock_grad(x, y):
        dx = -2 * (1 - x) - 400 * x * (y - x**2)
        dy = 200 * (y - x**2)
        return dx, dy

    optimizers = {
        'SGD': {'lr': 0.0001, 'momentum': 0},
        'Momentum': {'lr': 0.0001, 'momentum': 0.9},
        'Adam': {'lr': 0.01}
    }

    results = {}

    for name, config in optimizers.items():
        x, y = -1.0, 1.0
        history = [(x, y, rosenbrock(x, y))]

        if name == 'Adam':
            m_x, m_y = 0, 0
            v_x, v_y = 0, 0
            beta1, beta2 = 0.9, 0.999
            epsilon = 1e-8
            lr = config['lr']

            for t in range(1, 1001):
                dx, dy = rosenbrock_grad(x, y)

                m_x = beta1 * m_x + (1 - beta1) * dx
                m_y = beta1 * m_y + (1 - beta1) * dy

                v_x = beta2 * v_x + (1 - beta2) * dx**2
                v_y = beta2 * v_y + (1 - beta2) * dy**2

                m_x_hat = m_x / (1 - beta1**t)
                m_y_hat = m_y / (1 - beta1**t)
                v_x_hat = v_x / (1 - beta2**t)
                v_y_hat = v_y / (1 - beta2**t)

                x -= lr * m_x_hat / (np.sqrt(v_x_hat) + epsilon)
                y -= lr * m_y_hat / (np.sqrt(v_y_hat) + epsilon)

                history.append((x, y, rosenbrock(x, y)))
        else:
            v_x, v_y = 0, 0
            lr = config['lr']
            momentum = config['momentum']

            for _ in range(1000):
                dx, dy = rosenbrock_grad(x, y)

                v_x = momentum * v_x + dx
                v_y = momentum * v_y + dy

                x -= lr * v_x
                y -= lr * v_y

                history.append((x, y, rosenbrock(x, y)))

        results[name] = history
        print(f"{name}: Final (x, y) = ({x:.4f}, {y:.4f}), loss = {rosenbrock(x, y):.6f}")

    return results

print("Optimizer Comparison on Rosenbrock Function")
print("=" * 50)
print("Minimum at (1, 1) with value 0")
print()
results = compare_optimizers()

Optimizer Selection Guidelines

5. Learning Rate Scheduling

5.1 Step Decay

Reduce learning rate by a factor at specific epochs:

$$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$$

def step_decay(epoch, initial_lr=0.1, drop_factor=0.5, epochs_drop=10):
    """
    Step decay learning rate schedule
    """
    return initial_lr * (drop_factor ** (epoch // epochs_drop))

# Example
print("Step Decay Schedule:")
for epoch in [0, 10, 20, 30, 40]:
    lr = step_decay(epoch)
    print(f"  Epoch {epoch}: lr = {lr:.4f}")

5.2 Cosine Annealing

Smoothly decrease learning rate following a cosine curve:

$$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t \pi}{T}))$$

import numpy as np

def cosine_annealing(epoch, total_epochs, lr_max=0.1, lr_min=0.0001):
    """
    Cosine annealing learning rate schedule
    """
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / total_epochs))

# Example
print("\nCosine Annealing Schedule (50 epochs):")
for epoch in [0, 10, 25, 40, 50]:
    lr = cosine_annealing(epoch, total_epochs=50)
    print(f"  Epoch {epoch}: lr = {lr:.6f}")

5.3 Warmup

Warmup gradually increases learning rate at the start of training:

def warmup_schedule(epoch, warmup_epochs=5, base_lr=0.1):
    """
    Linear warmup schedule
    """
    if epoch < warmup_epochs:
        return base_lr * (epoch + 1) / warmup_epochs
    return base_lr

# Example with warmup + cosine decay
def warmup_cosine(epoch, total_epochs=100, warmup_epochs=10, lr_max=0.1, lr_min=0.0001):
    if epoch < warmup_epochs:
        return lr_max * (epoch + 1) / warmup_epochs
    else:
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))

print("\nWarmup + Cosine Schedule:")
for epoch in [0, 5, 10, 50, 100]:
    lr = warmup_cosine(epoch)
    print(f"  Epoch {epoch}: lr = {lr:.6f}")

6. Early Stopping

Early stopping monitors validation loss and stops training when it stops improving, preventing overfitting.

class EarlyStopping:
    """
    Early stopping to prevent overfitting
    """

    def __init__(self, patience=10, min_delta=0.0001, restore_best=True):
        """
        Parameters:
        -----------
        patience : int
            Number of epochs to wait for improvement
        min_delta : float
            Minimum change to qualify as improvement
        restore_best : bool
            Whether to restore best weights
        """
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best

        self.best_loss = float('inf')
        self.counter = 0
        self.best_weights = None

    def __call__(self, val_loss, model_weights=None):
        """
        Check if training should stop

        Returns:
        --------
        should_stop : bool
            True if training should stop
        """
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            if self.restore_best and model_weights is not None:
                self.best_weights = [w.copy() for w in model_weights]
            return False
        else:
            self.counter += 1
            if self.counter >= self.patience:
                print(f"Early stopping triggered after {self.patience} epochs without improvement")
                return True
            return False

# Simulate training with early stopping
np.random.seed(42)
early_stopping = EarlyStopping(patience=5)

print("Simulated Training with Early Stopping")
print("=" * 50)

for epoch in range(100):
    # Simulate val loss: decreases then increases (overfitting)
    if epoch < 20:
        val_loss = 1.0 - epoch * 0.03 + np.random.normal(0, 0.02)
    else:
        val_loss = 0.4 + (epoch - 20) * 0.02 + np.random.normal(0, 0.02)

    if epoch % 5 == 0:
        print(f"Epoch {epoch}: val_loss = {val_loss:.4f}")

    if early_stopping(val_loss):
        print(f"Stopped at epoch {epoch}")
        print(f"Best validation loss: {early_stopping.best_loss:.4f}")
        break

Exercises

Exercise 1: L1 vs L2 Regularization

Problem: Train two identical networks on the same data, one with L1 and one with L2 regularization:

  1. Compare the weight distributions (histogram)
  2. Count the number of near-zero weights (|w| < 0.01)
  3. Explain the difference in terms of sparsity
Exercise 2: Dropout Rate Experiment

Problem: Train networks with dropout rates of [0, 0.1, 0.3, 0.5, 0.7]:

  1. Plot training and validation accuracy for each
  2. Find the optimal dropout rate
  3. Observe what happens with very high dropout
Exercise 3: Batch Normalization Ablation

Problem: Compare training with and without Batch Normalization:

  1. Plot loss curves for both
  2. Try different learning rates
  3. Measure the sensitivity to initialization
Exercise 4: Optimizer Comparison

Problem: Train the same network with SGD, Momentum, RMSprop, and Adam:

  1. Use the same learning rate initially, then tune each
  2. Plot convergence curves
  3. Measure final test accuracy
Exercise 5: Learning Rate Schedule Design

Problem: Implement and compare:

  1. Constant learning rate
  2. Step decay (halve every 20 epochs)
  3. Cosine annealing
  4. Warmup + cosine

Which achieves the best final accuracy?

Exercise 6: Early Stopping Analysis

Problem: Train a network prone to overfitting:

  1. Plot train/val loss without early stopping
  2. Implement early stopping with patience=5, 10, 20
  3. Compare final test accuracy for each patience value

Summary

In this chapter, we learned techniques to improve model training:

Next Chapter Preview: In Chapter 5, we will apply everything we've learned to practical projects, including MNIST classification, data preprocessing pipelines, model evaluation, and hyperparameter tuning.

Disclaimer