Chapter 4: Regularization and Optimization - Deep Learning Fundamentals Introduction

In this chapter, we will learn techniques to improve model generalization and training efficiency. We will cover regularization methods to prevent overfitting, normalization techniques for stable training, and advanced optimization algorithms that adapt learning rates automatically.

Learning Objectives

Understand overfitting and the bias-variance tradeoff
Master L1 and L2 regularization techniques
Implement and apply Dropout correctly
Understand Batch Normalization and its variants
Know when to use different optimization algorithms (Adam, RMSprop)
Apply learning rate scheduling and early stopping

1. Understanding Overfitting

1.1 Training Error vs Generalization Error

Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, resulting in poor performance on new, unseen data.

Scenario	Training Error	Validation Error	Status
Underfitting	High	High	Model too simple
Good Fit	Low	Low	Optimal
Overfitting	Very Low	High	Model memorizing

1.2 Bias-Variance Tradeoff

The bias-variance tradeoff describes the relationship between model complexity and error:

$$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

Bias: Error from simplifying assumptions (underfitting)
Variance: Error from sensitivity to training data (overfitting)
Irreducible Error: Noise inherent in the problem

graph LR A[Model Complexity] --> B{Balance?} B -->|Too Simple| C[High Bias
Underfitting] B -->|Just Right| D[Optimal
Good Generalization] B -->|Too Complex| E[High Variance
Overfitting] style D fill:#e8f5e9 style C fill:#fff3e0 style E fill:#ffebee

2. Regularization Techniques

2.1 L1 Regularization (Lasso)

L1 regularization adds the sum of absolute weights to the loss function:

$$\mathcal{L}_{L1} = \mathcal{L}_{original} + \lambda \sum_{i} |w_i|$$

Characteristics:

Promotes sparsity (drives some weights to exactly zero)
Acts as feature selection
Useful when many features are irrelevant

import numpy as np

def l1_regularization_loss(weights, lambda_l1=0.01):
    """
    Compute L1 regularization term

    Parameters:
    -----------
    weights : list of ndarray
        Network weights
    lambda_l1 : float
        Regularization strength

    Returns:
    --------
    l1_loss : float
        L1 regularization term
    """
    l1_loss = 0
    for W in weights:
        l1_loss += np.sum(np.abs(W))
    return lambda_l1 * l1_loss

def l1_gradient(W, lambda_l1=0.01):
    """
    Gradient of L1 regularization
    """
    return lambda_l1 * np.sign(W)

# Example
np.random.seed(42)
W = np.random.randn(10, 5)

l1_loss = l1_regularization_loss([W], lambda_l1=0.01)
l1_grad = l1_gradient(W, lambda_l1=0.01)

print(f"L1 Loss: {l1_loss:.4f}")
print(f"L1 Gradient mean: {l1_grad.mean():.4f}")
print(f"Non-zero gradient elements: {np.sum(l1_grad != 0)}")

2.2 L2 Regularization (Ridge / Weight Decay)

L2 regularization adds the sum of squared weights to the loss:

$$\mathcal{L}_{L2} = \mathcal{L}_{original} + \frac{\lambda}{2} \sum_{i} w_i^2$$

Characteristics:

Penalizes large weights more heavily
Encourages smaller, distributed weights
Most commonly used regularization in deep learning

import numpy as np

def l2_regularization_loss(weights, lambda_l2=0.01):
    """
    Compute L2 regularization term

    Parameters:
    -----------
    weights : list of ndarray
        Network weights
    lambda_l2 : float
        Regularization strength (weight decay)

    Returns:
    --------
    l2_loss : float
        L2 regularization term
    """
    l2_loss = 0
    for W in weights:
        l2_loss += np.sum(W ** 2)
    return 0.5 * lambda_l2 * l2_loss

def l2_gradient(W, lambda_l2=0.01):
    """
    Gradient of L2 regularization
    """
    return lambda_l2 * W

# Example: Compare weight distributions with/without L2
np.random.seed(42)

# Simulate training effect
W_original = np.random.randn(100, 50) * 2  # Large weights

# With L2, weights would be smaller
W_l2 = W_original * 0.5  # Simulated effect

print("Weight statistics:")
print(f"  Original: mean={W_original.mean():.4f}, std={W_original.std():.4f}")
print(f"  With L2:  mean={W_l2.mean():.4f}, std={W_l2.std():.4f}")
print(f"\nL2 loss (original): {l2_regularization_loss([W_original]):.4f}")
print(f"L2 loss (with L2):  {l2_regularization_loss([W_l2]):.4f}")

2.3 Dropout

Dropout randomly sets a fraction of neurons to zero during training, forcing the network to learn redundant representations.

During training:

$$\tilde{h} = h \odot m, \quad m_i \sim \text{Bernoulli}(1-p)$$

During inference:

$$h_{test} = h \cdot (1-p)$$

Or equivalently, scale during training (inverted dropout).

import numpy as np

class Dropout:
    """
    Dropout layer implementation
    """

    def __init__(self, drop_rate=0.5):
        """
        Parameters:
        -----------
        drop_rate : float
            Probability of dropping a neuron (0 to 1)
        """
        self.drop_rate = drop_rate
        self.mask = None
        self.training = True

    def forward(self, X):
        """
        Forward pass with dropout

        Parameters:
        -----------
        X : ndarray
            Input activations

        Returns:
        --------
        output : ndarray
            Dropped out activations
        """
        if self.training and self.drop_rate > 0:
            # Create binary mask
            self.mask = np.random.binomial(1, 1 - self.drop_rate, X.shape)
            # Apply mask and scale (inverted dropout)
            return X * self.mask / (1 - self.drop_rate)
        else:
            return X

    def backward(self, dout):
        """
        Backward pass
        """
        if self.training and self.drop_rate > 0:
            return dout * self.mask / (1 - self.drop_rate)
        else:
            return dout

    def train(self):
        self.training = True

    def eval(self):
        self.training = False

# Example usage
np.random.seed(42)
dropout = Dropout(drop_rate=0.5)

X = np.ones((3, 5))  # All ones for clarity

print("Dropout Example (drop_rate=0.5)")
print("=" * 40)
print(f"Input:\n{X}")

dropout.train()
output_train = dropout.forward(X)
print(f"\nTraining mode output:\n{output_train}")
print(f"Active neurons: {np.sum(output_train > 0)}/{X.size}")

dropout.eval()
output_eval = dropout.forward(X)
print(f"\nEval mode output:\n{output_eval}")

Important: Always switch to evaluation mode during inference:

PyTorch: model.eval()
Training: Dropout active, Batch Norm uses batch statistics
Inference: Dropout disabled, Batch Norm uses running statistics

3. Batch Normalization

3.1 Internal Covariate Shift

Internal covariate shift refers to the change in the distribution of layer inputs during training. Batch Normalization addresses this by normalizing layer inputs.

3.2 How Batch Normalization Works

For a mini-batch of activations, Batch Norm performs:

Compute mean and variance:
$$\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_B)^2$$
Normalize:
$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
Scale and shift (learnable parameters):
$$y_i = \gamma \hat{x}_i + \beta$$

import numpy as np

class BatchNorm:
    """
    Batch Normalization layer
    """

    def __init__(self, n_features, momentum=0.9, epsilon=1e-5):
        self.gamma = np.ones((1, n_features))
        self.beta = np.zeros((1, n_features))

        # Running statistics for inference
        self.running_mean = np.zeros((1, n_features))
        self.running_var = np.ones((1, n_features))

        self.momentum = momentum
        self.epsilon = epsilon
        self.training = True

        # Cache for backprop
        self.cache = {}

    def forward(self, X):
        if self.training:
            # Compute batch statistics
            mean = np.mean(X, axis=0, keepdims=True)
            var = np.var(X, axis=0, keepdims=True)

            # Update running statistics
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * mean
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * var

            # Normalize
            X_norm = (X - mean) / np.sqrt(var + self.epsilon)

            # Cache for backprop
            self.cache = {'X': X, 'mean': mean, 'var': var, 'X_norm': X_norm}
        else:
            # Use running statistics
            X_norm = (X - self.running_mean) / np.sqrt(self.running_var + self.epsilon)

        # Scale and shift
        return self.gamma * X_norm + self.beta

    def train(self):
        self.training = True

    def eval(self):
        self.training = False

# Example
np.random.seed(42)
bn = BatchNorm(n_features=4)

# Simulated activations (batch_size=8, features=4)
X = np.random.randn(8, 4) * 5 + 10  # Mean ~10, std ~5

print("Batch Normalization Example")
print("=" * 40)
print(f"Input stats: mean={X.mean():.2f}, std={X.std():.2f}")

bn.train()
output = bn.forward(X)
print(f"Output stats: mean={output.mean():.4f}, std={output.std():.4f}")

3.3 Layer Normalization Comparison

Aspect	Batch Normalization	Layer Normalization
Normalizes over	Batch dimension	Feature dimension
Batch size dependency	Yes (needs large batches)	No
Best for	CNNs, feedforward	RNNs, Transformers
Inference behavior	Uses running stats	Same as training

4. Optimization Algorithms

4.1 SGD with Momentum

Covered in Chapter 3. Accumulates velocity to accelerate consistent gradient directions.

4.2 AdaGrad

AdaGrad adapts learning rates for each parameter based on historical gradients:

$$G_t = G_{t-1} + g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$

Problem: Learning rate monotonically decreases, can become too small.

4.3 RMSprop

RMSprop fixes AdaGrad's diminishing learning rate by using exponential moving average:

$$E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t$$

4.4 Adam / AdamW

Adam (Adaptive Moment Estimation) combines momentum and RMSprop:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(First moment)}$$

$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(Second moment)}$$

$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(Bias correction)}$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

import numpy as np

class Adam:
    """
    Adam optimizer implementation
    """

    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step

    def update(self, params, grads):
        """
        Update parameters using Adam

        Parameters:
        -----------
        params : dict
            Parameter name -> parameter array
        grads : dict
            Parameter name -> gradient array
        """
        self.t += 1

        for name in params:
            if name not in self.m:
                self.m[name] = np.zeros_like(params[name])
                self.v[name] = np.zeros_like(params[name])

            # Update biased first moment estimate
            self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * grads[name]

            # Update biased second raw moment estimate
            self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * (grads[name] ** 2)

            # Compute bias-corrected first moment estimate
            m_hat = self.m[name] / (1 - self.beta1 ** self.t)

            # Compute bias-corrected second raw moment estimate
            v_hat = self.v[name] / (1 - self.beta2 ** self.t)

            # Update parameters
            params[name] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

# Compare optimizers
def compare_optimizers():
    """
    Compare SGD, Momentum, and Adam on a test function
    """
    def rosenbrock(x, y):
        return (1 - x)**2 + 100 * (y - x**2)**2

    def rosenbrock_grad(x, y):
        dx = -2 * (1 - x) - 400 * x * (y - x**2)
        dy = 200 * (y - x**2)
        return dx, dy

    optimizers = {
        'SGD': {'lr': 0.0001, 'momentum': 0},
        'Momentum': {'lr': 0.0001, 'momentum': 0.9},
        'Adam': {'lr': 0.01}
    }

    results = {}

    for name, config in optimizers.items():
        x, y = -1.0, 1.0
        history = [(x, y, rosenbrock(x, y))]

        if name == 'Adam':
            m_x, m_y = 0, 0
            v_x, v_y = 0, 0
            beta1, beta2 = 0.9, 0.999
            epsilon = 1e-8
            lr = config['lr']

            for t in range(1, 1001):
                dx, dy = rosenbrock_grad(x, y)

                m_x = beta1 * m_x + (1 - beta1) * dx
                m_y = beta1 * m_y + (1 - beta1) * dy

                v_x = beta2 * v_x + (1 - beta2) * dx**2
                v_y = beta2 * v_y + (1 - beta2) * dy**2

                m_x_hat = m_x / (1 - beta1**t)
                m_y_hat = m_y / (1 - beta1**t)
                v_x_hat = v_x / (1 - beta2**t)
                v_y_hat = v_y / (1 - beta2**t)

                x -= lr * m_x_hat / (np.sqrt(v_x_hat) + epsilon)
                y -= lr * m_y_hat / (np.sqrt(v_y_hat) + epsilon)

                history.append((x, y, rosenbrock(x, y)))
        else:
            v_x, v_y = 0, 0
            lr = config['lr']
            momentum = config['momentum']

            for _ in range(1000):
                dx, dy = rosenbrock_grad(x, y)

                v_x = momentum * v_x + dx
                v_y = momentum * v_y + dy

                x -= lr * v_x
                y -= lr * v_y

                history.append((x, y, rosenbrock(x, y)))

        results[name] = history
        print(f"{name}: Final (x, y) = ({x:.4f}, {y:.4f}), loss = {rosenbrock(x, y):.6f}")

    return results

print("Optimizer Comparison on Rosenbrock Function")
print("=" * 50)
print("Minimum at (1, 1) with value 0")
print()
results = compare_optimizers()

Optimizer Selection Guidelines

Adam: Good default, works well for most problems
SGD + Momentum: Often better final performance with tuning
AdamW: Adam with proper weight decay, recommended for transformers
RMSprop: Good for RNNs and non-stationary objectives

5. Learning Rate Scheduling

5.1 Step Decay

Reduce learning rate by a factor at specific epochs:

$$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$$

def step_decay(epoch, initial_lr=0.1, drop_factor=0.5, epochs_drop=10):
    """
    Step decay learning rate schedule
    """
    return initial_lr * (drop_factor ** (epoch // epochs_drop))

# Example
print("Step Decay Schedule:")
for epoch in [0, 10, 20, 30, 40]:
    lr = step_decay(epoch)
    print(f"  Epoch {epoch}: lr = {lr:.4f}")

5.2 Cosine Annealing

Smoothly decrease learning rate following a cosine curve:

$$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t \pi}{T}))$$

import numpy as np

def cosine_annealing(epoch, total_epochs, lr_max=0.1, lr_min=0.0001):
    """
    Cosine annealing learning rate schedule
    """
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / total_epochs))

# Example
print("\nCosine Annealing Schedule (50 epochs):")
for epoch in [0, 10, 25, 40, 50]:
    lr = cosine_annealing(epoch, total_epochs=50)
    print(f"  Epoch {epoch}: lr = {lr:.6f}")

5.3 Warmup

Warmup gradually increases learning rate at the start of training:

def warmup_schedule(epoch, warmup_epochs=5, base_lr=0.1):
    """
    Linear warmup schedule
    """
    if epoch < warmup_epochs:
        return base_lr * (epoch + 1) / warmup_epochs
    return base_lr

# Example with warmup + cosine decay
def warmup_cosine(epoch, total_epochs=100, warmup_epochs=10, lr_max=0.1, lr_min=0.0001):
    if epoch < warmup_epochs:
        return lr_max * (epoch + 1) / warmup_epochs
    else:
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))

print("\nWarmup + Cosine Schedule:")
for epoch in [0, 5, 10, 50, 100]:
    lr = warmup_cosine(epoch)
    print(f"  Epoch {epoch}: lr = {lr:.6f}")

6. Early Stopping

Early stopping monitors validation loss and stops training when it stops improving, preventing overfitting.

class EarlyStopping:
    """
    Early stopping to prevent overfitting
    """

    def __init__(self, patience=10, min_delta=0.0001, restore_best=True):
        """
        Parameters:
        -----------
        patience : int
            Number of epochs to wait for improvement
        min_delta : float
            Minimum change to qualify as improvement
        restore_best : bool
            Whether to restore best weights
        """
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best

        self.best_loss = float('inf')
        self.counter = 0
        self.best_weights = None

    def __call__(self, val_loss, model_weights=None):
        """
        Check if training should stop

        Returns:
        --------
        should_stop : bool
            True if training should stop
        """
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            if self.restore_best and model_weights is not None:
                self.best_weights = [w.copy() for w in model_weights]
            return False
        else:
            self.counter += 1
            if self.counter >= self.patience:
                print(f"Early stopping triggered after {self.patience} epochs without improvement")
                return True
            return False

# Simulate training with early stopping
np.random.seed(42)
early_stopping = EarlyStopping(patience=5)

print("Simulated Training with Early Stopping")
print("=" * 50)

for epoch in range(100):
    # Simulate val loss: decreases then increases (overfitting)
    if epoch < 20:
        val_loss = 1.0 - epoch * 0.03 + np.random.normal(0, 0.02)
    else:
        val_loss = 0.4 + (epoch - 20) * 0.02 + np.random.normal(0, 0.02)

    if epoch % 5 == 0:
        print(f"Epoch {epoch}: val_loss = {val_loss:.4f}")

    if early_stopping(val_loss):
        print(f"Stopped at epoch {epoch}")
        print(f"Best validation loss: {early_stopping.best_loss:.4f}")
        break

Exercises

Exercise 1: L1 vs L2 Regularization

Problem: Train two identical networks on the same data, one with L1 and one with L2 regularization:

Compare the weight distributions (histogram)
Count the number of near-zero weights (|w| < 0.01)
Explain the difference in terms of sparsity

Exercise 2: Dropout Rate Experiment

Problem: Train networks with dropout rates of [0, 0.1, 0.3, 0.5, 0.7]:

Plot training and validation accuracy for each
Find the optimal dropout rate
Observe what happens with very high dropout

Exercise 3: Batch Normalization Ablation

Problem: Compare training with and without Batch Normalization:

Plot loss curves for both
Try different learning rates
Measure the sensitivity to initialization

Exercise 4: Optimizer Comparison

Problem: Train the same network with SGD, Momentum, RMSprop, and Adam:

Use the same learning rate initially, then tune each
Plot convergence curves
Measure final test accuracy

Exercise 5: Learning Rate Schedule Design

Problem: Implement and compare:

Constant learning rate
Step decay (halve every 20 epochs)
Cosine annealing
Warmup + cosine

Which achieves the best final accuracy?

Exercise 6: Early Stopping Analysis

Problem: Train a network prone to overfitting:

Plot train/val loss without early stopping
Implement early stopping with patience=5, 10, 20
Compare final test accuracy for each patience value

Summary

In this chapter, we learned techniques to improve model training:

Overfitting: When models memorize training data; detected by gap between train/val loss
L1/L2 Regularization: Penalize large weights; L1 promotes sparsity, L2 is most common
Dropout: Randomly disable neurons during training for redundant representations
Batch Normalization: Normalize layer inputs for stable, faster training
Adam: Adaptive optimizer combining momentum and per-parameter learning rates
Learning Rate Scheduling: Adjust learning rate during training (warmup, cosine decay)
Early Stopping: Stop training when validation loss stops improving

Next Chapter Preview: In Chapter 5, we will apply everything we've learned to practical projects, including MNIST classification, data preprocessing pipelines, model evaluation, and hyperparameter tuning.