Chapter 2: How Neural Networks Work - Deep Learning Fundamentals Introduction

In this chapter, we will learn how neural networks process data from input to output. We will understand forward propagation calculations, the role of weight matrices and bias vectors, computational graphs, and loss functions used to measure prediction errors.

Learning Objectives

Understand the data flow from input to output in forward propagation
Master matrix operations for efficient neural network computation
Comprehend the roles of weight matrices and bias vectors
Learn about computational graphs and their importance for automatic differentiation
Understand and implement MSE and Cross Entropy loss functions
Implement fully connected layers in NumPy and PyTorch

1. Forward Propagation

1.1 Data Flow from Input to Output

Forward propagation is the process of computing the output of a neural network given an input. Data flows through the network layer by layer, with each layer transforming the data until the final output is produced.

graph LR subgraph Input Layer I1[x1] I2[x2] I3[x3] end subgraph Hidden Layer H1[h1] H2[h2] end subgraph Output Layer O1[y1] O2[y2] end I1 --> H1 I1 --> H2 I2 --> H1 I2 --> H2 I3 --> H1 I3 --> H2 H1 --> O1 H1 --> O2 H2 --> O1 H2 --> O2 style I1 fill:#e3f2fd style I2 fill:#e3f2fd style I3 fill:#e3f2fd style H1 fill:#fff3e0 style H2 fill:#fff3e0 style O1 fill:#e8f5e9 style O2 fill:#e8f5e9

For each layer, the computation follows these steps:

Linear transformation: Multiply input by weights and add bias
Activation: Apply non-linear activation function

The mathematical expression for a single layer:

$$\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}$$

$$\mathbf{a} = f(\mathbf{z})$$

Where:

$\mathbf{x}$: input vector (or activation from previous layer)
$\mathbf{W}$: weight matrix
$\mathbf{b}$: bias vector
$\mathbf{z}$: pre-activation (linear combination)
$f$: activation function
$\mathbf{a}$: activation output

1.2 Efficient Computation with Matrix Operations

Instead of computing neuron outputs one by one, we use matrix operations to process all neurons in a layer simultaneously. This is not only more concise but also significantly faster due to optimized linear algebra libraries.

Why Matrix Operations?

Parallelization: GPUs can perform thousands of matrix operations simultaneously
Optimization: Libraries like BLAS are highly optimized for matrix multiplication
Batch Processing: Multiple samples can be processed at once

Example: Forward pass for a single layer

import numpy as np

def forward_layer(X, W, b, activation_fn):
    """
    Forward pass for a single layer

    Parameters:
    -----------
    X : ndarray, shape (batch_size, n_input)
        Input data
    W : ndarray, shape (n_input, n_output)
        Weight matrix
    b : ndarray, shape (1, n_output)
        Bias vector
    activation_fn : function
        Activation function

    Returns:
    --------
    A : ndarray, shape (batch_size, n_output)
        Activation output
    Z : ndarray, shape (batch_size, n_output)
        Pre-activation (for backpropagation)
    """
    # Linear transformation: Z = X @ W + b
    Z = np.dot(X, W) + b

    # Apply activation function
    A = activation_fn(Z)

    return A, Z

# Example usage
np.random.seed(42)

# Input: 4 samples, 3 features
X = np.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0],
    [10.0, 11.0, 12.0]
])

# Weights: 3 inputs -> 2 outputs
W = np.random.randn(3, 2) * 0.01
b = np.zeros((1, 2))

# ReLU activation
relu = lambda x: np.maximum(0, x)

A, Z = forward_layer(X, W, b, relu)
print("Input shape:", X.shape)
print("Output shape:", A.shape)
print("Pre-activation:\n", Z)
print("Activation:\n", A)

1.3 Batch Processing

In practice, we process multiple samples (a batch) at once. This is achieved by stacking samples as rows in a matrix:

$$\mathbf{Z} = \mathbf{X}\mathbf{W} + \mathbf{b}$$

Where:

$\mathbf{X}$ has shape $(m, n_{in})$ - $m$ samples, $n_{in}$ input features
$\mathbf{W}$ has shape $(n_{in}, n_{out})$ - weight matrix
$\mathbf{b}$ has shape $(1, n_{out})$ - bias vector (broadcasted)
$\mathbf{Z}$ has shape $(m, n_{out})$ - pre-activations for all samples

Batch Size	Advantages	Disadvantages
1 (SGD)	Frequent updates, can escape local minima	Noisy gradients, slow (no parallelization)
32-256	Good balance of speed and gradient quality	May need to tune learning rate
Full batch	Accurate gradients	Memory intensive, can get stuck in local minima

2. Weight Matrices and Bias Vectors

2.1 The Role of Weights

The weight matrix $\mathbf{W}$ determines how strongly each input feature influences each output neuron. Each element $W_{ij}$ represents the connection strength from input $i$ to output $j$.

graph LR subgraph Inputs x1[x1] x2[x2] end subgraph Outputs h1[h1] h2[h2] h3[h3] end x1 -->|w11| h1 x1 -->|w12| h2 x1 -->|w13| h3 x2 -->|w21| h1 x2 -->|w22| h2 x2 -->|w23| h3 style x1 fill:#e3f2fd style x2 fill:#e3f2fd style h1 fill:#fff3e0 style h2 fill:#fff3e0 style h3 fill:#fff3e0

For 2 inputs and 3 outputs, the weight matrix is:

$$\mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}$$

2.2 Weight Initialization

Proper weight initialization is crucial for effective training. Poor initialization can lead to vanishing or exploding gradients.

Method	Formula	Best For
Zero	$W = 0$	Never use (symmetric problem)
Random Small	$W \sim \mathcal{N}(0, 0.01)$	Simple networks
Xavier/Glorot	$W \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in} + n_{out}}})$	Sigmoid, tanh
He	$W \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in}}})$	ReLU

import numpy as np

def initialize_weights(n_in, n_out, method='he'):
    """
    Initialize weight matrix using different methods

    Parameters:
    -----------
    n_in : int
        Number of input neurons
    n_out : int
        Number of output neurons
    method : str
        Initialization method ('zeros', 'random', 'xavier', 'he')

    Returns:
    --------
    W : ndarray
        Initialized weight matrix
    """
    if method == 'zeros':
        # Bad: all neurons learn the same thing
        return np.zeros((n_in, n_out))

    elif method == 'random':
        # Small random values
        return np.random.randn(n_in, n_out) * 0.01

    elif method == 'xavier':
        # Good for sigmoid/tanh
        std = np.sqrt(2.0 / (n_in + n_out))
        return np.random.randn(n_in, n_out) * std

    elif method == 'he':
        # Good for ReLU
        std = np.sqrt(2.0 / n_in)
        return np.random.randn(n_in, n_out) * std

    else:
        raise ValueError(f"Unknown method: {method}")

# Compare initialization methods
np.random.seed(42)
n_in, n_out = 784, 256

for method in ['zeros', 'random', 'xavier', 'he']:
    W = initialize_weights(n_in, n_out, method)
    print(f"{method:8s}: mean={W.mean():.6f}, std={W.std():.6f}")

2.3 The Importance of Bias

The bias vector $\mathbf{b}$ allows the activation function to shift horizontally. Without bias, the pre-activation $z$ would be zero when inputs are zero, limiting the network's expressiveness.

Geometric Interpretation

Weight: Controls the slope/rotation of the decision boundary
Bias: Controls the position/offset of the decision boundary

Together, they allow the network to fit any linear function passing through any point in space.

import numpy as np
import matplotlib.pyplot as plt

# Demonstrate the effect of bias
x = np.linspace(-5, 5, 100)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Different bias values
biases = [-2, 0, 2]
w = 1  # Fixed weight

plt.figure(figsize=(10, 4))
for b in biases:
    z = w * x + b
    y = sigmoid(z)
    plt.plot(x, y, label=f'bias = {b}')

plt.xlabel('Input x')
plt.ylabel('Output sigmoid(wx + b)')
plt.title('Effect of Bias on Sigmoid Activation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.show()

3. Layer Connections and Computational Graphs

3.1 The Concept of Computational Graphs

A computational graph is a directed graph that represents the sequence of operations in a neural network. Each node represents an operation (addition, multiplication, activation function), and edges represent data flow.

graph LR x1((x1)) --> mul1[x] w1((w1)) --> mul1 mul1 --> add1[+] x2((x2)) --> mul2[x] w2((w2)) --> mul2 mul2 --> add1 b((b)) --> add1 add1 --> z((z)) z --> relu[ReLU] relu --> a((a)) style x1 fill:#e3f2fd style x2 fill:#e3f2fd style w1 fill:#fff3e0 style w2 fill:#fff3e0 style b fill:#fff3e0 style z fill:#f3e5f5 style a fill:#e8f5e9

The computational graph for a single neuron with 2 inputs:

$$z = w_1 x_1 + w_2 x_2 + b$$

$$a = \text{ReLU}(z)$$

3.2 Foundations of Automatic Differentiation

The key benefit of computational graphs is enabling automatic differentiation. By storing intermediate values and applying the chain rule, we can efficiently compute gradients for all parameters.

Chain Rule: If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

This allows us to compute gradients layer by layer from output to input.

import numpy as np

class ComputationalNode:
    """
    Simple computational node that stores value and gradient
    """
    def __init__(self, value, requires_grad=False):
        self.value = value
        self.grad = 0 if requires_grad else None
        self.requires_grad = requires_grad

def forward_with_cache(X, W, b):
    """
    Forward pass that caches intermediate values for backpropagation

    Returns:
    --------
    output : ndarray
        Final output
    cache : dict
        Intermediate values needed for backpropagation
    """
    # Store inputs
    cache = {'X': X, 'W': W, 'b': b}

    # Linear transformation
    Z = np.dot(X, W) + b
    cache['Z'] = Z

    # ReLU activation
    A = np.maximum(0, Z)
    cache['A'] = A

    return A, cache

# Example
X = np.array([[1.0, 2.0]])
W = np.array([[0.5, -0.5], [0.3, 0.7]])
b = np.array([[0.1, -0.1]])

output, cache = forward_with_cache(X, W, b)
print("Input X:", X)
print("Pre-activation Z:", cache['Z'])
print("Output A:", output)

3.3 Multi-Layer Forward Pass

For a network with multiple layers, we chain the forward passes together:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

class MultiLayerNetwork:
    """
    Multi-layer neural network with forward propagation
    """

    def __init__(self, layer_sizes):
        """
        Parameters:
        -----------
        layer_sizes : list
            Number of neurons in each layer [input, hidden1, hidden2, ..., output]
        """
        self.n_layers = len(layer_sizes) - 1
        self.weights = []
        self.biases = []

        # He initialization for ReLU
        for i in range(self.n_layers):
            n_in = layer_sizes[i]
            n_out = layer_sizes[i + 1]
            W = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
            b = np.zeros((1, n_out))
            self.weights.append(W)
            self.biases.append(b)

    def forward(self, X):
        """
        Forward propagation through all layers

        Parameters:
        -----------
        X : ndarray, shape (batch_size, n_input)
            Input data

        Returns:
        --------
        output : ndarray
            Network output
        cache : list
            Intermediate values for each layer
        """
        cache = [{'A': X}]  # Store input as first "activation"
        A = X

        for i in range(self.n_layers):
            Z = np.dot(A, self.weights[i]) + self.biases[i]

            # Use ReLU for hidden layers, softmax for output
            if i < self.n_layers - 1:
                A = relu(Z)
            else:
                A = softmax(Z)

            cache.append({'Z': Z, 'A': A})

        return A, cache

# Example: 4 inputs -> 8 hidden -> 4 hidden -> 3 outputs
np.random.seed(42)
network = MultiLayerNetwork([4, 8, 4, 3])

# Test data
X = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [6.3, 2.9, 5.6, 1.8],
    [5.8, 2.7, 5.1, 1.9]
])

output, cache = network.forward(X)
print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("Predictions (probabilities):\n", output)
print("Predicted classes:", np.argmax(output, axis=1))

4. Loss Functions

A loss function (also called cost function or objective function) measures how well the network's predictions match the true targets. The goal of training is to minimize this loss.

4.1 Mean Squared Error (MSE)

MSE is the standard loss function for regression problems. It measures the average squared difference between predictions and true values.

Formula:

$$\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Where:

$y_i$: true value
$\hat{y}_i$: predicted value
$n$: number of samples

Characteristics:

Always non-negative (squared values)
Penalizes large errors more than small ones (quadratic)
Differentiable everywhere (smooth gradient)
Sensitive to outliers

import numpy as np

def mse_loss(y_true, y_pred):
    """
    Mean Squared Error loss

    Parameters:
    -----------
    y_true : ndarray
        True values
    y_pred : ndarray
        Predicted values

    Returns:
    --------
    loss : float
        MSE loss value
    """
    return np.mean((y_true - y_pred) ** 2)

def mse_gradient(y_true, y_pred):
    """
    Gradient of MSE loss with respect to predictions

    Returns:
    --------
    gradient : ndarray
        dL/dy_pred
    """
    n = y_true.shape[0]
    return 2 * (y_pred - y_true) / n

# Example: regression task
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.2, 1.8, 3.1, 3.9, 5.2])

loss = mse_loss(y_true, y_pred)
grad = mse_gradient(y_true, y_pred)

print(f"True values: {y_true}")
print(f"Predictions: {y_pred}")
print(f"MSE Loss: {loss:.4f}")
print(f"Gradient: {grad}")

4.2 Cross Entropy Loss

Cross Entropy is the standard loss function for classification problems. It measures the difference between the predicted probability distribution and the true distribution.

Binary Cross Entropy (for binary classification)

$$\mathcal{L}_{BCE} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

Categorical Cross Entropy (for multi-class classification)

$$\mathcal{L}_{CE} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$$

Where:

$y_{i,c}$: true label (one-hot encoded)
$\hat{y}_{i,c}$: predicted probability for class $c$
$C$: number of classes

import numpy as np

def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Binary Cross Entropy loss

    Parameters:
    -----------
    y_true : ndarray
        True binary labels (0 or 1)
    y_pred : ndarray
        Predicted probabilities
    epsilon : float
        Small value to avoid log(0)

    Returns:
    --------
    loss : float
        BCE loss value
    """
    # Clip predictions to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Categorical Cross Entropy loss

    Parameters:
    -----------
    y_true : ndarray, shape (n_samples, n_classes)
        True labels (one-hot encoded)
    y_pred : ndarray, shape (n_samples, n_classes)
        Predicted probabilities

    Returns:
    --------
    loss : float
        CCE loss value
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

# Example: multi-class classification
# 3 samples, 4 classes
y_true = np.array([
    [1, 0, 0, 0],  # Class 0
    [0, 1, 0, 0],  # Class 1
    [0, 0, 0, 1]   # Class 3
])

# Good predictions
y_pred_good = np.array([
    [0.9, 0.05, 0.03, 0.02],
    [0.1, 0.7, 0.1, 0.1],
    [0.05, 0.05, 0.1, 0.8]
])

# Bad predictions
y_pred_bad = np.array([
    [0.25, 0.25, 0.25, 0.25],
    [0.1, 0.3, 0.3, 0.3],
    [0.4, 0.3, 0.2, 0.1]
])

loss_good = categorical_cross_entropy(y_true, y_pred_good)
loss_bad = categorical_cross_entropy(y_true, y_pred_bad)

print(f"Good predictions - Loss: {loss_good:.4f}")
print(f"Bad predictions - Loss: {loss_bad:.4f}")

4.3 Choosing the Right Loss Function

Task Type	Output Activation	Loss Function
Regression	Linear (none)	MSE, MAE, Huber
Binary Classification	Sigmoid	Binary Cross Entropy
Multi-class (single label)	Softmax	Categorical Cross Entropy
Multi-label	Sigmoid	Binary Cross Entropy (per label)

Why Cross Entropy for Classification?

Provides stronger gradients when predictions are wrong
Naturally pairs with softmax/sigmoid outputs
Has information-theoretic interpretation (relative entropy)
Leads to faster convergence than MSE for classification

5. Implementing Fully Connected Layers

5.1 NumPy Implementation

import numpy as np

class DenseLayer:
    """
    Fully connected (dense) layer implementation
    """

    def __init__(self, n_input, n_output, activation='relu'):
        """
        Parameters:
        -----------
        n_input : int
            Number of input features
        n_output : int
            Number of output neurons
        activation : str
            Activation function ('relu', 'sigmoid', 'softmax', 'none')
        """
        self.n_input = n_input
        self.n_output = n_output
        self.activation = activation

        # He initialization
        self.W = np.random.randn(n_input, n_output) * np.sqrt(2.0 / n_input)
        self.b = np.zeros((1, n_output))

        # Cache for backpropagation
        self.cache = {}

    def _activate(self, Z):
        """Apply activation function"""
        if self.activation == 'relu':
            return np.maximum(0, Z)
        elif self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-Z))
        elif self.activation == 'softmax':
            exp_Z = np.exp(Z - np.max(Z, axis=1, keepdims=True))
            return exp_Z / np.sum(exp_Z, axis=1, keepdims=True)
        elif self.activation == 'none':
            return Z
        else:
            raise ValueError(f"Unknown activation: {self.activation}")

    def forward(self, X):
        """
        Forward pass

        Parameters:
        -----------
        X : ndarray, shape (batch_size, n_input)
            Input data

        Returns:
        --------
        A : ndarray, shape (batch_size, n_output)
            Layer output
        """
        self.cache['X'] = X

        # Linear transformation
        Z = np.dot(X, self.W) + self.b
        self.cache['Z'] = Z

        # Activation
        A = self._activate(Z)
        self.cache['A'] = A

        return A

# Build a simple network
np.random.seed(42)

# Layer 1: 4 inputs -> 8 neurons (ReLU)
layer1 = DenseLayer(4, 8, activation='relu')

# Layer 2: 8 neurons -> 4 neurons (ReLU)
layer2 = DenseLayer(8, 4, activation='relu')

# Layer 3: 4 neurons -> 3 outputs (Softmax)
layer3 = DenseLayer(4, 3, activation='softmax')

# Test forward pass
X = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [6.3, 2.9, 5.6, 1.8],
    [5.8, 2.7, 5.1, 1.9],
    [5.0, 3.4, 1.5, 0.2],
    [6.7, 3.1, 5.6, 2.4]
])

# Forward through all layers
A1 = layer1.forward(X)
A2 = layer2.forward(A1)
output = layer3.forward(A2)

print("Input shape:", X.shape)
print("After Layer 1:", A1.shape)
print("After Layer 2:", A2.shape)
print("Output shape:", output.shape)
print("\nPredicted probabilities:\n", output)
print("\nPredicted classes:", np.argmax(output, axis=1))

5.2 PyTorch Implementation

import torch
import torch.nn as nn

class SimpleNetwork(nn.Module):
    """
    Simple fully connected network in PyTorch
    """

    def __init__(self, input_size, hidden_sizes, output_size):
        """
        Parameters:
        -----------
        input_size : int
            Number of input features
        hidden_sizes : list
            Number of neurons in each hidden layer
        output_size : int
            Number of output classes
        """
        super(SimpleNetwork, self).__init__()

        layers = []
        prev_size = input_size

        # Hidden layers with ReLU
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size

        # Output layer (no activation - use with CrossEntropyLoss)
        layers.append(nn.Linear(prev_size, output_size))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

# Create network
model = SimpleNetwork(input_size=4, hidden_sizes=[8, 4], output_size=3)
print("Model architecture:")
print(model)

# Test data
X_torch = torch.tensor([
    [5.1, 3.5, 1.4, 0.2],
    [6.3, 2.9, 5.6, 1.8],
    [5.8, 2.7, 5.1, 1.9],
    [5.0, 3.4, 1.5, 0.2],
    [6.7, 3.1, 5.6, 2.4]
], dtype=torch.float32)

# Forward pass
with torch.no_grad():
    logits = model(X_torch)
    probabilities = torch.softmax(logits, dim=1)

print("\nLogits shape:", logits.shape)
print("Probabilities:\n", probabilities.numpy())
print("Predicted classes:", torch.argmax(probabilities, dim=1).numpy())

# Loss calculation example
y_true = torch.tensor([0, 2, 2, 0, 2])  # True class labels
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, y_true)
print(f"\nCross Entropy Loss: {loss.item():.4f}")

5.3 Complete Forward Pass Example

import numpy as np

def complete_forward_pass(X, weights, biases):
    """
    Complete forward pass through a neural network

    Parameters:
    -----------
    X : ndarray
        Input data
    weights : list of ndarray
        Weight matrices for each layer
    biases : list of ndarray
        Bias vectors for each layer

    Returns:
    --------
    output : ndarray
        Network output
    activations : list
        Activation at each layer (for visualization)
    """
    def relu(x):
        return np.maximum(0, x)

    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)

    activations = [X]
    A = X

    for i, (W, b) in enumerate(zip(weights, biases)):
        # Linear transformation
        Z = np.dot(A, W) + b

        # Activation (softmax for last layer, ReLU for others)
        if i == len(weights) - 1:
            A = softmax(Z)
        else:
            A = relu(Z)

        activations.append(A)

    return A, activations

# Initialize network
np.random.seed(42)
layer_sizes = [4, 8, 4, 3]

weights = []
biases = []
for i in range(len(layer_sizes) - 1):
    n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
    W = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
    b = np.zeros((1, n_out))
    weights.append(W)
    biases.append(b)

# Input data (Iris-like)
X = np.array([
    [5.1, 3.5, 1.4, 0.2],  # setosa
    [7.0, 3.2, 4.7, 1.4],  # versicolor
    [6.3, 3.3, 6.0, 2.5],  # virginica
])

# Run forward pass
output, activations = complete_forward_pass(X, weights, biases)

print("Layer-by-layer activations:")
for i, A in enumerate(activations):
    print(f"  Layer {i}: shape {A.shape}, mean={A.mean():.4f}, std={A.std():.4f}")

print("\nFinal output (probabilities):")
print(output)
print("\nPredicted classes:", np.argmax(output, axis=1))

Exercises

Exercise 1: Manual Forward Pass Calculation

Problem: Given the following network with 2 inputs, 2 hidden neurons, and 2 outputs:

Input: $\mathbf{x} = [1, 2]$
Weights layer 1: $\mathbf{W}^{(1)} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}$
Bias layer 1: $\mathbf{b}^{(1)} = [0.1, 0.1]$
Weights layer 2: $\mathbf{W}^{(2)} = \begin{bmatrix} 0.5 & 0.6 \\ 0.7 & 0.8 \end{bmatrix}$
Bias layer 2: $\mathbf{b}^{(2)} = [0.2, 0.2]$
Activation: ReLU for hidden layer, Softmax for output

Calculate the output by hand, then verify with code.

Exercise 2: Loss Function Comparison

Problem: For a binary classification problem, compare how MSE and Binary Cross Entropy losses change as the prediction varies from 0 to 1 for a true label of 1.

Plot both loss functions for $\hat{y} \in [0.01, 0.99]$ with $y = 1$
Plot the gradients $\frac{dL}{d\hat{y}}$ for both losses
Explain why Cross Entropy is preferred for classification

Exercise 3: Implement Huber Loss

Problem: Implement the Huber loss function, which combines MSE and MAE:

$$L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$

Compare its behavior to MSE and MAE for inputs with outliers.

Exercise 4: Batch Processing

Problem: Implement a function that processes data in mini-batches:

Create a dataset of 1000 samples with 10 features
Implement batch iteration with configurable batch size
Measure the time to process all samples with batch sizes 1, 32, 128, and 1000
Plot the relationship between batch size and processing time

Exercise 5: Weight Initialization Experiment

Problem: Investigate the effect of weight initialization on activation distributions:

Create a 5-layer network (100 -> 100 -> 100 -> 100 -> 10)
Initialize with zeros, random small values, Xavier, and He methods
Pass random input through the network and record activation statistics (mean, std) at each layer
Plot histograms of activations for each initialization method

Exercise 6: Computational Graph Visualization

Problem: Draw the computational graph for a 2-layer network and trace through a forward pass:

Network: 2 inputs -> 3 hidden (ReLU) -> 2 outputs (Softmax)
Draw all operations (multiply, add, activation) as nodes
Label edges with tensor shapes
Calculate the total number of parameters

Summary

In this chapter, we learned how neural networks process data:

Forward Propagation: Data flows layer by layer, with each layer performing linear transformation followed by activation
Matrix Operations: Enable efficient batch processing and parallelization
Weight Matrices: Control connection strengths; proper initialization (Xavier, He) is crucial
Bias Vectors: Allow shifting the activation function
Computational Graphs: Represent operations as a graph, enabling automatic differentiation
Loss Functions: MSE for regression, Cross Entropy for classification

Next Chapter Preview: In Chapter 3, we will learn about learning algorithms - how neural networks adjust their weights to minimize loss using gradient descent and backpropagation.

Learning Objectives

1. Forward Propagation

1.1 Data Flow from Input to Output

1.2 Efficient Computation with Matrix Operations

1.3 Batch Processing

2. Weight Matrices and Bias Vectors

2.1 The Role of Weights

2.2 Weight Initialization

2.3 The Importance of Bias

3. Layer Connections and Computational Graphs

3.1 The Concept of Computational Graphs

3.2 Foundations of Automatic Differentiation

3.3 Multi-Layer Forward Pass

4. Loss Functions

4.1 Mean Squared Error (MSE)

4.2 Cross Entropy Loss

Binary Cross Entropy (for binary classification)

Categorical Cross Entropy (for multi-class classification)

4.3 Choosing the Right Loss Function

5. Implementing Fully Connected Layers

5.1 NumPy Implementation

5.2 PyTorch Implementation

5.3 Complete Forward Pass Example

Exercises

Summary

Disclaimer