Learning Objectives
- Understand the data flow from input to output in forward propagation
- Master matrix operations for efficient neural network computation
- Comprehend the roles of weight matrices and bias vectors
- Learn about computational graphs and their importance for automatic differentiation
- Understand and implement MSE and Cross Entropy loss functions
- Implement fully connected layers in NumPy and PyTorch
1. Forward Propagation
1.1 Data Flow from Input to Output
Forward propagation is the process of computing the output of a neural network given an input. Data flows through the network layer by layer, with each layer transforming the data until the final output is produced.
For each layer, the computation follows these steps:
- Linear transformation: Multiply input by weights and add bias
- Activation: Apply non-linear activation function
The mathematical expression for a single layer:
$$\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}$$
$$\mathbf{a} = f(\mathbf{z})$$
Where:
- $\mathbf{x}$: input vector (or activation from previous layer)
- $\mathbf{W}$: weight matrix
- $\mathbf{b}$: bias vector
- $\mathbf{z}$: pre-activation (linear combination)
- $f$: activation function
- $\mathbf{a}$: activation output
1.2 Efficient Computation with Matrix Operations
Instead of computing neuron outputs one by one, we use matrix operations to process all neurons in a layer simultaneously. This is not only more concise but also significantly faster due to optimized linear algebra libraries.
Why Matrix Operations?
- Parallelization: GPUs can perform thousands of matrix operations simultaneously
- Optimization: Libraries like BLAS are highly optimized for matrix multiplication
- Batch Processing: Multiple samples can be processed at once
Example: Forward pass for a single layer
import numpy as np
def forward_layer(X, W, b, activation_fn):
"""
Forward pass for a single layer
Parameters:
-----------
X : ndarray, shape (batch_size, n_input)
Input data
W : ndarray, shape (n_input, n_output)
Weight matrix
b : ndarray, shape (1, n_output)
Bias vector
activation_fn : function
Activation function
Returns:
--------
A : ndarray, shape (batch_size, n_output)
Activation output
Z : ndarray, shape (batch_size, n_output)
Pre-activation (for backpropagation)
"""
# Linear transformation: Z = X @ W + b
Z = np.dot(X, W) + b
# Apply activation function
A = activation_fn(Z)
return A, Z
# Example usage
np.random.seed(42)
# Input: 4 samples, 3 features
X = np.array([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[10.0, 11.0, 12.0]
])
# Weights: 3 inputs -> 2 outputs
W = np.random.randn(3, 2) * 0.01
b = np.zeros((1, 2))
# ReLU activation
relu = lambda x: np.maximum(0, x)
A, Z = forward_layer(X, W, b, relu)
print("Input shape:", X.shape)
print("Output shape:", A.shape)
print("Pre-activation:\n", Z)
print("Activation:\n", A)
1.3 Batch Processing
In practice, we process multiple samples (a batch) at once. This is achieved by stacking samples as rows in a matrix:
$$\mathbf{Z} = \mathbf{X}\mathbf{W} + \mathbf{b}$$
Where:
- $\mathbf{X}$ has shape $(m, n_{in})$ - $m$ samples, $n_{in}$ input features
- $\mathbf{W}$ has shape $(n_{in}, n_{out})$ - weight matrix
- $\mathbf{b}$ has shape $(1, n_{out})$ - bias vector (broadcasted)
- $\mathbf{Z}$ has shape $(m, n_{out})$ - pre-activations for all samples
| Batch Size | Advantages | Disadvantages |
|---|---|---|
| 1 (SGD) | Frequent updates, can escape local minima | Noisy gradients, slow (no parallelization) |
| 32-256 | Good balance of speed and gradient quality | May need to tune learning rate |
| Full batch | Accurate gradients | Memory intensive, can get stuck in local minima |
2. Weight Matrices and Bias Vectors
2.1 The Role of Weights
The weight matrix $\mathbf{W}$ determines how strongly each input feature influences each output neuron. Each element $W_{ij}$ represents the connection strength from input $i$ to output $j$.
For 2 inputs and 3 outputs, the weight matrix is:
$$\mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}$$
2.2 Weight Initialization
Proper weight initialization is crucial for effective training. Poor initialization can lead to vanishing or exploding gradients.
| Method | Formula | Best For |
|---|---|---|
| Zero | $W = 0$ | Never use (symmetric problem) |
| Random Small | $W \sim \mathcal{N}(0, 0.01)$ | Simple networks |
| Xavier/Glorot | $W \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in} + n_{out}}})$ | Sigmoid, tanh |
| He | $W \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in}}})$ | ReLU |
import numpy as np
def initialize_weights(n_in, n_out, method='he'):
"""
Initialize weight matrix using different methods
Parameters:
-----------
n_in : int
Number of input neurons
n_out : int
Number of output neurons
method : str
Initialization method ('zeros', 'random', 'xavier', 'he')
Returns:
--------
W : ndarray
Initialized weight matrix
"""
if method == 'zeros':
# Bad: all neurons learn the same thing
return np.zeros((n_in, n_out))
elif method == 'random':
# Small random values
return np.random.randn(n_in, n_out) * 0.01
elif method == 'xavier':
# Good for sigmoid/tanh
std = np.sqrt(2.0 / (n_in + n_out))
return np.random.randn(n_in, n_out) * std
elif method == 'he':
# Good for ReLU
std = np.sqrt(2.0 / n_in)
return np.random.randn(n_in, n_out) * std
else:
raise ValueError(f"Unknown method: {method}")
# Compare initialization methods
np.random.seed(42)
n_in, n_out = 784, 256
for method in ['zeros', 'random', 'xavier', 'he']:
W = initialize_weights(n_in, n_out, method)
print(f"{method:8s}: mean={W.mean():.6f}, std={W.std():.6f}")
2.3 The Importance of Bias
The bias vector $\mathbf{b}$ allows the activation function to shift horizontally. Without bias, the pre-activation $z$ would be zero when inputs are zero, limiting the network's expressiveness.
Geometric Interpretation
- Weight: Controls the slope/rotation of the decision boundary
- Bias: Controls the position/offset of the decision boundary
Together, they allow the network to fit any linear function passing through any point in space.
import numpy as np
import matplotlib.pyplot as plt
# Demonstrate the effect of bias
x = np.linspace(-5, 5, 100)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Different bias values
biases = [-2, 0, 2]
w = 1 # Fixed weight
plt.figure(figsize=(10, 4))
for b in biases:
z = w * x + b
y = sigmoid(z)
plt.plot(x, y, label=f'bias = {b}')
plt.xlabel('Input x')
plt.ylabel('Output sigmoid(wx + b)')
plt.title('Effect of Bias on Sigmoid Activation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.show()
3. Layer Connections and Computational Graphs
3.1 The Concept of Computational Graphs
A computational graph is a directed graph that represents the sequence of operations in a neural network. Each node represents an operation (addition, multiplication, activation function), and edges represent data flow.
The computational graph for a single neuron with 2 inputs:
$$z = w_1 x_1 + w_2 x_2 + b$$
$$a = \text{ReLU}(z)$$
3.2 Foundations of Automatic Differentiation
The key benefit of computational graphs is enabling automatic differentiation. By storing intermediate values and applying the chain rule, we can efficiently compute gradients for all parameters.
Chain Rule: If $y = f(g(x))$, then:
$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$
This allows us to compute gradients layer by layer from output to input.
import numpy as np
class ComputationalNode:
"""
Simple computational node that stores value and gradient
"""
def __init__(self, value, requires_grad=False):
self.value = value
self.grad = 0 if requires_grad else None
self.requires_grad = requires_grad
def forward_with_cache(X, W, b):
"""
Forward pass that caches intermediate values for backpropagation
Returns:
--------
output : ndarray
Final output
cache : dict
Intermediate values needed for backpropagation
"""
# Store inputs
cache = {'X': X, 'W': W, 'b': b}
# Linear transformation
Z = np.dot(X, W) + b
cache['Z'] = Z
# ReLU activation
A = np.maximum(0, Z)
cache['A'] = A
return A, cache
# Example
X = np.array([[1.0, 2.0]])
W = np.array([[0.5, -0.5], [0.3, 0.7]])
b = np.array([[0.1, -0.1]])
output, cache = forward_with_cache(X, W, b)
print("Input X:", X)
print("Pre-activation Z:", cache['Z'])
print("Output A:", output)
3.3 Multi-Layer Forward Pass
For a network with multiple layers, we chain the forward passes together:
import numpy as np
def relu(x):
return np.maximum(0, x)
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
class MultiLayerNetwork:
"""
Multi-layer neural network with forward propagation
"""
def __init__(self, layer_sizes):
"""
Parameters:
-----------
layer_sizes : list
Number of neurons in each layer [input, hidden1, hidden2, ..., output]
"""
self.n_layers = len(layer_sizes) - 1
self.weights = []
self.biases = []
# He initialization for ReLU
for i in range(self.n_layers):
n_in = layer_sizes[i]
n_out = layer_sizes[i + 1]
W = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
b = np.zeros((1, n_out))
self.weights.append(W)
self.biases.append(b)
def forward(self, X):
"""
Forward propagation through all layers
Parameters:
-----------
X : ndarray, shape (batch_size, n_input)
Input data
Returns:
--------
output : ndarray
Network output
cache : list
Intermediate values for each layer
"""
cache = [{'A': X}] # Store input as first "activation"
A = X
for i in range(self.n_layers):
Z = np.dot(A, self.weights[i]) + self.biases[i]
# Use ReLU for hidden layers, softmax for output
if i < self.n_layers - 1:
A = relu(Z)
else:
A = softmax(Z)
cache.append({'Z': Z, 'A': A})
return A, cache
# Example: 4 inputs -> 8 hidden -> 4 hidden -> 3 outputs
np.random.seed(42)
network = MultiLayerNetwork([4, 8, 4, 3])
# Test data
X = np.array([
[5.1, 3.5, 1.4, 0.2],
[6.3, 2.9, 5.6, 1.8],
[5.8, 2.7, 5.1, 1.9]
])
output, cache = network.forward(X)
print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("Predictions (probabilities):\n", output)
print("Predicted classes:", np.argmax(output, axis=1))
4. Loss Functions
A loss function (also called cost function or objective function) measures how well the network's predictions match the true targets. The goal of training is to minimize this loss.
4.1 Mean Squared Error (MSE)
MSE is the standard loss function for regression problems. It measures the average squared difference between predictions and true values.
Formula:
$$\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Where:
- $y_i$: true value
- $\hat{y}_i$: predicted value
- $n$: number of samples
Characteristics:
- Always non-negative (squared values)
- Penalizes large errors more than small ones (quadratic)
- Differentiable everywhere (smooth gradient)
- Sensitive to outliers
import numpy as np
def mse_loss(y_true, y_pred):
"""
Mean Squared Error loss
Parameters:
-----------
y_true : ndarray
True values
y_pred : ndarray
Predicted values
Returns:
--------
loss : float
MSE loss value
"""
return np.mean((y_true - y_pred) ** 2)
def mse_gradient(y_true, y_pred):
"""
Gradient of MSE loss with respect to predictions
Returns:
--------
gradient : ndarray
dL/dy_pred
"""
n = y_true.shape[0]
return 2 * (y_pred - y_true) / n
# Example: regression task
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.2, 1.8, 3.1, 3.9, 5.2])
loss = mse_loss(y_true, y_pred)
grad = mse_gradient(y_true, y_pred)
print(f"True values: {y_true}")
print(f"Predictions: {y_pred}")
print(f"MSE Loss: {loss:.4f}")
print(f"Gradient: {grad}")
4.2 Cross Entropy Loss
Cross Entropy is the standard loss function for classification problems. It measures the difference between the predicted probability distribution and the true distribution.
Binary Cross Entropy (for binary classification)
$$\mathcal{L}_{BCE} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$
Categorical Cross Entropy (for multi-class classification)
$$\mathcal{L}_{CE} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$$
Where:
- $y_{i,c}$: true label (one-hot encoded)
- $\hat{y}_{i,c}$: predicted probability for class $c$
- $C$: number of classes
import numpy as np
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
"""
Binary Cross Entropy loss
Parameters:
-----------
y_true : ndarray
True binary labels (0 or 1)
y_pred : ndarray
Predicted probabilities
epsilon : float
Small value to avoid log(0)
Returns:
--------
loss : float
BCE loss value
"""
# Clip predictions to avoid log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
"""
Categorical Cross Entropy loss
Parameters:
-----------
y_true : ndarray, shape (n_samples, n_classes)
True labels (one-hot encoded)
y_pred : ndarray, shape (n_samples, n_classes)
Predicted probabilities
Returns:
--------
loss : float
CCE loss value
"""
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
# Example: multi-class classification
# 3 samples, 4 classes
y_true = np.array([
[1, 0, 0, 0], # Class 0
[0, 1, 0, 0], # Class 1
[0, 0, 0, 1] # Class 3
])
# Good predictions
y_pred_good = np.array([
[0.9, 0.05, 0.03, 0.02],
[0.1, 0.7, 0.1, 0.1],
[0.05, 0.05, 0.1, 0.8]
])
# Bad predictions
y_pred_bad = np.array([
[0.25, 0.25, 0.25, 0.25],
[0.1, 0.3, 0.3, 0.3],
[0.4, 0.3, 0.2, 0.1]
])
loss_good = categorical_cross_entropy(y_true, y_pred_good)
loss_bad = categorical_cross_entropy(y_true, y_pred_bad)
print(f"Good predictions - Loss: {loss_good:.4f}")
print(f"Bad predictions - Loss: {loss_bad:.4f}")
4.3 Choosing the Right Loss Function
| Task Type | Output Activation | Loss Function |
|---|---|---|
| Regression | Linear (none) | MSE, MAE, Huber |
| Binary Classification | Sigmoid | Binary Cross Entropy |
| Multi-class (single label) | Softmax | Categorical Cross Entropy |
| Multi-label | Sigmoid | Binary Cross Entropy (per label) |
Why Cross Entropy for Classification?
- Provides stronger gradients when predictions are wrong
- Naturally pairs with softmax/sigmoid outputs
- Has information-theoretic interpretation (relative entropy)
- Leads to faster convergence than MSE for classification
5. Implementing Fully Connected Layers
5.1 NumPy Implementation
import numpy as np
class DenseLayer:
"""
Fully connected (dense) layer implementation
"""
def __init__(self, n_input, n_output, activation='relu'):
"""
Parameters:
-----------
n_input : int
Number of input features
n_output : int
Number of output neurons
activation : str
Activation function ('relu', 'sigmoid', 'softmax', 'none')
"""
self.n_input = n_input
self.n_output = n_output
self.activation = activation
# He initialization
self.W = np.random.randn(n_input, n_output) * np.sqrt(2.0 / n_input)
self.b = np.zeros((1, n_output))
# Cache for backpropagation
self.cache = {}
def _activate(self, Z):
"""Apply activation function"""
if self.activation == 'relu':
return np.maximum(0, Z)
elif self.activation == 'sigmoid':
return 1 / (1 + np.exp(-Z))
elif self.activation == 'softmax':
exp_Z = np.exp(Z - np.max(Z, axis=1, keepdims=True))
return exp_Z / np.sum(exp_Z, axis=1, keepdims=True)
elif self.activation == 'none':
return Z
else:
raise ValueError(f"Unknown activation: {self.activation}")
def forward(self, X):
"""
Forward pass
Parameters:
-----------
X : ndarray, shape (batch_size, n_input)
Input data
Returns:
--------
A : ndarray, shape (batch_size, n_output)
Layer output
"""
self.cache['X'] = X
# Linear transformation
Z = np.dot(X, self.W) + self.b
self.cache['Z'] = Z
# Activation
A = self._activate(Z)
self.cache['A'] = A
return A
# Build a simple network
np.random.seed(42)
# Layer 1: 4 inputs -> 8 neurons (ReLU)
layer1 = DenseLayer(4, 8, activation='relu')
# Layer 2: 8 neurons -> 4 neurons (ReLU)
layer2 = DenseLayer(8, 4, activation='relu')
# Layer 3: 4 neurons -> 3 outputs (Softmax)
layer3 = DenseLayer(4, 3, activation='softmax')
# Test forward pass
X = np.array([
[5.1, 3.5, 1.4, 0.2],
[6.3, 2.9, 5.6, 1.8],
[5.8, 2.7, 5.1, 1.9],
[5.0, 3.4, 1.5, 0.2],
[6.7, 3.1, 5.6, 2.4]
])
# Forward through all layers
A1 = layer1.forward(X)
A2 = layer2.forward(A1)
output = layer3.forward(A2)
print("Input shape:", X.shape)
print("After Layer 1:", A1.shape)
print("After Layer 2:", A2.shape)
print("Output shape:", output.shape)
print("\nPredicted probabilities:\n", output)
print("\nPredicted classes:", np.argmax(output, axis=1))
5.2 PyTorch Implementation
import torch
import torch.nn as nn
class SimpleNetwork(nn.Module):
"""
Simple fully connected network in PyTorch
"""
def __init__(self, input_size, hidden_sizes, output_size):
"""
Parameters:
-----------
input_size : int
Number of input features
hidden_sizes : list
Number of neurons in each hidden layer
output_size : int
Number of output classes
"""
super(SimpleNetwork, self).__init__()
layers = []
prev_size = input_size
# Hidden layers with ReLU
for hidden_size in hidden_sizes:
layers.append(nn.Linear(prev_size, hidden_size))
layers.append(nn.ReLU())
prev_size = hidden_size
# Output layer (no activation - use with CrossEntropyLoss)
layers.append(nn.Linear(prev_size, output_size))
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
# Create network
model = SimpleNetwork(input_size=4, hidden_sizes=[8, 4], output_size=3)
print("Model architecture:")
print(model)
# Test data
X_torch = torch.tensor([
[5.1, 3.5, 1.4, 0.2],
[6.3, 2.9, 5.6, 1.8],
[5.8, 2.7, 5.1, 1.9],
[5.0, 3.4, 1.5, 0.2],
[6.7, 3.1, 5.6, 2.4]
], dtype=torch.float32)
# Forward pass
with torch.no_grad():
logits = model(X_torch)
probabilities = torch.softmax(logits, dim=1)
print("\nLogits shape:", logits.shape)
print("Probabilities:\n", probabilities.numpy())
print("Predicted classes:", torch.argmax(probabilities, dim=1).numpy())
# Loss calculation example
y_true = torch.tensor([0, 2, 2, 0, 2]) # True class labels
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, y_true)
print(f"\nCross Entropy Loss: {loss.item():.4f}")
5.3 Complete Forward Pass Example
import numpy as np
def complete_forward_pass(X, weights, biases):
"""
Complete forward pass through a neural network
Parameters:
-----------
X : ndarray
Input data
weights : list of ndarray
Weight matrices for each layer
biases : list of ndarray
Bias vectors for each layer
Returns:
--------
output : ndarray
Network output
activations : list
Activation at each layer (for visualization)
"""
def relu(x):
return np.maximum(0, x)
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
activations = [X]
A = X
for i, (W, b) in enumerate(zip(weights, biases)):
# Linear transformation
Z = np.dot(A, W) + b
# Activation (softmax for last layer, ReLU for others)
if i == len(weights) - 1:
A = softmax(Z)
else:
A = relu(Z)
activations.append(A)
return A, activations
# Initialize network
np.random.seed(42)
layer_sizes = [4, 8, 4, 3]
weights = []
biases = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
W = np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)
b = np.zeros((1, n_out))
weights.append(W)
biases.append(b)
# Input data (Iris-like)
X = np.array([
[5.1, 3.5, 1.4, 0.2], # setosa
[7.0, 3.2, 4.7, 1.4], # versicolor
[6.3, 3.3, 6.0, 2.5], # virginica
])
# Run forward pass
output, activations = complete_forward_pass(X, weights, biases)
print("Layer-by-layer activations:")
for i, A in enumerate(activations):
print(f" Layer {i}: shape {A.shape}, mean={A.mean():.4f}, std={A.std():.4f}")
print("\nFinal output (probabilities):")
print(output)
print("\nPredicted classes:", np.argmax(output, axis=1))
Exercises
Exercise 1: Manual Forward Pass Calculation
Problem: Given the following network with 2 inputs, 2 hidden neurons, and 2 outputs:
- Input: $\mathbf{x} = [1, 2]$
- Weights layer 1: $\mathbf{W}^{(1)} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}$
- Bias layer 1: $\mathbf{b}^{(1)} = [0.1, 0.1]$
- Weights layer 2: $\mathbf{W}^{(2)} = \begin{bmatrix} 0.5 & 0.6 \\ 0.7 & 0.8 \end{bmatrix}$
- Bias layer 2: $\mathbf{b}^{(2)} = [0.2, 0.2]$
- Activation: ReLU for hidden layer, Softmax for output
Calculate the output by hand, then verify with code.
Exercise 2: Loss Function Comparison
Problem: For a binary classification problem, compare how MSE and Binary Cross Entropy losses change as the prediction varies from 0 to 1 for a true label of 1.
- Plot both loss functions for $\hat{y} \in [0.01, 0.99]$ with $y = 1$
- Plot the gradients $\frac{dL}{d\hat{y}}$ for both losses
- Explain why Cross Entropy is preferred for classification
Exercise 3: Implement Huber Loss
Problem: Implement the Huber loss function, which combines MSE and MAE:
$$L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$$
Compare its behavior to MSE and MAE for inputs with outliers.
Exercise 4: Batch Processing
Problem: Implement a function that processes data in mini-batches:
- Create a dataset of 1000 samples with 10 features
- Implement batch iteration with configurable batch size
- Measure the time to process all samples with batch sizes 1, 32, 128, and 1000
- Plot the relationship between batch size and processing time
Exercise 5: Weight Initialization Experiment
Problem: Investigate the effect of weight initialization on activation distributions:
- Create a 5-layer network (100 -> 100 -> 100 -> 100 -> 10)
- Initialize with zeros, random small values, Xavier, and He methods
- Pass random input through the network and record activation statistics (mean, std) at each layer
- Plot histograms of activations for each initialization method
Exercise 6: Computational Graph Visualization
Problem: Draw the computational graph for a 2-layer network and trace through a forward pass:
- Network: 2 inputs -> 3 hidden (ReLU) -> 2 outputs (Softmax)
- Draw all operations (multiply, add, activation) as nodes
- Label edges with tensor shapes
- Calculate the total number of parameters
Summary
In this chapter, we learned how neural networks process data:
- Forward Propagation: Data flows layer by layer, with each layer performing linear transformation followed by activation
- Matrix Operations: Enable efficient batch processing and parallelization
- Weight Matrices: Control connection strengths; proper initialization (Xavier, He) is crucial
- Bias Vectors: Allow shifting the activation function
- Computational Graphs: Represent operations as a graph, enabling automatic differentiation
- Loss Functions: MSE for regression, Cross Entropy for classification
Next Chapter Preview: In Chapter 3, we will learn about learning algorithms - how neural networks adjust their weights to minimize loss using gradient descent and backpropagation.