🌐 EN | šŸ‡ÆšŸ‡µ JP | Last sync: 2025-11-16

Chapter 2: Pooling Layers and CNN Architectures

Understanding the Evolution and Design Principles of Representative CNN Models

šŸ“– Reading Time: 25-30 minutes šŸ“Š Difficulty: Beginner to Intermediate šŸ’» Code Examples: 10 šŸ“ Exercises: 5

This chapter covers Pooling Layers and CNN Architectures. You will learn dimensionality reduction and overfitting prevention techniques using Dropout.

Learning Objectives

By reading this chapter, you will be able to:


2.1 The Role of Pooling Layers

What is Pooling?

A Pooling Layer is a layer that performs spatial downsampling of the output from convolutional layers. Its main purposes are the following three:

"Pooling is an operation that preserves important features of an image while discarding unnecessary details"

Max Pooling vs Average Pooling

There are mainly two types of pooling:

Type Operation Characteristics Use Cases
Max Pooling Takes the maximum value in the region Preserves the strongest features Object detection, general image classification
Average Pooling Takes the average value in the region Preserves overall features Global Average Pooling, segmentation
graph LR A["Input Feature Map
4Ɨ4"] --> B["Max Pooling
2Ɨ2, stride=2"] A --> C["Average Pooling
2Ɨ2, stride=2"] B --> D["Output
2Ɨ2
Max values retained"] C --> E["Output
2Ɨ2
Average values retained"] style A fill:#e1f5ff style B fill:#b3e5fc style C fill:#81d4fa style D fill:#4fc3f7 style E fill:#29b6f6

Example of Max Pooling Operation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0

"""
Example: Example of Max Pooling Operation

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np
import torch
import torch.nn as nn

# Input data (1 channel, 4Ɨ4)
input_data = torch.tensor([[
    [1.0, 3.0, 2.0, 4.0],
    [5.0, 6.0, 1.0, 2.0],
    [7.0, 2.0, 8.0, 3.0],
    [1.0, 4.0, 6.0, 9.0]
]], dtype=torch.float32).unsqueeze(0)  # (1, 1, 4, 4)

print("Input Feature Map:")
print(input_data.squeeze().numpy())

# Max Pooling (2Ɨ2, stride=2)
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
output_max = max_pool(input_data)

print("\nMax Pooling (2Ɨ2) Output:")
print(output_max.squeeze().numpy())

# Average Pooling (2Ɨ2, stride=2)
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
output_avg = avg_pool(input_data)

print("\nAverage Pooling (2Ɨ2) Output:")
print(output_avg.squeeze().numpy())

# Manual calculation verification (top-left region)
print("\nManual calculation (top-left 2Ɨ2 region):")
region = input_data[0, 0, 0:2, 0:2].numpy()
print(f"Region: \n{region}")
print(f"Max: {region.max()}")
print(f"Average: {region.mean()}")

Effect of Pooling: Translation Invariance

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Effect of Pooling: Translation Invariance

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Original feature map
original = torch.tensor([[
    [0, 0, 1, 0],
    [0, 1, 0, 0],
    [0, 0, 0, 0],
    [0, 0, 0, 0]
]], dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # (1, 1, 4, 4)

# Slightly shifted feature map
shifted = torch.tensor([[
    [0, 1, 0, 0],
    [0, 0, 1, 0],
    [0, 0, 0, 0],
    [0, 0, 0, 0]
]], dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # (1, 1, 4, 4)

max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

print("Max Pooling of original feature map:")
print(max_pool(original).squeeze().numpy())

print("\nMax Pooling of shifted feature map:")
print(max_pool(shifted).squeeze().numpy())

print("\n→ Even with minor positional changes, Max Pooling output has 1 appearing in the same region")
print("  This is 'translation invariance'")

Pooling Parameters

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Pooling Parameters

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Comparison of different pooling configurations
input_data = torch.randn(1, 1, 8, 8)  # (batch, channels, height, width)

# Configuration 1: 2Ɨ2, stride=2 (standard)
pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
output1 = pool1(input_data)

# Configuration 2: 3Ɨ3, stride=2 (with overlap)
pool2 = nn.MaxPool2d(kernel_size=3, stride=2)
output2 = pool2(input_data)

# Configuration 3: 2Ɨ2, stride=1 (high overlap)
pool3 = nn.MaxPool2d(kernel_size=2, stride=1)
output3 = pool3(input_data)

print(f"Input size: {input_data.shape}")
print(f"2Ɨ2, stride=2 output size: {output1.shape}")
print(f"3Ɨ3, stride=2 output size: {output2.shape}")
print(f"2Ɨ2, stride=1 output size: {output3.shape}")

# Calculate dimensionality reduction rate
reduction1 = (input_data.numel() - output1.numel()) / input_data.numel() * 100
reduction2 = (input_data.numel() - output2.numel()) / input_data.numel() * 100

print(f"\n2Ɨ2, stride=2 dimensionality reduction rate: {reduction1:.1f}%")
print(f"3Ɨ3, stride=2 dimensionality reduction rate: {reduction2:.1f}%")

Global Average Pooling

Global Average Pooling (GAP) is a special type of pooling that takes the average of the entire feature map. In modern CNNs, it is increasingly used as a replacement for the final fully connected layer.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Global Average Pooling (GAP)is a special type of pooling tha

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Input: (batch_size, channels, height, width)
input_features = torch.randn(2, 512, 7, 7)  # 2 samples, 512 channels, 7Ɨ7

# Global Average Pooling
gap = nn.AdaptiveAvgPool2d((1, 1))  # Specify output size as (1, 1)
output = gap(input_features)

print(f"Input size: {input_features.shape}")
print(f"GAP output size: {output.shape}")

# Flatten
output_flat = output.view(output.size(0), -1)
print(f"After flattening: {output_flat.shape}")

# Benefits of GAP
print("\nAdvantages of Global Average Pooling:")
print("1. Zero parameters (compared to fully connected layers)")
print("2. Independent of input size (works with any size)")
print("3. Reduced risk of overfitting")
print("4. Spatial average of each channel = intensity of the concept that channel represents")

2.2 Representative CNN Architectures

Evolution of CNNs: Historical Overview

graph LR A[LeNet-5
1998] --> B[AlexNet
2012] B --> C[VGGNet
2014] C --> D[GoogLeNet
2014] D --> E[ResNet
2015] E --> F[DenseNet
2017] F --> G[EfficientNet
2019] G --> H[Vision Transformer
2020+] style A fill:#e1f5ff style B fill:#b3e5fc style C fill:#81d4fa style D fill:#4fc3f7 style E fill:#29b6f6 style F fill:#03a9f4 style G fill:#039be5 style H fill:#0288d1

LeNet-5 (1998): The Origin of CNNs

LeNet-5 was developed by Yann LeCun for handwritten digit recognition (MNIST). It is the foundational architecture of modern CNNs.

Layer Output Size Parameters
Input 1Ɨ28Ɨ28 -
Conv1 (5Ɨ5, 6ch) 6Ɨ24Ɨ24 156
AvgPool (2Ɨ2) 6Ɨ12Ɨ12 0
Conv2 (5Ɨ5, 16ch) 16Ɨ8Ɨ8 2,416
AvgPool (2Ɨ2) 16Ɨ4Ɨ4 0
FC1 (120) 120 30,840
FC2 (84) 84 10,164
FC3 (10) 10 850
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: LeNet-5was developed by Yann LeCun for handwritten digit rec

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

class LeNet5(nn.Module):
    def __init__(self, num_classes=10):
        super(LeNet5, self).__init__()

        # Feature extraction layers
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)    # 28Ɨ28 → 24Ɨ24
        self.pool1 = nn.AvgPool2d(kernel_size=2)        # 24Ɨ24 → 12Ɨ12
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)   # 12Ɨ12 → 8Ɨ8
        self.pool2 = nn.AvgPool2d(kernel_size=2)        # 8Ɨ8 → 4Ɨ4

        # Classification layers
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)

    def forward(self, x):
        # Feature extraction
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)

        # Flatten
        x = x.view(x.size(0), -1)

        # Classification
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

# Create model and summary
model = LeNet5(num_classes=10)
print(model)

# Calculate parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Test run
x = torch.randn(1, 1, 28, 28)
output = model(x)
print(f"\nInput size: {x.shape}")
print(f"Output size: {output.shape}")

AlexNet (2012): The Dawn of Deep Learning

AlexNet demonstrated overwhelming performance in the 2012 ImageNet competition and sparked the deep learning boom.

Key features:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Key features:

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()

        # Feature extraction layers
        self.features = nn.Sequential(
            # Conv1: 96 filters, 11Ɨ11, stride=4
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),

            # Conv2: 256 filters, 5Ɨ5
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),

            # Conv3: 384 filters, 3Ɨ3
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),

            # Conv4: 384 filters, 3Ɨ3
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),

            # Conv5: 256 filters, 3Ɨ3
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )

        # Classification layers
        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.classifier(x)
        return x

# Create model
model = AlexNet(num_classes=1000)

# Parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"AlexNet total parameters: {total_params:,}")

# Check layer sizes
x = torch.randn(1, 3, 224, 224)
print(f"\nInput: {x.shape}")

for i, layer in enumerate(model.features):
    x = layer(x)
    if isinstance(layer, (nn.Conv2d, nn.MaxPool2d)):
        print(f"Layer {i} ({layer.__class__.__name__}): {x.shape}")

VGGNet (2014): The Aesthetics of Simplicity

VGGNet demonstrated the effectiveness of deep networks with a simple design that repeatedly uses small 3Ɨ3 filters.

Design principles:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn

class VGGBlock(nn.Module):
    """Basic VGG block: repeat Conv → ReLU"""
    def __init__(self, in_channels, out_channels, num_convs):
        super(VGGBlock, self).__init__()

        layers = []
        for i in range(num_convs):
            layers.append(nn.Conv2d(
                in_channels if i == 0 else out_channels,
                out_channels,
                kernel_size=3,
                padding=1
            ))
            layers.append(nn.ReLU(inplace=True))

        self.block = nn.Sequential(*layers)

    def forward(self, x):
        return self.block(x)

class VGG16(nn.Module):
    def __init__(self, num_classes=1000):
        super(VGG16, self).__init__()

        # Feature extraction part
        self.features = nn.Sequential(
            VGGBlock(3, 64, 2),      # Block 1: 64 channels, 2 convs
            nn.MaxPool2d(2, 2),

            VGGBlock(64, 128, 2),    # Block 2: 128 channels, 2 convs
            nn.MaxPool2d(2, 2),

            VGGBlock(128, 256, 3),   # Block 3: 256 channels, 3 convs
            nn.MaxPool2d(2, 2),

            VGGBlock(256, 512, 3),   # Block 4: 512 channels, 3 convs
            nn.MaxPool2d(2, 2),

            VGGBlock(512, 512, 3),   # Block 5: 512 channels, 3 convs
            nn.MaxPool2d(2, 2),
        )

        # Classification part
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

# Create model
model = VGG16(num_classes=1000)
total_params = sum(p.numel() for p in model.parameters())
print(f"VGG-16 total parameters: {total_params:,}")

# Why stack 3Ɨ3 filters twice?
print("\nTwo 3Ɨ3 filters vs one 5Ɨ5 filter:")
print("Receptive field: Same 5Ɨ5")
print("Parameters: 3Ɨ3Ɨ2 = 18 < 5Ɨ5 = 25")
print("Non-linearity: 2 ReLUs > 1 ReLU (higher expressiveness)")

ResNet (2015): The Revolution of Skip Connections

ResNet (Residual Network) introduced Skip Connections (residual connections), enabling the training of very deep networks (over 100 layers).

Problem: Deeper networks should perform better, but in practice the vanishing gradient problem makes training difficult.

Solution: Introduce Residual Blocks

graph TD A["Input x"] --> B["Conv + ReLU"] B --> C["Conv"] A --> D["Identity
(as is)"] C --> E["Addition +"] D --> E E --> F["ReLU"] F --> G["Output"] style A fill:#e1f5ff style D fill:#fff9c4 style E fill:#c8e6c9 style G fill:#4fc3f7

Mathematical expression:

$$ \mathbf{y} = F(\mathbf{x}) + \mathbf{x} $$

Where $F(\mathbf{x})$ is the residual function (the part to be learned) and $\mathbf{x}$ is the shortcut connection.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResidualBlock(nn.Module):
    """Basic ResNet block"""
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResidualBlock, self).__init__()

        # Main path
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut path (adjustment when input/output channels differ)
        self.downsample = downsample

    def forward(self, x):
        identity = x

        # Main path
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        # Shortcut connection
        if self.downsample is not None:
            identity = self.downsample(x)

        # Addition
        out += identity
        out = F.relu(out)

        return out

class SimpleResNet(nn.Module):
    """Simplified ResNet (for CIFAR-10)"""
    def __init__(self, num_classes=10):
        super(SimpleResNet, self).__init__()

        # Initial layer
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        # Residual blocks
        self.layer1 = self._make_layer(64, 64, num_blocks=2, stride=1)
        self.layer2 = self._make_layer(64, 128, num_blocks=2, stride=2)
        self.layer3 = self._make_layer(128, 256, num_blocks=2, stride=2)

        # Classification layer
        self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(256, num_classes)

    def _make_layer(self, in_channels, out_channels, num_blocks, stride):
        downsample = None
        if stride != 1 or in_channels != out_channels:
            downsample = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )

        layers = []
        layers.append(ResidualBlock(in_channels, out_channels, stride, downsample))

        for _ in range(1, num_blocks):
            layers.append(ResidualBlock(out_channels, out_channels))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = F.relu(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)

        x = self.avg_pool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

# Create model
model = SimpleResNet(num_classes=10)
total_params = sum(p.numel() for p in model.parameters())
print(f"ResNet total parameters: {total_params:,}")

# Visualize Skip Connection effect
x = torch.randn(1, 3, 32, 32)
output = model(x)
print(f"\nInput: {x.shape} → Output: {output.shape}")

Why Skip Connections are Effective

Theoretical Background of Skip Connections

In conventional networks, deepening causes the vanishing gradient problem:

$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} $$

With deep layers, $\frac{\partial y}{\partial x}$ is multiplied many times, causing gradients to vanish.

With Skip Connections:

$$ \frac{\partial}{\partial x}(F(x) + x) = \frac{\partial F(x)}{\partial x} + 1 $$

Because of the "+1" term, gradients always flow!

Furthermore, the network only needs to "learn the identity mapping" → learning becomes easier.


2.3 Batch Normalization

What is Batch Normalization?

Batch Normalization (BN) is a technique that normalizes the output of each mini-batch to stabilize learning.

Normalize the output of each layer to mean 0 and variance 1 across the mini-batch:

$$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$ $$ y_i = \gamma \hat{x}_i + \beta $$

Where:

Effects of Batch Normalization

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class ConvBlockWithoutBN(nn.Module):
    """Without Batch Normalization"""
    def __init__(self, in_channels, out_channels):
        super(ConvBlockWithoutBN, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)

    def forward(self, x):
        return F.relu(self.conv(x))

class ConvBlockWithBN(nn.Module):
    """With Batch Normalization"""
    def __init__(self, in_channels, out_channels):
        super(ConvBlockWithBN, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)

    def forward(self, x):
        return F.relu(self.bn(self.conv(x)))

# Comparison experiment
x = torch.randn(32, 3, 32, 32)  # Batch size 32

# Without BN
block_without_bn = ConvBlockWithoutBN(3, 64)
output_without_bn = block_without_bn(x)

# With BN
block_with_bn = ConvBlockWithBN(3, 64)
output_with_bn = block_with_bn(x)

print("=== Effect of Batch Normalization ===")
print(f"Without BN - Mean: {output_without_bn.mean():.4f}, Std: {output_without_bn.std():.4f}")
print(f"With BN - Mean: {output_with_bn.mean():.4f}, Std: {output_with_bn.std():.4f}")

# Visualize distribution
print("\nStatistics per channel (with BN):")
for i in range(min(5, output_with_bn.size(1))):
    channel_data = output_with_bn[:, i, :, :]
    print(f"  Channel {i}: Mean={channel_data.mean():.4f}, Std={channel_data.std():.4f}")

Placement of Batch Normalization

BN is typically placed in the order: Conv → BN → Activation:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: BN is typically placed in the order:Conv → BN → Activation:

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch.nn as nn

# Recommended order
class StandardConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(StandardConvBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.conv(x)    # 1. Convolution
        x = self.bn(x)      # 2. Batch Normalization
        x = self.relu(x)    # 3. Activation function
        return x

# Note: Specify bias=False in Conv
# Reason: BN makes the mean 0, so bias is unnecessary
block = StandardConvBlock(3, 64)
print("Conv → BN → ReLU order")
print(block)

2.4 Overfitting Prevention with Dropout

What is Dropout?

Dropout is a regularization technique that prevents overfitting by randomly disabling (dropping out) neurons during training.

graph TD A["During Training"] --> B["All Neurons"] B --> C["Drop 50% Randomly"] C --> D["Learn with Remaining 50%"] E["During Testing"] --> F["Use All Neurons"] F --> G["Scale Weights by 0.5"] style A fill:#e1f5ff style E fill:#c8e6c9 style D fill:#b3e5fc style G fill:#81d4fa

Why is Dropout Effective?

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Why is Dropout Effective?

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

# Verify Dropout behavior
x = torch.ones(4, 10)  # Tensor of all 1s

dropout = nn.Dropout(p=0.5)  # 50% probability of dropping

# Training mode
dropout.train()
print("=== Training Mode (Dropout Enabled) ===")
for i in range(3):
    output = dropout(x)
    print(f"Trial {i+1}: {output[0, :5].numpy()}")  # Display first 5 elements

# Evaluation mode
dropout.eval()
print("\n=== Evaluation Mode (Dropout Disabled) ===")
output = dropout(x)
print(f"Output: {output[0, :5].numpy()}")

How to Use Dropout in CNNs

In CNNs, Dropout is typically placed before fully connected layers. It's not commonly used in convolutional layers.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: In CNNs, Dropout is typically placedbefore fully connected l

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch.nn as nn

class CNNWithDropout(nn.Module):
    def __init__(self, num_classes=10):
        super(CNNWithDropout, self).__init__()

        # Convolutional layers (without Dropout)
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
        )

        # Fully connected layers (with Dropout)
        self.classifier = nn.Sequential(
            nn.Linear(128 * 8 * 8, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),  # Dropout

            nn.Linear(512, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),  # Dropout

            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

model = CNNWithDropout(num_classes=10)
print(model)

# Experiment with Dropout effect
model.train()
x = torch.randn(2, 3, 32, 32)
output1 = model(x)
output2 = model(x)
print(f"\nTraining mode: Same input produces different outputs = {not torch.allclose(output1, output2)}")

model.eval()
output3 = model(x)
output4 = model(x)
print(f"Evaluation mode: Same input produces same output = {torch.allclose(output3, output4)}")

Dropout vs Batch Normalization

Item Dropout Batch Normalization
Main Purpose Overfitting prevention Learning stabilization & acceleration
Where Used Fully connected layers Convolutional layers
Train/Test Different behavior Different behavior
Combined Use Possible (though BN may make it unnecessary) -

Modern Best Practice: Use Batch Normalization in convolutional layers and Dropout in fully connected layers (as needed). However, due to BN's regularization effect, Dropout often becomes unnecessary.


2.5 Practical: CIFAR-10 Image Classification

CIFAR-10 Dataset

CIFAR-10 is a dataset consisting of 32Ɨ32 color images (60,000 total) in 10 classes:

Complete CNN Classifier Implementation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

"""
Example: Complete CNN Classifier Implementation

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import numpy as np

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Data augmentation and normalization
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

# Load dataset
train_dataset = datasets.CIFAR10(root='./data', train=True,
                                 download=True, transform=transform_train)
test_dataset = datasets.CIFAR10(root='./data', train=False,
                                download=True, transform=transform_test)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=2)

print(f"Training data: {len(train_dataset)} images")
print(f"Test data: {len(test_dataset)} images")

# Class names
classes = ('plane', 'car', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck')

Modern CNN Architecture

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch.nn as nn
import torch.nn.functional as F

class CIFAR10Net(nn.Module):
    """Modern CNN for CIFAR-10"""
    def __init__(self, num_classes=10):
        super(CIFAR10Net, self).__init__()

        # Block 1
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        # Block 2
        self.conv2 = nn.Conv2d(64, 128, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(128)

        # Block 3
        self.conv3 = nn.Conv2d(128, 256, 3, padding=1, bias=False)
        self.bn3 = nn.BatchNorm2d(256)
        self.conv4 = nn.Conv2d(256, 256, 3, padding=1, bias=False)
        self.bn4 = nn.BatchNorm2d(256)

        # Block 4
        self.conv5 = nn.Conv2d(256, 512, 3, padding=1, bias=False)
        self.bn5 = nn.BatchNorm2d(512)
        self.conv6 = nn.Conv2d(512, 512, 3, padding=1, bias=False)
        self.bn6 = nn.BatchNorm2d(512)

        # Global Average Pooling
        self.gap = nn.AdaptiveAvgPool2d((1, 1))

        # Classification layer
        self.fc = nn.Linear(512, num_classes)

        # Dropout
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # Block 1
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.max_pool2d(x, 2)

        # Block 2
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.max_pool2d(x, 2)

        # Block 3
        x = F.relu(self.bn3(self.conv3(x)))
        x = F.relu(self.bn4(self.conv4(x)))
        x = F.max_pool2d(x, 2)

        # Block 4
        x = F.relu(self.bn5(self.conv5(x)))
        x = F.relu(self.bn6(self.conv6(x)))

        # Global Average Pooling
        x = self.gap(x)
        x = x.view(x.size(0), -1)

        # Dropout + classification
        x = self.dropout(x)
        x = self.fc(x)

        return x

# Create model
model = CIFAR10Net(num_classes=10).to(device)
print(model)

# Parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")

Training Loop

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Training Loop

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch.optim as optim

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in loader:
        inputs, labels = inputs.to(device), labels.to(device)

        # Forward
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward
        loss.backward()
        optimizer.step()

        # Statistics
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    epoch_loss = running_loss / len(loader)
    epoch_acc = 100. * correct / total
    return epoch_loss, epoch_acc

def test_epoch(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels)

            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    epoch_loss = running_loss / len(loader)
    epoch_acc = 100. * correct / total
    return epoch_loss, epoch_acc

# Execute training
num_epochs = 50
best_acc = 0

print("\nStarting training...")
for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = test_epoch(model, test_loader, criterion, device)

    scheduler.step()

    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}]")
        print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
        print(f"  Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

    # Save best model
    if test_acc > best_acc:
        best_acc = test_acc
        torch.save(model.state_dict(), 'best_cifar10_model.pth')

print(f"\nTraining complete! Best accuracy: {best_acc:.2f}%")

Prediction and Visualization

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import matplotlib.pyplot as plt
import numpy as np

def imshow(img, title=None):
    """Helper function for image display"""
    img = img.numpy().transpose((1, 2, 0))
    mean = np.array([0.4914, 0.4822, 0.4465])
    std = np.array([0.2023, 0.1994, 0.2010])
    img = std * img + mean
    img = np.clip(img, 0, 1)
    plt.imshow(img)
    if title:
        plt.title(title)
    plt.axis('off')

# Get samples from test data
dataiter = iter(test_loader)
images, labels = next(dataiter)
images, labels = images.to(device), labels.to(device)

# Prediction
model.eval()
with torch.no_grad():
    outputs = model(images)
    _, predicted = outputs.max(1)

# Display first 8 images
fig, axes = plt.subplots(2, 4, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    imshow(images[i].cpu(), title=f"True: {classes[labels[i]]}\nPred: {classes[predicted[i]]}")
    ax.imshow(images[i].cpu().numpy().transpose((1, 2, 0)))

plt.tight_layout()
plt.savefig('cifar10_predictions.png', dpi=150, bbox_inches='tight')
print("Saved prediction results: cifar10_predictions.png")

2.6 Overview of Modern Architectures

EfficientNet (2019): Efficient Scaling

EfficientNet proposed a method to scale network depth, width, and resolution in a balanced way.

Compound Scaling:

$$ \text{depth} = \alpha^\phi, \quad \text{width} = \beta^\phi, \quad \text{resolution} = \gamma^\phi $$

Constraint: $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$

Vision Transformer (2020+): Surpassing CNNs

Vision Transformer (ViT) is a new approach that divides images into patches and processes them with Transformers.

graph LR A["Image
224Ɨ224"] --> B["Patch Division
16Ɨ16 patches"] B --> C["Linear Projection"] C --> D["Transformer Encoder"] D --> E["Classification Head"] E --> F["Class Prediction"] style A fill:#e1f5ff style D fill:#b3e5fc style F fill:#4fc3f7

Features:

Architecture Selection Guidelines

Architecture Features Recommended Cases
LeNet-5 Simple, lightweight MNIST, educational purposes
VGGNet Easy-to-understand structure Transfer learning base, education
ResNet Deep network, stable General image classification, standard choice
EfficientNet Efficient, high accuracy Resource constraints, mobile
Vision Transformer State-of-the-art, large-scale data Large datasets, research

Exercises

Exercise 1: Effects of Pooling Layers

Apply Max Pooling and Average Pooling to the same input and verify the differences in output. For what types of image features is each advantageous?

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Apply Max Pooling and Average Pooling to the same input and 

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Create feature map with edges
edge_feature = torch.tensor([[
    [0, 0, 0, 0, 0, 0],
    [0, 1, 1, 1, 1, 0],
    [0, 1, 0, 0, 1, 0],
    [0, 1, 0, 0, 1, 0],
    [0, 1, 1, 1, 1, 0],
    [0, 0, 0, 0, 0, 0]
]], dtype=torch.float32).unsqueeze(0).unsqueeze(0)

# Exercise: Apply Max Pooling and Average Pooling and compare results
# Hint: Use nn.MaxPool2d and nn.AvgPool2d
Exercise 2: Skip Connection in Residual Block

Compare Residual blocks with and without Skip Connections to verify differences in gradient flow.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Compare Residual blocks with and without Skip Connections to

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

import torch
import torch.nn as nn

# Exercise: Implement blocks with and without Skip Connection
# Hint: Same structure, only difference is presence of Skip Connection
# To compare gradients, check grad attribute after backward()
Exercise 3: Effect of Batch Normalization

Train the same network with and without Batch Normalization and compare convergence speed and final accuracy.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

"""
Example: Train the same network with and without Batch Normalization 

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Exercise: Create models with and without BN
# Exercise: Train on MNIST or CIFAR-10
# Exercise: Compare training curves
Exercise 4: Regularization Effect of Dropout

Vary Dropout probability (p=0.0, 0.3, 0.5, 0.7) and investigate the impact on overfitting.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Vary Dropout probability (p=0.0, 0.3, 0.5, 0.7) and investig

Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch
import torch.nn as nn

# Exercise: Train models with different Dropout probabilities
# Exercise: Compare difference between training and test errors (degree of overfitting)
# Which Dropout probability is optimal?
Exercise 5: Architecture Comparison on CIFAR-10

Compare three architectures on CIFAR-10: LeNet-5, VGG-style, and ResNet-style.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Compare three architectures on CIFAR-10: LeNet-5, VGG-style,

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch
import torch.nn as nn

# Exercise: Implement three different architectures
# Exercise: Compare performance with same training settings
# Exercise: Record parameter count, training time, and accuracy

# Evaluation metrics:
# - Test accuracy
# - Parameter count
# - Training time (per epoch)
# - Epochs until convergence

Summary

In this chapter, we learned about pooling layers and representative CNN architectures.

Key Points

Next Steps

In the next chapter, we will learn about Transfer Learning and Fine-tuning. We will master practical techniques for building high-accuracy models with limited data by leveraging pre-trained models.

Disclaimer