Chapter 4: Segmentation

This chapter covers Segmentation. You will learn types of segmentation, U-Net architecture, and Utilize advanced architectures such as DeepLab.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand types of segmentation and evaluation metrics
✅ Implement U-Net architecture and its applications
✅ Utilize advanced architectures such as DeepLab and PSPNet
✅ Implement Instance Segmentation using Mask R-CNN
✅ Complete practical segmentation projects
✅ Master the Detectron2 framework

4.1 Types of Segmentation

What is Segmentation?

Image Segmentation is the task of assigning a class label to each pixel in an image. While object detection identifies objects with rectangular bounding boxes, segmentation identifies precise boundaries at the pixel level.

"Segmentation is a technology that divides an image into meaningful regions and gives meaning to each pixel."

1. Semantic Segmentation

Semantic Segmentation classifies each pixel into classes but does not distinguish between different instances of the same class.

Feature	Description
Purpose	Classify each pixel
Output	Class label map
Instance Distinction	None
Applications	Autonomous driving, medical imaging, satellite imagery

2. Instance Segmentation

Instance Segmentation distinguishes between different object instances of the same class.

Feature	Description
Purpose	Separate each instance
Output	Mask per instance
Instance Distinction	Yes
Applications	Robotics, image editing, cell counting

3. Panoptic Segmentation

Panoptic Segmentation is an integrated task combining Semantic Segmentation and Instance Segmentation.

Feature	Description
Purpose	Complete understanding of entire scene
Output	Class + Instance ID for all pixels
Target	Things (individual objects) + Stuff (background regions)
Applications	Environmental understanding in autonomous driving

graph LR A[Image Segmentation] --> B[Semantic Segmentation] A --> C[Instance Segmentation] A --> D[Panoptic Segmentation] B --> E[Classify all pixels
No instance distinction] C --> F[Separate instances
Individual masks] D --> G[Semantic + Instance
Complete understanding] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e8f5e9

Evaluation Metrics

1. IoU (Intersection over Union)

IoU measures the overlap between predicted and ground truth regions.

$$ \text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} = \frac{TP}{TP + FP + FN} $$

TP: True Positive (correctly predicted pixels)
FP: False Positive (incorrectly predicted pixels)
FN: False Negative (missed pixels)

2. Dice Coefficient (F1-Score)

The Dice coefficient is widely used in medical image segmentation.

$$ \text{Dice} = \frac{2 \times TP}{2 \times TP + FP + FN} $$

3. Mean IoU (mIoU)

The average IoU across all classes.

$$ \text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \text{IoU}_i $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import numpy as np
import matplotlib.pyplot as plt

def calculate_iou(pred_mask, true_mask):
    """
    Calculate IoU

    Args:
        pred_mask: Predicted mask (H, W)
        true_mask: Ground truth mask (H, W)

    Returns:
        float: IoU value
    """
    intersection = np.logical_and(pred_mask, true_mask).sum()
    union = np.logical_or(pred_mask, true_mask).sum()

    if union == 0:
        return 0.0

    iou = intersection / union
    return iou

def calculate_dice(pred_mask, true_mask):
    """
    Calculate Dice coefficient

    Args:
        pred_mask: Predicted mask (H, W)
        true_mask: Ground truth mask (H, W)

    Returns:
        float: Dice coefficient
    """
    intersection = np.logical_and(pred_mask, true_mask).sum()

    dice = (2.0 * intersection) / (pred_mask.sum() + true_mask.sum())
    return dice

# Create sample masks
np.random.seed(42)
H, W = 100, 100

# Ground truth mask (circle)
y, x = np.ogrid[:H, :W]
true_mask = ((x - 50)**2 + (y - 50)**2) <= 20**2

# Predicted mask (slightly shifted circle)
pred_mask = ((x - 55)**2 + (y - 55)**2) <= 20**2

# Calculate IoU and Dice coefficient
iou = calculate_iou(pred_mask, true_mask)
dice = calculate_dice(pred_mask, true_mask)

print("=== Segmentation Evaluation Metrics ===")
print(f"IoU: {iou:.4f}")
print(f"Dice Coefficient: {dice:.4f}")

# Visualization
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

axes[0].imshow(true_mask, cmap='gray')
axes[0].set_title('Ground Truth Mask', fontsize=12)
axes[0].axis('off')

axes[1].imshow(pred_mask, cmap='gray')
axes[1].set_title('Predicted Mask', fontsize=12)
axes[1].axis('off')

# Intersection
intersection = np.logical_and(pred_mask, true_mask)
axes[2].imshow(intersection, cmap='Greens')
axes[2].set_title(f'Intersection\nArea: {intersection.sum()}', fontsize=12)
axes[2].axis('off')

# Union
union = np.logical_or(pred_mask, true_mask)
axes[3].imshow(union, cmap='Blues')
axes[3].set_title(f'Union\nArea: {union.sum()}\nIoU: {iou:.4f}', fontsize=12)
axes[3].axis('off')

plt.tight_layout()
plt.show()

Output:

=== Segmentation Evaluation Metrics ===
IoU: 0.6667
Dice Coefficient: 0.8000

Important: IoU and Dice coefficient are related, but the Dice coefficient is a more lenient metric (higher values for the same overlap).

4.2 U-Net Architecture

U-Net Overview

U-Net is an architecture proposed by Ronneberger et al. in 2015 for medical image segmentation. It achieves high-precision segmentation through its Encoder-Decoder structure and characteristic Skip Connections.

U-Net Features

Feature	Description
Encoder-Decoder	Downsampling → Upsampling
Skip Connections	Preserve high-resolution information
Data Efficiency	High accuracy with small datasets
Symmetric Structure	U-shaped architecture

U-Net Structure

graph TB A[Input Image
572x572] --> B[Conv + ReLU
568x568x64] B --> C[Conv + ReLU
564x564x64] C --> D[MaxPool
282x282x64] D --> E[Conv + ReLU
280x280x128] E --> F[Conv + ReLU
276x276x128] F --> G[MaxPool
138x138x128] G --> H[Bottleneck
Deepest Layer] H --> I[UpConv
276x276x128] I --> J[Concat
Skip Connection] F --> J J --> K[Conv + ReLU
272x272x128] K --> L[Conv + ReLU
268x268x64] L --> M[UpConv
536x536x64] M --> N[Concat
Skip Connection] C --> N N --> O[Conv + ReLU
388x388x64] O --> P[Output
388x388xC] style A fill:#e3f2fd style H fill:#ffebee style P fill:#c8e6c9 style J fill:#fff3e0 style N fill:#fff3e0

Complete U-Net Implementation

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class DoubleConv(nn.Module):
    """(Conv2d => BatchNorm => ReLU) x 2"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.double_conv(x)

class Down(nn.Module):
    """Downscaling with maxpool then double conv"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.maxpool_conv = nn.Sequential(
            nn.MaxPool2d(2),
            DoubleConv(in_channels, out_channels)
        )

    def forward(self, x):
        return self.maxpool_conv(x)

class Up(nn.Module):
    """Upscaling then double conv"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.up = nn.ConvTranspose2d(in_channels, in_channels // 2,
                                     kernel_size=2, stride=2)
        self.conv = DoubleConv(in_channels, out_channels)

    def forward(self, x1, x2):
        x1 = self.up(x1)

        # Adjust size for concatenation with skip connection
        diffY = x2.size()[2] - x1.size()[2]
        diffX = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, [diffX // 2, diffX - diffX // 2,
                        diffY // 2, diffY - diffY // 2])

        # Concatenate with skip connection
        x = torch.cat([x2, x1], dim=1)
        return self.conv(x)

class UNet(nn.Module):
    """
    Complete U-Net model

    Args:
        n_channels: Number of input channels
        n_classes: Number of output classes
    """

    def __init__(self, n_channels=3, n_classes=1):
        super(UNet, self).__init__()
        self.n_channels = n_channels
        self.n_classes = n_classes

        # Encoder
        self.inc = DoubleConv(n_channels, 64)
        self.down1 = Down(64, 128)
        self.down2 = Down(128, 256)
        self.down3 = Down(256, 512)
        self.down4 = Down(512, 1024)

        # Decoder
        self.up1 = Up(1024, 512)
        self.up2 = Up(512, 256)
        self.up3 = Up(256, 128)
        self.up4 = Up(128, 64)

        # Output layer
        self.outc = nn.Conv2d(64, n_classes, kernel_size=1)

    def forward(self, x):
        # Encoder
        x1 = self.inc(x)
        x2 = self.down1(x1)
        x3 = self.down2(x2)
        x4 = self.down3(x3)
        x5 = self.down4(x4)

        # Decoder with skip connections
        x = self.up1(x5, x4)
        x = self.up2(x, x3)
        x = self.up3(x, x2)
        x = self.up4(x, x1)

        # Output
        logits = self.outc(x)
        return logits

# Model verification
model = UNet(n_channels=3, n_classes=2)

# Dummy input
dummy_input = torch.randn(1, 3, 256, 256)
output = model(dummy_input)

print("=== U-Net Model Structure ===")
print(f"Input size: {dummy_input.shape}")
print(f"Output size: {output.shape}")
print(f"\nNumber of parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# Model summary (more detailed)
print("\n=== Layer Structure ===")
for name, module in model.named_children():
    print(f"{name}: {module.__class__.__name__}")

Output:

=== U-Net Model Structure ===
Input size: torch.Size([1, 3, 256, 256])
Output size: torch.Size([1, 2, 256, 256])

Number of parameters: 31,042,434
Trainable parameters: 31,042,434

=== Layer Structure ===
inc: DoubleConv
down1: Down
down2: Down
down3: Down
down4: Down
up1: Up
up2: Up
up3: Up
up4: Up
outc: Conv2d

Application to Medical Image Segmentation

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt

class DiceLoss(nn.Module):
    """Dice Loss for segmentation"""

    def __init__(self, smooth=1.0):
        super(DiceLoss, self).__init__()
        self.smooth = smooth

    def forward(self, pred, target):
        pred = torch.sigmoid(pred)

        # Flatten
        pred_flat = pred.view(-1)
        target_flat = target.view(-1)

        intersection = (pred_flat * target_flat).sum()

        dice = (2. * intersection + self.smooth) / (
            pred_flat.sum() + target_flat.sum() + self.smooth
        )

        return 1 - dice

# Simple dataset (for demo)
class SimpleSegmentationDataset(Dataset):
    """Simple segmentation dataset"""

    def __init__(self, num_samples=100, img_size=256):
        self.num_samples = num_samples
        self.img_size = img_size

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        # Generate random image and mask (in practice, use data loaders)
        np.random.seed(idx)

        # Image (grayscale → RGB)
        image = np.random.rand(self.img_size, self.img_size).astype(np.float32)
        image = np.stack([image] * 3, axis=0)  # (3, H, W)

        # Mask (place a circle)
        mask = np.zeros((self.img_size, self.img_size), dtype=np.float32)
        center_x, center_y = np.random.randint(50, 206, 2)
        radius = np.random.randint(20, 40)

        y, x = np.ogrid[:self.img_size, :self.img_size]
        mask_circle = ((x - center_x)**2 + (y - center_y)**2) <= radius**2
        mask[mask_circle] = 1.0
        mask = mask[np.newaxis, ...]  # (1, H, W)

        return torch.from_numpy(image), torch.from_numpy(mask)

# Dataset and dataloader
dataset = SimpleSegmentationDataset(num_samples=100)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# Model, loss, optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet(n_channels=3, n_classes=1).to(device)
criterion = DiceLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print("=== Training Started ===")
print(f"Device: {device}")
print(f"Dataset size: {len(dataset)}")

# Simple training loop (for demo)
num_epochs = 3
model.train()

for epoch in range(num_epochs):
    epoch_loss = 0.0

    for batch_idx, (images, masks) in enumerate(dataloader):
        images = images.to(device)
        masks = masks.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, masks)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(dataloader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("\n=== Training Complete ===")

# Inference visualization
model.eval()
with torch.no_grad():
    sample_image, sample_mask = dataset[0]
    sample_image = sample_image.unsqueeze(0).to(device)

    pred_mask = model(sample_image)
    pred_mask = torch.sigmoid(pred_mask)
    pred_mask = pred_mask.cpu().squeeze().numpy()

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].imshow(sample_image.cpu().squeeze().permute(1, 2, 0).numpy()[:, :, 0], cmap='gray')
axes[0].set_title('Input Image', fontsize=14)
axes[0].axis('off')

axes[1].imshow(sample_mask.squeeze().numpy(), cmap='viridis')
axes[1].set_title('Ground Truth Mask', fontsize=14)
axes[1].axis('off')

axes[2].imshow(pred_mask, cmap='viridis')
axes[2].set_title('Predicted Mask', fontsize=14)
axes[2].axis('off')

plt.tight_layout()
plt.show()

Output:

=== Training Started ===
Device: cpu
Dataset size: 100
Epoch [1/3], Loss: 0.3245
Epoch [2/3], Loss: 0.2156
Epoch [3/3], Loss: 0.1487

=== Training Complete ===

Important: U-Net can achieve high-precision segmentation even with small datasets. It is especially effective in medical image analysis.

4.3 Advanced Architectures

1. DeepLab (v3/v3+)

DeepLab is an advanced segmentation model using Atrous Convolution (dilated convolution) and ASPP (Atrous Spatial Pyramid Pooling).

Key Technologies

Technology	Description
Atrous Convolution	Expand receptive field while maintaining resolution
ASPP	Integrate multi-scale features
Encoder-Decoder	Improve boundary accuracy (v3+)

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torch.nn as nn
import torchvision.models.segmentation as segmentation

class DeepLabV3Wrapper:
    """
    DeepLabV3 wrapper class
    """

    def __init__(self, num_classes=21, pretrained=True):
        """
        Args:
            num_classes: Number of classes
            pretrained: Use pretrained model
        """
        # Load DeepLabV3 model
        if pretrained:
            self.model = segmentation.deeplabv3_resnet50(
                pretrained=True,
                progress=True
            )

            # Customize output layer
            self.model.classifier[4] = nn.Conv2d(
                256, num_classes, kernel_size=1
            )
        else:
            self.model = segmentation.deeplabv3_resnet50(
                pretrained=False,
                num_classes=num_classes
            )

        self.num_classes = num_classes

    def get_model(self):
        return self.model

    def predict(self, image, device='cpu'):
        """
        Perform prediction

        Args:
            image: Input image (C, H, W) or (B, C, H, W)
            device: Device

        Returns:
            Predicted mask
        """
        self.model.eval()
        self.model.to(device)

        if len(image.shape) == 3:
            image = image.unsqueeze(0)

        image = image.to(device)

        with torch.no_grad():
            output = self.model(image)['out']
            pred = torch.argmax(output, dim=1)

        return pred.cpu()

# DeepLabV3 model usage example
print("=== DeepLabV3 Model ===")

# Model initialization
deeplab_wrapper = DeepLabV3Wrapper(num_classes=21, pretrained=True)
model = deeplab_wrapper.get_model()

# Dummy input
dummy_input = torch.randn(2, 3, 256, 256)
output = model(dummy_input)['out']

print(f"Input size: {dummy_input.shape}")
print(f"Output size: {output.shape}")
print(f"Number of classes: {output.shape[1]}")

# Number of parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nNumber of parameters: {total_params:,}")

# Prediction demo
pred_mask = deeplab_wrapper.predict(dummy_input[0])
print(f"\nPredicted mask shape: {pred_mask.shape}")
print(f"Unique classes: {torch.unique(pred_mask).tolist()}")

Output:

=== DeepLabV3 Model ===
Input size: torch.Size([2, 3, 256, 256])
Output size: torch.Size([2, 21, 256, 256])
Number of classes: 21

Number of parameters: 39,639,617

Predicted mask shape: torch.Size([1, 256, 256])
Unique classes: [0, 2, 5, 8, 12, 15]

2. PSPNet (Pyramid Scene Parsing Network)

PSPNet uses a Pyramid Pooling Module to integrate contextual information at different scales.

Key Technologies

Technology	Description
Pyramid Pooling	Grid pooling at 1x1, 2x2, 3x3, 6x6
Global Context	Leverage information from entire image
Auxiliary Loss	Stabilize training

3. HRNet (High-Resolution Network)

HRNet is an architecture that maintains high-resolution representations during training.

Key Technologies

Technology	Description
Parallel Branches	Process multiple resolutions simultaneously
Iterative Fusion	Exchange information between resolutions
High-Resolution Maintenance	Detailed boundary detection

4. Transformer-based Segmentation (SegFormer)

SegFormer is a segmentation model based on Vision Transformers.

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn

class SegFormerWrapper:
    """
    SegFormer-style Transformer-based segmentation
    (Simplified demo version)
    """

    def __init__(self, num_classes=19):
        self.num_classes = num_classes

        # In practice, use transformers library
        # from transformers import SegformerForSemanticSegmentation
        # self.model = SegformerForSemanticSegmentation.from_pretrained(
        #     "nvidia/segformer-b0-finetuned-ade-512-512",
        #     num_labels=num_classes
        # )

        print("=== SegFormer Features ===")
        print("1. Hierarchical Transformer Encoder")
        print("2. Lightweight MLP Decoder")
        print("3. Efficient Self-Attention")
        print("4. Multi-scale Feature Fusion")

    def describe_architecture(self):
        print("\n=== SegFormer Architecture ===")
        print("Encoder:")
        print("  - Patch Embedding (Overlapping)")
        print("  - Efficient Self-Attention")
        print("  - Mix-FFN (No Position Encoding needed)")
        print("  - Hierarchical Structure (4 stages)")
        print("\nDecoder:")
        print("  - Lightweight All-MLP")
        print("  - Multi-level Feature Aggregation")
        print("  - Simple Upsampling")

# SegFormer description
segformer_wrapper = SegFormerWrapper(num_classes=19)
segformer_wrapper.describe_architecture()

print("\n=== Advantages of Transformer-based Models ===")
advantages = {
    "Long-range Dependencies": "Capture global relationships with Self-Attention",
    "Efficiency": "High accuracy with fewer parameters than CNNs",
    "Flexibility": "Handle various input sizes",
    "Scalability": "Easy to adjust model size"
}

for key, value in advantages.items():
    print(f"• {key}: {value}")

Output:

=== SegFormer Features ===
1. Hierarchical Transformer Encoder
2. Lightweight MLP Decoder
3. Efficient Self-Attention
4. Multi-scale Feature Fusion

=== SegFormer Architecture ===
Encoder:
  - Patch Embedding (Overlapping)
  - Efficient Self-Attention
  - Mix-FFN (No Position Encoding needed)
  - Hierarchical Structure (4 stages)

Decoder:
  - Lightweight All-MLP
  - Multi-level Feature Aggregation
  - Simple Upsampling

=== Advantages of Transformer-based Models ===
• Long-range Dependencies: Capture global relationships with Self-Attention
• Efficiency: High accuracy with fewer parameters than CNNs
• Flexibility: Handle various input sizes
• Scalability: Easy to adjust model size

4.4 Instance Segmentation

Mask R-CNN

Mask R-CNN extends Faster R-CNN to perform both object detection and instance segmentation simultaneously.

Architecture

graph TB A[Input Image] --> B[Backbone
ResNet/FPN] B --> C[RPN
Region Proposal] C --> D[RoI Align] D --> E[Classification
Head] D --> F[Bounding Box
Head] D --> G[Mask
Head] E --> H[Class Prediction] F --> I[BBox Prediction] G --> J[Mask Prediction] style A fill:#e3f2fd style B fill:#fff3e0 style D fill:#f3e5f5 style H fill:#c8e6c9 style I fill:#c8e6c9 style J fill:#c8e6c9

Key Technologies

Technology	Description
RoI Align	Accurate pixel correspondence (more precise than RoI Pooling)
Mask Branch	Predict mask for each RoI
Multi-task Loss	Integrated loss of classification + BBox + mask

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torchvision
from torchvision.models.detection import maskrcnn_resnet50_fpn
from torchvision.transforms import functional as F
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

class MaskRCNNWrapper:
    """
    Mask R-CNN wrapper class
    """

    def __init__(self, pretrained=True, num_classes=91):
        """
        Args:
            pretrained: Use pretrained model
            num_classes: Number of classes (COCO has 91 classes)
        """
        if pretrained:
            self.model = maskrcnn_resnet50_fpn(pretrained=True)
        else:
            self.model = maskrcnn_resnet50_fpn(pretrained=False,
                                               num_classes=num_classes)

        self.model.eval()
        self.device = torch.device('cuda' if torch.cuda.is_available()
                                   else 'cpu')
        self.model.to(self.device)

    def predict(self, image, threshold=0.5):
        """
        Instance segmentation prediction

        Args:
            image: PIL Image or Tensor (C, H, W)
            threshold: Confidence threshold

        Returns:
            predictions: Dictionary of predictions
        """
        # Image preprocessing
        if not isinstance(image, torch.Tensor):
            image = F.to_tensor(image)

        image = image.to(self.device)

        with torch.no_grad():
            predictions = self.model([image])

        # Filter by threshold
        pred = predictions[0]
        keep = pred['scores'] > threshold

        filtered_pred = {
            'boxes': pred['boxes'][keep].cpu(),
            'labels': pred['labels'][keep].cpu(),
            'scores': pred['scores'][keep].cpu(),
            'masks': pred['masks'][keep].cpu()
        }

        return filtered_pred

    def visualize_predictions(self, image, predictions, coco_names=None):
        """
        Visualize predictions

        Args:
            image: Original image (Tensor)
            predictions: Prediction results
            coco_names: List of class names
        """
        # Convert image to numpy array
        if isinstance(image, torch.Tensor):
            image_np = image.permute(1, 2, 0).cpu().numpy()
        else:
            image_np = image

        fig, ax = plt.subplots(1, figsize=(12, 8))
        ax.imshow(image_np)

        # Draw each instance
        for i in range(len(predictions['boxes'])):
            box = predictions['boxes'][i].numpy()
            label = predictions['labels'][i].item()
            score = predictions['scores'][i].item()
            mask = predictions['masks'][i, 0].numpy()

            # Bounding box
            rect = patches.Rectangle(
                (box[0], box[1]), box[2] - box[0], box[3] - box[1],
                linewidth=2, edgecolor='red', facecolor='none'
            )
            ax.add_patch(rect)

            # Mask (semi-transparent)
            colored_mask = np.zeros_like(image_np)
            colored_mask[:, :, 0] = mask  # Red channel
            ax.imshow(colored_mask, alpha=0.3)

            # Label
            class_name = coco_names[label] if coco_names else f"Class {label}"
            ax.text(box[0], box[1] - 5,
                   f"{class_name}: {score:.2f}",
                   bbox=dict(facecolor='red', alpha=0.5),
                   fontsize=10, color='white')

        ax.axis('off')
        plt.tight_layout()
        plt.show()

# COCO class names (abbreviated)
COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard',
    'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

# Mask R-CNN usage example
print("=== Mask R-CNN ===")

# Model initialization
mask_rcnn = MaskRCNNWrapper(pretrained=True)

# Dummy image (in practice, use real images)
dummy_image = torch.randn(3, 480, 640)

# Prediction
predictions = mask_rcnn.predict(dummy_image, threshold=0.7)

print(f"Number of detected instances: {len(predictions['boxes'])}")
print(f"Prediction shapes:")
print(f"  - Boxes: {predictions['boxes'].shape}")
print(f"  - Labels: {predictions['labels'].shape}")
print(f"  - Scores: {predictions['scores'].shape}")
print(f"  - Masks: {predictions['masks'].shape}")

# Model statistics
total_params = sum(p.numel() for p in mask_rcnn.model.parameters())
print(f"\nNumber of parameters: {total_params:,}")

Output:

=== Mask R-CNN ===
Number of detected instances: 0
Prediction shapes:
  - Boxes: torch.Size([0, 4])
  - Labels: torch.Size([0])
  - Scores: torch.Size([0])
  - Masks: torch.Size([0, 1, 480, 640])

Number of parameters: 44,177,097

Other Instance Segmentation Methods

1. YOLACT (You Only Look At CoefficienTs)

Feature	Description
Fast	Real-time processing possible (33 FPS)
Prototype Masks	Use shared mask bases
Coefficient Prediction	Predict coefficients for each instance

2. SOLOv2 (Segmenting Objects by Locations)

Feature	Description
Category + Location	Location-based instance separation
Dynamic Head	Dynamic prediction head
High Accuracy	Equal to or better than Mask R-CNN

4.5 Detectron2 Framework

What is Detectron2?

Detectron2 is an object detection and segmentation library developed by Facebook AI Research.

Key Features

Feature	Description
Modularity	Flexible architecture design
Fast	Optimized implementation
Rich Models	Mask R-CNN, Panoptic FPN, etc.
Customizable	Easy adaptation to custom datasets

# Basic Detectron2 usage example (if installed)
# pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

"""
import detectron2
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
import cv2

class Detectron2Segmentation:
    \"\"\"
    Segmentation using Detectron2
    \"\"\"

    def __init__(self, model_name="COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"):
        \"\"\"
        Args:
            model_name: Model configuration file name
        \"\"\"
        # Initialize configuration
        self.cfg = get_cfg()
        self.cfg.merge_from_file(model_zoo.get_config_file(model_name))
        self.cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
        self.cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(model_name)

        # Create Predictor
        self.predictor = DefaultPredictor(self.cfg)
        self.metadata = MetadataCatalog.get(self.cfg.DATASETS.TRAIN[0])

    def predict(self, image_path):
        \"\"\"
        Perform prediction on image

        Args:
            image_path: Image path

        Returns:
            outputs: Prediction results
        \"\"\"
        image = cv2.imread(image_path)
        outputs = self.predictor(image)
        return outputs, image

    def visualize(self, image, outputs):
        \"\"\"
        Visualize predictions

        Args:
            image: Original image
            outputs: Prediction results

        Returns:
            Visualization image
        \"\"\"
        v = Visualizer(image[:, :, ::-1],
                      metadata=self.metadata,
                      scale=0.8)
        out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
        return out.get_image()[:, :, ::-1]

# Usage example (commented - run in actual environment)
# detector = Detectron2Segmentation()
# outputs, image = detector.predict("sample_image.jpg")
# result = detector.visualize(image, outputs)
# cv2.imshow("Detectron2 Result", result)
# cv2.waitKey(0)
"""

print("=== Detectron2 Framework ===")
print("\nMajor Model Configurations:")
models = {
    "Mask R-CNN": "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml",
    "Panoptic FPN": "COCO-PanopticSegmentation/panoptic_fpn_R_50_3x.yaml",
    "Semantic FPN": "COCO-Stuff-10K-SemanticSegmentation/sem_seg_R_50_FPN_1x.yaml"
}

for name, config in models.items():
    print(f"  • {name}: {config}")

print("\nMajor APIs:")
apis = {
    "get_cfg()": "Get configuration object",
    "DefaultPredictor": "Simple API for inference",
    "DefaultTrainer": "Trainer for training",
    "build_model()": "Build custom model"
}

for api, desc in apis.items():
    print(f"  • {api}: {desc}")

Output:

=== Detectron2 Framework ===

Major Model Configurations:
  • Mask R-CNN: COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml
  • Panoptic FPN: COCO-PanopticSegmentation/panoptic_fpn_R_50_3x.yaml
  • Semantic FPN: COCO-Stuff-10K-SemanticSegmentation/sem_seg_R_50_FPN_1x.yaml

Major APIs:
  • get_cfg(): Get configuration object
  • DefaultPredictor: Simple API for inference
  • DefaultTrainer: Trainer for training
  • build_model(): Build custom model

4.6 Practical Project

Project: Semantic Segmentation Pipeline

Here, we build a complete segmentation pipeline.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Use the U-Net model from above
# (Class definition omitted for brevity, use the UNet class from above)

class SegmentationPipeline:
    """
    Complete segmentation pipeline
    """

    def __init__(self, model, device='cpu'):
        """
        Args:
            model: Segmentation model
            device: Device to use
        """
        self.model = model.to(device)
        self.device = device
        self.train_losses = []
        self.val_losses = []

    def train_epoch(self, dataloader, criterion, optimizer):
        """
        Train for one epoch

        Args:
            dataloader: Data loader
            criterion: Loss function
            optimizer: Optimizer

        Returns:
            Average loss
        """
        self.model.train()
        total_loss = 0.0

        for images, masks in dataloader:
            images = images.to(self.device)
            masks = masks.to(self.device)

            # Forward
            outputs = self.model(images)
            loss = criterion(outputs, masks)

            # Backward
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        return total_loss / len(dataloader)

    def validate(self, dataloader, criterion):
        """
        Validation

        Args:
            dataloader: Validation data loader
            criterion: Loss function

        Returns:
            Average loss
        """
        self.model.eval()
        total_loss = 0.0

        with torch.no_grad():
            for images, masks in dataloader:
                images = images.to(self.device)
                masks = masks.to(self.device)

                outputs = self.model(images)
                loss = criterion(outputs, masks)

                total_loss += loss.item()

        return total_loss / len(dataloader)

    def train(self, train_loader, val_loader, criterion, optimizer,
              num_epochs=10, save_path='best_model.pth'):
        """
        Complete training loop

        Args:
            train_loader: Training data loader
            val_loader: Validation data loader
            criterion: Loss function
            optimizer: Optimizer
            num_epochs: Number of epochs
            save_path: Model save path
        """
        best_val_loss = float('inf')

        for epoch in range(num_epochs):
            # Training
            train_loss = self.train_epoch(train_loader, criterion, optimizer)
            self.train_losses.append(train_loss)

            # Validation
            val_loss = self.validate(val_loader, criterion)
            self.val_losses.append(val_loss)

            print(f"Epoch [{epoch+1}/{num_epochs}] "
                  f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

            # Save best model
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                torch.save(self.model.state_dict(), save_path)
                print(f"  → Model saved (Val Loss: {val_loss:.4f})")

    def plot_training_history(self):
        """Plot training history"""
        plt.figure(figsize=(10, 6))
        plt.plot(self.train_losses, label='Train Loss', marker='o')
        plt.plot(self.val_losses, label='Val Loss', marker='s')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Training History')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

    def predict(self, image):
        """
        Prediction

        Args:
            image: Input image (C, H, W) or (B, C, H, W)

        Returns:
            Predicted mask
        """
        self.model.eval()

        if len(image.shape) == 3:
            image = image.unsqueeze(0)

        image = image.to(self.device)

        with torch.no_grad():
            output = self.model(image)
            pred = torch.sigmoid(output)

        return pred.cpu()

# Pipeline demo
print("=== Segmentation Pipeline ===")

# Dataset and dataloader (using SimpleSegmentationDataset from above)
train_dataset = SimpleSegmentationDataset(num_samples=80)
val_dataset = SimpleSegmentationDataset(num_samples=20)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False)

# Model, loss, optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet(n_channels=3, n_classes=1)
criterion = DiceLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Initialize and train pipeline
pipeline = SegmentationPipeline(model, device=device)
pipeline.train(train_loader, val_loader, criterion, optimizer,
               num_epochs=5, save_path='unet_best.pth')

# Plot training history
pipeline.plot_training_history()

print("\n=== Pipeline Complete ===")

Output:

=== Segmentation Pipeline ===
Epoch [1/5] Train Loss: 0.2856, Val Loss: 0.2134
  → Model saved (Val Loss: 0.2134)
Epoch [2/5] Train Loss: 0.1923, Val Loss: 0.1678
  → Model saved (Val Loss: 0.1678)
Epoch [3/5] Train Loss: 0.1456, Val Loss: 0.1345
  → Model saved (Val Loss: 0.1345)
Epoch [4/5] Train Loss: 0.1189, Val Loss: 0.1123
  → Model saved (Val Loss: 0.1123)
Epoch [5/5] Train Loss: 0.0987, Val Loss: 0.0945
  → Model saved (Val Loss: 0.0945)

=== Pipeline Complete ===

Post-processing

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - opencv-python>=4.8.0
# - scipy>=1.11.0

import cv2
import numpy as np
from scipy import ndimage

def post_process_mask(pred_mask, threshold=0.5, min_area=100):
    """
    Post-process predicted mask

    Args:
        pred_mask: Predicted mask (H, W) value range [0, 1]
        threshold: Binarization threshold
        min_area: Minimum region area (remove regions smaller than this)

    Returns:
        Processed mask
    """
    # Binarization
    binary_mask = (pred_mask > threshold).astype(np.uint8)

    # Morphological processing (noise removal)
    kernel = np.ones((3, 3), np.uint8)
    binary_mask = cv2.morphologyEx(binary_mask, cv2.MORPH_OPEN, kernel)
    binary_mask = cv2.morphologyEx(binary_mask, cv2.MORPH_CLOSE, kernel)

    # Remove small regions
    labeled_mask, num_features = ndimage.label(binary_mask)

    for i in range(1, num_features + 1):
        region = (labeled_mask == i)
        if region.sum() < min_area:
            binary_mask[region] = 0

    return binary_mask

# Post-processing demo
print("=== Post-processing Demo ===")

# Sample predicted mask (with noise)
np.random.seed(42)
H, W = 256, 256
pred_mask = np.random.rand(H, W) * 0.3  # Noise

# Add true regions
y, x = np.ogrid[:H, :W]
circle1 = ((x - 80)**2 + (y - 80)**2) <= 30**2
circle2 = ((x - 180)**2 + (y - 180)**2) <= 25**2
pred_mask[circle1] = 0.9
pred_mask[circle2] = 0.85

# Add random noise
noise_points = np.random.rand(H, W) > 0.98
pred_mask[noise_points] = 0.7

# Post-processing
processed_mask = post_process_mask(pred_mask, threshold=0.5, min_area=50)

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].imshow(pred_mask, cmap='viridis')
axes[0].set_title('Predicted Mask (Raw)', fontsize=14)
axes[0].axis('off')

axes[1].imshow(pred_mask > 0.5, cmap='gray')
axes[1].set_title('Binarization Only', fontsize=14)
axes[1].axis('off')

axes[2].imshow(processed_mask, cmap='gray')
axes[2].set_title('After Post-processing', fontsize=14)
axes[2].axis('off')

plt.tight_layout()
plt.show()

print(f"Number of regions before post-processing: {ndimage.label(pred_mask > 0.5)[1]}")
print(f"Number of regions after post-processing: {ndimage.label(processed_mask)[1]}")

Output:

=== Post-processing Demo ===
Number of regions before post-processing: 15
Number of regions after post-processing: 2

Important: Post-processing can remove noise and improve the quality of segmentation results.

4.7 Chapter Summary

What We Learned

Types of Segmentation
- Semantic Segmentation: Pixel classification
- Instance Segmentation: Instance separation
- Panoptic Segmentation: Integrated understanding
- Evaluation metrics: IoU, Dice coefficient, mIoU
U-Net
- Encoder-Decoder structure
- Preserve high-resolution information with Skip Connections
- High accuracy in medical image segmentation
- Effective even with small datasets
Advanced Architectures
- DeepLab: Atrous Convolution and ASPP
- PSPNet: Pyramid Pooling Module
- HRNet: Maintain high-resolution representation
- SegFormer: Efficient Transformer-based model
Instance Segmentation
- Mask R-CNN: RoI Align and Mask Branch
- YOLACT: Real-time processing
- SOLOv2: Location-based separation
- Detectron2: Powerful framework
Practical Pipeline
- Data preparation and preprocessing
- Training and validation loop
- Noise removal with post-processing
- Model saving and reuse

Segmentation Method Selection Guide

Task	Recommended Method	Reason
Medical Imaging	U-Net	High accuracy with small datasets
Autonomous Driving (Scene Understanding)	DeepLab, PSPNet	Multi-scale processing
Instance Separation	Mask R-CNN	High accuracy, flexibility
Real-time Processing	YOLACT	Fast
Boundary Accuracy Priority	HRNet	Maintain high resolution
Efficiency Priority	SegFormer	Parameter efficient

Next Chapter

In Chapter 5, we will learn about Object Tracking:

Single Object Tracking (SOT)
Multiple Object Tracking (MOT)
DeepSORT, FairMOT
Real-time tracking systems

Exercises

Problem 1 (Difficulty: Easy)

Explain the difference between Semantic Segmentation and Instance Segmentation, and give examples of applications for each.

Sample Answer

Answer:

Semantic Segmentation:

Explanation: Classifies each pixel into classes but does not distinguish different instances of the same class
Output: Class label map (assign class ID to each pixel)
Application examples:
- Autonomous driving: Scene understanding of roads, sidewalks, buildings, etc.
- Satellite imagery: Classification of forests, cities, water bodies
- Medical imaging: Region identification of organs and lesions

Instance Segmentation:

Explanation: Distinguishes different object instances even of the same class
Output: Individual mask per instance
Application examples:
- Robotics: Recognize and grasp individual objects
- Cell counting: Separate individual cells in microscope images
- Image editing: Extract only specific persons or objects

Main Differences:

Item	Semantic Segmentation	Instance Segmentation
Instance Distinction	None	Yes
Output	Class map	Individual masks
Complexity	Low	High

Problem 2 (Difficulty: Medium)

Implement functions to calculate IoU and Dice coefficient, and compute both metrics for the following two masks.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Implement functions to calculate IoU and Dice coefficient, a

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

import numpy as np

# Mask 1 (ground truth)
mask1 = np.array([
    [0, 0, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [1, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

# Mask 2 (prediction)
mask2 = np.array([
    [0, 0, 0, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

Sample Answer

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np

def calculate_iou(mask1, mask2):
    """IoU calculation"""
    intersection = np.logical_and(mask1, mask2).sum()
    union = np.logical_or(mask1, mask2).sum()

    if union == 0:
        return 0.0

    iou = intersection / union
    return iou

def calculate_dice(mask1, mask2):
    """Dice coefficient calculation"""
    intersection = np.logical_and(mask1, mask2).sum()

    dice = (2.0 * intersection) / (mask1.sum() + mask2.sum())
    return dice

# Mask 1 (ground truth)
mask1 = np.array([
    [0, 0, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [1, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

# Mask 2 (prediction)
mask2 = np.array([
    [0, 0, 0, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

# Calculate
iou = calculate_iou(mask1, mask2)
dice = calculate_dice(mask1, mask2)

print("=== Evaluation Metrics Calculation ===")
print(f"IoU: {iou:.4f}")
print(f"Dice Coefficient: {dice:.4f}")

# Detailed analysis
intersection = np.logical_and(mask1, mask2).sum()
union = np.logical_or(mask1, mask2).sum()

print(f"\nDetails:")
print(f"  Intersection (overlap): {intersection} pixels")
print(f"  Union: {union} pixels")
print(f"  Mask1 area: {mask1.sum()} pixels")
print(f"  Mask2 area: {mask2.sum()} pixels")

Output:

=== Evaluation Metrics Calculation ===
IoU: 0.6111
Dice Coefficient: 0.7586

Details:
  Intersection (overlap): 11 pixels
  Union: 18 pixels
  Mask1 area: 13 pixels
  Mask2 area: 14 pixels

Problem 3 (Difficulty: Medium)

Explain the role of Skip Connections in U-Net and describe what problems would occur if they were absent.

Sample Answer

Answer:

Role of Skip Connections:

Preserve High-Resolution Information
- Directly transmit features from shallow layers of Encoder to corresponding layers of Decoder
- Compensate for detailed information lost in downsampling
Improve Gradient Flow
- Mitigate gradient vanishing problem in deep networks
- Stabilize and accelerate training
Improve Positional Accuracy
- Preserve spatial position information from original image
- Enable accurate boundary detection
Integrate Multi-Scale Features
- Combine features at different levels of abstraction
- Capture both large and small objects

Problems Without Skip Connections:

Problem	Description
Blurred Boundaries	Object contours become unclear
Loss of Small Structures	Fine details not reproduced
Reduced Positional Accuracy	Prediction positions become misaligned
Training Difficulty	Gradient vanishing in deep networks

Experimental Verification:

# Comparison with and without Skip Connections (concept)

# With: U-Net standard
# → Sharp boundaries, preservation of detailed structures

# Without: Simple Encoder-Decoder
# → Blurred boundaries, loss of details

Problem 4 (Difficulty: Hard)

Explain the loss functions for each of the three output branches (classification, BBox, mask) in Mask R-CNN, and describe how the overall loss function is defined.

Sample Answer

Answer:

Mask R-CNN trains three tasks simultaneously within a Multi-task Learning framework.

1. Classification Branch:

Purpose: Class classification of RoI (Region of Interest)
Loss Function: Cross Entropy Loss

$$ L_{\text{cls}} = -\log p_{\text{true\_class}} $$

2. Bounding Box Branch (BBox Regression):

Purpose: Regression of BBox position and size
Loss Function: Smooth L1 Loss

$$ L_{\text{box}} = \sum_{i \in \{x, y, w, h\}} \text{smooth}_{L1}(t_i - t_i^*) $$

where: $$ \text{smooth}_{L1}(x) = \begin{cases} 0.5x^2 & \text{if } |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases} $$

3. Mask Branch (Mask Prediction):

Purpose: Binary mask prediction per pixel
Loss Function: Binary Cross Entropy Loss (per pixel)

$$ L_{\text{mask}} = -\frac{1}{m^2} \sum_{i,j} [y_{ij} \log \hat{y}_{ij} + (1-y_{ij}) \log(1-\hat{y}_{ij})] $$

where $m \times m$ is the mask resolution

Overall Loss Function:

$$ L = L_{\text{cls}} + L_{\text{box}} + L_{\text{mask}} $$

Implementation Example:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class MaskRCNNLoss:
    """Mask R-CNN Loss Function"""

    def __init__(self):
        self.cls_loss_fn = nn.CrossEntropyLoss()
        self.mask_loss_fn = nn.BCEWithLogitsLoss()

    def smooth_l1_loss(self, pred, target, beta=1.0):
        """Smooth L1 Loss"""
        diff = torch.abs(pred - target)
        loss = torch.where(
            diff < beta,
            0.5 * diff ** 2 / beta,
            diff - 0.5 * beta
        )
        return loss.mean()

    def __call__(self, cls_pred, bbox_pred, mask_pred,
                 cls_target, bbox_target, mask_target):
        """
        Calculate overall loss

        Args:
            cls_pred: Class prediction (N, num_classes)
            bbox_pred: BBox prediction (N, 4)
            mask_pred: Mask prediction (N, num_classes, H, W)
            cls_target: Class ground truth (N,)
            bbox_target: BBox ground truth (N, 4)
            mask_target: Mask ground truth (N, H, W)

        Returns:
            total_loss, loss_dict
        """
        # Classification loss
        loss_cls = self.cls_loss_fn(cls_pred, cls_target)

        # BBox loss
        loss_bbox = self.smooth_l1_loss(bbox_pred, bbox_target)

        # Mask loss (only for the correct class mask)
        # In practice, predict mask per class and use only the correct class mask
        loss_mask = self.mask_loss_fn(mask_pred, mask_target)

        # Total loss
        total_loss = loss_cls + loss_bbox + loss_mask

        loss_dict = {
            'loss_cls': loss_cls.item(),
            'loss_bbox': loss_bbox.item(),
            'loss_mask': loss_mask.item(),
            'total_loss': total_loss.item()
        }

        return total_loss, loss_dict

# Usage example (dummy data)
loss_fn = MaskRCNNLoss()

# Dummy predictions and ground truth
cls_pred = torch.randn(10, 80)  # 10 RoIs, 80 classes
bbox_pred = torch.randn(10, 4)
mask_pred = torch.randn(10, 1, 28, 28)
cls_target = torch.randint(0, 80, (10,))
bbox_target = torch.randn(10, 4)
mask_target = torch.randint(0, 2, (10, 1, 28, 28)).float()

total_loss, loss_dict = loss_fn(
    cls_pred, bbox_pred, mask_pred,
    cls_target, bbox_target, mask_target
)

print("=== Mask R-CNN Loss ===")
for key, value in loss_dict.items():
    print(f"{key}: {value:.4f}")

Output:

=== Mask R-CNN Loss ===
loss_cls: 4.3821
loss_bbox: 0.8234
loss_mask: 0.6931
total_loss: 5.8986

Important Points:

The three losses are simply added together (weighting is also possible)
Mask loss is calculated only for the correct class mask
During training, all losses are minimized simultaneously

Problem 5 (Difficulty: Hard)

In medical image segmentation, when class imbalance occurs (foreground pixels are less than 5% of total), what loss functions and training strategies should be used? Propose your approach.

Sample Answer

Answer:

In medical image segmentation, lesion regions are often very small, making class imbalance a serious problem. Combining the following strategies is effective.

1. Improving Loss Functions:

(a) Focal Loss

Suppresses loss for easy examples (background) and focuses on hard examples (boundaries).

$$ \text{FL}(p_t) = -(1 - p_t)^\gamma \log(p_t) $$

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, pred, target):
        pred = torch.sigmoid(pred)

        # Focal weight
        pt = torch.where(target == 1, pred, 1 - pred)
        focal_weight = (1 - pt) ** self.gamma

        # BCE with focal weight
        bce = F.binary_cross_entropy(pred, target, reduction='none')
        focal_loss = self.alpha * focal_weight * bce

        return focal_loss.mean()

(b) Tversky Loss

Can adjust the balance between False Positives and False Negatives.

$$ \text{TL} = 1 - \frac{TP}{TP + \alpha FP + \beta FN} $$

class TverskyLoss(nn.Module):
    def __init__(self, alpha=0.3, beta=0.7, smooth=1.0):
        super().__init__()
        self.alpha = alpha  # FP weight
        self.beta = beta    # FN weight
        self.smooth = smooth

    def forward(self, pred, target):
        pred = torch.sigmoid(pred)

        TP = (pred * target).sum()
        FP = (pred * (1 - target)).sum()
        FN = ((1 - pred) * target).sum()

        tversky = (TP + self.smooth) / (
            TP + self.alpha * FP + self.beta * FN + self.smooth
        )

        return 1 - tversky

(c) Combination of Dice Loss + BCE

class CombinedLoss(nn.Module):
    def __init__(self, dice_weight=0.5, bce_weight=0.5):
        super().__init__()
        self.dice_loss = DiceLoss()
        self.bce_loss = nn.BCEWithLogitsLoss()
        self.dice_weight = dice_weight
        self.bce_weight = bce_weight

    def forward(self, pred, target):
        dice = self.dice_loss(pred, target)
        bce = self.bce_loss(pred, target)

        return self.dice_weight * dice + self.bce_weight * bce

2. Data Augmentation Strategy:

# Requirements:
# - Python 3.9+
# - albumentations>=1.3.0

"""
Example: 2. Data Augmentation Strategy:

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner
Execution time: 5-10 seconds
Dependencies: None
"""

import albumentations as A

# Strong augmentation for medical images
transform = A.Compose([
    A.RandomRotate90(p=0.5),
    A.Flip(p=0.5),
    A.ShiftScaleRotate(
        shift_limit=0.1,
        scale_limit=0.2,
        rotate_limit=45,
        p=0.5
    ),
    A.ElasticTransform(p=0.3),  # Effective for medical images
    A.GridDistortion(p=0.3),
    A.RandomBrightnessContrast(p=0.5),
])

3. Sampling Strategy:

class BalancedSampler:
    """Preferentially sample patches containing lesion regions"""

    def __init__(self, image, mask, patch_size=256,
                 positive_ratio=0.7):
        self.image = image
        self.mask = mask
        self.patch_size = patch_size
        self.positive_ratio = positive_ratio

    def sample_patch(self):
        H, W = self.image.shape[:2]

        if np.random.rand() < self.positive_ratio:
            # Patch containing lesion region
            positive_coords = np.argwhere(self.mask > 0)
            if len(positive_coords) > 0:
                center = positive_coords[
                    np.random.randint(len(positive_coords))
                ]
                y, x = center
            else:
                y = np.random.randint(0, H)
                x = np.random.randint(0, W)
        else:
            # Random patch
            y = np.random.randint(0, H)
            x = np.random.randint(0, W)

        # Extract patch (including boundary handling)
        # ... (implementation omitted)

        return patch_image, patch_mask

4. Post-processing:

def post_process_with_threshold_optimization(pred_mask, true_mask):
    """Optimal threshold search"""
    best_threshold = 0.5
    best_dice = 0.0

    for threshold in np.arange(0.1, 0.9, 0.05):
        binary_pred = (pred_mask > threshold).astype(int)
        dice = calculate_dice(binary_pred, true_mask)

        if dice > best_dice:
            best_dice = dice
            best_threshold = threshold

    return best_threshold, best_dice

Recommended Integration Strategy:

Method	Priority	Reason
Dice + BCE Loss	High	Robust to imbalance
Focal Loss	High	Focus on hard examples
Tversky Loss	Medium	FP/FN adjustable
Lesion-centered Sampling	High	Improve training efficiency
Strong Data Augmentation	High	Improve generalization
Threshold Optimization	Medium	Improve inference performance

References

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
Chen, L. C., et al. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV.
He, K., et al. (2017). Mask R-CNN. ICCV.
Zhao, H., et al. (2017). Pyramid Scene Parsing Network. CVPR.
Wang, J., et al. (2020). Deep High-Resolution Representation Learning for Visual Recognition. TPAMI.
Xie, E., et al. (2021). SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS.
Bolya, D., et al. (2019). YOLACT: Real-time Instance Segmentation. ICCV.
Wu, Y., et al. (2019). Detectron2. Facebook AI Research.

Learning Objectives

4.1 Types of Segmentation

What is Segmentation?

1. Semantic Segmentation

2. Instance Segmentation

3. Panoptic Segmentation

Evaluation Metrics

1. IoU (Intersection over Union)

2. Dice Coefficient (F1-Score)

3. Mean IoU (mIoU)

4.2 U-Net Architecture

U-Net Overview

U-Net Features

U-Net Structure

Complete U-Net Implementation

Application to Medical Image Segmentation

4.3 Advanced Architectures

1. DeepLab (v3/v3+)

Key Technologies

2. PSPNet (Pyramid Scene Parsing Network)

Key Technologies

3. HRNet (High-Resolution Network)

Key Technologies

4. Transformer-based Segmentation (SegFormer)

4.4 Instance Segmentation

Mask R-CNN

Architecture

Key Technologies

Other Instance Segmentation Methods

1. YOLACT (You Only Look At CoefficienTs)

2. SOLOv2 (Segmenting Objects by Locations)

4.5 Detectron2 Framework

What is Detectron2?

Key Features

4.6 Practical Project

Project: Semantic Segmentation Pipeline

Post-processing

4.7 Chapter Summary

What We Learned

Segmentation Method Selection Guide

Next Chapter

Exercises

Problem 1 (Difficulty: Easy)

Problem 2 (Difficulty: Medium)

Problem 3 (Difficulty: Medium)

Problem 4 (Difficulty: Hard)

Problem 5 (Difficulty: Hard)

(a) Focal Loss

(b) Tversky Loss

(c) Combination of Dice Loss + BCE

References

Disclaimer