Chapter 5: Introduction to Object Detection

This chapter introduces the basics of Introduction to Object Detection. You will learn differences between image classification, design philosophy, and and interpret evaluation metrics such as IoU.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the differences between image classification and object detection and define appropriate tasks
✅ Explain the architecture and evolution of two-stage detectors (R-CNN family)
✅ Understand the design philosophy and advantages of one-stage detectors (YOLO, SSD)
✅ Implement and interpret evaluation metrics such as IoU, NMS, and mAP
✅ Implement and perform inference with object detection models using PyTorch
✅ Achieve practical object detection with COCO format datasets

5.1 What is Object Detection

Types of Image Recognition Tasks

Image recognition tasks in computer vision are primarily classified into three categories based on their objectives.

graph LR A[Image Recognition Tasks] --> B[Classification
Image Classification] A --> C[Detection
Object Detection] A --> D[Segmentation
Segmentation] B --> B1["What is this image?
Class label only"] C --> C1["What and where?
Location + Class label"] D --> D1["Which pixel is what?
Pixel-level classification"] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#e8f5e9 style D fill:#f3e5f5

Task	Purpose	Output	Applications
Classification	Classify the entire image	Class label (e.g., "cat")	Image search, content filtering
Detection	Identify object location and class	Bounding Box + class label	Autonomous driving, surveillance, medical imaging
Segmentation	Pixel-level region division	Segmentation mask	Background removal, 3D reconstruction, medical diagnosis

Basic Concepts of Object Detection

Bounding Box

A bounding box is a rectangular region that encloses a detected object, containing the following information:

Coordinate representation: $(x, y, w, h)$ or $(x_{\min}, y_{\min}, x_{\max}, y_{\max})$
Class label: Object category (e.g., "person", "car")
Confidence score: Detection confidence $[0, 1]$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0

import torch
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np

def visualize_bounding_boxes(image_path, boxes, labels, scores, class_names):
    """
    Visualize Bounding Boxes

    Args:
        image_path: Image file path
        boxes: Bounding box coordinates [[x_min, y_min, x_max, y_max], ...]
        labels: Class labels [0, 1, 2, ...]
        scores: Confidence scores [0.95, 0.87, ...]
        class_names: List of class names ['person', 'car', ...]
    """
    # Load image
    img = Image.open(image_path)

    fig, ax = plt.subplots(1, figsize=(12, 8))
    ax.imshow(img)

    # Draw each bounding box
    colors = ['red', 'blue', 'green', 'yellow', 'purple', 'orange']

    for box, label, score in zip(boxes, labels, scores):
        x_min, y_min, x_max, y_max = box
        width = x_max - x_min
        height = y_max - y_min

        # Draw rectangle
        color = colors[label % len(colors)]
        rect = patches.Rectangle(
            (x_min, y_min), width, height,
            linewidth=2, edgecolor=color, facecolor='none'
        )
        ax.add_patch(rect)

        # Display label and score
        label_text = f'{class_names[label]}: {score:.2f}'
        ax.text(
            x_min, y_min - 5,
            label_text,
            bbox=dict(facecolor=color, alpha=0.7),
            fontsize=10, color='white'
        )

    ax.axis('off')
    plt.tight_layout()
    plt.show()

# Usage example
# boxes = [[50, 50, 200, 300], [250, 100, 400, 350]]
# labels = [0, 1]  # 0: person, 1: car
# scores = [0.95, 0.87]
# class_names = ['person', 'car', 'dog', 'cat']
# visualize_bounding_boxes('sample.jpg', boxes, labels, scores, class_names)

IoU (Intersection over Union)

IoU is a metric that measures the overlap between two bounding boxes and is essential for evaluating object detection.

$$ \text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} = \frac{|A \cap B|}{|A \cup B|} $$

graph LR A[Predicted Box] --> C[Intersection
Overlap region] B[Ground Truth Box] --> C C --> D[Union
Union region] D --> E[IoU = Intersection / Union] style A fill:#ffebee style B fill:#e8f5e9 style C fill:#fff3e0 style E fill:#e3f2fd

def calculate_iou(box1, box2):
    """
    Calculate IoU between two bounding boxes

    Args:
        box1, box2: [x_min, y_min, x_max, y_max]

    Returns:
        iou: IoU value [0, 1]
    """
    # Intersection region coordinates
    x_min_inter = max(box1[0], box2[0])
    y_min_inter = max(box1[1], box2[1])
    x_max_inter = min(box1[2], box2[2])
    y_max_inter = min(box1[3], box2[3])

    # Intersection area
    inter_width = max(0, x_max_inter - x_min_inter)
    inter_height = max(0, y_max_inter - y_min_inter)
    intersection = inter_width * inter_height

    # Area of each box
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])

    # Union area
    union = area1 + area2 - intersection

    # Calculate IoU (avoid division by zero)
    iou = intersection / union if union > 0 else 0

    return iou

# Usage examples and tests
box1 = [50, 50, 150, 150]   # Ground truth box
box2 = [100, 100, 200, 200] # Predicted box (partial overlap)
box3 = [50, 50, 150, 150]   # Predicted box (perfect match)
box4 = [200, 200, 300, 300] # Predicted box (no overlap)

print(f"Partial overlap IoU: {calculate_iou(box1, box2):.4f}")  # ~0.14
print(f"Perfect match IoU: {calculate_iou(box1, box3):.4f}")      # 1.00
print(f"No overlap IoU: {calculate_iou(box1, box4):.4f}")    # 0.00

# Vectorized batch IoU calculation
def batch_iou(boxes1, boxes2):
    """
    Efficiently calculate IoU between multiple bounding boxes (PyTorch version)

    Args:
        boxes1: Tensor of shape [N, 4]
        boxes2: Tensor of shape [M, 4]

    Returns:
        iou: Tensor of shape [N, M]
    """
    # Calculate intersection
    x_min = torch.max(boxes1[:, None, 0], boxes2[:, 0])
    y_min = torch.max(boxes1[:, None, 1], boxes2[:, 1])
    x_max = torch.min(boxes1[:, None, 2], boxes2[:, 2])
    y_max = torch.min(boxes1[:, None, 3], boxes2[:, 3])

    inter_width = torch.clamp(x_max - x_min, min=0)
    inter_height = torch.clamp(y_max - y_min, min=0)
    intersection = inter_width * inter_height

    # Calculate areas
    area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
    area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])

    # Union and IoU
    union = area1[:, None] + area2 - intersection
    iou = intersection / union

    return iou

# Usage example
boxes1 = torch.tensor([[50, 50, 150, 150], [100, 100, 200, 200]], dtype=torch.float32)
boxes2 = torch.tensor([[50, 50, 150, 150], [200, 200, 300, 300]], dtype=torch.float32)

iou_matrix = batch_iou(boxes1, boxes2)
print("\nBatch IoU Matrix:")
print(iou_matrix)
# Output:
# tensor([[1.0000, 0.0000],
#         [0.1429, 0.0000]])

IoU criteria:

IoU ≥ 0.5: Typically treated as correct (PASCAL VOC standard)

IoU ≥ 0.75: Stricter criteria (COCO evaluation)

IoU < 0.5: Treated as false positive

5.2 Two-Stage Detectors

Evolution of R-CNN Family

Two-stage detectors perform object detection in two stages: ①region proposal and ②object classification.

graph LR A[Input Image] --> B[Stage 1
Region Proposal] B --> C[Candidate Regions
~2000] C --> D[Stage 2
Classification] D --> E[Final Detection Results
Box + Class] style A fill:#e3f2fd style B fill:#fff3e0 style D fill:#f3e5f5 style E fill:#e8f5e9

R-CNN (2014)

R-CNN (Regions with CNN features) is a pioneering deep learning-based object detection approach.

Step	Process	Features
1. Region Proposal	Generate candidate regions with Selective Search (~2000)	Traditional image processing method
2. CNN Feature Extraction	Extract features from each region with AlexNet	Requires 2000 forward passes
3. SVM Classification	Classify with SVM	Trained separately from CNN
4. Bounding Box Regression	Fine-tune box coordinates	Improves accuracy

Problems:

Very slow inference (47 seconds per image)
Complex training (three separate learning stages)
Many redundant feature extraction computations

Fast R-CNN (2015)

Fast R-CNN significantly improved R-CNN's computational efficiency.

graph LR A[Input Image] --> B[CNN
Feature Map] B --> C[RoI Pooling] D[Region
Proposals] --> C C --> E[FC Layers] E --> F1[Softmax
Classification] E --> F2[Regressor
Box Regression] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style F1 fill:#e8f5e9 style F2 fill:#e8f5e9

Improvements:

Run CNN only once on the entire image
Extract fixed-size features from candidate regions with RoI Pooling
End-to-end learning with multi-task loss (classification + box regression)
Inference speed: 47 seconds → 2 seconds (23x speedup)

Faster R-CNN (2016)

Faster R-CNN made region proposals learnable with CNNs, achieving complete end-to-end detection.

# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torch.nn as nn
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def create_faster_rcnn(num_classes, pretrained=True):
    """
    Create Faster R-CNN model

    Args:
        num_classes: Number of detection classes (including background)
        pretrained: Use COCO pre-trained weights

    Returns:
        model: Faster R-CNN model
    """
    # Load COCO pre-trained model
    model = fasterrcnn_resnet50_fpn(pretrained=pretrained)

    # Replace classifier (final layer only)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

# Create model (e.g., COCO 80 classes + background)
model = create_faster_rcnn(num_classes=91, pretrained=True)
model.eval()

print("Faster R-CNN model structure:")
print(f"- Backbone: ResNet-50 + FPN")
print(f"- RPN: Region Proposal Network")
print(f"- RoI Heads: Box Head + Class Predictor")

# Run inference
def run_faster_rcnn_inference(model, image_path, threshold=0.5):
    """
    Object detection inference with Faster R-CNN

    Args:
        model: Faster R-CNN model
        image_path: Input image path
        threshold: Detection score threshold

    Returns:
        boxes, labels, scores: Detection results
    """
    from PIL import Image
    from torchvision import transforms

    # Load and preprocess image
    img = Image.open(image_path).convert('RGB')
    transform = transforms.Compose([transforms.ToTensor()])
    img_tensor = transform(img).unsqueeze(0)  # [1, 3, H, W]

    # Inference
    model.eval()
    with torch.no_grad():
        predictions = model(img_tensor)

    # Extract results above threshold
    pred = predictions[0]
    keep = pred['scores'] > threshold

    boxes = pred['boxes'][keep].cpu().numpy()
    labels = pred['labels'][keep].cpu().numpy()
    scores = pred['scores'][keep].cpu().numpy()

    print(f"\nNumber of detected objects: {len(boxes)}")
    for i, (box, label, score) in enumerate(zip(boxes, labels, scores)):
        print(f"  {i+1}. Label: {label}, Score: {score:.3f}, Box: {box}")

    return boxes, labels, scores

# COCO class names (partial)
COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow'
    # ... all 91 classes
]

# Usage example
# boxes, labels, scores = run_faster_rcnn_inference(model, 'test_image.jpg', threshold=0.7)
# visualize_bounding_boxes('test_image.jpg', boxes, labels, scores, COCO_INSTANCE_CATEGORY_NAMES)

Region Proposal Network (RPN)

RPN is the core technology of Faster R-CNN, proposing candidate regions with a learning-based approach.

How RPN works:

Place multiple anchor boxes (different sizes and aspect ratios) at each location on the feature map

Score each anchor for "objectness" (object likelihood)

Regress bounding box coordinate offsets

Pass high-score proposals to RoI Pooling

class SimpleRPN(nn.Module):
    """
    Simplified Region Proposal Network (for educational purposes)
    """

    def __init__(self, in_channels=512, num_anchors=9):
        """
        Args:
            in_channels: Number of input feature map channels
            num_anchors: Number of anchors per location (typically 3 scales × 3 aspect ratios = 9)
        """
        super(SimpleRPN, self).__init__()

        # Shared convolutional layer
        self.conv = nn.Conv2d(in_channels, 512, kernel_size=3, padding=1)

        # Objectness score (2 classes: object or background)
        self.cls_logits = nn.Conv2d(512, num_anchors * 2, kernel_size=1)

        # Bounding box regression (4 coordinates × num_anchors)
        self.bbox_pred = nn.Conv2d(512, num_anchors * 4, kernel_size=1)

    def forward(self, feature_map):
        """
        Args:
            feature_map: [B, C, H, W] feature map

        Returns:
            objectness: [B, num_anchors*2, H, W] object scores
            bbox_deltas: [B, num_anchors*4, H, W] box coordinate offsets
        """
        # Shared feature extraction
        x = torch.relu(self.conv(feature_map))

        # Objectness classification
        objectness = self.cls_logits(x)

        # Bounding box regression
        bbox_deltas = self.bbox_pred(x)

        return objectness, bbox_deltas

# Test RPN operation
rpn = SimpleRPN(in_channels=512, num_anchors=9)
feature_map = torch.randn(1, 512, 38, 38)  # e.g., ResNet feature map

objectness, bbox_deltas = rpn(feature_map)
print(f"Objectness shape: {objectness.shape}")     # [1, 18, 38, 38]
print(f"BBox Deltas shape: {bbox_deltas.shape}")   # [1, 36, 38, 38]
print(f"Total Proposals: {38 * 38 * 9} anchors")   # 12,996 anchors

5.3 One-Stage Detectors

YOLO (You Only Look Once)

YOLO formulates object detection as a "regression problem," directly predicting bounding boxes and classes end-to-end with a single CNN.

graph LR A[Input Image
448×448] --> B[CNN Backbone
Feature Extraction] B --> C[Grid Division
7×7] C --> D[Predict per Cell
Box + Class] D --> E[NMS
Remove Duplicates] E --> F[Final Detection Results] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#ffe0b2 style E fill:#e1bee7 style F fill:#e8f5e9

YOLO Design Philosophy

Speed-focused: Aims for real-time inference (45+ FPS)
Global context: Better context understanding by looking at the entire image at once
Simple: No complex pipeline, end-to-end learning

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn

# Using YOLOv5 (Ultralytics implementation)
def load_yolov5(model_size='yolov5s', pretrained=True):
    """
    Load YOLOv5 model

    Args:
        model_size: Model size ('yolov5n', 'yolov5s', 'yolov5m', 'yolov5l', 'yolov5x')
        pretrained: Use COCO pre-trained weights

    Returns:
        model: YOLOv5 model
    """
    # Load from PyTorch Hub (Ultralytics implementation)
    model = torch.hub.load('ultralytics/yolov5', model_size, pretrained=pretrained)

    return model

# Load model
model = load_yolov5('yolov5s', pretrained=True)
model.eval()

print("YOLOv5s model information:")
print(f"- Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"- Inference speed: ~140 FPS (GPU)")
print(f"- Input size: 640×640 (default)")

def run_yolo_inference(model, image_path, conf_threshold=0.25, iou_threshold=0.45):
    """
    Object detection inference with YOLOv5

    Args:
        model: YOLOv5 model
        image_path: Input image path
        conf_threshold: Confidence score threshold
        iou_threshold: NMS IoU threshold

    Returns:
        results: Detection results (pandas DataFrame)
    """
    # Inference settings
    model.conf = conf_threshold
    model.iou = iou_threshold

    # Run inference
    results = model(image_path)

    # Display results
    results.print()  # Print to console

    # Visualize results
    results.show()   # Display image

    # Get results as DataFrame
    detections = results.pandas().xyxy[0]

    print(f"\nNumber of detected objects: {len(detections)}")
    print(detections)

    return results

# Usage example
# results = run_yolo_inference(model, 'test_image.jpg', conf_threshold=0.5)

# Batch inference
def run_yolo_batch_inference(model, image_paths, save_dir='results/'):
    """
    Batch inference for multiple images

    Args:
        model: YOLOv5 model
        image_paths: List of image paths
        save_dir: Directory to save results
    """
    import os
    os.makedirs(save_dir, exist_ok=True)

    # Batch inference
    results = model(image_paths)

    # Save results
    results.save(save_dir=save_dir)

    print(f"Batch inference complete: {len(image_paths)} images")
    print(f"Results saved to: {save_dir}")

    return results

# Usage example
# image_list = ['img1.jpg', 'img2.jpg', 'img3.jpg']
# batch_results = run_yolo_batch_inference(model, image_list)

YOLO Loss Function

YOLO combines three losses for training:

$$ \mathcal{L}_{\text{YOLO}} = \lambda_{\text{box}} \mathcal{L}_{\text{box}} + \lambda_{\text{obj}} \mathcal{L}_{\text{obj}} + \lambda_{\text{cls}} \mathcal{L}_{\text{cls}} $$

$\mathcal{L}_{\text{box}}$: Bounding box coordinate regression loss (CIoU Loss)
$\mathcal{L}_{\text{obj}}$: Objectness binary classification loss
$\mathcal{L}_{\text{cls}}$: Multi-class classification loss

SSD (Single Shot Detector)

SSD performs detection on feature maps at different scales, balancing speed and accuracy.

SSD features:

Multi-scale feature maps (detection at multiple resolutions)

Uses default boxes (equivalent to anchors) on each feature map

More accurate than YOLO, faster than Faster R-CNN

# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torchvision>=0.15.0

from torchvision.models.detection import ssd300_vgg16

def create_ssd_model(num_classes=91, pretrained=True):
    """
    Create SSD300 model

    Args:
        num_classes: Number of detection classes
        pretrained: Use pre-trained weights

    Returns:
        model: SSD300 model
    """
    # SSD300 with VGG16 backbone
    model = ssd300_vgg16(pretrained=pretrained, num_classes=num_classes)

    return model

# Load model
ssd_model = create_ssd_model(num_classes=91, pretrained=True)
ssd_model.eval()

print("SSD300 model information:")
print(f"- Input size: 300×300")
print(f"- Backbone: VGG16")
print(f"- Feature maps: 6 layers (different scales)")

def run_ssd_inference(model, image_path, threshold=0.5):
    """
    Object detection inference with SSD
    """
    from PIL import Image
    from torchvision import transforms

    # Load and preprocess image
    img = Image.open(image_path).convert('RGB')
    transform = transforms.Compose([transforms.ToTensor()])
    img_tensor = transform(img).unsqueeze(0)

    # Inference
    model.eval()
    with torch.no_grad():
        predictions = model(img_tensor)

    # Extract results
    pred = predictions[0]
    keep = pred['scores'] > threshold

    boxes = pred['boxes'][keep].cpu().numpy()
    labels = pred['labels'][keep].cpu().numpy()
    scores = pred['scores'][keep].cpu().numpy()

    print(f"\nNumber of detected objects: {len(boxes)}")
    for i, (box, label, score) in enumerate(zip(boxes, labels, scores)):
        print(f"  {i+1}. Label: {label}, Score: {score:.3f}")

    return boxes, labels, scores

# Usage example
# boxes, labels, scores = run_ssd_inference(ssd_model, 'test_image.jpg', threshold=0.6)

5.4 Evaluation Metrics

Precision and Recall

Object detection evaluation uses metrics similar to information retrieval.

$$ \text{Precision} = \frac{TP}{TP + FP} \quad \text{(Detection accuracy)} $$

$$ \text{Recall} = \frac{TP}{TP + FN} \quad \text{(Detection coverage)} $$

TP (True Positive): Correctly detected objects (IoU ≥ threshold)
FP (False Positive): False detections (IoU < threshold or background misclassified as object)
FN (False Negative): Missed detections (existing objects not detected)

NMS (Non-Maximum Suppression)

NMS is an algorithm that removes overlapping detection results, keeping only one box per object.

graph LR A[Detection Boxes
Sort by Score] --> B[Select Highest Score Box] B --> C[Remove Overlapping Boxes
IoU > threshold] C --> D{Boxes Remaining?} D -->|Yes| B D -->|No| E[Final Detection Results] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style E fill:#e8f5e9

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

def non_max_suppression(boxes, scores, iou_threshold=0.5):
    """
    Non-Maximum Suppression (NMS) implementation

    Args:
        boxes: Bounding box coordinates [[x_min, y_min, x_max, y_max], ...] (numpy array)
        scores: Confidence scores [0.9, 0.8, ...]
        iou_threshold: IoU threshold (boxes with more overlap are removed)

    Returns:
        keep_indices: Indices of boxes to keep
    """
    import numpy as np

    # Sort by descending score
    sorted_indices = np.argsort(scores)[::-1]

    keep_indices = []

    while len(sorted_indices) > 0:
        # Keep highest score box
        current = sorted_indices[0]
        keep_indices.append(current)

        if len(sorted_indices) == 1:
            break

        # Calculate IoU with remaining boxes
        current_box = boxes[current]
        remaining_boxes = boxes[sorted_indices[1:]]

        ious = np.array([calculate_iou(current_box, box) for box in remaining_boxes])

        # Keep only boxes below IoU threshold
        keep_mask = ious < iou_threshold
        sorted_indices = sorted_indices[1:][keep_mask]

    return np.array(keep_indices)

# Usage example
boxes = np.array([
    [50, 50, 150, 150],
    [55, 55, 155, 155],   # Large overlap with first box
    [200, 200, 300, 300],
    [205, 205, 305, 305]  # Large overlap with third box
])
scores = np.array([0.9, 0.85, 0.95, 0.88])

keep_indices = non_max_suppression(boxes, scores, iou_threshold=0.5)
print(f"Original number of boxes: {len(boxes)}")
print(f"Number of boxes after NMS: {len(keep_indices)}")
print(f"Kept indices: {keep_indices}")
print(f"Kept boxes:\n{boxes[keep_indices]}")

# PyTorch official NMS implementation (faster)
from torchvision.ops import nms

def nms_torch(boxes, scores, iou_threshold=0.5):
    """
    PyTorch NMS (fast C++ implementation)

    Args:
        boxes: Tensor of shape [N, 4]
        scores: Tensor of shape [N]
        iou_threshold: IoU threshold

    Returns:
        keep: Indices of boxes to keep (Tensor)
    """
    keep = nms(boxes, scores, iou_threshold)
    return keep

# Usage example
boxes_tensor = torch.tensor(boxes, dtype=torch.float32)
scores_tensor = torch.tensor(scores, dtype=torch.float32)

keep_torch = nms_torch(boxes_tensor, scores_tensor, iou_threshold=0.5)
print(f"\nPyTorch NMS result: {keep_torch}")

mAP (mean Average Precision)

mAP is the standard evaluation metric for object detection, representing the average precision across all classes.

Calculation Steps

For each class, draw a Precision-Recall curve
Calculate AP (Average Precision) as the area under the curve
Average AP across all classes to obtain mAP

$$ \text{AP} = \int_0^1 P(r) \, dr $$

$$ \text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c $$

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

def calculate_precision_recall(pred_boxes, pred_scores, true_boxes, iou_threshold=0.5):
    """
    Calculate values for Precision-Recall curve

    Args:
        pred_boxes: Predicted boxes [N, 4]
        pred_scores: Predicted scores [N]
        true_boxes: Ground truth boxes [M, 4]
        iou_threshold: IoU threshold

    Returns:
        precisions, recalls: Lists of Precision-Recall values
    """
    import numpy as np

    # Sort by descending score
    sorted_indices = np.argsort(pred_scores)[::-1]
    pred_boxes = pred_boxes[sorted_indices]
    pred_scores = pred_scores[sorted_indices]

    num_true = len(true_boxes)
    matched_true = np.zeros(num_true, dtype=bool)

    tp = np.zeros(len(pred_boxes))
    fp = np.zeros(len(pred_boxes))

    for i, pred_box in enumerate(pred_boxes):
        # Calculate maximum IoU with ground truth boxes
        if len(true_boxes) == 0:
            fp[i] = 1
            continue

        ious = np.array([calculate_iou(pred_box, true_box) for true_box in true_boxes])
        max_iou_idx = np.argmax(ious)
        max_iou = ious[max_iou_idx]

        # TP if exceeds IoU threshold and not yet matched
        if max_iou >= iou_threshold and not matched_true[max_iou_idx]:
            tp[i] = 1
            matched_true[max_iou_idx] = True
        else:
            fp[i] = 1

    # Cumulative sum
    tp_cumsum = np.cumsum(tp)
    fp_cumsum = np.cumsum(fp)

    # Precision and Recall
    recalls = tp_cumsum / num_true if num_true > 0 else np.zeros_like(tp_cumsum)
    precisions = tp_cumsum / (tp_cumsum + fp_cumsum + 1e-10)

    return precisions, recalls

def calculate_ap(precisions, recalls):
    """
    Calculate Average Precision (AP) using 11-point interpolation

    Args:
        precisions: List of precision values
        recalls: List of recall values

    Returns:
        ap: Average Precision
    """
    import numpy as np

    # 11-point interpolation
    ap = 0.0
    for t in np.linspace(0, 1, 11):
        # Maximum precision at recall ≥ t
        if np.sum(recalls >= t) == 0:
            p = 0
        else:
            p = np.max(precisions[recalls >= t])
        ap += p / 11

    return ap

# Usage example
pred_boxes = np.array([
    [50, 50, 150, 150],
    [55, 55, 155, 155],
    [200, 200, 300, 300]
])
pred_scores = np.array([0.9, 0.7, 0.85])
true_boxes = np.array([
    [52, 52, 152, 152],
    [205, 205, 305, 305]
])

precisions, recalls = calculate_precision_recall(
    pred_boxes, pred_scores, true_boxes, iou_threshold=0.5
)

ap = calculate_ap(precisions, recalls)
print(f"Average Precision: {ap:.4f}")

# Visualize Precision-Recall curve
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(recalls, precisions, marker='o', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)
plt.xlim([0, 1.05])
plt.ylim([0, 1.05])
plt.fill_between(recalls, precisions, alpha=0.2)
plt.text(0.5, 0.5, f'AP = {ap:.4f}', fontsize=14, bbox=dict(facecolor='white', alpha=0.8))
plt.tight_layout()
plt.show()

def calculate_map(all_precisions, all_recalls, num_classes):
    """
    Calculate mean Average Precision (mAP)

    Args:
        all_precisions: List of precision lists per class [[p1, p2, ...], ...]
        all_recalls: List of recall lists per class [[r1, r2, ...], ...]
        num_classes: Number of classes

    Returns:
        mAP: mean Average Precision
    """
    aps = []

    for i in range(num_classes):
        ap = calculate_ap(all_precisions[i], all_recalls[i])
        aps.append(ap)
        print(f"Class {i}: AP = {ap:.4f}")

    mAP = np.mean(aps)
    print(f"\nmAP: {mAP:.4f}")

    return mAP

COCO mAP: The COCO dataset uses a stricter evaluation by calculating AP at multiple IoU thresholds (0.5, 0.55, ..., 0.95) and averaging them.

5.5 Object Detection with PyTorch

Using torchvision.models.detection

PyTorch's torchvision provides a rich collection of pre-trained object detection models.

# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torchvision
from torchvision.models.detection import (
    fasterrcnn_resnet50_fpn,
    fasterrcnn_mobilenet_v3_large_fpn,
    retinanet_resnet50_fpn,
    ssd300_vgg16
)

def compare_detection_models():
    """
    Compare various object detection models
    """
    models_info = {
        'Faster R-CNN (ResNet-50)': {
            'model': fasterrcnn_resnet50_fpn,
            'type': 'Two-Stage',
            'speed': 'Slow',
            'accuracy': 'High'
        },
        'Faster R-CNN (MobileNetV3)': {
            'model': fasterrcnn_mobilenet_v3_large_fpn,
            'type': 'Two-Stage',
            'speed': 'Medium',
            'accuracy': 'Medium'
        },
        'RetinaNet (ResNet-50)': {
            'model': retinanet_resnet50_fpn,
            'type': 'One-Stage',
            'speed': 'Medium',
            'accuracy': 'High'
        },
        'SSD300 (VGG16)': {
            'model': ssd300_vgg16,
            'type': 'One-Stage',
            'speed': 'Fast',
            'accuracy': 'Medium'
        }
    }

    print("Object Detection Model Comparison:")
    print("-" * 80)
    for name, info in models_info.items():
        print(f"{name:35s} | Type: {info['type']:10s} | "
              f"Speed: {info['speed']:6s} | Accuracy: {info['accuracy']:6s}")
    print("-" * 80)

compare_detection_models()

# Fine-tuning on custom dataset
from torch.utils.data import Dataset, DataLoader
import json

class CustomDetectionDataset(Dataset):
    """
    Custom object detection dataset (COCO format)
    """

    def __init__(self, image_dir, annotation_file, transforms=None):
        """
        Args:
            image_dir: Image directory path
            annotation_file: Annotation file in COCO format
            transforms: Data augmentation
        """
        self.image_dir = image_dir
        self.transforms = transforms

        # Load annotations
        with open(annotation_file, 'r') as f:
            self.coco_data = json.load(f)

        self.images = self.coco_data['images']
        self.annotations = self.coco_data['annotations']

        # Group annotations by image ID
        self.image_to_annotations = {}
        for ann in self.annotations:
            image_id = ann['image_id']
            if image_id not in self.image_to_annotations:
                self.image_to_annotations[image_id] = []
            self.image_to_annotations[image_id].append(ann)

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        # Image information
        img_info = self.images[idx]
        image_id = img_info['id']
        img_path = f"{self.image_dir}/{img_info['file_name']}"

        # Load image
        from PIL import Image
        img = Image.open(img_path).convert('RGB')

        # Get annotations
        anns = self.image_to_annotations.get(image_id, [])

        boxes = []
        labels = []

        for ann in anns:
            # COCO format: [x, y, width, height] → [x_min, y_min, x_max, y_max]
            x, y, w, h = ann['bbox']
            boxes.append([x, y, x + w, y + h])
            labels.append(ann['category_id'])

        # Convert to tensors
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.as_tensor(labels, dtype=torch.int64)

        target = {
            'boxes': boxes,
            'labels': labels,
            'image_id': torch.tensor([image_id])
        }

        # Data augmentation
        if self.transforms:
            img = self.transforms(img)

        return img, target

# Dataset usage example
# dataset = CustomDetectionDataset(
#     image_dir='data/images',
#     annotation_file='data/annotations.json',
#     transforms=torchvision.transforms.ToTensor()
# )
#
# data_loader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=lambda x: tuple(zip(*x)))

Training Loop Implementation

def train_detection_model(model, data_loader, optimizer, device, epoch):
    """
    Train object detection model (one epoch)

    Args:
        model: Object detection model
        data_loader: Data loader
        optimizer: Optimizer
        device: Execution device
        epoch: Current epoch number
    """
    model.train()

    total_loss = 0
    for batch_idx, (images, targets) in enumerate(data_loader):
        # Transfer data to device
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # Forward pass (torchvision models return loss during training)
        loss_dict = model(images, targets)

        # Sum all losses
        losses = sum(loss for loss in loss_dict.values())

        # Backward pass
        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        total_loss += losses.item()

        # Progress display
        if batch_idx % 10 == 0:
            print(f'Epoch {epoch}, Batch {batch_idx}/{len(data_loader)}, '
                  f'Loss: {losses.item():.4f}')
            print(f'  Details: {", ".join([f"{k}: {v.item():.4f}" for k, v in loss_dict.items()])}')

    avg_loss = total_loss / len(data_loader)
    print(f'Epoch {epoch} - Average Loss: {avg_loss:.4f}\n')

    return avg_loss

def evaluate_detection_model(model, data_loader, device):
    """
    Evaluate object detection model

    Args:
        model: Object detection model
        data_loader: Data loader
        device: Execution device

    Returns:
        metrics: Dictionary of evaluation metrics
    """
    model.eval()

    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for images, targets in data_loader:
            images = [img.to(device) for img in images]

            # Inference
            predictions = model(images)

            all_predictions.extend([{k: v.cpu() for k, v in p.items()} for p in predictions])
            all_targets.extend([{k: v.cpu() for k, v in t.items()} for t in targets])

    # Calculate evaluation metrics (simplified version)
    print("Evaluation results:")
    print(f"  Total samples: {len(all_predictions)}")

    # Average number of detections
    avg_detections = sum(len(p['boxes']) for p in all_predictions) / len(all_predictions)
    print(f"  Average detections: {avg_detections:.2f}")

    return {'avg_detections': avg_detections}

# Training execution example
def full_training_pipeline(num_epochs=10):
    """
    Complete training pipeline
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Create model
    model = fasterrcnn_resnet50_fpn(pretrained=True)
    model.to(device)

    # Optimizer
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)

    # Learning rate scheduler
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

    # Training loop
    for epoch in range(1, num_epochs + 1):
        # Training
        # train_loss = train_detection_model(model, train_loader, optimizer, device, epoch)

        # Evaluation
        # metrics = evaluate_detection_model(model, val_loader, device)

        # Update learning rate
        lr_scheduler.step()

        # Save model
        # torch.save(model.state_dict(), f'detection_model_epoch_{epoch}.pth')

        print(f"Epoch {epoch} completed.\n")

# Usage example
# full_training_pipeline(num_epochs=10)

5.6 Practice: Detection with COCO Format Data

COCO Dataset Overview

COCO (Common Objects in Context) is the standard benchmark dataset for object detection.

Item	Details
Number of images	Training: 118K, Validation: 5K, Test: 41K
Number of classes	80 classes (person, car, dog, etc.)
Annotations	Bounding boxes, segmentation, keypoints
Evaluation metric	mAP @ IoU=[0.50:0.05:0.95]

Complete Object Detection Pipeline

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

class ObjectDetectionPipeline:
    """
    Complete object detection pipeline
    """

    def __init__(self, num_classes, pretrained=True, device=None):
        """
        Args:
            num_classes: Number of detection classes (including background)
            pretrained: Use pre-trained weights
            device: Execution device
        """
        self.device = device or torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.num_classes = num_classes

        # Build model
        self.model = self._build_model(pretrained)
        self.model.to(self.device)

        print(f"Object detection pipeline initialization complete")
        print(f"  Device: {self.device}")
        print(f"  Number of classes: {num_classes}")

    def _build_model(self, pretrained):
        """Build model"""
        model = fasterrcnn_resnet50_fpn(pretrained=pretrained)

        # Replace final layer
        in_features = model.roi_heads.box_predictor.cls_score.in_features
        model.roi_heads.box_predictor = FastRCNNPredictor(in_features, self.num_classes)

        return model

    def predict(self, image_path, conf_threshold=0.5, nms_threshold=0.5):
        """
        Detect objects from image

        Args:
            image_path: Input image path
            conf_threshold: Confidence score threshold
            nms_threshold: NMS IoU threshold

        Returns:
            detections: Detection results dictionary
        """
        # Load image
        img = Image.open(image_path).convert('RGB')
        img_tensor = torchvision.transforms.ToTensor()(img).unsqueeze(0).to(self.device)

        # Inference
        self.model.eval()
        with torch.no_grad():
            predictions = self.model(img_tensor)

        # Post-processing
        pred = predictions[0]

        # NMS (torchvision models run NMS internally, but can apply additionally)
        keep = torchvision.ops.nms(pred['boxes'], pred['scores'], nms_threshold)

        # Threshold filtering
        keep = keep[pred['scores'][keep] > conf_threshold]

        detections = {
            'boxes': pred['boxes'][keep].cpu().numpy(),
            'labels': pred['labels'][keep].cpu().numpy(),
            'scores': pred['scores'][keep].cpu().numpy()
        }

        return detections, img

    def visualize(self, image, detections, class_names, save_path=None):
        """
        Visualize detection results

        Args:
            image: PIL Image
            detections: Return value from predict()
            class_names: List of class names
            save_path: Save path (display only if None)
        """
        fig, ax = plt.subplots(1, figsize=(12, 8))
        ax.imshow(image)

        colors = plt.cm.hsv(np.linspace(0, 1, len(class_names))).tolist()

        for box, label, score in zip(detections['boxes'], detections['labels'], detections['scores']):
            x_min, y_min, x_max, y_max = box
            width = x_max - x_min
            height = y_max - y_min

            color = colors[label % len(colors)]

            # Bounding box
            rect = patches.Rectangle(
                (x_min, y_min), width, height,
                linewidth=2, edgecolor=color, facecolor='none'
            )
            ax.add_patch(rect)

            # Label
            label_text = f'{class_names[label]}: {score:.2f}'
            ax.text(
                x_min, y_min - 5,
                label_text,
                bbox=dict(facecolor=color, alpha=0.7),
                fontsize=10, color='white', weight='bold'
            )

        ax.axis('off')
        plt.tight_layout()

        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
            print(f"Results saved: {save_path}")
        else:
            plt.show()

    def batch_predict(self, image_paths, conf_threshold=0.5):
        """
        Batch inference

        Args:
            image_paths: List of image paths
            conf_threshold: Confidence threshold

        Returns:
            all_detections: List of detection results per image
        """
        all_detections = []

        for img_path in image_paths:
            detections, img = self.predict(img_path, conf_threshold)
            all_detections.append({
                'path': img_path,
                'detections': detections,
                'image': img
            })

        return all_detections

    def evaluate_coco(self, data_loader, coco_gt):
        """
        COCO format evaluation

        Args:
            data_loader: Data loader
            coco_gt: COCO ground truth annotations

        Returns:
            metrics: Evaluation metrics
        """
        from pycocotools.coco import COCO
        from pycocotools.cocoeval import COCOeval

        self.model.eval()
        coco_results = []

        with torch.no_grad():
            for images, targets in data_loader:
                images = [img.to(self.device) for img in images]
                predictions = self.model(images)

                # Convert to COCO format
                for target, pred in zip(targets, predictions):
                    image_id = target['image_id'].item()

                    for box, label, score in zip(pred['boxes'], pred['labels'], pred['scores']):
                        x_min, y_min, x_max, y_max = box.tolist()

                        coco_results.append({
                            'image_id': image_id,
                            'category_id': label.item(),
                            'bbox': [x_min, y_min, x_max - x_min, y_max - y_min],
                            'score': score.item()
                        })

        # COCO evaluation
        coco_dt = coco_gt.loadRes(coco_results)
        coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
        coco_eval.evaluate()
        coco_eval.accumulate()
        coco_eval.summarize()

        metrics = {
            'mAP': coco_eval.stats[0],
            'mAP_50': coco_eval.stats[1],
            'mAP_75': coco_eval.stats[2]
        }

        return metrics

# Usage example
if __name__ == '__main__':
    # COCO class names (abbreviated)
    COCO_CLASSES = [
        '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
        'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
        'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
        'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag'
        # ... all 91 classes
    ]

    # Initialize pipeline
    pipeline = ObjectDetectionPipeline(num_classes=91, pretrained=True)

    # Single image inference
    # detections, img = pipeline.predict('test_image.jpg', conf_threshold=0.7)
    # pipeline.visualize(img, detections, COCO_CLASSES, save_path='result.jpg')

    # Batch inference
    # image_list = ['img1.jpg', 'img2.jpg', 'img3.jpg']
    # results = pipeline.batch_predict(image_list, conf_threshold=0.6)

    print("Object detection pipeline ready")

Chapter Summary

What We Learned

Object Detection Fundamentals
- Differences between classification, detection, and segmentation
- How to calculate bounding boxes and IoU
- Challenges and evaluation metrics for object detection
Two-Stage Detectors
- Evolution of R-CNN, Fast R-CNN, Faster R-CNN
- How Region Proposal Networks work
- Accuracy-focused approaches
One-Stage Detectors
- Design philosophy of YOLO and SSD
- Trade-offs between speed and accuracy
- Achieving real-time detection
Evaluation Metrics
- Implementation of NMS (Non-Maximum Suppression)
- Precision-Recall curves and AP
- Calculating mAP (mean Average Precision)
Implementation Skills
- Object detection with PyTorch torchvision
- Building training and evaluation pipelines
- Handling COCO format data

Model Selection Guide

Requirement	Recommended Model	Reason
Highest accuracy	Faster R-CNN (ResNet-101)	Precise detection with two-stage
Real-time	YOLOv5s / YOLOv8	140+ FPS, lightweight
Balanced	YOLOv5m / RetinaNet	Balance of speed and accuracy
Edge devices	MobileNet SSD	Low computation, memory efficient
Small object detection	Faster R-CNN + FPN	Multi-scale feature extraction

Exercises

Problem 1 (Difficulty: medium)

Implement an IoU calculation function in NumPy and verify with the following test cases:

Box1: [0, 0, 100, 100], Box2: [50, 50, 150, 150] → IoU ≈ 0.143
Box1: [0, 0, 100, 100], Box2: [0, 0, 100, 100] → IoU = 1.0
Box1: [0, 0, 50, 50], Box2: [60, 60, 100, 100] → IoU = 0.0

Solution Example

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np

def calculate_iou_numpy(box1, box2):
    """IoU calculation with NumPy"""
    # Intersection region
    x_min_inter = max(box1[0], box2[0])
    y_min_inter = max(box1[1], box2[1])
    x_max_inter = min(box1[2], box2[2])
    y_max_inter = min(box1[3], box2[3])

    inter_area = max(0, x_max_inter - x_min_inter) * max(0, y_max_inter - y_min_inter)

    # Area of each box
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])

    # IoU
    union_area = box1_area + box2_area - inter_area
    iou = inter_area / union_area if union_area > 0 else 0

    return iou

# Test
test_cases = [
    ([0, 0, 100, 100], [50, 50, 150, 150], 0.143),
    ([0, 0, 100, 100], [0, 0, 100, 100], 1.0),
    ([0, 0, 50, 50], [60, 60, 100, 100], 0.0)
]

for box1, box2, expected in test_cases:
    iou = calculate_iou_numpy(box1, box2)
    print(f"Box1: {box1}, Box2: {box2}")
    print(f"  Calculated IoU: {iou:.4f}, Expected: {expected:.4f}, Match: {abs(iou - expected) < 0.001}")

Problem 2 (Difficulty: hard)

Implement the NMS (Non-Maximum Suppression) algorithm from scratch and test with the following data:

boxes = [[50, 50, 150, 150], [55, 55, 155, 155], [200, 200, 300, 300], [205, 205, 305, 305]]
scores = [0.9, 0.85, 0.95, 0.88]
iou_threshold = 0.5

Expected output: Indices [2, 0] (in score order) are kept

Hint

Sort by descending score
Keep highest score box and remove overlapping boxes
Repeat until all boxes are processed

Problem 3 (Difficulty: medium)

Use Faster R-CNN to perform object detection on a custom image and visualize the results. Display the class names and scores of detected objects.