This chapter introduces the basics of Introduction to Object Detection. You will learn differences between image classification, design philosophy, and and interpret evaluation metrics such as IoU.
Learning Objectives
By reading this chapter, you will be able to:
- β Understand the differences between image classification and object detection and define appropriate tasks
- β Explain the architecture and evolution of two-stage detectors (R-CNN family)
- β Understand the design philosophy and advantages of one-stage detectors (YOLO, SSD)
- β Implement and interpret evaluation metrics such as IoU, NMS, and mAP
- β Implement and perform inference with object detection models using PyTorch
- β Achieve practical object detection with COCO format datasets
5.1 What is Object Detection
Types of Image Recognition Tasks
Image recognition tasks in computer vision are primarily classified into three categories based on their objectives.
Image Classification] A --> C[Detection
Object Detection] A --> D[Segmentation
Segmentation] B --> B1["What is this image?
Class label only"] C --> C1["What and where?
Location + Class label"] D --> D1["Which pixel is what?
Pixel-level classification"] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#e8f5e9 style D fill:#f3e5f5
| Task | Purpose | Output | Applications |
|---|---|---|---|
| Classification | Classify the entire image | Class label (e.g., "cat") | Image search, content filtering |
| Detection | Identify object location and class | Bounding Box + class label | Autonomous driving, surveillance, medical imaging |
| Segmentation | Pixel-level region division | Segmentation mask | Background removal, 3D reconstruction, medical diagnosis |
Basic Concepts of Object Detection
Bounding Box
A bounding box is a rectangular region that encloses a detected object, containing the following information:
- Coordinate representation: $(x, y, w, h)$ or $(x_{\min}, y_{\min}, x_{\max}, y_{\max})$
- Class label: Object category (e.g., "person", "car")
- Confidence score: Detection confidence $[0, 1]$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
import torch
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np
def visualize_bounding_boxes(image_path, boxes, labels, scores, class_names):
"""
Visualize Bounding Boxes
Args:
image_path: Image file path
boxes: Bounding box coordinates [[x_min, y_min, x_max, y_max], ...]
labels: Class labels [0, 1, 2, ...]
scores: Confidence scores [0.95, 0.87, ...]
class_names: List of class names ['person', 'car', ...]
"""
# Load image
img = Image.open(image_path)
fig, ax = plt.subplots(1, figsize=(12, 8))
ax.imshow(img)
# Draw each bounding box
colors = ['red', 'blue', 'green', 'yellow', 'purple', 'orange']
for box, label, score in zip(boxes, labels, scores):
x_min, y_min, x_max, y_max = box
width = x_max - x_min
height = y_max - y_min
# Draw rectangle
color = colors[label % len(colors)]
rect = patches.Rectangle(
(x_min, y_min), width, height,
linewidth=2, edgecolor=color, facecolor='none'
)
ax.add_patch(rect)
# Display label and score
label_text = f'{class_names[label]}: {score:.2f}'
ax.text(
x_min, y_min - 5,
label_text,
bbox=dict(facecolor=color, alpha=0.7),
fontsize=10, color='white'
)
ax.axis('off')
plt.tight_layout()
plt.show()
# Usage example
# boxes = [[50, 50, 200, 300], [250, 100, 400, 350]]
# labels = [0, 1] # 0: person, 1: car
# scores = [0.95, 0.87]
# class_names = ['person', 'car', 'dog', 'cat']
# visualize_bounding_boxes('sample.jpg', boxes, labels, scores, class_names)
IoU (Intersection over Union)
IoU is a metric that measures the overlap between two bounding boxes and is essential for evaluating object detection.
$$ \text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} = \frac{|A \cap B|}{|A \cup B|} $$
Overlap region] B[Ground Truth Box] --> C C --> D[Union
Union region] D --> E[IoU = Intersection / Union] style A fill:#ffebee style B fill:#e8f5e9 style C fill:#fff3e0 style E fill:#e3f2fd
def calculate_iou(box1, box2):
"""
Calculate IoU between two bounding boxes
Args:
box1, box2: [x_min, y_min, x_max, y_max]
Returns:
iou: IoU value [0, 1]
"""
# Intersection region coordinates
x_min_inter = max(box1[0], box2[0])
y_min_inter = max(box1[1], box2[1])
x_max_inter = min(box1[2], box2[2])
y_max_inter = min(box1[3], box2[3])
# Intersection area
inter_width = max(0, x_max_inter - x_min_inter)
inter_height = max(0, y_max_inter - y_min_inter)
intersection = inter_width * inter_height
# Area of each box
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
# Union area
union = area1 + area2 - intersection
# Calculate IoU (avoid division by zero)
iou = intersection / union if union > 0 else 0
return iou
# Usage examples and tests
box1 = [50, 50, 150, 150] # Ground truth box
box2 = [100, 100, 200, 200] # Predicted box (partial overlap)
box3 = [50, 50, 150, 150] # Predicted box (perfect match)
box4 = [200, 200, 300, 300] # Predicted box (no overlap)
print(f"Partial overlap IoU: {calculate_iou(box1, box2):.4f}") # ~0.14
print(f"Perfect match IoU: {calculate_iou(box1, box3):.4f}") # 1.00
print(f"No overlap IoU: {calculate_iou(box1, box4):.4f}") # 0.00
# Vectorized batch IoU calculation
def batch_iou(boxes1, boxes2):
"""
Efficiently calculate IoU between multiple bounding boxes (PyTorch version)
Args:
boxes1: Tensor of shape [N, 4]
boxes2: Tensor of shape [M, 4]
Returns:
iou: Tensor of shape [N, M]
"""
# Calculate intersection
x_min = torch.max(boxes1[:, None, 0], boxes2[:, 0])
y_min = torch.max(boxes1[:, None, 1], boxes2[:, 1])
x_max = torch.min(boxes1[:, None, 2], boxes2[:, 2])
y_max = torch.min(boxes1[:, None, 3], boxes2[:, 3])
inter_width = torch.clamp(x_max - x_min, min=0)
inter_height = torch.clamp(y_max - y_min, min=0)
intersection = inter_width * inter_height
# Calculate areas
area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])
# Union and IoU
union = area1[:, None] + area2 - intersection
iou = intersection / union
return iou
# Usage example
boxes1 = torch.tensor([[50, 50, 150, 150], [100, 100, 200, 200]], dtype=torch.float32)
boxes2 = torch.tensor([[50, 50, 150, 150], [200, 200, 300, 300]], dtype=torch.float32)
iou_matrix = batch_iou(boxes1, boxes2)
print("\nBatch IoU Matrix:")
print(iou_matrix)
# Output:
# tensor([[1.0000, 0.0000],
# [0.1429, 0.0000]])
IoU criteria:
- IoU β₯ 0.5: Typically treated as correct (PASCAL VOC standard)
- IoU β₯ 0.75: Stricter criteria (COCO evaluation)
- IoU < 0.5: Treated as false positive
5.2 Two-Stage Detectors
Evolution of R-CNN Family
Two-stage detectors perform object detection in two stages: β region proposal and β‘object classification.
Region Proposal] B --> C[Candidate Regions
~2000] C --> D[Stage 2
Classification] D --> E[Final Detection Results
Box + Class] style A fill:#e3f2fd style B fill:#fff3e0 style D fill:#f3e5f5 style E fill:#e8f5e9
R-CNN (2014)
R-CNN (Regions with CNN features) is a pioneering deep learning-based object detection approach.
| Step | Process | Features |
|---|---|---|
| 1. Region Proposal | Generate candidate regions with Selective Search (~2000) | Traditional image processing method |
| 2. CNN Feature Extraction | Extract features from each region with AlexNet | Requires 2000 forward passes |
| 3. SVM Classification | Classify with SVM | Trained separately from CNN |
| 4. Bounding Box Regression | Fine-tune box coordinates | Improves accuracy |
Problems:
- Very slow inference (47 seconds per image)
- Complex training (three separate learning stages)
- Many redundant feature extraction computations
Fast R-CNN (2015)
Fast R-CNN significantly improved R-CNN's computational efficiency.
Feature Map] B --> C[RoI Pooling] D[Region
Proposals] --> C C --> E[FC Layers] E --> F1[Softmax
Classification] E --> F2[Regressor
Box Regression] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style F1 fill:#e8f5e9 style F2 fill:#e8f5e9
Improvements:
- Run CNN only once on the entire image
- Extract fixed-size features from candidate regions with RoI Pooling
- End-to-end learning with multi-task loss (classification + box regression)
- Inference speed: 47 seconds β 2 seconds (23x speedup)
Faster R-CNN (2016)
Faster R-CNN made region proposals learnable with CNNs, achieving complete end-to-end detection.
# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0
import torch
import torch.nn as nn
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
def create_faster_rcnn(num_classes, pretrained=True):
"""
Create Faster R-CNN model
Args:
num_classes: Number of detection classes (including background)
pretrained: Use COCO pre-trained weights
Returns:
model: Faster R-CNN model
"""
# Load COCO pre-trained model
model = fasterrcnn_resnet50_fpn(pretrained=pretrained)
# Replace classifier (final layer only)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
return model
# Create model (e.g., COCO 80 classes + background)
model = create_faster_rcnn(num_classes=91, pretrained=True)
model.eval()
print("Faster R-CNN model structure:")
print(f"- Backbone: ResNet-50 + FPN")
print(f"- RPN: Region Proposal Network")
print(f"- RoI Heads: Box Head + Class Predictor")
# Run inference
def run_faster_rcnn_inference(model, image_path, threshold=0.5):
"""
Object detection inference with Faster R-CNN
Args:
model: Faster R-CNN model
image_path: Input image path
threshold: Detection score threshold
Returns:
boxes, labels, scores: Detection results
"""
from PIL import Image
from torchvision import transforms
# Load and preprocess image
img = Image.open(image_path).convert('RGB')
transform = transforms.Compose([transforms.ToTensor()])
img_tensor = transform(img).unsqueeze(0) # [1, 3, H, W]
# Inference
model.eval()
with torch.no_grad():
predictions = model(img_tensor)
# Extract results above threshold
pred = predictions[0]
keep = pred['scores'] > threshold
boxes = pred['boxes'][keep].cpu().numpy()
labels = pred['labels'][keep].cpu().numpy()
scores = pred['scores'][keep].cpu().numpy()
print(f"\nNumber of detected objects: {len(boxes)}")
for i, (box, label, score) in enumerate(zip(boxes, labels, scores)):
print(f" {i+1}. Label: {label}, Score: {score:.3f}, Box: {box}")
return boxes, labels, scores
# COCO class names (partial)
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow'
# ... all 91 classes
]
# Usage example
# boxes, labels, scores = run_faster_rcnn_inference(model, 'test_image.jpg', threshold=0.7)
# visualize_bounding_boxes('test_image.jpg', boxes, labels, scores, COCO_INSTANCE_CATEGORY_NAMES)
Region Proposal Network (RPN)
RPN is the core technology of Faster R-CNN, proposing candidate regions with a learning-based approach.
How RPN works:
- Place multiple anchor boxes (different sizes and aspect ratios) at each location on the feature map
- Score each anchor for "objectness" (object likelihood)
- Regress bounding box coordinate offsets
- Pass high-score proposals to RoI Pooling
class SimpleRPN(nn.Module):
"""
Simplified Region Proposal Network (for educational purposes)
"""
def __init__(self, in_channels=512, num_anchors=9):
"""
Args:
in_channels: Number of input feature map channels
num_anchors: Number of anchors per location (typically 3 scales Γ 3 aspect ratios = 9)
"""
super(SimpleRPN, self).__init__()
# Shared convolutional layer
self.conv = nn.Conv2d(in_channels, 512, kernel_size=3, padding=1)
# Objectness score (2 classes: object or background)
self.cls_logits = nn.Conv2d(512, num_anchors * 2, kernel_size=1)
# Bounding box regression (4 coordinates Γ num_anchors)
self.bbox_pred = nn.Conv2d(512, num_anchors * 4, kernel_size=1)
def forward(self, feature_map):
"""
Args:
feature_map: [B, C, H, W] feature map
Returns:
objectness: [B, num_anchors*2, H, W] object scores
bbox_deltas: [B, num_anchors*4, H, W] box coordinate offsets
"""
# Shared feature extraction
x = torch.relu(self.conv(feature_map))
# Objectness classification
objectness = self.cls_logits(x)
# Bounding box regression
bbox_deltas = self.bbox_pred(x)
return objectness, bbox_deltas
# Test RPN operation
rpn = SimpleRPN(in_channels=512, num_anchors=9)
feature_map = torch.randn(1, 512, 38, 38) # e.g., ResNet feature map
objectness, bbox_deltas = rpn(feature_map)
print(f"Objectness shape: {objectness.shape}") # [1, 18, 38, 38]
print(f"BBox Deltas shape: {bbox_deltas.shape}") # [1, 36, 38, 38]
print(f"Total Proposals: {38 * 38 * 9} anchors") # 12,996 anchors
5.3 One-Stage Detectors
YOLO (You Only Look Once)
YOLO formulates object detection as a "regression problem," directly predicting bounding boxes and classes end-to-end with a single CNN.
448Γ448] --> B[CNN Backbone
Feature Extraction] B --> C[Grid Division
7Γ7] C --> D[Predict per Cell
Box + Class] D --> E[NMS
Remove Duplicates] E --> F[Final Detection Results] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#ffe0b2 style E fill:#e1bee7 style F fill:#e8f5e9
YOLO Design Philosophy
- Speed-focused: Aims for real-time inference (45+ FPS)
- Global context: Better context understanding by looking at the entire image at once
- Simple: No complex pipeline, end-to-end learning
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
# Using YOLOv5 (Ultralytics implementation)
def load_yolov5(model_size='yolov5s', pretrained=True):
"""
Load YOLOv5 model
Args:
model_size: Model size ('yolov5n', 'yolov5s', 'yolov5m', 'yolov5l', 'yolov5x')
pretrained: Use COCO pre-trained weights
Returns:
model: YOLOv5 model
"""
# Load from PyTorch Hub (Ultralytics implementation)
model = torch.hub.load('ultralytics/yolov5', model_size, pretrained=pretrained)
return model
# Load model
model = load_yolov5('yolov5s', pretrained=True)
model.eval()
print("YOLOv5s model information:")
print(f"- Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"- Inference speed: ~140 FPS (GPU)")
print(f"- Input size: 640Γ640 (default)")
def run_yolo_inference(model, image_path, conf_threshold=0.25, iou_threshold=0.45):
"""
Object detection inference with YOLOv5
Args:
model: YOLOv5 model
image_path: Input image path
conf_threshold: Confidence score threshold
iou_threshold: NMS IoU threshold
Returns:
results: Detection results (pandas DataFrame)
"""
# Inference settings
model.conf = conf_threshold
model.iou = iou_threshold
# Run inference
results = model(image_path)
# Display results
results.print() # Print to console
# Visualize results
results.show() # Display image
# Get results as DataFrame
detections = results.pandas().xyxy[0]
print(f"\nNumber of detected objects: {len(detections)}")
print(detections)
return results
# Usage example
# results = run_yolo_inference(model, 'test_image.jpg', conf_threshold=0.5)
# Batch inference
def run_yolo_batch_inference(model, image_paths, save_dir='results/'):
"""
Batch inference for multiple images
Args:
model: YOLOv5 model
image_paths: List of image paths
save_dir: Directory to save results
"""
import os
os.makedirs(save_dir, exist_ok=True)
# Batch inference
results = model(image_paths)
# Save results
results.save(save_dir=save_dir)
print(f"Batch inference complete: {len(image_paths)} images")
print(f"Results saved to: {save_dir}")
return results
# Usage example
# image_list = ['img1.jpg', 'img2.jpg', 'img3.jpg']
# batch_results = run_yolo_batch_inference(model, image_list)
YOLO Loss Function
YOLO combines three losses for training:
$$ \mathcal{L}_{\text{YOLO}} = \lambda_{\text{box}} \mathcal{L}_{\text{box}} + \lambda_{\text{obj}} \mathcal{L}_{\text{obj}} + \lambda_{\text{cls}} \mathcal{L}_{\text{cls}} $$
- $\mathcal{L}_{\text{box}}$: Bounding box coordinate regression loss (CIoU Loss)
- $\mathcal{L}_{\text{obj}}$: Objectness binary classification loss
- $\mathcal{L}_{\text{cls}}$: Multi-class classification loss
SSD (Single Shot Detector)
SSD performs detection on feature maps at different scales, balancing speed and accuracy.
SSD features:
- Multi-scale feature maps (detection at multiple resolutions)
- Uses default boxes (equivalent to anchors) on each feature map
- More accurate than YOLO, faster than Faster R-CNN
# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torchvision>=0.15.0
from torchvision.models.detection import ssd300_vgg16
def create_ssd_model(num_classes=91, pretrained=True):
"""
Create SSD300 model
Args:
num_classes: Number of detection classes
pretrained: Use pre-trained weights
Returns:
model: SSD300 model
"""
# SSD300 with VGG16 backbone
model = ssd300_vgg16(pretrained=pretrained, num_classes=num_classes)
return model
# Load model
ssd_model = create_ssd_model(num_classes=91, pretrained=True)
ssd_model.eval()
print("SSD300 model information:")
print(f"- Input size: 300Γ300")
print(f"- Backbone: VGG16")
print(f"- Feature maps: 6 layers (different scales)")
def run_ssd_inference(model, image_path, threshold=0.5):
"""
Object detection inference with SSD
"""
from PIL import Image
from torchvision import transforms
# Load and preprocess image
img = Image.open(image_path).convert('RGB')
transform = transforms.Compose([transforms.ToTensor()])
img_tensor = transform(img).unsqueeze(0)
# Inference
model.eval()
with torch.no_grad():
predictions = model(img_tensor)
# Extract results
pred = predictions[0]
keep = pred['scores'] > threshold
boxes = pred['boxes'][keep].cpu().numpy()
labels = pred['labels'][keep].cpu().numpy()
scores = pred['scores'][keep].cpu().numpy()
print(f"\nNumber of detected objects: {len(boxes)}")
for i, (box, label, score) in enumerate(zip(boxes, labels, scores)):
print(f" {i+1}. Label: {label}, Score: {score:.3f}")
return boxes, labels, scores
# Usage example
# boxes, labels, scores = run_ssd_inference(ssd_model, 'test_image.jpg', threshold=0.6)
5.4 Evaluation Metrics
Precision and Recall
Object detection evaluation uses metrics similar to information retrieval.
$$ \text{Precision} = \frac{TP}{TP + FP} \quad \text{(Detection accuracy)} $$
$$ \text{Recall} = \frac{TP}{TP + FN} \quad \text{(Detection coverage)} $$
- TP (True Positive): Correctly detected objects (IoU β₯ threshold)
- FP (False Positive): False detections (IoU < threshold or background misclassified as object)
- FN (False Negative): Missed detections (existing objects not detected)
NMS (Non-Maximum Suppression)
NMS is an algorithm that removes overlapping detection results, keeping only one box per object.
Sort by Score] --> B[Select Highest Score Box] B --> C[Remove Overlapping Boxes
IoU > threshold] C --> D{Boxes Remaining?} D -->|Yes| B D -->|No| E[Final Detection Results] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style E fill:#e8f5e9
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
def non_max_suppression(boxes, scores, iou_threshold=0.5):
"""
Non-Maximum Suppression (NMS) implementation
Args:
boxes: Bounding box coordinates [[x_min, y_min, x_max, y_max], ...] (numpy array)
scores: Confidence scores [0.9, 0.8, ...]
iou_threshold: IoU threshold (boxes with more overlap are removed)
Returns:
keep_indices: Indices of boxes to keep
"""
import numpy as np
# Sort by descending score
sorted_indices = np.argsort(scores)[::-1]
keep_indices = []
while len(sorted_indices) > 0:
# Keep highest score box
current = sorted_indices[0]
keep_indices.append(current)
if len(sorted_indices) == 1:
break
# Calculate IoU with remaining boxes
current_box = boxes[current]
remaining_boxes = boxes[sorted_indices[1:]]
ious = np.array([calculate_iou(current_box, box) for box in remaining_boxes])
# Keep only boxes below IoU threshold
keep_mask = ious < iou_threshold
sorted_indices = sorted_indices[1:][keep_mask]
return np.array(keep_indices)
# Usage example
boxes = np.array([
[50, 50, 150, 150],
[55, 55, 155, 155], # Large overlap with first box
[200, 200, 300, 300],
[205, 205, 305, 305] # Large overlap with third box
])
scores = np.array([0.9, 0.85, 0.95, 0.88])
keep_indices = non_max_suppression(boxes, scores, iou_threshold=0.5)
print(f"Original number of boxes: {len(boxes)}")
print(f"Number of boxes after NMS: {len(keep_indices)}")
print(f"Kept indices: {keep_indices}")
print(f"Kept boxes:\n{boxes[keep_indices]}")
# PyTorch official NMS implementation (faster)
from torchvision.ops import nms
def nms_torch(boxes, scores, iou_threshold=0.5):
"""
PyTorch NMS (fast C++ implementation)
Args:
boxes: Tensor of shape [N, 4]
scores: Tensor of shape [N]
iou_threshold: IoU threshold
Returns:
keep: Indices of boxes to keep (Tensor)
"""
keep = nms(boxes, scores, iou_threshold)
return keep
# Usage example
boxes_tensor = torch.tensor(boxes, dtype=torch.float32)
scores_tensor = torch.tensor(scores, dtype=torch.float32)
keep_torch = nms_torch(boxes_tensor, scores_tensor, iou_threshold=0.5)
print(f"\nPyTorch NMS result: {keep_torch}")
mAP (mean Average Precision)
mAP is the standard evaluation metric for object detection, representing the average precision across all classes.
Calculation Steps
- For each class, draw a Precision-Recall curve
- Calculate AP (Average Precision) as the area under the curve
- Average AP across all classes to obtain mAP
$$ \text{AP} = \int_0^1 P(r) \, dr $$
$$ \text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c $$
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
def calculate_precision_recall(pred_boxes, pred_scores, true_boxes, iou_threshold=0.5):
"""
Calculate values for Precision-Recall curve
Args:
pred_boxes: Predicted boxes [N, 4]
pred_scores: Predicted scores [N]
true_boxes: Ground truth boxes [M, 4]
iou_threshold: IoU threshold
Returns:
precisions, recalls: Lists of Precision-Recall values
"""
import numpy as np
# Sort by descending score
sorted_indices = np.argsort(pred_scores)[::-1]
pred_boxes = pred_boxes[sorted_indices]
pred_scores = pred_scores[sorted_indices]
num_true = len(true_boxes)
matched_true = np.zeros(num_true, dtype=bool)
tp = np.zeros(len(pred_boxes))
fp = np.zeros(len(pred_boxes))
for i, pred_box in enumerate(pred_boxes):
# Calculate maximum IoU with ground truth boxes
if len(true_boxes) == 0:
fp[i] = 1
continue
ious = np.array([calculate_iou(pred_box, true_box) for true_box in true_boxes])
max_iou_idx = np.argmax(ious)
max_iou = ious[max_iou_idx]
# TP if exceeds IoU threshold and not yet matched
if max_iou >= iou_threshold and not matched_true[max_iou_idx]:
tp[i] = 1
matched_true[max_iou_idx] = True
else:
fp[i] = 1
# Cumulative sum
tp_cumsum = np.cumsum(tp)
fp_cumsum = np.cumsum(fp)
# Precision and Recall
recalls = tp_cumsum / num_true if num_true > 0 else np.zeros_like(tp_cumsum)
precisions = tp_cumsum / (tp_cumsum + fp_cumsum + 1e-10)
return precisions, recalls
def calculate_ap(precisions, recalls):
"""
Calculate Average Precision (AP) using 11-point interpolation
Args:
precisions: List of precision values
recalls: List of recall values
Returns:
ap: Average Precision
"""
import numpy as np
# 11-point interpolation
ap = 0.0
for t in np.linspace(0, 1, 11):
# Maximum precision at recall β₯ t
if np.sum(recalls >= t) == 0:
p = 0
else:
p = np.max(precisions[recalls >= t])
ap += p / 11
return ap
# Usage example
pred_boxes = np.array([
[50, 50, 150, 150],
[55, 55, 155, 155],
[200, 200, 300, 300]
])
pred_scores = np.array([0.9, 0.7, 0.85])
true_boxes = np.array([
[52, 52, 152, 152],
[205, 205, 305, 305]
])
precisions, recalls = calculate_precision_recall(
pred_boxes, pred_scores, true_boxes, iou_threshold=0.5
)
ap = calculate_ap(precisions, recalls)
print(f"Average Precision: {ap:.4f}")
# Visualize Precision-Recall curve
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(recalls, precisions, marker='o', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)
plt.xlim([0, 1.05])
plt.ylim([0, 1.05])
plt.fill_between(recalls, precisions, alpha=0.2)
plt.text(0.5, 0.5, f'AP = {ap:.4f}', fontsize=14, bbox=dict(facecolor='white', alpha=0.8))
plt.tight_layout()
plt.show()
def calculate_map(all_precisions, all_recalls, num_classes):
"""
Calculate mean Average Precision (mAP)
Args:
all_precisions: List of precision lists per class [[p1, p2, ...], ...]
all_recalls: List of recall lists per class [[r1, r2, ...], ...]
num_classes: Number of classes
Returns:
mAP: mean Average Precision
"""
aps = []
for i in range(num_classes):
ap = calculate_ap(all_precisions[i], all_recalls[i])
aps.append(ap)
print(f"Class {i}: AP = {ap:.4f}")
mAP = np.mean(aps)
print(f"\nmAP: {mAP:.4f}")
return mAP
COCO mAP: The COCO dataset uses a stricter evaluation by calculating AP at multiple IoU thresholds (0.5, 0.55, ..., 0.95) and averaging them.
5.5 Object Detection with PyTorch
Using torchvision.models.detection
PyTorch's torchvision provides a rich collection of pre-trained object detection models.
# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0
import torch
import torchvision
from torchvision.models.detection import (
fasterrcnn_resnet50_fpn,
fasterrcnn_mobilenet_v3_large_fpn,
retinanet_resnet50_fpn,
ssd300_vgg16
)
def compare_detection_models():
"""
Compare various object detection models
"""
models_info = {
'Faster R-CNN (ResNet-50)': {
'model': fasterrcnn_resnet50_fpn,
'type': 'Two-Stage',
'speed': 'Slow',
'accuracy': 'High'
},
'Faster R-CNN (MobileNetV3)': {
'model': fasterrcnn_mobilenet_v3_large_fpn,
'type': 'Two-Stage',
'speed': 'Medium',
'accuracy': 'Medium'
},
'RetinaNet (ResNet-50)': {
'model': retinanet_resnet50_fpn,
'type': 'One-Stage',
'speed': 'Medium',
'accuracy': 'High'
},
'SSD300 (VGG16)': {
'model': ssd300_vgg16,
'type': 'One-Stage',
'speed': 'Fast',
'accuracy': 'Medium'
}
}
print("Object Detection Model Comparison:")
print("-" * 80)
for name, info in models_info.items():
print(f"{name:35s} | Type: {info['type']:10s} | "
f"Speed: {info['speed']:6s} | Accuracy: {info['accuracy']:6s}")
print("-" * 80)
compare_detection_models()
# Fine-tuning on custom dataset
from torch.utils.data import Dataset, DataLoader
import json
class CustomDetectionDataset(Dataset):
"""
Custom object detection dataset (COCO format)
"""
def __init__(self, image_dir, annotation_file, transforms=None):
"""
Args:
image_dir: Image directory path
annotation_file: Annotation file in COCO format
transforms: Data augmentation
"""
self.image_dir = image_dir
self.transforms = transforms
# Load annotations
with open(annotation_file, 'r') as f:
self.coco_data = json.load(f)
self.images = self.coco_data['images']
self.annotations = self.coco_data['annotations']
# Group annotations by image ID
self.image_to_annotations = {}
for ann in self.annotations:
image_id = ann['image_id']
if image_id not in self.image_to_annotations:
self.image_to_annotations[image_id] = []
self.image_to_annotations[image_id].append(ann)
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
# Image information
img_info = self.images[idx]
image_id = img_info['id']
img_path = f"{self.image_dir}/{img_info['file_name']}"
# Load image
from PIL import Image
img = Image.open(img_path).convert('RGB')
# Get annotations
anns = self.image_to_annotations.get(image_id, [])
boxes = []
labels = []
for ann in anns:
# COCO format: [x, y, width, height] β [x_min, y_min, x_max, y_max]
x, y, w, h = ann['bbox']
boxes.append([x, y, x + w, y + h])
labels.append(ann['category_id'])
# Convert to tensors
boxes = torch.as_tensor(boxes, dtype=torch.float32)
labels = torch.as_tensor(labels, dtype=torch.int64)
target = {
'boxes': boxes,
'labels': labels,
'image_id': torch.tensor([image_id])
}
# Data augmentation
if self.transforms:
img = self.transforms(img)
return img, target
# Dataset usage example
# dataset = CustomDetectionDataset(
# image_dir='data/images',
# annotation_file='data/annotations.json',
# transforms=torchvision.transforms.ToTensor()
# )
#
# data_loader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=lambda x: tuple(zip(*x)))
Training Loop Implementation
def train_detection_model(model, data_loader, optimizer, device, epoch):
"""
Train object detection model (one epoch)
Args:
model: Object detection model
data_loader: Data loader
optimizer: Optimizer
device: Execution device
epoch: Current epoch number
"""
model.train()
total_loss = 0
for batch_idx, (images, targets) in enumerate(data_loader):
# Transfer data to device
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
# Forward pass (torchvision models return loss during training)
loss_dict = model(images, targets)
# Sum all losses
losses = sum(loss for loss in loss_dict.values())
# Backward pass
optimizer.zero_grad()
losses.backward()
optimizer.step()
total_loss += losses.item()
# Progress display
if batch_idx % 10 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}/{len(data_loader)}, '
f'Loss: {losses.item():.4f}')
print(f' Details: {", ".join([f"{k}: {v.item():.4f}" for k, v in loss_dict.items()])}')
avg_loss = total_loss / len(data_loader)
print(f'Epoch {epoch} - Average Loss: {avg_loss:.4f}\n')
return avg_loss
def evaluate_detection_model(model, data_loader, device):
"""
Evaluate object detection model
Args:
model: Object detection model
data_loader: Data loader
device: Execution device
Returns:
metrics: Dictionary of evaluation metrics
"""
model.eval()
all_predictions = []
all_targets = []
with torch.no_grad():
for images, targets in data_loader:
images = [img.to(device) for img in images]
# Inference
predictions = model(images)
all_predictions.extend([{k: v.cpu() for k, v in p.items()} for p in predictions])
all_targets.extend([{k: v.cpu() for k, v in t.items()} for t in targets])
# Calculate evaluation metrics (simplified version)
print("Evaluation results:")
print(f" Total samples: {len(all_predictions)}")
# Average number of detections
avg_detections = sum(len(p['boxes']) for p in all_predictions) / len(all_predictions)
print(f" Average detections: {avg_detections:.2f}")
return {'avg_detections': avg_detections}
# Training execution example
def full_training_pipeline(num_epochs=10):
"""
Complete training pipeline
"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Create model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.to(device)
# Optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
# Learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
# Training loop
for epoch in range(1, num_epochs + 1):
# Training
# train_loss = train_detection_model(model, train_loader, optimizer, device, epoch)
# Evaluation
# metrics = evaluate_detection_model(model, val_loader, device)
# Update learning rate
lr_scheduler.step()
# Save model
# torch.save(model.state_dict(), f'detection_model_epoch_{epoch}.pth')
print(f"Epoch {epoch} completed.\n")
# Usage example
# full_training_pipeline(num_epochs=10)
5.6 Practice: Detection with COCO Format Data
COCO Dataset Overview
COCO (Common Objects in Context) is the standard benchmark dataset for object detection.
| Item | Details |
|---|---|
| Number of images | Training: 118K, Validation: 5K, Test: 41K |
| Number of classes | 80 classes (person, car, dog, etc.) |
| Annotations | Bounding boxes, segmentation, keypoints |
| Evaluation metric | mAP @ IoU=[0.50:0.05:0.95] |
Complete Object Detection Pipeline
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0
import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
class ObjectDetectionPipeline:
"""
Complete object detection pipeline
"""
def __init__(self, num_classes, pretrained=True, device=None):
"""
Args:
num_classes: Number of detection classes (including background)
pretrained: Use pre-trained weights
device: Execution device
"""
self.device = device or torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.num_classes = num_classes
# Build model
self.model = self._build_model(pretrained)
self.model.to(self.device)
print(f"Object detection pipeline initialization complete")
print(f" Device: {self.device}")
print(f" Number of classes: {num_classes}")
def _build_model(self, pretrained):
"""Build model"""
model = fasterrcnn_resnet50_fpn(pretrained=pretrained)
# Replace final layer
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, self.num_classes)
return model
def predict(self, image_path, conf_threshold=0.5, nms_threshold=0.5):
"""
Detect objects from image
Args:
image_path: Input image path
conf_threshold: Confidence score threshold
nms_threshold: NMS IoU threshold
Returns:
detections: Detection results dictionary
"""
# Load image
img = Image.open(image_path).convert('RGB')
img_tensor = torchvision.transforms.ToTensor()(img).unsqueeze(0).to(self.device)
# Inference
self.model.eval()
with torch.no_grad():
predictions = self.model(img_tensor)
# Post-processing
pred = predictions[0]
# NMS (torchvision models run NMS internally, but can apply additionally)
keep = torchvision.ops.nms(pred['boxes'], pred['scores'], nms_threshold)
# Threshold filtering
keep = keep[pred['scores'][keep] > conf_threshold]
detections = {
'boxes': pred['boxes'][keep].cpu().numpy(),
'labels': pred['labels'][keep].cpu().numpy(),
'scores': pred['scores'][keep].cpu().numpy()
}
return detections, img
def visualize(self, image, detections, class_names, save_path=None):
"""
Visualize detection results
Args:
image: PIL Image
detections: Return value from predict()
class_names: List of class names
save_path: Save path (display only if None)
"""
fig, ax = plt.subplots(1, figsize=(12, 8))
ax.imshow(image)
colors = plt.cm.hsv(np.linspace(0, 1, len(class_names))).tolist()
for box, label, score in zip(detections['boxes'], detections['labels'], detections['scores']):
x_min, y_min, x_max, y_max = box
width = x_max - x_min
height = y_max - y_min
color = colors[label % len(colors)]
# Bounding box
rect = patches.Rectangle(
(x_min, y_min), width, height,
linewidth=2, edgecolor=color, facecolor='none'
)
ax.add_patch(rect)
# Label
label_text = f'{class_names[label]}: {score:.2f}'
ax.text(
x_min, y_min - 5,
label_text,
bbox=dict(facecolor=color, alpha=0.7),
fontsize=10, color='white', weight='bold'
)
ax.axis('off')
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
print(f"Results saved: {save_path}")
else:
plt.show()
def batch_predict(self, image_paths, conf_threshold=0.5):
"""
Batch inference
Args:
image_paths: List of image paths
conf_threshold: Confidence threshold
Returns:
all_detections: List of detection results per image
"""
all_detections = []
for img_path in image_paths:
detections, img = self.predict(img_path, conf_threshold)
all_detections.append({
'path': img_path,
'detections': detections,
'image': img
})
return all_detections
def evaluate_coco(self, data_loader, coco_gt):
"""
COCO format evaluation
Args:
data_loader: Data loader
coco_gt: COCO ground truth annotations
Returns:
metrics: Evaluation metrics
"""
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
self.model.eval()
coco_results = []
with torch.no_grad():
for images, targets in data_loader:
images = [img.to(self.device) for img in images]
predictions = self.model(images)
# Convert to COCO format
for target, pred in zip(targets, predictions):
image_id = target['image_id'].item()
for box, label, score in zip(pred['boxes'], pred['labels'], pred['scores']):
x_min, y_min, x_max, y_max = box.tolist()
coco_results.append({
'image_id': image_id,
'category_id': label.item(),
'bbox': [x_min, y_min, x_max - x_min, y_max - y_min],
'score': score.item()
})
# COCO evaluation
coco_dt = coco_gt.loadRes(coco_results)
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
metrics = {
'mAP': coco_eval.stats[0],
'mAP_50': coco_eval.stats[1],
'mAP_75': coco_eval.stats[2]
}
return metrics
# Usage example
if __name__ == '__main__':
# COCO class names (abbreviated)
COCO_CLASSES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag'
# ... all 91 classes
]
# Initialize pipeline
pipeline = ObjectDetectionPipeline(num_classes=91, pretrained=True)
# Single image inference
# detections, img = pipeline.predict('test_image.jpg', conf_threshold=0.7)
# pipeline.visualize(img, detections, COCO_CLASSES, save_path='result.jpg')
# Batch inference
# image_list = ['img1.jpg', 'img2.jpg', 'img3.jpg']
# results = pipeline.batch_predict(image_list, conf_threshold=0.6)
print("Object detection pipeline ready")
Chapter Summary
What We Learned
Object Detection Fundamentals
- Differences between classification, detection, and segmentation
- How to calculate bounding boxes and IoU
- Challenges and evaluation metrics for object detection
Two-Stage Detectors
- Evolution of R-CNN, Fast R-CNN, Faster R-CNN
- How Region Proposal Networks work
- Accuracy-focused approaches
One-Stage Detectors
- Design philosophy of YOLO and SSD
- Trade-offs between speed and accuracy
- Achieving real-time detection
Evaluation Metrics
- Implementation of NMS (Non-Maximum Suppression)
- Precision-Recall curves and AP
- Calculating mAP (mean Average Precision)
Implementation Skills
- Object detection with PyTorch torchvision
- Building training and evaluation pipelines
- Handling COCO format data
Model Selection Guide
| Requirement | Recommended Model | Reason |
|---|---|---|
| Highest accuracy | Faster R-CNN (ResNet-101) | Precise detection with two-stage |
| Real-time | YOLOv5s / YOLOv8 | 140+ FPS, lightweight |
| Balanced | YOLOv5m / RetinaNet | Balance of speed and accuracy |
| Edge devices | MobileNet SSD | Low computation, memory efficient |
| Small object detection | Faster R-CNN + FPN | Multi-scale feature extraction |
Exercises
Problem 1 (Difficulty: medium)
Implement an IoU calculation function in NumPy and verify with the following test cases:
- Box1: [0, 0, 100, 100], Box2: [50, 50, 150, 150] β IoU β 0.143
- Box1: [0, 0, 100, 100], Box2: [0, 0, 100, 100] β IoU = 1.0
- Box1: [0, 0, 50, 50], Box2: [60, 60, 100, 100] β IoU = 0.0
Solution Example
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
import numpy as np
def calculate_iou_numpy(box1, box2):
"""IoU calculation with NumPy"""
# Intersection region
x_min_inter = max(box1[0], box2[0])
y_min_inter = max(box1[1], box2[1])
x_max_inter = min(box1[2], box2[2])
y_max_inter = min(box1[3], box2[3])
inter_area = max(0, x_max_inter - x_min_inter) * max(0, y_max_inter - y_min_inter)
# Area of each box
box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
# IoU
union_area = box1_area + box2_area - inter_area
iou = inter_area / union_area if union_area > 0 else 0
return iou
# Test
test_cases = [
([0, 0, 100, 100], [50, 50, 150, 150], 0.143),
([0, 0, 100, 100], [0, 0, 100, 100], 1.0),
([0, 0, 50, 50], [60, 60, 100, 100], 0.0)
]
for box1, box2, expected in test_cases:
iou = calculate_iou_numpy(box1, box2)
print(f"Box1: {box1}, Box2: {box2}")
print(f" Calculated IoU: {iou:.4f}, Expected: {expected:.4f}, Match: {abs(iou - expected) < 0.001}")
Problem 2 (Difficulty: hard)
Implement the NMS (Non-Maximum Suppression) algorithm from scratch and test with the following data:
boxes = [[50, 50, 150, 150], [55, 55, 155, 155], [200, 200, 300, 300], [205, 205, 305, 305]]
scores = [0.9, 0.85, 0.95, 0.88]
iou_threshold = 0.5
Expected output: Indices [2, 0] (in score order) are kept
Hint
- Sort by descending score
- Keep highest score box and remove overlapping boxes
- Repeat until all boxes are processed
Problem 3 (Difficulty: medium)
Use Faster R-CNN to perform object detection on a custom image and visualize the results. Display the class names and scores of detected objects.
Solution Example
# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0
"""
Example: Use Faster R-CNN to perform object detection on a custom ima
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from PIL import Image
import torchvision.transforms as T
# Load model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
# Load image
img = Image.open('your_image.jpg').convert('RGB')
img_tensor = T.ToTensor()(img).unsqueeze(0)
# Inference
with torch.no_grad():
predictions = model(img_tensor)
# Display results
pred = predictions[0]
for i, (box, label, score) in enumerate(zip(pred['boxes'], pred['labels'], pred['scores'])):
if score > 0.5:
print(f"Detection {i+1}: Class={COCO_CLASSES[label]}, Score={score:.3f}, Box={box.tolist()}")
# Visualization
# Use visualize_bounding_boxes function
Problem 4 (Difficulty: hard)
Create a script to perform real-time object detection from a video file (or webcam) using YOLOv5. Display detection results frame by frame and measure FPS.
Hint
- Load video with OpenCV (cv2.VideoCapture)
- Run YOLOv5 inference on each frame
- Measure FPS with time.time()
- Display results with cv2.imshow()
References
- Girshick, R., et al. (2014). "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR.
- Girshick, R. (2015). "Fast R-CNN." ICCV.
- Ren, S., et al. (2016). "Faster R-CNN: Towards real-time object detection with region proposal networks." TPAMI.
- Redmon, J., et al. (2016). "You only look once: Unified, real-time object detection." CVPR.
- Liu, W., et al. (2016). "SSD: Single shot multibox detector." ECCV.
- Lin, T.-Y., et al. (2014). "Microsoft COCO: Common objects in context." ECCV.
- Lin, T.-Y., et al. (2017). "Focal loss for dense object detection." ICCV. (RetinaNet)
- Jocher, G., et al. (2022). "YOLOv5: State-of-the-art object detection." Ultralytics.