Chapter 3: Object Detection

This chapter covers Object Detection. You will learn Understanding Bounding Box representation methods and Mastering the principles.

Learning Objectives

By reading this chapter, you will master:

✅ Understanding Bounding Box representation methods and IoU calculation
✅ Mastering the principles and implementation of Non-Maximum Suppression (NMS)
✅ Understanding evaluation metrics for object detection (mAP, Precision-Recall)
✅ Explaining the mechanisms of Two-Stage detectors (R-CNN, Fast R-CNN, Faster R-CNN)
✅ Understanding the characteristics of One-Stage detectors (YOLO, SSD, RetinaNet)
✅ Implementing practical object detection using YOLOv8
✅ Training object detection models with custom datasets
✅ Implementing real-time detection and tracking applications

3.1 Fundamentals of Object Detection

What is Object Detection?

Object Detection is the task of detecting multiple objects in an image and predicting their locations (Bounding Boxes) and classes. While image classification answers "what is in the entire image," object detection answers "where is what."

Object Detection = Localization + Classification

From an input image, it outputs (x, y, width, height, class, confidence) for each object.

Bounding Box Representation Methods

A Bounding Box is a rectangular region that encloses an object. There are mainly four representation methods:

Representation	Format	Description
XYXY	(x1, y1, x2, y2)	Top-left and bottom-right coordinates
XYWH	(x, y, w, h)	Top-left coordinates and width/height
CXCYWH	(cx, cy, w, h)	Center coordinates and width/height
Normalized Coordinates	(normalized to 0~1)	Coordinates normalized by image size

IoU (Intersection over Union)

IoU is a metric that measures the degree of overlap between predicted and ground truth Bounding Boxes. It is one of the most important evaluation metrics in object detection.

IoU = (Prediction ∩ Ground Truth) / (Prediction ∪ Ground Truth) = Intersection Area / Union Area

IoU ranges from 0 (no overlap) to 1 (perfect match).

Code Example 1: IoU Calculation Implementation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np

def calculate_iou(box1, box2):
    """
    Calculate IoU (Intersection over Union)

    Args:
        box1, box2: Bounding Boxes in [x1, y1, x2, y2] format

    Returns:
        float: IoU value (0~1)
    """
    # Calculate intersection area coordinates
    x1_inter = max(box1[0], box2[0])
    y1_inter = max(box1[1], box2[1])
    x2_inter = min(box1[2], box2[2])
    y2_inter = min(box1[3], box2[3])

    # Intersection area
    inter_width = max(0, x2_inter - x1_inter)
    inter_height = max(0, y2_inter - y1_inter)
    intersection = inter_width * inter_height

    # Area of each box
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])

    # Union area
    union = box1_area + box2_area - intersection

    # Calculate IoU (avoid division by zero)
    iou = intersection / union if union > 0 else 0

    return iou

# Usage example
box_pred = [50, 50, 150, 150]   # Predicted Box
box_gt = [60, 60, 160, 160]     # Ground Truth Box

iou = calculate_iou(box_pred, box_gt)
print(f"IoU: {iou:.4f}")

# Calculate IoU for multiple boxes
boxes_pred = np.array([
    [50, 50, 150, 150],
    [100, 100, 200, 200],
    [30, 30, 130, 130]
])

boxes_gt = np.array([[60, 60, 160, 160]])

for i, box_pred in enumerate(boxes_pred):
    iou = calculate_iou(box_pred, boxes_gt[0])
    print(f"Box {i+1} IoU: {iou:.4f}")

Output Example:

IoU: 0.6806
Box 1 IoU: 0.6806
Box 2 IoU: 0.2537
Box 3 IoU: 0.7347

Non-Maximum Suppression (NMS)

Object detection models may predict multiple Bounding Boxes for the same object. NMS is a technique that removes duplicate detections and keeps only the box with the highest confidence score.

NMS Algorithm

Sort Bounding Boxes in descending order by confidence score
Select the box with highest confidence and add to output list
Remove boxes from the remaining set whose IoU with the selected box is above a threshold
Repeat steps 2-3 for remaining boxes

Code Example 2: NMS Implementation

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

import numpy as np

def non_max_suppression(boxes, scores, iou_threshold=0.5):
    """
    Implement Non-Maximum Suppression (NMS)

    Args:
        boxes: numpy array, shape (N, 4) [x1, y1, x2, y2]
        scores: numpy array, shape (N,) confidence scores
        iou_threshold: float, IoU threshold

    Returns:
        list: Indices of boxes to keep
    """
    # If boxes is empty
    if len(boxes) == 0:
        return []

    # Convert to float32
    boxes = boxes.astype(np.float32)

    # Calculate area of each box
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    areas = (x2 - x1) * (y2 - y1)

    # Sort by scores in descending order
    order = scores.argsort()[::-1]

    keep = []

    while len(order) > 0:
        # Select box with highest confidence
        idx = order[0]
        keep.append(idx)

        if len(order) == 1:
            break

        # Calculate IoU with remaining boxes
        xx1 = np.maximum(x1[idx], x1[order[1:]])
        yy1 = np.maximum(y1[idx], y1[order[1:]])
        xx2 = np.minimum(x2[idx], x2[order[1:]])
        yy2 = np.minimum(y2[idx], y2[order[1:]])

        w = np.maximum(0, xx2 - xx1)
        h = np.maximum(0, yy2 - yy1)

        intersection = w * h
        union = areas[idx] + areas[order[1:]] - intersection
        iou = intersection / union

        # Keep only boxes with IoU below threshold
        inds = np.where(iou <= iou_threshold)[0]
        order = order[inds + 1]

    return keep

# Usage example
boxes = np.array([
    [50, 50, 150, 150],
    [55, 55, 155, 155],
    [60, 60, 160, 160],
    [200, 200, 300, 300],
    [205, 205, 305, 305]
])

scores = np.array([0.9, 0.85, 0.88, 0.95, 0.92])

keep_indices = non_max_suppression(boxes, scores, iou_threshold=0.5)

print(f"Original number of boxes: {len(boxes)}")
print(f"Number of boxes after NMS: {len(keep_indices)}")
print(f"Indices of kept boxes: {keep_indices}")
print(f"\nKept Boxes:")
for idx in keep_indices:
    print(f"  Box {idx}: {boxes[idx]}, Score: {scores[idx]:.2f}")

Output Example:

Original number of boxes: 5
Number of boxes after NMS: 2
Indices of kept boxes: [3, 2]

Kept Boxes:
  Box 3: [200 200 300 300], Score: 0.95
  Box 2: [ 60  60 160 160], Score: 0.88

Evaluation Metrics (mAP)

mAP (mean Average Precision) is widely used for evaluating object detection performance.

Main Evaluation Metrics

Precision: Ratio of correct predictions among all detections
Recall: Ratio of detected objects among all ground truth objects
AP (Average Precision): Area under the Precision-Recall curve for one class
mAP (mean Average Precision): Average of AP across all classes

mAP@0.5: mAP at IoU threshold 0.5

mAP@[0.5:0.95]: Average mAP over IoU thresholds from 0.5 to 0.95 in 0.05 increments (COCO evaluation)

3.2 Two-Stage Detectors

Evolution of the R-CNN Family

Two-Stage detectors perform object detection in two stages: "Region Proposal" and "Classification & Localization Refinement."

R-CNN (2014)

Extract ~2000 region candidates using Selective Search
Extract features from each region using CNN (AlexNet)
Classify with SVM, refine location with regression

Problem: Very slow due to 2000 CNN forward passes (47 seconds per image)

Fast R-CNN (2015)

Process entire image once with CNN
Extract region candidates from feature map using RoI Pooling
Perform classification and localization simultaneously with fully connected layers

Improvement: ~10x faster than R-CNN (2 seconds per image)

Faster R-CNN (2015)

Generate region proposals with RPN (Region Proposal Network)
Extract features with RoI Pooling
Perform classification and localization

Improvement: Eliminates Selective Search, enables full end-to-end learning (0.2 seconds per image)

Faster R-CNN Implementation

Code Example 3: Object Detection with Faster R-CNN (torchvision)

# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - requests>=2.31.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

"""
Example: Faster R-CNN Implementation

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.transforms import functional as F
from PIL import Image, ImageDraw, ImageFont
import requests
from io import BytesIO

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load pre-trained Faster R-CNN model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model = model.to(device)
model.eval()

# COCO class names (91 classes)
COCO_CLASSES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

def detect_objects(image_path, threshold=0.5):
    """
    Perform object detection with Faster R-CNN

    Args:
        image_path: Image path or URL
        threshold: Confidence threshold
    """
    # Load image
    if image_path.startswith('http'):
        response = requests.get(image_path)
        img = Image.open(BytesIO(response.content)).convert('RGB')
    else:
        img = Image.open(image_path).convert('RGB')

    # Convert to tensor
    img_tensor = F.to_tensor(img).to(device)

    # Inference
    with torch.no_grad():
        predictions = model([img_tensor])[0]

    # Filter results
    keep = predictions['scores'] > threshold
    boxes = predictions['boxes'][keep].cpu().numpy()
    labels = predictions['labels'][keep].cpu().numpy()
    scores = predictions['scores'][keep].cpu().numpy()

    # Draw results
    draw = ImageDraw.Draw(img)

    for box, label, score in zip(boxes, labels, scores):
        x1, y1, x2, y2 = box
        class_name = COCO_CLASSES[label]

        # Draw Bounding Box
        draw.rectangle([x1, y1, x2, y2], outline='red', width=3)

        # Draw label and score
        text = f"{class_name}: {score:.2f}"
        draw.text((x1, y1 - 15), text, fill='red')

    # Display results
    print(f"Number of detected objects: {len(boxes)}")
    for label, score in zip(labels, scores):
        print(f"  - {COCO_CLASSES[label]}: {score:.3f}")

    return img, boxes, labels, scores

# Usage example
image_url = "https://images.unsplash.com/photo-1544568100-847a948585b9?w=800"
result_img, boxes, labels, scores = detect_objects(image_url, threshold=0.7)

# Display image (for Jupyter Notebook)
# display(result_img)

# Save image
result_img.save('faster_rcnn_result.jpg')
print("Results saved to faster_rcnn_result.jpg")

Feature Pyramid Networks (FPN)

FPN is an architecture that effectively utilizes multi-scale features. It combines feature maps at multiple resolutions to detect objects of different sizes.

FPN Features:

Bottom-up pathway: Standard CNN forward pass

Top-down pathway: Propagate high-level features from low to high resolution

Lateral connections: Merge features at each level

3.3 One-Stage Detectors

YOLO Family

YOLO (You Only Look Once) is a revolutionary approach that performs object detection by looking at the image only once. It achieves real-time detection and is faster than Two-Stage detectors.

Basic Principles of YOLO

Divide image into a grid (e.g., 13×13)
Each grid cell predicts Bounding Boxes and confidence
Predict class probabilities for each box
Remove duplicates with NMS

Evolution of YOLO

Version	Year	Main Improvements
YOLOv1	2016	Proposed One-Stage detection, real-time processing
YOLOv2	2017	Batch Normalization, introduced Anchor Boxes
YOLOv3	2018	Multi-scale predictions, Darknet-53
YOLOv4	2020	CSPDarknet53, Mosaic augmentation
YOLOv5	2020	PyTorch implementation, improved usability
YOLOv8	2023	Anchor-free, improved architecture

Code Example 4: Object Detection with YOLOv8

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - opencv-python>=4.8.0
# - pillow>=10.0.0

from ultralytics import YOLO
from PIL import Image
import cv2
import numpy as np

# Load YOLOv8 model
# Sizes: n (nano), s (small), m (medium), l (large), x (extra large)
model = YOLO('yolov8n.pt')  # nano model (lightest)

def detect_with_yolo(image_path, conf_threshold=0.5):
    """
    Perform object detection with YOLOv8

    Args:
        image_path: Image path or URL
        conf_threshold: Confidence threshold
    """
    # Run inference
    results = model(image_path, conf=conf_threshold)

    # Get results
    result = results[0]

    # Display detected object information
    print(f"Number of detected objects: {len(result.boxes)}")

    for box in result.boxes:
        # Get class ID, confidence, coordinates
        class_id = int(box.cls[0])
        confidence = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()

        class_name = model.names[class_id]
        print(f"  - {class_name}: {confidence:.3f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")

    # Get result image (with annotations)
    annotated_img = result.plot()

    return annotated_img, result

# Usage example 1: Detect from image file
image_path = "path/to/your/image.jpg"
annotated_img, result = detect_with_yolo(image_path, conf_threshold=0.5)

# Save results
cv2.imwrite('yolov8_result.jpg', annotated_img)
print("Results saved to yolov8_result.jpg")

# Usage example 2: Detect from video file or Webcam
def detect_video(source=0, conf_threshold=0.5):
    """
    Real-time detection from video or Webcam

    Args:
        source: Video file path or 0 (Webcam)
        conf_threshold: Confidence threshold
    """
    # Inference on video stream
    results = model(source, stream=True, conf=conf_threshold)

    for result in results:
        # Process each frame
        annotated_frame = result.plot()

        # Display
        cv2.imshow('YOLOv8 Detection', annotated_frame)

        # Press 'q' to quit
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cv2.destroyAllWindows()

# Real-time detection with Webcam (uncomment to run)
# detect_video(source=0, conf_threshold=0.5)

# Detect from video file
# detect_video(source='path/to/video.mp4', conf_threshold=0.5)

SSD (Single Shot Detector)

SSD is a One-Stage detector like YOLO but performs detection from feature maps at multiple scales.

SSD Features

Multi-scale feature maps: Detect from layers at different resolutions
Default boxes: Predict boxes with multiple aspect ratios at each location
Fast: Faster than YOLOv1 with higher mAP

RetinaNet (Focal Loss)

RetinaNet solved the class imbalance problem by introducing Focal Loss.

What is Focal Loss?

Focal Loss = -α(1-p_t)^γ log(p_t)

It reduces loss for easy examples (like background) and focuses learning on hard examples.

Code Example 5: Focal Loss Implementation

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class FocalLoss(nn.Module):
    """
    Focal Loss for Object Detection

    Args:
        alpha: Class weight (default: 0.25)
        gamma: Focusing parameter (default: 2.0)
    """
    def __init__(self, alpha=0.25, gamma=2.0):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, predictions, targets):
        """
        Args:
            predictions: (N, num_classes) predicted probabilities
            targets: (N,) ground truth labels
        """
        # Cross Entropy Loss
        ce_loss = F.cross_entropy(predictions, targets, reduction='none')

        # Calculate p_t (predicted probability of correct class)
        p = torch.exp(-ce_loss)

        # Focal Loss
        focal_loss = self.alpha * (1 - p) ** self.gamma * ce_loss

        return focal_loss.mean()

# Usage example
num_classes = 91  # COCO
batch_size = 32

# Dummy data
predictions = torch.randn(batch_size, num_classes)
targets = torch.randint(0, num_classes, (batch_size,))

# Standard Cross Entropy Loss
ce_loss = F.cross_entropy(predictions, targets)
print(f"Cross Entropy Loss: {ce_loss.item():.4f}")

# Focal Loss
focal_loss_fn = FocalLoss(alpha=0.25, gamma=2.0)
focal_loss = focal_loss_fn(predictions, targets)
print(f"Focal Loss: {focal_loss.item():.4f}")

# Compare loss for easy vs hard examples
easy_predictions = torch.tensor([[10.0, 0.0, 0.0]])  # High confidence for correct class 0
hard_predictions = torch.tensor([[1.0, 0.9, 0.8]])   # Low confidence for correct class 0
targets_test = torch.tensor([0])

easy_loss = focal_loss_fn(easy_predictions, targets_test)
hard_loss = focal_loss_fn(hard_predictions, targets_test)

print(f"\nEasy example loss: {easy_loss.item():.4f}")
print(f"Hard example loss: {hard_loss.item():.4f}")
print(f"Hard example loss is {hard_loss.item() / easy_loss.item():.1f}x the easy example loss")

EfficientDet

EfficientDet is an efficient detector using EfficientNet as backbone and BiFPN (Bi-directional Feature Pyramid Network).

Compound Scaling: Scale resolution, depth, and width simultaneously
BiFPN: Bidirectional feature fusion
High efficiency: Higher accuracy with fewer parameters than YOLOv3 or RetinaNet

3.4 Implementation and Training

COCO Dataset

COCO (Common Objects in Context) is the standard benchmark dataset for object detection.

Number of images: 330K (train: 118K, val: 5K, test: 41K)
Categories: 80 classes (people, animals, vehicles, furniture, etc.)
Annotations: Bounding Boxes, segmentation, keypoints

Code Example 6: Training PyTorch Object Detection

# Requirements:
# - Python 3.9+
# - pillow>=10.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torch.utils.data import DataLoader
import torchvision.transforms as T

# Custom dataset class
class CustomObjectDetectionDataset(torch.utils.data.Dataset):
    """
    Custom object detection dataset

    Returns images and annotations (boxes, labels)
    """
    def __init__(self, image_paths, annotations, transforms=None):
        self.image_paths = image_paths
        self.annotations = annotations
        self.transforms = transforms

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        # Load image
        from PIL import Image
        img = Image.open(self.image_paths[idx]).convert("RGB")

        # Get annotations
        boxes = self.annotations[idx]['boxes']  # [[x1,y1,x2,y2], ...]
        labels = self.annotations[idx]['labels']  # [1, 2, 1, ...]

        # Convert to tensors
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.as_tensor(labels, dtype=torch.int64)

        target = {
            'boxes': boxes,
            'labels': labels,
            'image_id': torch.tensor([idx])
        }

        if self.transforms:
            img = self.transforms(img)

        return img, target

def get_model(num_classes):
    """
    Build Faster R-CNN model

    Args:
        num_classes: Number of classes (background + object classes)
    """
    # Load pre-trained model
    model = fasterrcnn_resnet50_fpn(pretrained=True)

    # Replace classification head
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

def train_one_epoch(model, optimizer, data_loader, device):
    """
    Train for one epoch
    """
    model.train()
    total_loss = 0

    for images, targets in data_loader:
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # Forward pass
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        # Backward pass
        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        total_loss += losses.item()

    return total_loss / len(data_loader)

# Training configuration
num_classes = 3  # background + 2 classes
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Model, optimizer, scheduler
model = get_model(num_classes)
model.to(device)

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.005,
    momentum=0.9,
    weight_decay=0.0005
)

lr_scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=3,
    gamma=0.1
)

# Dataset (dummy)
# In practice, prepare image paths and annotations
image_paths = ['img1.jpg', 'img2.jpg', 'img3.jpg']
annotations = [
    {'boxes': [[10, 10, 50, 50]], 'labels': [1]},
    {'boxes': [[20, 20, 60, 60], [70, 70, 100, 100]], 'labels': [1, 2]},
    {'boxes': [[30, 30, 80, 80]], 'labels': [2]}
]

# transforms = T.Compose([T.ToTensor()])
# dataset = CustomObjectDetectionDataset(image_paths, annotations, transforms)
# data_loader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=lambda x: tuple(zip(*x)))

# Training loop (if actual data is available)
# num_epochs = 10
# for epoch in range(num_epochs):
#     train_loss = train_one_epoch(model, optimizer, data_loader, device)
#     lr_scheduler.step()
#     print(f"Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}")

# Save model
# torch.save(model.state_dict(), 'object_detection_model.pth')

print("Training script ready")

Training with Custom Dataset

Code Example 7: Training YOLOv8 with Custom Dataset

# Requirements:
# - Python 3.9+
# - pyyaml>=6.0.0

"""
Example: Training with Custom Dataset

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

from ultralytics import YOLO
import yaml
import os

# Create dataset configuration file
dataset_yaml = """
# Dataset path
path: ./custom_dataset  # Root directory
train: images/train     # Training images (relative to path)
val: images/val         # Validation images

# Class definitions
names:
  0: cat
  1: dog
  2: bird
"""

# Save dataset.yaml
with open('custom_dataset.yaml', 'w') as f:
    f.write(dataset_yaml)

# Directory structure example:
# custom_dataset/
# ├── images/
# │   ├── train/
# │   │   ├── img1.jpg
# │   │   ├── img2.jpg
# │   │   └── ...
# │   └── val/
# │       ├── img1.jpg
# │       └── ...
# └── labels/
#     ├── train/
#     │   ├── img1.txt  # YOLO format (class x_center y_center width height)
#     │   ├── img2.txt
#     │   └── ...
#     └── val/
#         ├── img1.txt
#         └── ...

# Initialize YOLOv8 model
model = YOLO('yolov8n.pt')  # Start from pre-trained weights

# Run training
results = model.train(
    data='custom_dataset.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    name='custom_yolo',
    # Other hyperparameters
    lr0=0.01,          # Initial learning rate
    momentum=0.937,     # SGD momentum
    weight_decay=0.0005,
    warmup_epochs=3,
    patience=50,        # Early stopping
    # Data Augmentation
    degrees=10.0,       # Rotation
    translate=0.1,      # Translation
    scale=0.5,          # Scale
    flipud=0.0,         # Vertical flip
    fliplr=0.5,         # Horizontal flip
    mosaic=1.0,         # Mosaic augmentation
)

# Validation
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

# Inference
results = model('path/to/test/image.jpg')

# Save model (automatically saved, but can also save manually)
# model.save('custom_yolo_best.pt')

# Export model (ONNX, TensorRT, etc.)
# model.export(format='onnx')

print("\nTraining complete!")
print(f"Weights: runs/detect/custom_yolo/weights/best.pt")
print(f"Metrics: runs/detect/custom_yolo/results.csv")

Annotation file format (YOLO):

# img1.txt example (each line is one object)
0 0.5 0.5 0.3 0.2    # class=0, center=(0.5, 0.5), size=(0.3, 0.2)
1 0.7 0.3 0.2 0.15   # class=1, center=(0.7, 0.3), size=(0.2, 0.15)

# Coordinates are normalized by image size (0~1)
# class x_center y_center width height

3.5 Advanced Techniques

Anchor-Free Detection

Traditional detectors rely on Anchor Boxes (pre-defined boxes), but Anchor-Free approaches eliminate this requirement.

Main Anchor-Free Methods

FCOS (Fully Convolutional One-Stage): Predict distance from each pixel to object center
CenterNet: Detect object center points, regress size and location
YOLOv8: Adopts Anchor-Free approach

Benefits: No need for Anchor hyperparameter tuning, more flexible detection

Object Tracking

Object Tracking is the task of continuously tracking objects across video frames. Used in combination with detectors.

SORT (Simple Online and Realtime Tracking)

Detect objects in each frame
Predict next frame positions with Kalman filter
Match detections to tracks using Hungarian Algorithm

DeepSORT

Adds appearance features (Deep features) to SORT for more robust tracking.

Code Example 8: YOLOv8 + Object Tracking

# Requirements:
# - Python 3.9+
# - opencv-python>=4.8.0

from ultralytics import YOLO
import cv2

# Load YOLOv8 model
model = YOLO('yolov8n.pt')

def track_objects_video(video_path, output_path='tracking_result.mp4'):
    """
    Detect and track objects in video

    Args:
        video_path: Input video path
        output_path: Output video path
    """
    # Video capture
    cap = cv2.VideoCapture(video_path)

    # Output settings
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

    frame_count = 0

    # Inference in tracking mode
    results = model.track(video_path, stream=True, persist=True, conf=0.5)

    for result in results:
        frame_count += 1

        # Annotated frame
        annotated_frame = result.plot()

        # Display tracking IDs
        if result.boxes.id is not None:
            for box, track_id in zip(result.boxes.xyxy, result.boxes.id):
                x1, y1, x2, y2 = box.cpu().numpy()
                track_id = int(track_id.cpu().numpy())

                # Draw tracking ID
                cv2.putText(
                    annotated_frame,
                    f"ID: {track_id}",
                    (int(x1), int(y1) - 30),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.9,
                    (0, 255, 0),
                    2
                )

        # Write frame
        out.write(annotated_frame)

        # Display
        cv2.imshow('Object Tracking', annotated_frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    out.release()
    cv2.destroyAllWindows()

    print(f"Processing complete: {frame_count} frames")
    print(f"Output: {output_path}")

# Usage example
# track_objects_video('input_video.mp4', 'output_tracking.mp4')

# Real-time tracking with Webcam
def track_webcam():
    """
    Real-time object tracking with Webcam
    """
    results = model.track(source=0, stream=True, persist=True, conf=0.5)

    for result in results:
        annotated_frame = result.plot()
        cv2.imshow('Real-time Tracking', annotated_frame)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cv2.destroyAllWindows()

# Real-time tracking (uncomment to run)
# track_webcam()

print("Object tracking script ready")

Multi-Scale Detection

Multi-scale detection is important because object sizes vary greatly across images.

Techniques

Image Pyramid: Resize image to multiple scales and detect
Feature Pyramid: Detect at multiple feature map levels (FPN)
Multi-scale Training: Use different input sizes during training

Real-Time Optimization

Speed-Up Techniques

Model lightweighting: YOLOv8n, MobileNet-SSD
Quantization: FP32 → FP16 → INT8
TensorRT: NVIDIA inference optimization engine
ONNX Runtime: Cross-platform inference
Resolution adjustment: Reduce input image size (e.g., 640→416)

Speed vs Accuracy Tradeoff:

Real-time requirements: YOLOv8n/s (30+ FPS)

High accuracy requirements: YOLOv8x, Faster R-CNN with FPN

Exercises

Exercise 1: Understanding IoU and NMS

Problem: Calculate IoU and apply NMS to the following set of Bounding Boxes.

boxes = np.array([
    [100, 100, 200, 200],
    [110, 110, 210, 210],
    [105, 105, 205, 205],
    [300, 300, 400, 400]
])
scores = np.array([0.9, 0.85, 0.95, 0.8])

Calculate IoU of each box with Box 0
Apply NMS with IoU threshold 0.5
Identify which boxes remain

Exercise 2: Detection with Faster R-CNN

Problem: Create a script that detects only specific classes (e.g., person, car) from multiple images using pre-trained Faster R-CNN.

Load multiple images
Filter only specified classes
Display only detections with confidence >= 0.7
Aggregate detection counts

Exercise 3: YOLOv8 Model Size Comparison

Problem: Detect the same image with different YOLOv8 sizes (n, s, m, l) and compare accuracy and speed.

Measure inference time for each model
Compare number of detected objects
Analyze distribution of confidence scores
Determine which model is optimal

Exercise 4: Preparing Custom Dataset

Problem: Prepare your own image dataset (10+ images) and create YOLO format annotation files.

Create annotations using tools like LabelImg
Convert to YOLO format (class x_center y_center width height)
Split into train/val (80/20)
Create dataset.yaml

Exercise 5: Object Tracking Implementation

Problem: Track objects from video file or Webcam and draw trajectory for each object.

Use YOLOv8 tracking feature
Save trajectory for each tracking ID
Draw trajectories as lines
Verify ID stability across frames

Exercise 6: Real-Time Detection Optimization

Problem: Optimize the model to maximize detection speed.

Export YOLOv8 to ONNX format
Measure FPS at different input resolutions (320, 416, 640)
Adjust confidence threshold to improve speed
Analyze speed vs accuracy tradeoff

Summary

In this chapter, we learned about object detection from fundamentals to practice:

✅ Object Detection Fundamentals: Bounding Box representation, IoU, NMS, evaluation metrics (mAP)
✅ Two-Stage Detectors: Evolution and features of R-CNN, Fast R-CNN, Faster R-CNN, FPN
✅ One-Stage Detectors: Principles and comparison of YOLO, SSD, RetinaNet, EfficientDet
✅ Implementation and Training: Practical detection with PyTorch and YOLOv8, training with custom datasets
✅ Advanced Techniques: Anchor-free detection, object tracking, multi-scale detection, real-time optimization

In the next chapter, we will learn about Semantic Segmentation. We will understand more detailed image understanding methods including pixel-level classification, U-Net, DeepLab, and Mask R-CNN.

Key Point: In object detection, the tradeoff between real-time performance and accuracy is crucial. Select appropriate models and parameters based on application requirements (speed priority or accuracy priority).

References

Girshick et al. (2014). "Rich feature hierarchies for accurate object detection and semantic segmentation" (R-CNN)
Girshick (2015). "Fast R-CNN"
Ren et al. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"
Redmon et al. (2016). "You Only Look Once: Unified, Real-Time Object Detection" (YOLO)
Liu et al. (2016). "SSD: Single Shot MultiBox Detector"
Lin et al. (2017). "Focal Loss for Dense Object Detection" (RetinaNet)
Lin et al. (2017). "Feature Pyramid Networks for Object Detection"
Bochkovskiy et al. (2020). "YOLOv4: Optimal Speed and Accuracy of Object Detection"
Ultralytics YOLOv8: https://github.com/ultralytics/ultralytics
COCO Dataset: https://cocodataset.org/