第4章：セグメンテーション

学習目標

この章を読むことで、以下を習得できます：

✅ セグメンテーションの種類と評価指標を理解する
✅ U-Netアーキテクチャとその応用を実装できる
✅ DeepLab、PSPNet等の高度なアーキテクチャを活用できる
✅ Mask R-CNNを用いたInstance Segmentationを実装できる
✅ セグメンテーションの実践プロジェクトを完成できる
✅ Detectron2フレームワークを使いこなせる

4.1 セグメンテーションの種類

セグメンテーションとは

画像セグメンテーション（Image Segmentation）は、画像の各ピクセルにクラスラベルを割り当てるタスクです。物体検出が矩形のバウンディングボックスで物体を特定するのに対し、セグメンテーションはピクセルレベルで正確な境界を識別します。

「セグメンテーションは、画像を意味のある領域に分割し、各ピクセルに意味を与える技術です。」

1. Semantic Segmentation（意味的セグメンテーション）

Semantic Segmentationは、各ピクセルをクラスに分類しますが、同じクラスの異なるインスタンスは区別しません。

特徴	説明
目的	各ピクセルのクラス分類
出力	クラスラベルマップ
インスタンス区別	なし
用途	自動運転、医療画像、衛星画像

2. Instance Segmentation（インスタンスセグメンテーション）

Instance Segmentationは、同じクラスの異なる物体インスタンスを区別します。

特徴	説明
目的	各インスタンスの分離
出力	インスタンスごとのマスク
インスタンス区別	あり
用途	ロボット工学、画像編集、細胞カウント

3. Panoptic Segmentation（全景セグメンテーション）

Panoptic Segmentationは、Semantic SegmentationとInstance Segmentationを統合したタスクです。

特徴	説明
目的	シーン全体の完全な理解
出力	全ピクセルのクラス + インスタンスID
対象	Thing（個別物体）+ Stuff（背景領域）
用途	自動運転の環境理解

graph LR A[画像セグメンテーション] --> B[Semantic Segmentation] A --> C[Instance Segmentation] A --> D[Panoptic Segmentation] B --> E[全ピクセル分類
インスタンス区別なし] C --> F[インスタンス分離
個別マスク] D --> G[Semantic + Instance
完全な理解] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e8f5e9

評価指標

1. IoU (Intersection over Union)

IoUは予測領域と正解領域の重なりを測定します。

$$ \text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} = \frac{TP}{TP + FP + FN} $$

TP: True Positive（正しく予測されたピクセル）
FP: False Positive（誤って予測されたピクセル）
FN: False Negative（見逃されたピクセル）

2. Dice Coefficient（F1-Score）

Dice係数は医療画像セグメンテーションで広く使用されます。

$$ \text{Dice} = \frac{2 \times TP}{2 \times TP + FP + FN} $$

3. Mean IoU (mIoU)

全クラスのIoUの平均値です。

$$ \text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \text{IoU}_i $$

import numpy as np
import matplotlib.pyplot as plt

def calculate_iou(pred_mask, true_mask):
    """
    IoUを計算

    Args:
        pred_mask: 予測マスク (H, W)
        true_mask: 正解マスク (H, W)

    Returns:
        float: IoU値
    """
    intersection = np.logical_and(pred_mask, true_mask).sum()
    union = np.logical_or(pred_mask, true_mask).sum()

    if union == 0:
        return 0.0

    iou = intersection / union
    return iou

def calculate_dice(pred_mask, true_mask):
    """
    Dice係数を計算

    Args:
        pred_mask: 予測マスク (H, W)
        true_mask: 正解マスク (H, W)

    Returns:
        float: Dice係数
    """
    intersection = np.logical_and(pred_mask, true_mask).sum()

    dice = (2.0 * intersection) / (pred_mask.sum() + true_mask.sum())
    return dice

# サンプルマスクの作成
np.random.seed(42)
H, W = 100, 100

# 正解マスク（円）
y, x = np.ogrid[:H, :W]
true_mask = ((x - 50)**2 + (y - 50)**2) <= 20**2

# 予測マスク（少しずれた円）
pred_mask = ((x - 55)**2 + (y - 55)**2) <= 20**2

# IoUとDice係数の計算
iou = calculate_iou(pred_mask, true_mask)
dice = calculate_dice(pred_mask, true_mask)

print("=== セグメンテーション評価指標 ===")
print(f"IoU: {iou:.4f}")
print(f"Dice係数: {dice:.4f}")

# 可視化
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

axes[0].imshow(true_mask, cmap='gray')
axes[0].set_title('正解マスク', fontsize=12)
axes[0].axis('off')

axes[1].imshow(pred_mask, cmap='gray')
axes[1].set_title('予測マスク', fontsize=12)
axes[1].axis('off')

# Intersection
intersection = np.logical_and(pred_mask, true_mask)
axes[2].imshow(intersection, cmap='Greens')
axes[2].set_title(f'Intersection\n面積: {intersection.sum()}', fontsize=12)
axes[2].axis('off')

# Union
union = np.logical_or(pred_mask, true_mask)
axes[3].imshow(union, cmap='Blues')
axes[3].set_title(f'Union\n面積: {union.sum()}\nIoU: {iou:.4f}', fontsize=12)
axes[3].axis('off')

plt.tight_layout()
plt.show()

出力：

=== セグメンテーション評価指標 ===
IoU: 0.6667
Dice係数: 0.8000

重要: IoUとDice係数は関連していますが、Dice係数はより寛容な指標です（同じ重なりでより高い値）。

4.2 U-Net Architecture

U-Netの概要

U-Netは、2015年にRonnebergerらによって医療画像セグメンテーション用に提案されたアーキテクチャです。Encoder-Decoder構造と特徴的なSkip Connectionsにより、高精度なセグメンテーションを実現します。

U-Netの特徴

特徴	説明
Encoder-Decoder	ダウンサンプリング→アップサンプリング
Skip Connections	高解像度情報を保持
データ効率	少量のデータで高精度
対称構造	U字型のアーキテクチャ

U-Netの構造

graph TB A[入力画像
572x572] --> B[Conv + ReLU
568x568x64] B --> C[Conv + ReLU
564x564x64] C --> D[MaxPool
282x282x64] D --> E[Conv + ReLU
280x280x128] E --> F[Conv + ReLU
276x276x128] F --> G[MaxPool
138x138x128] G --> H[ボトルネック
最深層] H --> I[UpConv
276x276x128] I --> J[Concat
Skip Connection] F --> J J --> K[Conv + ReLU
272x272x128] K --> L[Conv + ReLU
268x268x64] L --> M[UpConv
536x536x64] M --> N[Concat
Skip Connection] C --> N N --> O[Conv + ReLU
388x388x64] O --> P[出力
388x388xC] style A fill:#e3f2fd style H fill:#ffebee style P fill:#c8e6c9 style J fill:#fff3e0 style N fill:#fff3e0

完全なU-Net実装

import torch
import torch.nn as nn
import torch.nn.functional as F

class DoubleConv(nn.Module):
    """(Conv2d => BatchNorm => ReLU) x 2"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.double_conv(x)

class Down(nn.Module):
    """Downscaling with maxpool then double conv"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.maxpool_conv = nn.Sequential(
            nn.MaxPool2d(2),
            DoubleConv(in_channels, out_channels)
        )

    def forward(self, x):
        return self.maxpool_conv(x)

class Up(nn.Module):
    """Upscaling then double conv"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.up = nn.ConvTranspose2d(in_channels, in_channels // 2,
                                     kernel_size=2, stride=2)
        self.conv = DoubleConv(in_channels, out_channels)

    def forward(self, x1, x2):
        x1 = self.up(x1)

        # Skip connectionとの結合のためサイズ調整
        diffY = x2.size()[2] - x1.size()[2]
        diffX = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, [diffX // 2, diffX - diffX // 2,
                        diffY // 2, diffY - diffY // 2])

        # Skip connectionと結合
        x = torch.cat([x2, x1], dim=1)
        return self.conv(x)

class UNet(nn.Module):
    """
    完全なU-Netモデル

    Args:
        n_channels: 入力チャンネル数
        n_classes: 出力クラス数
    """

    def __init__(self, n_channels=3, n_classes=1):
        super(UNet, self).__init__()
        self.n_channels = n_channels
        self.n_classes = n_classes

        # Encoder
        self.inc = DoubleConv(n_channels, 64)
        self.down1 = Down(64, 128)
        self.down2 = Down(128, 256)
        self.down3 = Down(256, 512)
        self.down4 = Down(512, 1024)

        # Decoder
        self.up1 = Up(1024, 512)
        self.up2 = Up(512, 256)
        self.up3 = Up(256, 128)
        self.up4 = Up(128, 64)

        # 出力層
        self.outc = nn.Conv2d(64, n_classes, kernel_size=1)

    def forward(self, x):
        # Encoder
        x1 = self.inc(x)
        x2 = self.down1(x1)
        x3 = self.down2(x2)
        x4 = self.down3(x3)
        x5 = self.down4(x4)

        # Decoder with skip connections
        x = self.up1(x5, x4)
        x = self.up2(x, x3)
        x = self.up3(x, x2)
        x = self.up4(x, x1)

        # 出力
        logits = self.outc(x)
        return logits

# モデルの確認
model = UNet(n_channels=3, n_classes=2)

# ダミー入力
dummy_input = torch.randn(1, 3, 256, 256)
output = model(dummy_input)

print("=== U-Netモデル構造 ===")
print(f"入力サイズ: {dummy_input.shape}")
print(f"出力サイズ: {output.shape}")
print(f"\nパラメータ数: {sum(p.numel() for p in model.parameters()):,}")
print(f"学習可能なパラメータ数: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# モデルサマリー（より詳細）
print("\n=== レイヤー構造 ===")
for name, module in model.named_children():
    print(f"{name}: {module.__class__.__name__}")

出力：

=== U-Netモデル構造 ===
入力サイズ: torch.Size([1, 3, 256, 256])
出力サイズ: torch.Size([1, 2, 256, 256])

パラメータ数: 31,042,434
学習可能なパラメータ数: 31,042,434

=== レイヤー構造 ===
inc: DoubleConv
down1: Down
down2: Down
down3: Down
down4: Down
up1: Up
up2: Up
up3: Up
up4: Up
outc: Conv2d

医療画像セグメンテーションへの応用

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt

class DiceLoss(nn.Module):
    """Dice Loss for segmentation"""

    def __init__(self, smooth=1.0):
        super(DiceLoss, self).__init__()
        self.smooth = smooth

    def forward(self, pred, target):
        pred = torch.sigmoid(pred)

        # Flatten
        pred_flat = pred.view(-1)
        target_flat = target.view(-1)

        intersection = (pred_flat * target_flat).sum()

        dice = (2. * intersection + self.smooth) / (
            pred_flat.sum() + target_flat.sum() + self.smooth
        )

        return 1 - dice

# シンプルなデータセット（デモ用）
class SimpleSegmentationDataset(Dataset):
    """シンプルなセグメンテーションデータセット"""

    def __init__(self, num_samples=100, img_size=256):
        self.num_samples = num_samples
        self.img_size = img_size

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        # ランダムな画像とマスクを生成（実際にはデータローダーを使用）
        np.random.seed(idx)

        # 画像（グレースケール → RGB化）
        image = np.random.rand(self.img_size, self.img_size).astype(np.float32)
        image = np.stack([image] * 3, axis=0)  # (3, H, W)

        # マスク（円を配置）
        mask = np.zeros((self.img_size, self.img_size), dtype=np.float32)
        center_x, center_y = np.random.randint(50, 206, 2)
        radius = np.random.randint(20, 40)

        y, x = np.ogrid[:self.img_size, :self.img_size]
        mask_circle = ((x - center_x)**2 + (y - center_y)**2) <= radius**2
        mask[mask_circle] = 1.0
        mask = mask[np.newaxis, ...]  # (1, H, W)

        return torch.from_numpy(image), torch.from_numpy(mask)

# データセットとデータローダー
dataset = SimpleSegmentationDataset(num_samples=100)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# モデル、損失、オプティマイザー
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet(n_channels=3, n_classes=1).to(device)
criterion = DiceLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print("=== 学習開始 ===")
print(f"デバイス: {device}")
print(f"データセットサイズ: {len(dataset)}")

# 簡易学習ループ（デモ用）
num_epochs = 3
model.train()

for epoch in range(num_epochs):
    epoch_loss = 0.0

    for batch_idx, (images, masks) in enumerate(dataloader):
        images = images.to(device)
        masks = masks.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, masks)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(dataloader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("\n=== 学習完了 ===")

# 推論の可視化
model.eval()
with torch.no_grad():
    sample_image, sample_mask = dataset[0]
    sample_image = sample_image.unsqueeze(0).to(device)

    pred_mask = model(sample_image)
    pred_mask = torch.sigmoid(pred_mask)
    pred_mask = pred_mask.cpu().squeeze().numpy()

# 可視化
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].imshow(sample_image.cpu().squeeze().permute(1, 2, 0).numpy()[:, :, 0], cmap='gray')
axes[0].set_title('入力画像', fontsize=14)
axes[0].axis('off')

axes[1].imshow(sample_mask.squeeze().numpy(), cmap='viridis')
axes[1].set_title('正解マスク', fontsize=14)
axes[1].axis('off')

axes[2].imshow(pred_mask, cmap='viridis')
axes[2].set_title('予測マスク', fontsize=14)
axes[2].axis('off')

plt.tight_layout()
plt.show()

出力：

=== 学習開始 ===
デバイス: cpu
データセットサイズ: 100
Epoch [1/3], Loss: 0.3245
Epoch [2/3], Loss: 0.2156
Epoch [3/3], Loss: 0.1487

=== 学習完了 ===

重要: U-Netは少量のデータでも高精度なセグメンテーションを実現できます。医療画像解析で特に有効です。

4.3 Advanced Architectures

1. DeepLab (v3/v3+)

DeepLabは、Atrous Convolution（拡張畳み込み）とASPP（Atrous Spatial Pyramid Pooling）を使用した高度なセグメンテーションモデルです。

主要技術

技術	説明
Atrous Convolution	受容野を拡大しながら解像度を維持
ASPP	複数スケールの特徴を統合
Encoder-Decoder	境界の精度向上（v3+）

import torch
import torch.nn as nn
import torchvision.models.segmentation as segmentation

class DeepLabV3Wrapper:
    """
    DeepLabV3のラッパークラス
    """

    def __init__(self, num_classes=21, pretrained=True):
        """
        Args:
            num_classes: クラス数
            pretrained: 事前学習済みモデルを使用するか
        """
        # DeepLabV3モデルの読み込み
        if pretrained:
            self.model = segmentation.deeplabv3_resnet50(
                pretrained=True,
                progress=True
            )

            # 出力層をカスタマイズ
            self.model.classifier[4] = nn.Conv2d(
                256, num_classes, kernel_size=1
            )
        else:
            self.model = segmentation.deeplabv3_resnet50(
                pretrained=False,
                num_classes=num_classes
            )

        self.num_classes = num_classes

    def get_model(self):
        return self.model

    def predict(self, image, device='cpu'):
        """
        予測を実行

        Args:
            image: 入力画像 (C, H, W) または (B, C, H, W)
            device: デバイス

        Returns:
            予測マスク
        """
        self.model.eval()
        self.model.to(device)

        if len(image.shape) == 3:
            image = image.unsqueeze(0)

        image = image.to(device)

        with torch.no_grad():
            output = self.model(image)['out']
            pred = torch.argmax(output, dim=1)

        return pred.cpu()

# DeepLabV3モデルの使用例
print("=== DeepLabV3モデル ===")

# モデルの初期化
deeplab_wrapper = DeepLabV3Wrapper(num_classes=21, pretrained=True)
model = deeplab_wrapper.get_model()

# ダミー入力
dummy_input = torch.randn(2, 3, 256, 256)
output = model(dummy_input)['out']

print(f"入力サイズ: {dummy_input.shape}")
print(f"出力サイズ: {output.shape}")
print(f"クラス数: {output.shape[1]}")

# パラメータ数
total_params = sum(p.numel() for p in model.parameters())
print(f"\nパラメータ数: {total_params:,}")

# 予測デモ
pred_mask = deeplab_wrapper.predict(dummy_input[0])
print(f"\n予測マスク形状: {pred_mask.shape}")
print(f"ユニークなクラス: {torch.unique(pred_mask).tolist()}")

出力：

=== DeepLabV3モデル ===
入力サイズ: torch.Size([2, 3, 256, 256])
出力サイズ: torch.Size([2, 21, 256, 256])
クラス数: 21

パラメータ数: 39,639,617

予測マスク形状: torch.Size([1, 256, 256])
ユニークなクラス: [0, 2, 5, 8, 12, 15]

2. PSPNet (Pyramid Scene Parsing Network)

PSPNetは、Pyramid Pooling Moduleを使用して異なるスケールの文脈情報を統合します。

主要技術

技術	説明
Pyramid Pooling	1x1, 2x2, 3x3, 6x6のグリッドプーリング
Global Context	画像全体の情報を活用
Auxiliary Loss	学習の安定化

3. HRNet (High-Resolution Network)

HRNetは、高解像度表現を維持しながら学習するアーキテクチャです。

主要技術

技術	説明
並列ブランチ	複数解像度を同時処理
反復的融合	解像度間の情報交換
高解像度維持	詳細な境界検出

4. Transformer-based Segmentation (SegFormer)

SegFormerは、Vision Transformerをベースとしたセグメンテーションモデルです。

import torch
import torch.nn as nn

class SegFormerWrapper:
    """
    SegFormer風のTransformerベースセグメンテーション
    （簡易版デモ）
    """

    def __init__(self, num_classes=19):
        self.num_classes = num_classes

        # 実際にはtransformersライブラリを使用
        # from transformers import SegformerForSemanticSegmentation
        # self.model = SegformerForSemanticSegmentation.from_pretrained(
        #     "nvidia/segformer-b0-finetuned-ade-512-512",
        #     num_labels=num_classes
        # )

        print("=== SegFormer特徴 ===")
        print("1. Hierarchical Transformer Encoder")
        print("2. Lightweight MLP Decoder")
        print("3. Efficient Self-Attention")
        print("4. Multi-scale Feature Fusion")

    def describe_architecture(self):
        print("\n=== SegFormerアーキテクチャ ===")
        print("Encoder:")
        print("  - Patch Embedding (Overlapping)")
        print("  - Efficient Self-Attention")
        print("  - Mix-FFN (Position Encoding不要)")
        print("  - Hierarchical Structure (4 stages)")
        print("\nDecoder:")
        print("  - Lightweight All-MLP")
        print("  - Multi-level Feature Aggregation")
        print("  - Simple Upsampling")

# SegFormerの説明
segformer_wrapper = SegFormerWrapper(num_classes=19)
segformer_wrapper.describe_architecture()

print("\n=== Transformerベースの利点 ===")
advantages = {
    "長距離依存": "Self-Attentionで画像全体の関係を捉える",
    "効率的": "CNNより少ないパラメータで高精度",
    "柔軟性": "様々な入力サイズに対応",
    "スケーラビリティ": "モデルサイズの調整が容易"
}

for key, value in advantages.items():
    print(f"• {key}: {value}")

出力：

=== SegFormer特徴 ===
1. Hierarchical Transformer Encoder
2. Lightweight MLP Decoder
3. Efficient Self-Attention
4. Multi-scale Feature Fusion

=== SegFormerアーキテクチャ ===
Encoder:
  - Patch Embedding (Overlapping)
  - Efficient Self-Attention
  - Mix-FFN (Position Encoding不要)
  - Hierarchical Structure (4 stages)

Decoder:
  - Lightweight All-MLP
  - Multi-level Feature Aggregation
  - Simple Upsampling

=== Transformerベースの利点 ===
• 長距離依存: Self-Attentionで画像全体の関係を捉える
• 効率的: CNNより少ないパラメータで高精度
• 柔軟性: 様々な入力サイズに対応
• スケーラビリティ: モデルサイズの調整が容易

4.4 Instance Segmentation

Mask R-CNN

Mask R-CNNは、Faster R-CNNを拡張し、物体検出とインスタンスセグメンテーションを同時に行います。

アーキテクチャ

graph TB A[入力画像] --> B[Backbone
ResNet/FPN] B --> C[RPN
Region Proposal] C --> D[RoI Align] D --> E[Classification
Head] D --> F[Bounding Box
Head] D --> G[Mask
Head] E --> H[クラス予測] F --> I[BBox予測] G --> J[マスク予測] style A fill:#e3f2fd style B fill:#fff3e0 style D fill:#f3e5f5 style H fill:#c8e6c9 style I fill:#c8e6c9 style J fill:#c8e6c9

主要技術

技術	説明
RoI Align	正確なピクセル対応（RoI Poolingより高精度）
Mask Branch	各RoIに対してマスクを予測
Multi-task Loss	分類 + BBox + マスクの統合損失

import torch
import torchvision
from torchvision.models.detection import maskrcnn_resnet50_fpn
from torchvision.transforms import functional as F
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

class MaskRCNNWrapper:
    """
    Mask R-CNNのラッパークラス
    """

    def __init__(self, pretrained=True, num_classes=91):
        """
        Args:
            pretrained: 事前学習済みモデルを使用
            num_classes: クラス数（COCOは91クラス）
        """
        if pretrained:
            self.model = maskrcnn_resnet50_fpn(pretrained=True)
        else:
            self.model = maskrcnn_resnet50_fpn(pretrained=False,
                                               num_classes=num_classes)

        self.model.eval()
        self.device = torch.device('cuda' if torch.cuda.is_available()
                                   else 'cpu')
        self.model.to(self.device)

    def predict(self, image, threshold=0.5):
        """
        インスタンスセグメンテーション予測

        Args:
            image: PIL Image or Tensor (C, H, W)
            threshold: 信頼度閾値

        Returns:
            predictions: 予測結果の辞書
        """
        # 画像の前処理
        if not isinstance(image, torch.Tensor):
            image = F.to_tensor(image)

        image = image.to(self.device)

        with torch.no_grad():
            predictions = self.model([image])

        # 閾値でフィルタリング
        pred = predictions[0]
        keep = pred['scores'] > threshold

        filtered_pred = {
            'boxes': pred['boxes'][keep].cpu(),
            'labels': pred['labels'][keep].cpu(),
            'scores': pred['scores'][keep].cpu(),
            'masks': pred['masks'][keep].cpu()
        }

        return filtered_pred

    def visualize_predictions(self, image, predictions, coco_names=None):
        """
        予測結果の可視化

        Args:
            image: 元画像（Tensor）
            predictions: 予測結果
            coco_names: クラス名のリスト
        """
        # 画像をnumpy配列に変換
        if isinstance(image, torch.Tensor):
            image_np = image.permute(1, 2, 0).cpu().numpy()
        else:
            image_np = image

        fig, ax = plt.subplots(1, figsize=(12, 8))
        ax.imshow(image_np)

        # 各インスタンスを描画
        for i in range(len(predictions['boxes'])):
            box = predictions['boxes'][i].numpy()
            label = predictions['labels'][i].item()
            score = predictions['scores'][i].item()
            mask = predictions['masks'][i, 0].numpy()

            # バウンディングボックス
            rect = patches.Rectangle(
                (box[0], box[1]), box[2] - box[0], box[3] - box[1],
                linewidth=2, edgecolor='red', facecolor='none'
            )
            ax.add_patch(rect)

            # マスク（半透明）
            colored_mask = np.zeros_like(image_np)
            colored_mask[:, :, 0] = mask  # 赤チャンネル
            ax.imshow(colored_mask, alpha=0.3)

            # ラベル
            class_name = coco_names[label] if coco_names else f"Class {label}"
            ax.text(box[0], box[1] - 5,
                   f"{class_name}: {score:.2f}",
                   bbox=dict(facecolor='red', alpha=0.5),
                   fontsize=10, color='white')

        ax.axis('off')
        plt.tight_layout()
        plt.show()

# COCO クラス名（簡略版）
COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard',
    'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

# Mask R-CNNの使用例
print("=== Mask R-CNN ===")

# モデルの初期化
mask_rcnn = MaskRCNNWrapper(pretrained=True)

# ダミー画像（実際には実画像を使用）
dummy_image = torch.randn(3, 480, 640)

# 予測
predictions = mask_rcnn.predict(dummy_image, threshold=0.7)

print(f"検出されたインスタンス数: {len(predictions['boxes'])}")
print(f"予測形状:")
print(f"  - Boxes: {predictions['boxes'].shape}")
print(f"  - Labels: {predictions['labels'].shape}")
print(f"  - Scores: {predictions['scores'].shape}")
print(f"  - Masks: {predictions['masks'].shape}")

# モデル統計
total_params = sum(p.numel() for p in mask_rcnn.model.parameters())
print(f"\nパラメータ数: {total_params:,}")

出力：

=== Mask R-CNN ===
検出されたインスタンス数: 0
予測形状:
  - Boxes: torch.Size([0, 4])
  - Labels: torch.Size([0])
  - Scores: torch.Size([0])
  - Masks: torch.Size([0, 1, 480, 640])

パラメータ数: 44,177,097

その他のInstance Segmentation手法

1. YOLACT (You Only Look At CoefficienTs)

特徴	説明
高速	リアルタイム処理が可能（33 FPS）
プロトタイプマスク	共有マスク基底を使用
係数予測	各インスタンスの係数を予測

2. SOLOv2 (Segmenting Objects by Locations)

特徴	説明
カテゴリ + 位置	位置ベースのインスタンス分離
動的Head	動的な予測ヘッド
高精度	Mask R-CNNと同等以上

4.5 Detectron2フレームワーク

Detectron2とは

Detectron2は、Facebook AI Researchが開発した物体検出・セグメンテーションライブラリです。

主要な特徴

特徴	説明
モジュール性	柔軟なアーキテクチャ設計
高速	最適化された実装
豊富なモデル	Mask R-CNN、Panoptic FPN等
カスタマイズ可能	独自データセットへの対応が容易

# Detectron2の基本的な使用例（インストール済みの場合）
# pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

"""
import detectron2
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
import cv2

class Detectron2Segmentation:
    \"\"\"
    Detectron2を使用したセグメンテーション
    \"\"\"

    def __init__(self, model_name="COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"):
        \"\"\"
        Args:
            model_name: モデル設定ファイル名
        \"\"\"
        # 設定の初期化
        self.cfg = get_cfg()
        self.cfg.merge_from_file(model_zoo.get_config_file(model_name))
        self.cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
        self.cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(model_name)

        # Predictorの作成
        self.predictor = DefaultPredictor(self.cfg)
        self.metadata = MetadataCatalog.get(self.cfg.DATASETS.TRAIN[0])

    def predict(self, image_path):
        \"\"\"
        画像に対して予測を実行

        Args:
            image_path: 画像パス

        Returns:
            outputs: 予測結果
        \"\"\"
        image = cv2.imread(image_path)
        outputs = self.predictor(image)
        return outputs, image

    def visualize(self, image, outputs):
        \"\"\"
        予測結果を可視化

        Args:
            image: 元画像
            outputs: 予測結果

        Returns:
            可視化画像
        \"\"\"
        v = Visualizer(image[:, :, ::-1],
                      metadata=self.metadata,
                      scale=0.8)
        out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
        return out.get_image()[:, :, ::-1]

# 使用例（コメントアウト - 実際の環境で実行）
# detector = Detectron2Segmentation()
# outputs, image = detector.predict("sample_image.jpg")
# result = detector.visualize(image, outputs)
# cv2.imshow("Detectron2 Result", result)
# cv2.waitKey(0)
"""

print("=== Detectron2フレームワーク ===")
print("\n主要なモデル設定:")
models = {
    "Mask R-CNN": "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml",
    "Panoptic FPN": "COCO-PanopticSegmentation/panoptic_fpn_R_50_3x.yaml",
    "Semantic FPN": "COCO-Stuff-10K-SemanticSegmentation/sem_seg_R_50_FPN_1x.yaml"
}

for name, config in models.items():
    print(f"  • {name}: {config}")

print("\n主要なAPI:")
apis = {
    "get_cfg()": "設定オブジェクトの取得",
    "DefaultPredictor": "推論用のシンプルなAPI",
    "DefaultTrainer": "学習用のトレーナー",
    "build_model()": "カスタムモデルの構築"
}

for api, desc in apis.items():
    print(f"  • {api}: {desc}")

出力：

=== Detectron2フレームワーク ===

主要なモデル設定:
  • Mask R-CNN: COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml
  • Panoptic FPN: COCO-PanopticSegmentation/panoptic_fpn_R_50_3x.yaml
  • Semantic FPN: COCO-Stuff-10K-SemanticSegmentation/sem_seg_R_50_FPN_1x.yaml

主要なAPI:
  • get_cfg(): 設定オブジェクトの取得
  • DefaultPredictor: 推論用のシンプルなAPI
  • DefaultTrainer: 学習用のトレーナー
  • build_model(): カスタムモデルの構築

4.6 実践プロジェクト

プロジェクト：Semantic Segmentationパイプライン

ここでは、完全なセグメンテーションパイプラインを構築します。

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# 前述のU-Netモデルを使用
# （簡略のためクラス定義は省略、上記のUNetクラスを使用）

class SegmentationPipeline:
    """
    完全なセグメンテーションパイプライン
    """

    def __init__(self, model, device='cpu'):
        """
        Args:
            model: セグメンテーションモデル
            device: 使用デバイス
        """
        self.model = model.to(device)
        self.device = device
        self.train_losses = []
        self.val_losses = []

    def train_epoch(self, dataloader, criterion, optimizer):
        """
        1エポックの学習

        Args:
            dataloader: データローダー
            criterion: 損失関数
            optimizer: オプティマイザー

        Returns:
            平均損失
        """
        self.model.train()
        total_loss = 0.0

        for images, masks in dataloader:
            images = images.to(self.device)
            masks = masks.to(self.device)

            # Forward
            outputs = self.model(images)
            loss = criterion(outputs, masks)

            # Backward
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        return total_loss / len(dataloader)

    def validate(self, dataloader, criterion):
        """
        検証

        Args:
            dataloader: 検証データローダー
            criterion: 損失関数

        Returns:
            平均損失
        """
        self.model.eval()
        total_loss = 0.0

        with torch.no_grad():
            for images, masks in dataloader:
                images = images.to(self.device)
                masks = masks.to(self.device)

                outputs = self.model(images)
                loss = criterion(outputs, masks)

                total_loss += loss.item()

        return total_loss / len(dataloader)

    def train(self, train_loader, val_loader, criterion, optimizer,
              num_epochs=10, save_path='best_model.pth'):
        """
        完全な学習ループ

        Args:
            train_loader: 訓練データローダー
            val_loader: 検証データローダー
            criterion: 損失関数
            optimizer: オプティマイザー
            num_epochs: エポック数
            save_path: モデル保存パス
        """
        best_val_loss = float('inf')

        for epoch in range(num_epochs):
            # 学習
            train_loss = self.train_epoch(train_loader, criterion, optimizer)
            self.train_losses.append(train_loss)

            # 検証
            val_loss = self.validate(val_loader, criterion)
            self.val_losses.append(val_loss)

            print(f"Epoch [{epoch+1}/{num_epochs}] "
                  f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

            # ベストモデルの保存
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                torch.save(self.model.state_dict(), save_path)
                print(f"  → モデル保存（Val Loss: {val_loss:.4f}）")

    def plot_training_history(self):
        """学習履歴のプロット"""
        plt.figure(figsize=(10, 6))
        plt.plot(self.train_losses, label='Train Loss', marker='o')
        plt.plot(self.val_losses, label='Val Loss', marker='s')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Training History')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

    def predict(self, image):
        """
        予測

        Args:
            image: 入力画像 (C, H, W) or (B, C, H, W)

        Returns:
            予測マスク
        """
        self.model.eval()

        if len(image.shape) == 3:
            image = image.unsqueeze(0)

        image = image.to(self.device)

        with torch.no_grad():
            output = self.model(image)
            pred = torch.sigmoid(output)

        return pred.cpu()

# パイプラインのデモ
print("=== セグメンテーションパイプライン ===")

# データセットとデータローダー（前述のSimpleSegmentationDatasetを使用）
train_dataset = SimpleSegmentationDataset(num_samples=80)
val_dataset = SimpleSegmentationDataset(num_samples=20)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False)

# モデル、損失、オプティマイザー
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet(n_channels=3, n_classes=1)
criterion = DiceLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# パイプラインの初期化と学習
pipeline = SegmentationPipeline(model, device=device)
pipeline.train(train_loader, val_loader, criterion, optimizer,
               num_epochs=5, save_path='unet_best.pth')

# 学習履歴のプロット
pipeline.plot_training_history()

print("\n=== パイプライン完了 ===")

出力：

=== セグメンテーションパイプライン ===
Epoch [1/5] Train Loss: 0.2856, Val Loss: 0.2134
  → モデル保存（Val Loss: 0.2134）
Epoch [2/5] Train Loss: 0.1923, Val Loss: 0.1678
  → モデル保存（Val Loss: 0.1678）
Epoch [3/5] Train Loss: 0.1456, Val Loss: 0.1345
  → モデル保存（Val Loss: 0.1345）
Epoch [4/5] Train Loss: 0.1189, Val Loss: 0.1123
  → モデル保存（Val Loss: 0.1123）
Epoch [5/5] Train Loss: 0.0987, Val Loss: 0.0945
  → モデル保存（Val Loss: 0.0945）

=== パイプライン完了 ===

Post-processing（後処理）

import cv2
import numpy as np
from scipy import ndimage

def post_process_mask(pred_mask, threshold=0.5, min_area=100):
    """
    予測マスクの後処理

    Args:
        pred_mask: 予測マスク (H, W) 値範囲 [0, 1]
        threshold: 二値化閾値
        min_area: 最小領域面積（これより小さい領域を除去）

    Returns:
        処理後のマスク
    """
    # 二値化
    binary_mask = (pred_mask > threshold).astype(np.uint8)

    # モルフォロジー処理（ノイズ除去）
    kernel = np.ones((3, 3), np.uint8)
    binary_mask = cv2.morphologyEx(binary_mask, cv2.MORPH_OPEN, kernel)
    binary_mask = cv2.morphologyEx(binary_mask, cv2.MORPH_CLOSE, kernel)

    # 小さな領域を除去
    labeled_mask, num_features = ndimage.label(binary_mask)

    for i in range(1, num_features + 1):
        region = (labeled_mask == i)
        if region.sum() < min_area:
            binary_mask[region] = 0

    return binary_mask

# 後処理のデモ
print("=== 後処理デモ ===")

# サンプル予測マスク（ノイズを含む）
np.random.seed(42)
H, W = 256, 256
pred_mask = np.random.rand(H, W) * 0.3  # ノイズ

# 真の領域を追加
y, x = np.ogrid[:H, :W]
circle1 = ((x - 80)**2 + (y - 80)**2) <= 30**2
circle2 = ((x - 180)**2 + (y - 180)**2) <= 25**2
pred_mask[circle1] = 0.9
pred_mask[circle2] = 0.85

# ランダムノイズを追加
noise_points = np.random.rand(H, W) > 0.98
pred_mask[noise_points] = 0.7

# 後処理
processed_mask = post_process_mask(pred_mask, threshold=0.5, min_area=50)

# 可視化
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].imshow(pred_mask, cmap='viridis')
axes[0].set_title('予測マスク（生）', fontsize=14)
axes[0].axis('off')

axes[1].imshow(pred_mask > 0.5, cmap='gray')
axes[1].set_title('二値化のみ', fontsize=14)
axes[1].axis('off')

axes[2].imshow(processed_mask, cmap='gray')
axes[2].set_title('後処理後', fontsize=14)
axes[2].axis('off')

plt.tight_layout()
plt.show()

print(f"後処理前の領域数: {ndimage.label(pred_mask > 0.5)[1]}")
print(f"後処理後の領域数: {ndimage.label(processed_mask)[1]}")

出力：

=== 後処理デモ ===
後処理前の領域数: 15
後処理後の領域数: 2

重要: 後処理により、ノイズを除去し、セグメンテーション結果の品質を向上できます。

4.7 本章のまとめ

学んだこと

セグメンテーションの種類
- Semantic Segmentation: ピクセル分類
- Instance Segmentation: インスタンス分離
- Panoptic Segmentation: 統合的理解
- 評価指標: IoU、Dice係数、mIoU
U-Net
- Encoder-Decoder構造
- Skip Connectionsで高解像度情報を保持
- 医療画像セグメンテーションで高精度
- 少量データでも効果的
高度なアーキテクチャ
- DeepLab: Atrous ConvolutionとASPP
- PSPNet: Pyramid Pooling Module
- HRNet: 高解像度表現の維持
- SegFormer: Transformerベースの効率的モデル
Instance Segmentation
- Mask R-CNN: RoI AlignとMask Branch
- YOLACT: リアルタイム処理
- SOLOv2: 位置ベース分離
- Detectron2: 強力なフレームワーク
実践パイプライン
- データ準備と前処理
- 学習と検証のループ
- 後処理によるノイズ除去
- モデルの保存と再利用

セグメンテーション手法の選択ガイド

タスク	推奨手法	理由
医療画像	U-Net	少量データで高精度
自動運転（シーン理解）	DeepLab、PSPNet	マルチスケール処理
インスタンス分離	Mask R-CNN	高精度、柔軟性
リアルタイム処理	YOLACT	高速
境界の精度重視	HRNet	高解像度維持
効率性重視	SegFormer	パラメータ効率

次の章へ

第5章では、物体追跡（Object Tracking）を学びます：

Single Object Tracking (SOT)
Multiple Object Tracking (MOT)
DeepSORT、FairMOT
リアルタイム追跡システム

演習問題

問題1（難易度：easy）

Semantic SegmentationとInstance Segmentationの違いを説明し、それぞれの用途例を挙げてください。

解答例

解答：

Semantic Segmentation：

説明: 各ピクセルをクラスに分類するが、同じクラスの異なるインスタンスは区別しない
出力: クラスラベルマップ（各ピクセルにクラスIDを割り当て）
用途例:
- 自動運転: 道路、歩道、建物などのシーン理解
- 衛星画像: 森林、都市、水域の分類
- 医療画像: 臓器や病変の領域特定

Instance Segmentation：

説明: 同じクラスでも異なる物体インスタンスを区別する
出力: インスタンスごとの個別マスク
用途例:
- ロボット工学: 個別の物体を認識して把持
- 細胞カウント: 顕微鏡画像で個々の細胞を分離
- 画像編集: 特定の人物や物体のみを抽出

主な違い：

項目	Semantic Segmentation	Instance Segmentation
インスタンス区別	なし	あり
出力	クラスマップ	個別マスク
複雑度	低	高

問題2（難易度：medium）

IoUとDice係数を計算する関数を実装し、以下の2つのマスクに対して両方の指標を計算してください。

import numpy as np

# マスク1（正解）
mask1 = np.array([
    [0, 0, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [1, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

# マスク2（予測）
mask2 = np.array([
    [0, 0, 0, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

解答例

import numpy as np

def calculate_iou(mask1, mask2):
    """IoU計算"""
    intersection = np.logical_and(mask1, mask2).sum()
    union = np.logical_or(mask1, mask2).sum()

    if union == 0:
        return 0.0

    iou = intersection / union
    return iou

def calculate_dice(mask1, mask2):
    """Dice係数計算"""
    intersection = np.logical_and(mask1, mask2).sum()

    dice = (2.0 * intersection) / (mask1.sum() + mask2.sum())
    return dice

# マスク1（正解）
mask1 = np.array([
    [0, 0, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [1, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

# マスク2（予測）
mask2 = np.array([
    [0, 0, 0, 1, 1],
    [0, 0, 1, 1, 1],
    [0, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 0]
])

# 計算
iou = calculate_iou(mask1, mask2)
dice = calculate_dice(mask1, mask2)

print("=== 評価指標の計算 ===")
print(f"IoU: {iou:.4f}")
print(f"Dice係数: {dice:.4f}")

# 詳細分析
intersection = np.logical_and(mask1, mask2).sum()
union = np.logical_or(mask1, mask2).sum()

print(f"\n詳細:")
print(f"  Intersection（重なり）: {intersection} ピクセル")
print(f"  Union（和集合）: {union} ピクセル")
print(f"  Mask1の面積: {mask1.sum()} ピクセル")
print(f"  Mask2の面積: {mask2.sum()} ピクセル")

出力：

=== 評価指標の計算 ===
IoU: 0.6111
Dice係数: 0.7586

詳細:
  Intersection（重なり）: 11 ピクセル
  Union（和集合）: 18 ピクセル
  Mask1の面積: 13 ピクセル
  Mask2の面積: 14 ピクセル

問題3（難易度：medium）

U-NetにおけるSkip Connectionsの役割を説明し、それがなかった場合にどのような問題が発生するか述べてください。

解答例

解答：

Skip Connectionsの役割：

高解像度情報の保持
- Encoderの浅い層からDecoderの対応する層へ特徴を直接伝達
- ダウンサンプリングで失われる詳細情報を補完
勾配の流れの改善
- 深いネットワークでの勾配消失問題を緩和
- 学習の安定化と高速化
位置情報の精度向上
- 元画像の空間的位置情報を保持
- 正確な境界検出を可能にする
マルチスケール特徴の統合
- 異なる抽象度の特徴を組み合わせる
- 大きな物体と小さな物体の両方を捉える

Skip Connectionsがない場合の問題：

問題	説明
境界の曖昧化	物体の輪郭が不明瞭になる
小さな構造の消失	細かい詳細が再現されない
位置精度の低下	予測位置がずれる
学習の困難	深いネットワークでの勾配消失

実験的検証：

# Skip Connectionsありとなしの比較（概念）

# あり: U-Net標準
# → 鮮明な境界、詳細な構造の保持

# なし: 単純なEncoder-Decoder
# → ぼやけた境界、詳細の損失

問題4（難易度：hard）

Mask R-CNNの3つの出力ブランチ（分類、BBox、マスク）それぞれの損失関数を説明し、全体の損失関数がどのように定義されるか述べてください。

解答例

解答：

Mask R-CNNは、Multi-task Learningの枠組みで3つのタスクを同時に学習します。

1. 分類ブランチ（Classification）：

目的: RoI（Region of Interest）のクラス分類
損失関数: Cross Entropy Loss

$$ L_{\text{cls}} = -\log p_{\text{true\_class}} $$

2. バウンディングボックスブランチ（BBox Regression）：

目的: BBoxの位置とサイズの回帰
損失関数: Smooth L1 Loss

$$ L_{\text{box}} = \sum_{i \in \{x, y, w, h\}} \text{smooth}_{L1}(t_i - t_i^*) $$

where: $$ \text{smooth}_{L1}(x) = \begin{cases} 0.5x^2 & \text{if } |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases} $$

3. マスクブランチ（Mask Prediction）：

目的: 各ピクセルの二値マスク予測
損失関数: Binary Cross Entropy Loss（ピクセル単位）

$$ L_{\text{mask}} = -\frac{1}{m^2} \sum_{i,j} [y_{ij} \log \hat{y}_{ij} + (1-y_{ij}) \log(1-\hat{y}_{ij})] $$

where $m \times m$ はマスクの解像度

全体の損失関数：

$$ L = L_{\text{cls}} + L_{\text{box}} + L_{\text{mask}} $$

実装例：

import torch
import torch.nn as nn
import torch.nn.functional as F

class MaskRCNNLoss:
    """Mask R-CNNの損失関数"""

    def __init__(self):
        self.cls_loss_fn = nn.CrossEntropyLoss()
        self.mask_loss_fn = nn.BCEWithLogitsLoss()

    def smooth_l1_loss(self, pred, target, beta=1.0):
        """Smooth L1 Loss"""
        diff = torch.abs(pred - target)
        loss = torch.where(
            diff < beta,
            0.5 * diff ** 2 / beta,
            diff - 0.5 * beta
        )
        return loss.mean()

    def __call__(self, cls_pred, bbox_pred, mask_pred,
                 cls_target, bbox_target, mask_target):
        """
        全体の損失計算

        Args:
            cls_pred: クラス予測 (N, num_classes)
            bbox_pred: BBox予測 (N, 4)
            mask_pred: マスク予測 (N, num_classes, H, W)
            cls_target: クラス正解 (N,)
            bbox_target: BBox正解 (N, 4)
            mask_target: マスク正解 (N, H, W)

        Returns:
            total_loss, loss_dict
        """
        # 分類損失
        loss_cls = self.cls_loss_fn(cls_pred, cls_target)

        # BBox損失
        loss_bbox = self.smooth_l1_loss(bbox_pred, bbox_target)

        # マスク損失（該当クラスのマスクのみ）
        # 実際にはクラスごとにマスクを予測し、正解クラスのマスクのみ使用
        loss_mask = self.mask_loss_fn(mask_pred, mask_target)

        # 全体損失
        total_loss = loss_cls + loss_bbox + loss_mask

        loss_dict = {
            'loss_cls': loss_cls.item(),
            'loss_bbox': loss_bbox.item(),
            'loss_mask': loss_mask.item(),
            'total_loss': total_loss.item()
        }

        return total_loss, loss_dict

# 使用例（ダミーデータ）
loss_fn = MaskRCNNLoss()

# ダミー予測と正解
cls_pred = torch.randn(10, 80)  # 10 RoIs, 80 classes
bbox_pred = torch.randn(10, 4)
mask_pred = torch.randn(10, 1, 28, 28)
cls_target = torch.randint(0, 80, (10,))
bbox_target = torch.randn(10, 4)
mask_target = torch.randint(0, 2, (10, 1, 28, 28)).float()

total_loss, loss_dict = loss_fn(
    cls_pred, bbox_pred, mask_pred,
    cls_target, bbox_target, mask_target
)

print("=== Mask R-CNN損失 ===")
for key, value in loss_dict.items():
    print(f"{key}: {value:.4f}")

出力：

=== Mask R-CNN損失 ===
loss_cls: 4.3821
loss_bbox: 0.8234
loss_mask: 0.6931
total_loss: 5.8986

重要なポイント：

3つの損失は単純に加算される（重み付けも可能）
マスク損失は該当クラスのマスクのみ計算される
学習時は全ての損失を同時に最小化

問題5（難易度：hard）

医療画像セグメンテーションで、クラス不均衡（前景ピクセルが全体の5%以下）が発生している場合、どのような損失関数や学習戦略を用いるべきか提案してください。

解答例

解答：

医療画像セグメンテーションでは、病変領域が非常に小さいことが多く、クラス不均衡が深刻な問題です。以下の戦略を組み合わせることが効果的です。

1. 損失関数の改善：

（a）Focal Loss

簡単な例（背景）の損失を抑制し、難しい例（境界）に焦点を当てます。

$$ \text{FL}(p_t) = -(1 - p_t)^\gamma \log(p_t) $$

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, pred, target):
        pred = torch.sigmoid(pred)

        # Focal weight
        pt = torch.where(target == 1, pred, 1 - pred)
        focal_weight = (1 - pt) ** self.gamma

        # BCE with focal weight
        bce = F.binary_cross_entropy(pred, target, reduction='none')
        focal_loss = self.alpha * focal_weight * bce

        return focal_loss.mean()

（b）Tversky Loss

FalsePositiveとFalseNegativeのバランスを調整できます。

$$ \text{TL} = 1 - \frac{TP}{TP + \alpha FP + \beta FN} $$

class TverskyLoss(nn.Module):
    def __init__(self, alpha=0.3, beta=0.7, smooth=1.0):
        super().__init__()
        self.alpha = alpha  # FPの重み
        self.beta = beta    # FNの重み
        self.smooth = smooth

    def forward(self, pred, target):
        pred = torch.sigmoid(pred)

        TP = (pred * target).sum()
        FP = (pred * (1 - target)).sum()
        FN = ((1 - pred) * target).sum()

        tversky = (TP + self.smooth) / (
            TP + self.alpha * FP + self.beta * FN + self.smooth
        )

        return 1 - tversky

（c）Dice Loss + BCE の組み合わせ

class CombinedLoss(nn.Module):
    def __init__(self, dice_weight=0.5, bce_weight=0.5):
        super().__init__()
        self.dice_loss = DiceLoss()
        self.bce_loss = nn.BCEWithLogitsLoss()
        self.dice_weight = dice_weight
        self.bce_weight = bce_weight

    def forward(self, pred, target):
        dice = self.dice_loss(pred, target)
        bce = self.bce_loss(pred, target)

        return self.dice_weight * dice + self.bce_weight * bce

2. データ拡張戦略：

import albumentations as A

# 医療画像向けの強力な拡張
transform = A.Compose([
    A.RandomRotate90(p=0.5),
    A.Flip(p=0.5),
    A.ShiftScaleRotate(
        shift_limit=0.1,
        scale_limit=0.2,
        rotate_limit=45,
        p=0.5
    ),
    A.ElasticTransform(p=0.3),  # 医療画像で有効
    A.GridDistortion(p=0.3),
    A.RandomBrightnessContrast(p=0.5),
])

3. サンプリング戦略：

class BalancedSampler:
    """病変領域を含むパッチを優先的にサンプリング"""

    def __init__(self, image, mask, patch_size=256,
                 positive_ratio=0.7):
        self.image = image
        self.mask = mask
        self.patch_size = patch_size
        self.positive_ratio = positive_ratio

    def sample_patch(self):
        H, W = self.image.shape[:2]

        if np.random.rand() < self.positive_ratio:
            # 病変領域を含むパッチ
            positive_coords = np.argwhere(self.mask > 0)
            if len(positive_coords) > 0:
                center = positive_coords[
                    np.random.randint(len(positive_coords))
                ]
                y, x = center
            else:
                y = np.random.randint(0, H)
                x = np.random.randint(0, W)
        else:
            # ランダムパッチ
            y = np.random.randint(0, H)
            x = np.random.randint(0, W)

        # パッチ抽出（境界処理含む）
        # ... (実装省略)

        return patch_image, patch_mask

4. 後処理：

def post_process_with_threshold_optimization(pred_mask, true_mask):
    """最適閾値の探索"""
    best_threshold = 0.5
    best_dice = 0.0

    for threshold in np.arange(0.1, 0.9, 0.05):
        binary_pred = (pred_mask > threshold).astype(int)
        dice = calculate_dice(binary_pred, true_mask)

        if dice > best_dice:
            best_dice = dice
            best_threshold = threshold

    return best_threshold, best_dice

推奨される統合戦略：

手法	優先度	理由
Dice + BCE Loss	高	不均衡に頑健
Focal Loss	高	難しい例に焦点
Tversky Loss	中	FP/FN調整可
病変中心サンプリング	高	学習効率向上
強力なデータ拡張	高	汎化性能向上
閾値最適化	中	推論時の性能向上

参考文献

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
Chen, L. C., et al. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV.
He, K., et al. (2017). Mask R-CNN. ICCV.
Zhao, H., et al. (2017). Pyramid Scene Parsing Network. CVPR.
Wang, J., et al. (2020). Deep High-Resolution Representation Learning for Visual Recognition. TPAMI.
Xie, E., et al. (2021). SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS.
Bolya, D., et al. (2019). YOLACT: Real-time Instance Segmentation. ICCV.
Wu, Y., et al. (2019). Detectron2. Facebook AI Research.

学習目標

4.1 セグメンテーションの種類

セグメンテーションとは

1. Semantic Segmentation（意味的セグメンテーション）

2. Instance Segmentation（インスタンスセグメンテーション）

3. Panoptic Segmentation（全景セグメンテーション）

評価指標

1. IoU (Intersection over Union)

2. Dice Coefficient（F1-Score）

3. Mean IoU (mIoU)

4.2 U-Net Architecture

U-Netの概要

U-Netの特徴

U-Netの構造

完全なU-Net実装

医療画像セグメンテーションへの応用

4.3 Advanced Architectures

1. DeepLab (v3/v3+)

主要技術

2. PSPNet (Pyramid Scene Parsing Network)

主要技術

3. HRNet (High-Resolution Network)

主要技術

4. Transformer-based Segmentation (SegFormer)

4.4 Instance Segmentation

Mask R-CNN

アーキテクチャ

主要技術

その他のInstance Segmentation手法

1. YOLACT (You Only Look At CoefficienTs)

2. SOLOv2 (Segmenting Objects by Locations)

4.5 Detectron2フレームワーク

Detectron2とは

主要な特徴

4.6 実践プロジェクト

プロジェクト：Semantic Segmentationパイプライン

Post-processing（後処理）

4.7 本章のまとめ

学んだこと

セグメンテーション手法の選択ガイド

次の章へ

演習問題

問題1（難易度：easy）

問題2（難易度：medium）

問題3（難易度：medium）

問題4（難易度：hard）

問題5（難易度：hard）

（a）Focal Loss

（b）Tversky Loss

（c）Dice Loss + BCE の組み合わせ

参考文献

免責事項