Chapter 4: Transfer Learning

This chapter covers Transfer Learning. You will learn fundamental concepts of transfer learning, and optimize fine-tuning strategies, and Build practical transfer learning projects.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the fundamental concepts of transfer learning and the effects of pre-training
✅ Implement and optimize fine-tuning strategies
✅ Solve distribution shift problems using Domain Adaptation
✅ Combine knowledge distillation for model compression with meta-learning
✅ Build practical transfer learning projects

1. Fundamentals of Transfer Learning

1.1 What is Transfer Learning?

Transfer learning is a machine learning approach that leverages knowledge learned from one task (source task) for another task (target task). While meta-learning aims for "learning to learn", transfer learning aims for "knowledge reuse".

1.2 Types of Transfer Learning

Transfer Type	Description	Example
Domain Transfer	Transfer between different data distributions	Natural images → Medical images
Task Transfer	Transfer between different tasks	Classification → Object detection
Model Transfer	Reuse of model architecture	ResNet → Custom architecture

1.3 Evaluating Transferability

The success of transfer learning depends on the relationship between the source and target tasks:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torch.nn as nn
from torchvision import models
from scipy.stats import spearmanr

def compute_transferability_score(source_features, target_features):
    """
    Compute transferability score

    Args:
        source_features: Feature representations from source domain
        target_features: Feature representations from target domain

    Returns:
        transferability_score: Transferability score
    """
    # Calculate correlation coefficient between features
    correlation, _ = spearmanr(
        source_features.flatten(),
        target_features.flatten()
    )

    # Calculate distance between feature distributions (MMD)
    def compute_mmd(x, y):
        xx = torch.mm(x, x.t())
        yy = torch.mm(y, y.t())
        xy = torch.mm(x, y.t())
        return xx.mean() + yy.mean() - 2 * xy.mean()

    mmd_distance = compute_mmd(
        torch.tensor(source_features),
        torch.tensor(target_features)
    )

    # Overall score (higher is better for transfer)
    transferability_score = correlation - 0.1 * mmd_distance.item()

    return transferability_score

# Usage example
source_feats = torch.randn(100, 512)
target_feats = torch.randn(100, 512)
score = compute_transferability_score(source_feats, target_feats)
print(f"Transferability Score: {score:.4f}")

2. Fine-Tuning Strategies

2.1 Full Layer vs Partial Layer Updates

The decision of which layers to update in a pre-trained model depends on the amount of data and task similarity:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch.nn as nn
from torchvision import models

class TransferLearningModel(nn.Module):
    def __init__(self, num_classes, freeze_strategy='partial'):
        """
        Transfer learning model

        Args:
            num_classes: Number of classes for target task
            freeze_strategy: 'all', 'partial', 'none'
        """
        super().__init__()
        # Load ResNet50 pre-trained model
        self.backbone = models.resnet50(pretrained=True)

        # Apply freezing strategy
        if freeze_strategy == 'all':
            # Freeze all layers (except classification layer)
            for param in self.backbone.parameters():
                param.requires_grad = False

        elif freeze_strategy == 'partial':
            # Freeze only early layers (use as feature extractor)
            for name, param in self.backbone.named_parameters():
                if 'layer4' not in name and 'fc' not in name:
                    param.requires_grad = False

        # Replace classification layer
        num_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(num_features, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        return self.backbone(x)

# Create models with different strategies
model_frozen = TransferLearningModel(num_classes=10, freeze_strategy='all')
model_partial = TransferLearningModel(num_classes=10, freeze_strategy='partial')
model_full = TransferLearningModel(num_classes=10, freeze_strategy='none')

# Check parameter counts
for name, model in [('Frozen', model_frozen), ('Partial', model_partial), ('Full', model_full)]:
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"{name}: {trainable:,} / {total:,} parameters trainable")

2.2 Discriminative Fine-Tuning

More effective fine-tuning is possible by setting different learning rates for each layer:

def get_discriminative_params(model, base_lr=1e-4, multiplier=2.6):
    """
    Set different learning rates per layer

    Args:
        model: Neural network model
        base_lr: Learning rate for the final layer
        multiplier: Learning rate multiplier between layers

    Returns:
        param_groups: Layer-wise parameter groups
    """
    param_groups = []

    # Layer groups for ResNet
    layer_groups = [
        ('layer1', model.backbone.layer1),
        ('layer2', model.backbone.layer2),
        ('layer3', model.backbone.layer3),
        ('layer4', model.backbone.layer4),
        ('fc', model.backbone.fc)
    ]

    # Set learning rate per layer (deeper layers get higher learning rates)
    for i, (name, layer) in enumerate(layer_groups):
        lr = base_lr / (multiplier ** (len(layer_groups) - i - 1))
        param_groups.append({
            'params': layer.parameters(),
            'lr': lr,
            'name': name
        })
        print(f"{name}: lr = {lr:.2e}")

    return param_groups

# Apply Discriminative Fine-Tuning
model = TransferLearningModel(num_classes=10, freeze_strategy='none')
param_groups = get_discriminative_params(model, base_lr=1e-3)
optimizer = torch.optim.Adam(param_groups)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer, T_0=10, T_mult=2
)

3. Domain Adaptation

3.1 The Domain Shift Problem

When the distributions of the source and target domains differ, model performance degrades. Domain Adaptation is a technique that mitigates this distribution shift.

3.2 Domain Adversarial Neural Networks (DANN)

DANN uses adversarial training to ensure that the feature extractor learns domain-invariant representations:

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class GradientReversalLayer(torch.autograd.Function):
    """Gradient Reversal Layer (for DANN)"""
    @staticmethod
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.view_as(x)

    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.alpha * grad_output, None

class DANN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # Feature extractor (domain-invariant)
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(3, 64, 5), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 5), nn.ReLU(), nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(128 * 5 * 5, 1024), nn.ReLU(),
            nn.Dropout(0.5)
        )

        # Class classifier
        self.class_classifier = nn.Sequential(
            nn.Linear(1024, 256), nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

        # Domain classifier
        self.domain_classifier = nn.Sequential(
            nn.Linear(1024, 256), nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 2)  # Source vs Target
        )

    def forward(self, x, alpha=1.0):
        features = self.feature_extractor(x)

        # Class prediction
        class_output = self.class_classifier(features)

        # Domain prediction (gradient reversal)
        reversed_features = GradientReversalLayer.apply(features, alpha)
        domain_output = self.domain_classifier(reversed_features)

        return class_output, domain_output

# DANN training
def train_dann(model, source_loader, target_loader, epochs=50):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    for epoch in range(epochs):
        model.train()
        # Gradually increase alpha value (gradient reversal strength)
        alpha = 2 / (1 + np.exp(-10 * epoch / epochs)) - 1

        for (source_data, source_labels), (target_data, _) in zip(source_loader, target_loader):
            # Source domain loss
            source_class, source_domain = model(source_data, alpha)
            class_loss = F.cross_entropy(source_class, source_labels)
            source_domain_loss = F.cross_entropy(
                source_domain,
                torch.zeros(len(source_data), dtype=torch.long)
            )

            # Target domain loss
            _, target_domain = model(target_data, alpha)
            target_domain_loss = F.cross_entropy(
                target_domain,
                torch.ones(len(target_data), dtype=torch.long)
            )

            # Total loss
            loss = class_loss + source_domain_loss + target_domain_loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}, Alpha = {alpha:.4f}")

# Usage example
model = DANN(num_classes=10)
print("DANN model created with gradient reversal layer")

3.3 Maximum Mean Discrepancy (MMD)

MMD achieves domain adaptation by minimizing the distance between source and target distributions:

def compute_mmd_loss(source_features, target_features, kernel='rbf'):
    """
    Compute Maximum Mean Discrepancy loss

    Args:
        source_features: Source features (N_s, D)
        target_features: Target features (N_t, D)
        kernel: Kernel type ('rbf', 'linear')

    Returns:
        mmd_loss: MMD loss
    """
    def gaussian_kernel(x, y, sigma=1.0):
        x_size = x.size(0)
        y_size = y.size(0)
        dim = x.size(1)

        x = x.unsqueeze(1)  # (N_s, 1, D)
        y = y.unsqueeze(0)  # (1, N_t, D)

        diff = x - y  # (N_s, N_t, D)
        dist_sq = torch.sum(diff ** 2, dim=2)  # (N_s, N_t)

        return torch.exp(-dist_sq / (2 * sigma ** 2))

    if kernel == 'rbf':
        # RBF kernel
        xx = gaussian_kernel(source_features, source_features).mean()
        yy = gaussian_kernel(target_features, target_features).mean()
        xy = gaussian_kernel(source_features, target_features).mean()
    else:
        # Linear kernel
        xx = torch.mm(source_features, source_features.t()).mean()
        yy = torch.mm(target_features, target_features.t()).mean()
        xy = torch.mm(source_features, target_features.t()).mean()

    mmd_loss = xx + yy - 2 * xy
    return mmd_loss

class MMDNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.feature_extractor = nn.Sequential(
            nn.Linear(784, 512), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.5)
        )
        self.classifier = nn.Linear(256, num_classes)

    def forward(self, x):
        features = self.feature_extractor(x)
        output = self.classifier(features)
        return output, features

# Training with MMD
def train_with_mmd(model, source_loader, target_loader, lambda_mmd=0.1):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    for (source_data, source_labels), (target_data, _) in zip(source_loader, target_loader):
        # Get features and predictions
        source_pred, source_feat = model(source_data)
        _, target_feat = model(target_data)

        # Classification loss
        class_loss = F.cross_entropy(source_pred, source_labels)

        # MMD loss
        mmd_loss = compute_mmd_loss(source_feat, target_feat)

        # Total loss
        total_loss = class_loss + lambda_mmd * mmd_loss

        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

    return class_loss.item(), mmd_loss.item()

4. Knowledge Distillation

4.1 Fundamentals of Teacher-Student Learning

Knowledge distillation is a method that transfers knowledge from a large Teacher model to a small Student model. When combined with meta-learning, it enables efficient few-shot learning.

graph LR A[Large Teacher Model
High Accuracy] -->|Soft Targets| B[Knowledge
Distillation] C[Training Data] --> B B -->|Transfer| D[Small Student Model
Fast Inference] style A fill:#e3f2fd style D fill:#c8e6c9

4.2 Implementation of Temperature-based Distillation

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=3.0, alpha=0.7):
        """
        Knowledge distillation loss function

        Args:
            temperature: Temperature parameter for softmax
            alpha: Balance between hard loss and soft loss
        """
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = nn.CrossEntropyLoss()

    def forward(self, student_logits, teacher_logits, targets):
        # Soft target loss (knowledge distillation)
        soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
        soft_student = F.log_softmax(student_logits / self.temperature, dim=1)
        distillation_loss = F.kl_div(
            soft_student, soft_targets, reduction='batchmean'
        ) * (self.temperature ** 2)

        # Hard target loss (normal classification)
        student_loss = self.ce_loss(student_logits, targets)

        # Total loss
        total_loss = (
            self.alpha * distillation_loss +
            (1 - self.alpha) * student_loss
        )

        return total_loss, distillation_loss, student_loss

# Define Teacher and Student models
class TeacherModel(nn.Module):
    """Large Teacher model"""
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 128, 3, padding=1), nn.ReLU(),
            nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(256 * 8 * 8, 512), nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        return self.features(x)

class StudentModel(nn.Module):
    """Lightweight Student model"""
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 128), nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        return self.features(x)

# Knowledge distillation training
def train_distillation(teacher, student, train_loader, epochs=50):
    # Teacher in evaluation mode
    teacher.eval()
    student.train()

    criterion = DistillationLoss(temperature=3.0, alpha=0.7)
    optimizer = torch.optim.Adam(student.parameters(), lr=1e-3)

    for epoch in range(epochs):
        total_loss = 0
        for images, labels in train_loader:
            # Teacher prediction (no gradients)
            with torch.no_grad():
                teacher_logits = teacher(images)

            # Student prediction
            student_logits = student(images)

            # Distillation loss
            loss, dist_loss, student_loss = criterion(
                student_logits, teacher_logits, labels
            )

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Loss = {total_loss/len(train_loader):.4f}")

# Compare model sizes
teacher = TeacherModel()
student = StudentModel()
teacher_params = sum(p.numel() for p in teacher.parameters())
student_params = sum(p.numel() for p in student.parameters())
print(f"Teacher: {teacher_params:,} parameters")
print(f"Student: {student_params:,} parameters")
print(f"Compression ratio: {teacher_params/student_params:.2f}x")

4.3 Combining with Meta-Learning

class MetaDistillation(nn.Module):
    def __init__(self, teacher_model, student_model, inner_lr=0.01):
        """
        Combination of meta-learning and knowledge distillation

        Args:
            teacher_model: Large Teacher model
            student_model: Lightweight Student model
            inner_lr: Inner loop learning rate
        """
        super().__init__()
        self.teacher = teacher_model
        self.student = student_model
        self.inner_lr = inner_lr
        self.temperature = 3.0

    def inner_loop(self, support_x, support_y, steps=5):
        """
        Inner loop: Task adaptation for Student model
        """
        # Create a copy of Student model
        adapted_params = [p.clone() for p in self.student.parameters()]

        for _ in range(steps):
            # Teacher prediction
            with torch.no_grad():
                teacher_logits = self.teacher(support_x)

            # Student prediction
            student_logits = self.student(support_x)

            # Distillation loss
            soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
            soft_student = F.log_softmax(student_logits / self.temperature, dim=1)
            loss = F.kl_div(soft_student, soft_targets, reduction='batchmean')

            # Gradient computation and update
            grads = torch.autograd.grad(loss, self.student.parameters())
            adapted_params = [
                p - self.inner_lr * g
                for p, g in zip(adapted_params, grads)
            ]

        return adapted_params

    def forward(self, support_x, support_y, query_x, query_y):
        """
        Forward pass for meta-distillation
        """
        # Adapt in inner loop
        adapted_params = self.inner_loop(support_x, support_y)

        # Evaluate on query set with adapted model
        query_logits = self.student(query_x)
        loss = F.cross_entropy(query_logits, query_y)

        return loss

# Usage example
teacher = TeacherModel(num_classes=5)
student = StudentModel(num_classes=5)
meta_distill = MetaDistillation(teacher, student, inner_lr=0.01)
print("Meta-Distillation model initialized")

5. Practice: Transfer Learning Project

Project: Transfer Learning from ImageNet to Medical Image Classification

Goal: Build a high-accuracy diagnostic model with a small amount of medical image data using a model pre-trained on ImageNet.

5.1 Complete Transfer Learning Pipeline

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0

import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader, Dataset
import numpy as np

class MedicalImageDataset(Dataset):
    """Medical image dataset"""
    def __init__(self, images, labels, transform=None):
        self.images = images
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = self.images[idx]
        label = self.labels[idx]

        if self.transform:
            image = self.transform(image)

        return image, label

class TransferLearningPipeline:
    def __init__(self, num_classes, device='cuda'):
        self.device = device
        self.num_classes = num_classes

        # Data augmentation (for Domain Adaptation)
        self.train_transform = transforms.Compose([
            transforms.ToPILImage(),
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.RandomRotation(15),
            transforms.ColorJitter(brightness=0.2, contrast=0.2),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])

        self.val_transform = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])

        # Load pre-trained model
        self.model = self._create_model()

    def _create_model(self):
        """Create and customize pre-trained model"""
        # ResNet50 (pre-trained on ImageNet)
        model = models.resnet50(pretrained=True)

        # Freeze early layers
        for name, param in model.named_parameters():
            if 'layer4' not in name and 'fc' not in name:
                param.requires_grad = False

        # Customize classification layer
        num_features = model.fc.in_features
        model.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(num_features, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.Dropout(0.3),
            nn.Linear(512, self.num_classes)
        )

        return model.to(self.device)

    def train(self, train_data, val_data, epochs=50, use_distillation=False):
        """
        Execute training

        Args:
            train_data: Training data (images, labels)
            val_data: Validation data (images, labels)
            epochs: Number of epochs
            use_distillation: Whether to use knowledge distillation
        """
        # Data loaders
        train_dataset = MedicalImageDataset(*train_data, self.train_transform)
        val_dataset = MedicalImageDataset(*val_data, self.val_transform)

        train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

        # Discriminative Learning Rates
        optimizer = torch.optim.Adam([
            {'params': self.model.layer4.parameters(), 'lr': 1e-3},
            {'params': self.model.fc.parameters(), 'lr': 1e-2}
        ])

        scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
            optimizer, T_0=10, T_mult=2
        )

        criterion = nn.CrossEntropyLoss()

        best_val_acc = 0.0
        history = {'train_loss': [], 'val_loss': [], 'val_acc': []}

        for epoch in range(epochs):
            # Training phase
            self.model.train()
            train_loss = 0.0

            for images, labels in train_loader:
                images, labels = images.to(self.device), labels.to(self.device)

                optimizer.zero_grad()
                outputs = self.model(images)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

                train_loss += loss.item()

            # Validation phase
            val_loss, val_acc = self._validate(val_loader, criterion)

            # Save history
            history['train_loss'].append(train_loss / len(train_loader))
            history['val_loss'].append(val_loss)
            history['val_acc'].append(val_acc)

            # Save best model
            if val_acc > best_val_acc:
                best_val_acc = val_acc
                torch.save(self.model.state_dict(), 'best_model.pth')

            scheduler.step()

            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}/{epochs}")
                print(f"  Train Loss: {history['train_loss'][-1]:.4f}")
                print(f"  Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        return history

    def _validate(self, val_loader, criterion):
        """Validation"""
        self.model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(self.device), labels.to(self.device)
                outputs = self.model(images)
                loss = criterion(outputs, labels)

                val_loss += loss.item()
                _, predicted = outputs.max(1)
                total += labels.size(0)
                correct += predicted.eq(labels).sum().item()

        val_loss /= len(val_loader)
        val_acc = correct / total

        return val_loss, val_acc

    def evaluate_transferability(self, source_data, target_data):
        """Evaluate transferability"""
        self.model.eval()

        def extract_features(data):
            features = []
            with torch.no_grad():
                for images, _ in DataLoader(data, batch_size=32):
                    images = images.to(self.device)
                    # Extract features before final layer
                    feat = self.model.layer4(
                        self.model.layer3(
                            self.model.layer2(
                                self.model.layer1(
                                    self.model.conv1(images)
                                )
                            )
                        )
                    )
                    features.append(feat.cpu().flatten(1))
            return torch.cat(features, dim=0)

        source_feats = extract_features(source_data)
        target_feats = extract_features(target_data)

        # Calculate MMD score
        score = compute_transferability_score(source_feats, target_feats)
        print(f"Transferability Score: {score:.4f}")

        return score

# Usage example
if __name__ == "__main__":
    # Dummy data (use actual medical image data in practice)
    num_samples = 1000
    train_images = np.random.randint(0, 255, (num_samples, 224, 224, 3), dtype=np.uint8)
    train_labels = np.random.randint(0, 3, num_samples)

    val_images = np.random.randint(0, 255, (200, 224, 224, 3), dtype=np.uint8)
    val_labels = np.random.randint(0, 3, 200)

    # Execute pipeline
    pipeline = TransferLearningPipeline(num_classes=3, device='cpu')

    print("Starting transfer learning training...")
    history = pipeline.train(
        train_data=(train_images, train_labels),
        val_data=(val_images, val_labels),
        epochs=20
    )

    print("\nTraining completed!")
    print(f"Best validation accuracy: {max(history['val_acc']):.4f}")

Summary

In this chapter, we learned comprehensive transfer learning techniques:

Fundamentals of transfer learning: High-accuracy models can be built even with small datasets by leveraging pre-trained models
Fine-tuning strategies: Efficient knowledge transfer through layer-wise learning rate adjustment
Domain Adaptation: Solving domain shift problems using DANN and MMD
Knowledge distillation: Transferring knowledge from large models to lightweight models to improve inference efficiency
Practical project: Achieving practical applications through transfer from ImageNet to medical images

Key Point: Transfer learning, when combined with meta-learning, can build even more powerful few-shot learning systems. The key to success in the real world is integrating general feature representations acquired through pre-training with the rapid adaptation capabilities of meta-learning.

Exercises

Exercise 1: Analyzing Transferability

Evaluate the transferability from different source domains (ImageNet, Places365, COCO) to a target domain (medical images) and select the optimal pre-trained model. Compare MMD scores with actual performance.

Exercise 2: Comparing Fine-Tuning Strategies

Implement and compare the following three strategies:
1) Full layer freezing (train only classification layer)
2) Partial freezing (train only Layer4 and classification layer)
3) Discriminative Fine-Tuning
Evaluate the performance of each strategy by varying the amount of data.

Exercise 3: DANN Implementation and Evaluation

Implement Domain Adaptation from source domain (MNIST) to target domain (MNIST-M). Vary the alpha value of the gradient reversal layer and analyze the trade-off between domain invariance and classification performance.

Exercise 4: Optimizing Knowledge Distillation

Vary the temperature parameter (T=1, 3, 5, 10) and alpha value (α=0.3, 0.5, 0.7, 0.9) to evaluate their impact on Student model performance. Find the optimal hyperparameters.

Exercise 5: Meta-Distillation Implementation

Implement a meta-distillation system combining MAML and knowledge distillation. Compare three approaches: regular MAML, regular distillation, and meta-distillation, and verify their effectiveness in few-shot learning.

References

Pan, S. J., & Yang, Q. (2009). "A Survey on Transfer Learning". IEEE Transactions on Knowledge and Data Engineering.
Ganin, Y., et al. (2016). "Domain-Adversarial Training of Neural Networks". JMLR.
Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network". NIPS Deep Learning Workshop.
Yosinski, J., et al. (2014). "How transferable are features in deep neural networks?". NIPS.
Howard, J., & Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ACL.