This chapter covers Transfer Learning. You will learn fundamental concepts of transfer learning, and optimize fine-tuning strategies, and Build practical transfer learning projects.
Learning Objectives
By reading this chapter, you will be able to:
- ✅ Understand the fundamental concepts of transfer learning and the effects of pre-training
- ✅ Implement and optimize fine-tuning strategies
- ✅ Solve distribution shift problems using Domain Adaptation
- ✅ Combine knowledge distillation for model compression with meta-learning
- ✅ Build practical transfer learning projects
1. Fundamentals of Transfer Learning
1.1 What is Transfer Learning?
Transfer learning is a machine learning approach that leverages knowledge learned from one task (source task) for another task (target task). While meta-learning aims for "learning to learn", transfer learning aims for "knowledge reuse".
ImageNet] -->|Pre-training| B[Pre-trained Model] B -->|Transfer| C[Target Domain
Medical Images] C -->|Fine-tuning| D[Specialized Model] style A fill:#e3f2fd style D fill:#c8e6c9
1.2 Types of Transfer Learning
| Transfer Type | Description | Example |
|---|---|---|
| Domain Transfer | Transfer between different data distributions | Natural images → Medical images |
| Task Transfer | Transfer between different tasks | Classification → Object detection |
| Model Transfer | Reuse of model architecture | ResNet → Custom architecture |
1.3 Evaluating Transferability
The success of transfer learning depends on the relationship between the source and target tasks:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0
import torch
import torch.nn as nn
from torchvision import models
from scipy.stats import spearmanr
def compute_transferability_score(source_features, target_features):
"""
Compute transferability score
Args:
source_features: Feature representations from source domain
target_features: Feature representations from target domain
Returns:
transferability_score: Transferability score
"""
# Calculate correlation coefficient between features
correlation, _ = spearmanr(
source_features.flatten(),
target_features.flatten()
)
# Calculate distance between feature distributions (MMD)
def compute_mmd(x, y):
xx = torch.mm(x, x.t())
yy = torch.mm(y, y.t())
xy = torch.mm(x, y.t())
return xx.mean() + yy.mean() - 2 * xy.mean()
mmd_distance = compute_mmd(
torch.tensor(source_features),
torch.tensor(target_features)
)
# Overall score (higher is better for transfer)
transferability_score = correlation - 0.1 * mmd_distance.item()
return transferability_score
# Usage example
source_feats = torch.randn(100, 512)
target_feats = torch.randn(100, 512)
score = compute_transferability_score(source_feats, target_feats)
print(f"Transferability Score: {score:.4f}")
2. Fine-Tuning Strategies
2.1 Full Layer vs Partial Layer Updates
The decision of which layers to update in a pre-trained model depends on the amount of data and task similarity:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0
import torch.nn as nn
from torchvision import models
class TransferLearningModel(nn.Module):
def __init__(self, num_classes, freeze_strategy='partial'):
"""
Transfer learning model
Args:
num_classes: Number of classes for target task
freeze_strategy: 'all', 'partial', 'none'
"""
super().__init__()
# Load ResNet50 pre-trained model
self.backbone = models.resnet50(pretrained=True)
# Apply freezing strategy
if freeze_strategy == 'all':
# Freeze all layers (except classification layer)
for param in self.backbone.parameters():
param.requires_grad = False
elif freeze_strategy == 'partial':
# Freeze only early layers (use as feature extractor)
for name, param in self.backbone.named_parameters():
if 'layer4' not in name and 'fc' not in name:
param.requires_grad = False
# Replace classification layer
num_features = self.backbone.fc.in_features
self.backbone.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(num_features, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
def forward(self, x):
return self.backbone(x)
# Create models with different strategies
model_frozen = TransferLearningModel(num_classes=10, freeze_strategy='all')
model_partial = TransferLearningModel(num_classes=10, freeze_strategy='partial')
model_full = TransferLearningModel(num_classes=10, freeze_strategy='none')
# Check parameter counts
for name, model in [('Frozen', model_frozen), ('Partial', model_partial), ('Full', model_full)]:
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"{name}: {trainable:,} / {total:,} parameters trainable")
2.2 Discriminative Fine-Tuning
More effective fine-tuning is possible by setting different learning rates for each layer:
def get_discriminative_params(model, base_lr=1e-4, multiplier=2.6):
"""
Set different learning rates per layer
Args:
model: Neural network model
base_lr: Learning rate for the final layer
multiplier: Learning rate multiplier between layers
Returns:
param_groups: Layer-wise parameter groups
"""
param_groups = []
# Layer groups for ResNet
layer_groups = [
('layer1', model.backbone.layer1),
('layer2', model.backbone.layer2),
('layer3', model.backbone.layer3),
('layer4', model.backbone.layer4),
('fc', model.backbone.fc)
]
# Set learning rate per layer (deeper layers get higher learning rates)
for i, (name, layer) in enumerate(layer_groups):
lr = base_lr / (multiplier ** (len(layer_groups) - i - 1))
param_groups.append({
'params': layer.parameters(),
'lr': lr,
'name': name
})
print(f"{name}: lr = {lr:.2e}")
return param_groups
# Apply Discriminative Fine-Tuning
model = TransferLearningModel(num_classes=10, freeze_strategy='none')
param_groups = get_discriminative_params(model, base_lr=1e-3)
optimizer = torch.optim.Adam(param_groups)
# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
optimizer, T_0=10, T_mult=2
)
3. Domain Adaptation
3.1 The Domain Shift Problem
When the distributions of the source and target domains differ, model performance degrades. Domain Adaptation is a technique that mitigates this distribution shift.
P_s(X,Y)] -->|Training Data| B[Model] C[Target Domain
P_t(X,Y)] -.->|No Labels| B B -->|Adapt| D[Domain-Invariant
Features] D -->|Predict| E[Target Predictions] style A fill:#e3f2fd style C fill:#fff3e0 style D fill:#c8e6c9
3.2 Domain Adversarial Neural Networks (DANN)
DANN uses adversarial training to ensure that the feature extractor learns domain-invariant representations:
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import torch.nn.functional as F
class GradientReversalLayer(torch.autograd.Function):
"""Gradient Reversal Layer (for DANN)"""
@staticmethod
def forward(ctx, x, alpha):
ctx.alpha = alpha
return x.view_as(x)
@staticmethod
def backward(ctx, grad_output):
return -ctx.alpha * grad_output, None
class DANN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Feature extractor (domain-invariant)
self.feature_extractor = nn.Sequential(
nn.Conv2d(3, 64, 5), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(64, 128, 5), nn.ReLU(), nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(128 * 5 * 5, 1024), nn.ReLU(),
nn.Dropout(0.5)
)
# Class classifier
self.class_classifier = nn.Sequential(
nn.Linear(1024, 256), nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, num_classes)
)
# Domain classifier
self.domain_classifier = nn.Sequential(
nn.Linear(1024, 256), nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, 2) # Source vs Target
)
def forward(self, x, alpha=1.0):
features = self.feature_extractor(x)
# Class prediction
class_output = self.class_classifier(features)
# Domain prediction (gradient reversal)
reversed_features = GradientReversalLayer.apply(features, alpha)
domain_output = self.domain_classifier(reversed_features)
return class_output, domain_output
# DANN training
def train_dann(model, source_loader, target_loader, epochs=50):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(epochs):
model.train()
# Gradually increase alpha value (gradient reversal strength)
alpha = 2 / (1 + np.exp(-10 * epoch / epochs)) - 1
for (source_data, source_labels), (target_data, _) in zip(source_loader, target_loader):
# Source domain loss
source_class, source_domain = model(source_data, alpha)
class_loss = F.cross_entropy(source_class, source_labels)
source_domain_loss = F.cross_entropy(
source_domain,
torch.zeros(len(source_data), dtype=torch.long)
)
# Target domain loss
_, target_domain = model(target_data, alpha)
target_domain_loss = F.cross_entropy(
target_domain,
torch.ones(len(target_data), dtype=torch.long)
)
# Total loss
loss = class_loss + source_domain_loss + target_domain_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}, Alpha = {alpha:.4f}")
# Usage example
model = DANN(num_classes=10)
print("DANN model created with gradient reversal layer")
3.3 Maximum Mean Discrepancy (MMD)
MMD achieves domain adaptation by minimizing the distance between source and target distributions:
def compute_mmd_loss(source_features, target_features, kernel='rbf'):
"""
Compute Maximum Mean Discrepancy loss
Args:
source_features: Source features (N_s, D)
target_features: Target features (N_t, D)
kernel: Kernel type ('rbf', 'linear')
Returns:
mmd_loss: MMD loss
"""
def gaussian_kernel(x, y, sigma=1.0):
x_size = x.size(0)
y_size = y.size(0)
dim = x.size(1)
x = x.unsqueeze(1) # (N_s, 1, D)
y = y.unsqueeze(0) # (1, N_t, D)
diff = x - y # (N_s, N_t, D)
dist_sq = torch.sum(diff ** 2, dim=2) # (N_s, N_t)
return torch.exp(-dist_sq / (2 * sigma ** 2))
if kernel == 'rbf':
# RBF kernel
xx = gaussian_kernel(source_features, source_features).mean()
yy = gaussian_kernel(target_features, target_features).mean()
xy = gaussian_kernel(source_features, target_features).mean()
else:
# Linear kernel
xx = torch.mm(source_features, source_features.t()).mean()
yy = torch.mm(target_features, target_features.t()).mean()
xy = torch.mm(source_features, target_features.t()).mean()
mmd_loss = xx + yy - 2 * xy
return mmd_loss
class MMDNet(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.feature_extractor = nn.Sequential(
nn.Linear(784, 512), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.5)
)
self.classifier = nn.Linear(256, num_classes)
def forward(self, x):
features = self.feature_extractor(x)
output = self.classifier(features)
return output, features
# Training with MMD
def train_with_mmd(model, source_loader, target_loader, lambda_mmd=0.1):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for (source_data, source_labels), (target_data, _) in zip(source_loader, target_loader):
# Get features and predictions
source_pred, source_feat = model(source_data)
_, target_feat = model(target_data)
# Classification loss
class_loss = F.cross_entropy(source_pred, source_labels)
# MMD loss
mmd_loss = compute_mmd_loss(source_feat, target_feat)
# Total loss
total_loss = class_loss + lambda_mmd * mmd_loss
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
return class_loss.item(), mmd_loss.item()
4. Knowledge Distillation
4.1 Fundamentals of Teacher-Student Learning
Knowledge distillation is a method that transfers knowledge from a large Teacher model to a small Student model. When combined with meta-learning, it enables efficient few-shot learning.
High Accuracy] -->|Soft Targets| B[Knowledge
Distillation] C[Training Data] --> B B -->|Transfer| D[Small Student Model
Fast Inference] style A fill:#e3f2fd style D fill:#c8e6c9
4.2 Implementation of Temperature-based Distillation
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import torch.nn.functional as F
class DistillationLoss(nn.Module):
def __init__(self, temperature=3.0, alpha=0.7):
"""
Knowledge distillation loss function
Args:
temperature: Temperature parameter for softmax
alpha: Balance between hard loss and soft loss
"""
super().__init__()
self.temperature = temperature
self.alpha = alpha
self.ce_loss = nn.CrossEntropyLoss()
def forward(self, student_logits, teacher_logits, targets):
# Soft target loss (knowledge distillation)
soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
soft_student = F.log_softmax(student_logits / self.temperature, dim=1)
distillation_loss = F.kl_div(
soft_student, soft_targets, reduction='batchmean'
) * (self.temperature ** 2)
# Hard target loss (normal classification)
student_loss = self.ce_loss(student_logits, targets)
# Total loss
total_loss = (
self.alpha * distillation_loss +
(1 - self.alpha) * student_loss
)
return total_loss, distillation_loss, student_loss
# Define Teacher and Student models
class TeacherModel(nn.Module):
"""Large Teacher model"""
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 128, 3, padding=1), nn.ReLU(),
nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(256 * 8 * 8, 512), nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
return self.features(x)
class StudentModel(nn.Module):
"""Lightweight Student model"""
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(64 * 8 * 8, 128), nn.ReLU(),
nn.Linear(128, num_classes)
)
def forward(self, x):
return self.features(x)
# Knowledge distillation training
def train_distillation(teacher, student, train_loader, epochs=50):
# Teacher in evaluation mode
teacher.eval()
student.train()
criterion = DistillationLoss(temperature=3.0, alpha=0.7)
optimizer = torch.optim.Adam(student.parameters(), lr=1e-3)
for epoch in range(epochs):
total_loss = 0
for images, labels in train_loader:
# Teacher prediction (no gradients)
with torch.no_grad():
teacher_logits = teacher(images)
# Student prediction
student_logits = student(images)
# Distillation loss
loss, dist_loss, student_loss = criterion(
student_logits, teacher_logits, labels
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}: Loss = {total_loss/len(train_loader):.4f}")
# Compare model sizes
teacher = TeacherModel()
student = StudentModel()
teacher_params = sum(p.numel() for p in teacher.parameters())
student_params = sum(p.numel() for p in student.parameters())
print(f"Teacher: {teacher_params:,} parameters")
print(f"Student: {student_params:,} parameters")
print(f"Compression ratio: {teacher_params/student_params:.2f}x")
4.3 Combining with Meta-Learning
class MetaDistillation(nn.Module):
def __init__(self, teacher_model, student_model, inner_lr=0.01):
"""
Combination of meta-learning and knowledge distillation
Args:
teacher_model: Large Teacher model
student_model: Lightweight Student model
inner_lr: Inner loop learning rate
"""
super().__init__()
self.teacher = teacher_model
self.student = student_model
self.inner_lr = inner_lr
self.temperature = 3.0
def inner_loop(self, support_x, support_y, steps=5):
"""
Inner loop: Task adaptation for Student model
"""
# Create a copy of Student model
adapted_params = [p.clone() for p in self.student.parameters()]
for _ in range(steps):
# Teacher prediction
with torch.no_grad():
teacher_logits = self.teacher(support_x)
# Student prediction
student_logits = self.student(support_x)
# Distillation loss
soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
soft_student = F.log_softmax(student_logits / self.temperature, dim=1)
loss = F.kl_div(soft_student, soft_targets, reduction='batchmean')
# Gradient computation and update
grads = torch.autograd.grad(loss, self.student.parameters())
adapted_params = [
p - self.inner_lr * g
for p, g in zip(adapted_params, grads)
]
return adapted_params
def forward(self, support_x, support_y, query_x, query_y):
"""
Forward pass for meta-distillation
"""
# Adapt in inner loop
adapted_params = self.inner_loop(support_x, support_y)
# Evaluate on query set with adapted model
query_logits = self.student(query_x)
loss = F.cross_entropy(query_logits, query_y)
return loss
# Usage example
teacher = TeacherModel(num_classes=5)
student = StudentModel(num_classes=5)
meta_distill = MetaDistillation(teacher, student, inner_lr=0.01)
print("Meta-Distillation model initialized")
5. Practice: Transfer Learning Project
Project: Transfer Learning from ImageNet to Medical Image Classification
Goal: Build a high-accuracy diagnostic model with a small amount of medical image data using a model pre-trained on ImageNet.
5.1 Complete Transfer Learning Pipeline
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
# - torchvision>=0.15.0
import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader, Dataset
import numpy as np
class MedicalImageDataset(Dataset):
"""Medical image dataset"""
def __init__(self, images, labels, transform=None):
self.images = images
self.labels = labels
self.transform = transform
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
image = self.images[idx]
label = self.labels[idx]
if self.transform:
image = self.transform(image)
return image, label
class TransferLearningPipeline:
def __init__(self, num_classes, device='cuda'):
self.device = device
self.num_classes = num_classes
# Data augmentation (for Domain Adaptation)
self.train_transform = transforms.Compose([
transforms.ToPILImage(),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
self.val_transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# Load pre-trained model
self.model = self._create_model()
def _create_model(self):
"""Create and customize pre-trained model"""
# ResNet50 (pre-trained on ImageNet)
model = models.resnet50(pretrained=True)
# Freeze early layers
for name, param in model.named_parameters():
if 'layer4' not in name and 'fc' not in name:
param.requires_grad = False
# Customize classification layer
num_features = model.fc.in_features
model.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(num_features, 512),
nn.ReLU(),
nn.BatchNorm1d(512),
nn.Dropout(0.3),
nn.Linear(512, self.num_classes)
)
return model.to(self.device)
def train(self, train_data, val_data, epochs=50, use_distillation=False):
"""
Execute training
Args:
train_data: Training data (images, labels)
val_data: Validation data (images, labels)
epochs: Number of epochs
use_distillation: Whether to use knowledge distillation
"""
# Data loaders
train_dataset = MedicalImageDataset(*train_data, self.train_transform)
val_dataset = MedicalImageDataset(*val_data, self.val_transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# Discriminative Learning Rates
optimizer = torch.optim.Adam([
{'params': self.model.layer4.parameters(), 'lr': 1e-3},
{'params': self.model.fc.parameters(), 'lr': 1e-2}
])
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
optimizer, T_0=10, T_mult=2
)
criterion = nn.CrossEntropyLoss()
best_val_acc = 0.0
history = {'train_loss': [], 'val_loss': [], 'val_acc': []}
for epoch in range(epochs):
# Training phase
self.model.train()
train_loss = 0.0
for images, labels in train_loader:
images, labels = images.to(self.device), labels.to(self.device)
optimizer.zero_grad()
outputs = self.model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation phase
val_loss, val_acc = self._validate(val_loader, criterion)
# Save history
history['train_loss'].append(train_loss / len(train_loader))
history['val_loss'].append(val_loss)
history['val_acc'].append(val_acc)
# Save best model
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(self.model.state_dict(), 'best_model.pth')
scheduler.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{epochs}")
print(f" Train Loss: {history['train_loss'][-1]:.4f}")
print(f" Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
return history
def _validate(self, val_loader, criterion):
"""Validation"""
self.model.eval()
val_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(self.device), labels.to(self.device)
outputs = self.model(images)
loss = criterion(outputs, labels)
val_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
val_loss /= len(val_loader)
val_acc = correct / total
return val_loss, val_acc
def evaluate_transferability(self, source_data, target_data):
"""Evaluate transferability"""
self.model.eval()
def extract_features(data):
features = []
with torch.no_grad():
for images, _ in DataLoader(data, batch_size=32):
images = images.to(self.device)
# Extract features before final layer
feat = self.model.layer4(
self.model.layer3(
self.model.layer2(
self.model.layer1(
self.model.conv1(images)
)
)
)
)
features.append(feat.cpu().flatten(1))
return torch.cat(features, dim=0)
source_feats = extract_features(source_data)
target_feats = extract_features(target_data)
# Calculate MMD score
score = compute_transferability_score(source_feats, target_feats)
print(f"Transferability Score: {score:.4f}")
return score
# Usage example
if __name__ == "__main__":
# Dummy data (use actual medical image data in practice)
num_samples = 1000
train_images = np.random.randint(0, 255, (num_samples, 224, 224, 3), dtype=np.uint8)
train_labels = np.random.randint(0, 3, num_samples)
val_images = np.random.randint(0, 255, (200, 224, 224, 3), dtype=np.uint8)
val_labels = np.random.randint(0, 3, 200)
# Execute pipeline
pipeline = TransferLearningPipeline(num_classes=3, device='cpu')
print("Starting transfer learning training...")
history = pipeline.train(
train_data=(train_images, train_labels),
val_data=(val_images, val_labels),
epochs=20
)
print("\nTraining completed!")
print(f"Best validation accuracy: {max(history['val_acc']):.4f}")
Summary
In this chapter, we learned comprehensive transfer learning techniques:
- Fundamentals of transfer learning: High-accuracy models can be built even with small datasets by leveraging pre-trained models
- Fine-tuning strategies: Efficient knowledge transfer through layer-wise learning rate adjustment
- Domain Adaptation: Solving domain shift problems using DANN and MMD
- Knowledge distillation: Transferring knowledge from large models to lightweight models to improve inference efficiency
- Practical project: Achieving practical applications through transfer from ImageNet to medical images
Key Point: Transfer learning, when combined with meta-learning, can build even more powerful few-shot learning systems. The key to success in the real world is integrating general feature representations acquired through pre-training with the rapid adaptation capabilities of meta-learning.
Exercises
Exercise 1: Analyzing Transferability
Evaluate the transferability from different source domains (ImageNet, Places365, COCO) to a target domain (medical images) and select the optimal pre-trained model. Compare MMD scores with actual performance.
Exercise 2: Comparing Fine-Tuning Strategies
Implement and compare the following three strategies:
1) Full layer freezing (train only classification layer)
2) Partial freezing (train only Layer4 and classification layer)
3) Discriminative Fine-Tuning
Evaluate the performance of each strategy by varying the amount of data.
Exercise 3: DANN Implementation and Evaluation
Implement Domain Adaptation from source domain (MNIST) to target domain (MNIST-M). Vary the alpha value of the gradient reversal layer and analyze the trade-off between domain invariance and classification performance.
Exercise 4: Optimizing Knowledge Distillation
Vary the temperature parameter (T=1, 3, 5, 10) and alpha value (α=0.3, 0.5, 0.7, 0.9) to evaluate their impact on Student model performance. Find the optimal hyperparameters.
Exercise 5: Meta-Distillation Implementation
Implement a meta-distillation system combining MAML and knowledge distillation. Compare three approaches: regular MAML, regular distillation, and meta-distillation, and verify their effectiveness in few-shot learning.
References
- Pan, S. J., & Yang, Q. (2009). "A Survey on Transfer Learning". IEEE Transactions on Knowledge and Data Engineering.
- Ganin, Y., et al. (2016). "Domain-Adversarial Training of Neural Networks". JMLR.
- Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network". NIPS Deep Learning Workshop.
- Yosinski, J., et al. (2014). "How transferable are features in deep neural networks?". NIPS.
- Howard, J., & Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ACL.