This chapter covers advanced topics in Advanced Tuning Methods. You will master principles of Successive Halving, Utilize BOHB's fusion of Bayesian optimization, and large-scale distributed tuning with Ray Tune.
Learning Objectives
By reading this chapter, you will be able to:
- ✅ Understand the principles of Successive Halving and Hyperband
- ✅ Utilize BOHB's fusion of Bayesian optimization and Hyperband
- ✅ Optimize parallel training with Population-based Training (PBT)
- ✅ Understand the characteristics of major libraries including Hyperopt, SMAC, and Ax/BoTorch
- ✅ Implement large-scale distributed tuning with Ray Tune
3.1 Hyperband
Principles of Successive Halving
Successive Halving is a method for efficiently allocating limited computational resources. The basic idea is as follows:
- Start training with many configurations using a small amount of resources
- Progressively eliminate poorly performing configurations (by half)
- Allocate more resources to the remaining promising configurations
Important: By eliminating poorly performing configurations early, computational costs can be significantly reduced.
Algorithm Flow
Hyperband Algorithm
Hyperband runs Successive Halving with multiple different configurations to optimize resource allocation strategies.
Parameters:
- R: Maximum resources to allocate to one configuration (e.g., number of epochs)
- η: Reduction rate at each round (typically 3 or 4)
$$ s_{\max} = \lfloor \log_\eta(R) \rfloor $$
Implementation in Optuna (HyperbandPruner)
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - optuna>=3.2.0
"""
Example: Implementation in Optuna (HyperbandPruner)
Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""
import optuna
from optuna.pruners import HyperbandPruner
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# Hyperband configuration
pruner = HyperbandPruner(
min_resource=1, # Minimum resources (epochs)
max_resource=100, # Maximum resources
reduction_factor=3 # Reduction rate η
)
def objective(trial):
# Hyperparameter suggestions
n_estimators = trial.suggest_int('n_estimators', 10, 200)
max_depth = trial.suggest_int('max_depth', 2, 32)
min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)
# Data preparation
X, y = load_iris(return_X_y=True)
# Gradually increase n_estimators for evaluation (Hyperband compatible)
for step in range(1, 6):
# Number of trees according to current step
current_n_estimators = int(n_estimators * step / 5)
model = RandomForestClassifier(
n_estimators=current_n_estimators,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
random_state=42
)
# Cross-validation score
score = cross_val_score(model, X, y, cv=3, n_jobs=-1).mean()
# Report intermediate value to Optuna
trial.report(score, step)
# Pruning decision
if trial.should_prune():
raise optuna.TrialPruned()
return score
# Study execution
study = optuna.create_study(
direction='maximize',
pruner=pruner,
study_name='hyperband_example'
)
study.optimize(objective, n_trials=100, timeout=300)
print("\n=== Hyperband Optimization Results ===")
print(f"Best Score: {study.best_value:.4f}")
print(f"Best Parameters: {study.best_params}")
print(f"\nCompleted Trials: {len([t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE])}")
print(f"Pruned Trials: {len([t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED])}")
Example Output:
=== Hyperband Optimization Results ===
Best Score: 0.9733
Best Parameters: {'n_estimators': 142, 'max_depth': 8, 'min_samples_split': 3, 'min_samples_leaf': 1}
Completed Trials: 28
Pruned Trials: 72
Effect: Out of 100 trials, 72 were pruned early, significantly reducing computation time.
3.2 BOHB (Bayesian Optimization and HyperBand)
Fusion of Hyperband and Bayesian Optimization
BOHB is a method that combines Hyperband's efficient resource allocation with Bayesian optimization's intelligent search.
| Method | Strengths | Weaknesses |
|---|---|---|
| Hyperband | Efficient resource allocation | Random sampling |
| Bayesian Optimization | Intelligent search | Allocates all resources |
| BOHB | Efficient + Intelligent search | Complex implementation |
BOHB Operating Principles
- Manage resource allocation with the Hyperband framework
- At each round, use TPE (Tree-structured Parzen Estimator) to propose hyperparameters
- Learn from past trial results and preferentially explore promising regions
Implementation and Use Cases
# Requirements:
# - Python 3.9+
# - optuna>=3.2.0
"""
Example: Implementation and Use Cases
Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""
import optuna
from optuna.samplers import TPESampler
from optuna.pruners import HyperbandPruner
from sklearn.datasets import load_digits
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
# BOHB configuration (TPE + Hyperband)
sampler = TPESampler(seed=42, n_startup_trials=10)
pruner = HyperbandPruner(
min_resource=5,
max_resource=100,
reduction_factor=3
)
def objective_bohb(trial):
# Hyperparameter proposals (TPE selects intelligently)
hidden_layer_size = trial.suggest_int('hidden_layer_size', 50, 200)
alpha = trial.suggest_float('alpha', 1e-5, 1e-1, log=True)
learning_rate_init = trial.suggest_float('learning_rate_init', 1e-4, 1e-1, log=True)
X, y = load_digits(return_X_y=True)
# Hyperband: gradually increase max_iter
for step in range(1, 6):
max_iter = int(100 * step / 5)
model = MLPClassifier(
hidden_layer_sizes=(hidden_layer_size,),
alpha=alpha,
learning_rate_init=learning_rate_init,
max_iter=max_iter,
random_state=42
)
score = cross_val_score(model, X, y, cv=3, n_jobs=-1).mean()
trial.report(score, step)
if trial.should_prune():
raise optuna.TrialPruned()
return score
# BOHB study
study_bohb = optuna.create_study(
direction='maximize',
sampler=sampler,
pruner=pruner,
study_name='bohb_example'
)
study_bohb.optimize(objective_bohb, n_trials=50, timeout=180)
print("\n=== BOHB Optimization Results ===")
print(f"Best Score: {study_bohb.best_value:.4f}")
print(f"Best Parameters:")
for key, value in study_bohb.best_params.items():
print(f" {key}: {value}")
print(f"\nCompleted/Pruned: {len([t for t in study_bohb.trials if t.state == optuna.trial.TrialState.COMPLETE])}/{len([t for t in study_bohb.trials if t.state == optuna.trial.TrialState.PRUNED])}")
Use Cases
- Neural Networks: Gradually increase the number of epochs
- Ensemble Learning: Gradually increase the number of weak learners
- Large-scale Data: Gradually increase the number of data samples
3.3 Population-based Training (PBT)
Principles of PBT
Population-based Training trains multiple models in parallel and periodically performs the following:
- Exploit: Replace poorly performing models with well-performing ones
- Explore: Perturb hyperparameters to try new configurations
Feature: The ability to dynamically adjust hyperparameters during training is the major difference from traditional methods.
PBT Workflow
n models] --> B[Train each model in parallel] B --> C{Periodic evaluation point} C --> D[Identify poorly performing models] D --> E[Copy weights from good models
Exploit] E --> F[Perturb hyperparameters
Explore] F --> G{Training complete?} G -->|No| B G -->|Yes| H[Select best model] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style E fill:#e3f2fd style F fill:#e8f5e9 style H fill:#c8e6c9
Implementation with Ray Tune
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - ray>=2.5.0
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
import numpy as np
def train_function(config):
"""Training function (simulation)"""
# Initial configuration
learning_rate = config["lr"]
momentum = config["momentum"]
# Training simulation
for step in range(100):
# Dummy performance metric (actual model training in practice)
# Good performance when learning rate and momentum are in appropriate ranges
optimal_lr = 0.01
optimal_momentum = 0.9
score = 1.0 - (
abs(learning_rate - optimal_lr) / optimal_lr +
abs(momentum - optimal_momentum) / optimal_momentum
) / 2
# Add noise to mimic realistic training
score += np.random.normal(0, 0.05)
# Report results to Ray Tune
tune.report(score=score, lr=learning_rate, momentum=momentum)
# PBT scheduler configuration
pbt_scheduler = PopulationBasedTraining(
time_attr="training_iteration",
metric="score",
mode="max",
perturbation_interval=10, # Perturb every 10 iterations
hyperparam_mutations={
"lr": lambda: np.random.uniform(0.001, 0.1),
"momentum": lambda: np.random.uniform(0.8, 0.99)
}
)
# Ray Tune execution
analysis = tune.run(
train_function,
name="pbt_example",
scheduler=pbt_scheduler,
num_samples=8, # Run 8 models in parallel
config={
"lr": tune.uniform(0.001, 0.1),
"momentum": tune.uniform(0.8, 0.99)
},
stop={"training_iteration": 100},
verbose=1
)
print("\n=== PBT Optimization Results ===")
best_config = analysis.get_best_config(metric="score", mode="max")
print(f"Best Configuration:")
print(f" Learning Rate: {best_config['lr']:.4f}")
print(f" Momentum: {best_config['momentum']:.4f}")
print(f"\nBest Score: {analysis.best_result['score']:.4f}")
Combination with Parallel Training
The greatest advantage of PBT is its ability to fully utilize parallel computational resources:
| Scenario | Traditional Methods | PBT |
|---|---|---|
| 8 GPUs for 100 epochs | Try 8 configurations sequentially 800 epochs worth of time |
Train 8 configurations simultaneously 100 epochs worth of time |
| Dynamic adjustment | Not possible | Optimized during training |
| Resource efficiency | Poor configs run to completion | Early convergence to good configs |
3.4 Other Advanced Methods
Hyperopt (TPE Implementation)
Hyperopt is a popular library that implements Tree-structured Parzen Estimator (TPE).
# Requirements:
# - Python 3.9+
# - hyperopt>=0.2.7
# - numpy>=1.24.0, <2.0.0
"""
Example: Hyperoptis a popular library that implements Tree-structured
Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# Define search space
space = {
'n_estimators': hp.quniform('n_estimators', 50, 300, 1),
'max_depth': hp.quniform('max_depth', 3, 15, 1),
'learning_rate': hp.loguniform('learning_rate', np.log(0.001), np.log(0.3)),
'subsample': hp.uniform('subsample', 0.5, 1.0),
'min_samples_split': hp.quniform('min_samples_split', 2, 20, 1)
}
# Data preparation
X, y = load_breast_cancer(return_X_y=True)
def objective_hyperopt(params):
"""Objective function for Hyperopt"""
# Convert to integer types
params['n_estimators'] = int(params['n_estimators'])
params['max_depth'] = int(params['max_depth'])
params['min_samples_split'] = int(params['min_samples_split'])
model = GradientBoostingClassifier(**params, random_state=42)
score = cross_val_score(model, X, y, cv=5, n_jobs=-1).mean()
# Hyperopt minimizes, so return negative value
return {'loss': -score, 'status': STATUS_OK}
# Optimization execution
trials = Trials()
best = fmin(
fn=objective_hyperopt,
space=space,
algo=tpe.suggest, # TPE algorithm
max_evals=50,
trials=trials,
rstate=np.random.default_rng(42)
)
print("\n=== Hyperopt (TPE) Optimization Results ===")
print("Best Parameters:")
for key, value in best.items():
print(f" {key}: {value}")
print(f"\nBest Score: {-min(trials.losses()):.4f}")
SMAC (Random Forest based)
SMAC (Sequential Model-based Algorithm Configuration) uses random forests as surrogate models.
Features:
- Strong with categorical variables and conditional parameters
- Excellent uncertainty estimation
- Robust to noisy objective functions
Ax/BoTorch (Facebook Research)
Ax and BoTorch are next-generation Bayesian optimization frameworks developed by Facebook Research.
from ax.service.ax_client import AxClient
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
# Create Ax client
ax_client = AxClient()
# Define search space
ax_client.create_experiment(
name="svm_optimization",
parameters=[
{"name": "C", "type": "range", "bounds": [0.1, 100.0], "log_scale": True},
{"name": "gamma", "type": "range", "bounds": [0.0001, 1.0], "log_scale": True},
{"name": "kernel", "type": "choice", "values": ["rbf", "poly", "sigmoid"]}
],
objective_name="accuracy",
minimize=False
)
# Data preparation
X, y = load_wine(return_X_y=True)
# Optimization loop
for i in range(30):
# Propose next configuration
parameters, trial_index = ax_client.get_next_trial()
# Model evaluation
model = SVC(**parameters, random_state=42)
score = cross_val_score(model, X, y, cv=5, n_jobs=-1).mean()
# Report results
ax_client.complete_trial(trial_index=trial_index, raw_data=score)
# Get best configuration
best_parameters, metrics = ax_client.get_best_parameters()
print("\n=== Ax/BoTorch Optimization Results ===")
print("Best Parameters:")
for key, value in best_parameters.items():
print(f" {key}: {value}")
print(f"\nBest Accuracy: {metrics[0]['accuracy']:.4f}")
print(f"Confidence Interval: [{metrics[0]['accuracy'] - metrics[1]['accuracy']['accuracy']:.4f}, "
f"{metrics[0]['accuracy'] + metrics[1]['accuracy']['accuracy']:.4f}]")
Method Comparison Table
| Method | Surrogate Model | Strengths | Application Scenarios |
|---|---|---|---|
| Hyperopt (TPE) | Kernel density estimation | Simple, fast | General optimization |
| SMAC | Random Forest | Conditional parameters | Complex search spaces |
| Ax/BoTorch | Gaussian Process | Uncertainty estimation, multi-task | Research & experiments |
| Optuna | TPE/GP/CMA-ES | Flexible, pruning | Practical optimization |
3.5 Practical Application: Large-scale Tuning with Ray Tune
Ray Tune Setup
Ray Tune is a unified framework for distributed hyperparameter tuning.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - ray>=2.5.0
# - torch>=2.0.0, <2.3.0
"""
Example: Ray Tuneis a unified framework for distributed hyperparamete
Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
# Initialize Ray
ray.init(ignore_reinit_error=True)
# Data preparation
X, y = make_classification(
n_samples=10000, n_features=20, n_informative=15,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# PyTorch datasets
train_dataset = TensorDataset(
torch.FloatTensor(X_train),
torch.LongTensor(y_train)
)
test_dataset = TensorDataset(
torch.FloatTensor(X_test),
torch.LongTensor(y_test)
)
def train_model(config):
"""Training function for Ray Tune"""
# Model definition
model = nn.Sequential(
nn.Linear(20, config["hidden_size_1"]),
nn.ReLU(),
nn.Dropout(config["dropout"]),
nn.Linear(config["hidden_size_1"], config["hidden_size_2"]),
nn.ReLU(),
nn.Dropout(config["dropout"]),
nn.Linear(config["hidden_size_2"], 2)
)
# Optimizer
optimizer = optim.Adam(model.parameters(), lr=config["lr"])
criterion = nn.CrossEntropyLoss()
# Data loaders
train_loader = DataLoader(
train_dataset,
batch_size=config["batch_size"],
shuffle=True
)
test_loader = DataLoader(test_dataset, batch_size=256)
# Training loop
for epoch in range(50):
model.train()
for batch_X, batch_y in train_loader:
optimizer.zero_grad()
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
# Validation
model.eval()
correct = 0
total = 0
with torch.no_grad():
for batch_X, batch_y in test_loader:
outputs = model(batch_X)
_, predicted = torch.max(outputs.data, 1)
total += batch_y.size(0)
correct += (predicted == batch_y).sum().item()
accuracy = correct / total
# Report to Ray Tune
tune.report(accuracy=accuracy, epoch=epoch)
# Search space
search_space = {
"hidden_size_1": tune.choice([32, 64, 128, 256]),
"hidden_size_2": tune.choice([16, 32, 64, 128]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([32, 64, 128]),
"dropout": tune.uniform(0.1, 0.5)
}
print("=== Ray Tune Setup Complete ===")
print(f"Search Space: {len(search_space)} dimensions")
Utilizing PBT Scheduler
from ray.tune.schedulers import PopulationBasedTraining
# PBT scheduler
pbt = PopulationBasedTraining(
time_attr="epoch",
metric="accuracy",
mode="max",
perturbation_interval=5,
hyperparam_mutations={
"lr": lambda: 10 ** np.random.uniform(-4, -1),
"dropout": lambda: np.random.uniform(0.1, 0.5)
}
)
# Ray Tune execution (PBT)
analysis_pbt = tune.run(
train_model,
name="pbt_neural_net",
scheduler=pbt,
num_samples=8, # Run 8 models in parallel
config=search_space,
resources_per_trial={"cpu": 2, "gpu": 0}, # Change when using GPU
verbose=1
)
print("\n=== PBT Execution Results ===")
best_trial_pbt = analysis_pbt.get_best_trial("accuracy", "max", "last")
print(f"Best Accuracy: {best_trial_pbt.last_result['accuracy']:.4f}")
print(f"Best Configuration:")
for key, value in best_trial_pbt.config.items():
print(f" {key}: {value}")
Distributed Execution
Ray Tune supports distributed execution across multiple machines:
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
"""
Example: Ray Tune supports distributed execution across multiple mach
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
# ASHA scheduler + Bayesian optimization
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch
# ASHA scheduler (improved version of Hyperband)
asha_scheduler = ASHAScheduler(
max_t=50, # Maximum epochs
grace_period=5, # Minimum epochs
reduction_factor=3 # Reduction rate
)
# Bayesian optimization searcher
bayesopt = BayesOptSearch(
metric="accuracy",
mode="max"
)
# Distributed execution
analysis_distributed = tune.run(
train_model,
name="distributed_tuning",
scheduler=asha_scheduler,
search_alg=bayesopt,
num_samples=100, # 100 trials
config=search_space,
resources_per_trial={"cpu": 2},
verbose=1
)
print("\n=== Distributed Tuning Results ===")
best_trial = analysis_distributed.get_best_trial("accuracy", "max", "last")
print(f"Best Accuracy: {best_trial.last_result['accuracy']:.4f}")
print(f"\nTrial Statistics:")
print(f" Completed Trials: {len(analysis_distributed.trials)}")
print(f" Average Accuracy: {np.mean([t.last_result['accuracy'] for t in analysis_distributed.trials if 'accuracy' in t.last_result]):.4f}")
# Visualize results
import pandas as pd
df = analysis_distributed.results_df
print(f"\n=== Top 5 Configurations ===")
top5 = df.nlargest(5, 'accuracy')[['accuracy', 'config/hidden_size_1', 'config/lr', 'config/dropout']]
print(top5)
# Shutdown Ray
ray.shutdown()
Advantages of Ray Tune
| Feature | Description | Benefits |
|---|---|---|
| Unified API | Multiple schedulers/searchers with unified interface | Easy method switching |
| Distributed Execution | Automatic scaling across machines | Large-scale exploration possible |
| Early Stopping | ASHA, Hyperband, Median, etc. | Resource savings |
| Checkpointing | Interruption and resumption support | Safety for long-running tasks |
| Visualization | TensorBoard integration | Real-time monitoring |
3.6 Chapter Summary
What We Learned
Hyperband
- Efficient resource allocation with Successive Halving
- Early elimination of poorly performing configurations
- Easy implementation with Optuna
BOHB
- Fusion of Hyperband and TPE
- Balances efficient resource allocation and intelligent search
- Especially effective for neural networks
Population-based Training
- Dynamically adjusts hyperparameters during parallel training
- Balances Exploit and Explore
- Delivers true value in large-scale parallel environments
Other Methods
- Hyperopt: Simple and fast TPE implementation
- SMAC: Strong with conditional parameters
- Ax/BoTorch: State-of-the-art Bayesian optimization
Ray Tune
- Unified framework utilizing multiple methods
- Large-scale tuning in distributed environments
- Integration with practical tools
Method Selection Guidelines
| Scenario | Recommended Method | Reason |
|---|---|---|
| Limited compute resources | Hyperband | Efficient resource allocation |
| Neural networks | BOHB, PBT | Progressive learning and dynamic adjustment |
| Large-scale parallel environment | PBT, Ray Tune | Maximizes parallel resources |
| Conditional parameters | SMAC | Handles complex search spaces |
| Research & experiments | Ax/BoTorch | Cutting-edge methods and customizability |
| Practical projects | Optuna, Ray Tune | Usability and proven track record |
To the Next Chapter
In Chapter 4, we will learn practical optimization strategies:
- Best practices for search space design
- Optimization of parallelization and distributed execution
- Result analysis and visualization
- Deployment to production environments
References
- Li, L., et al. (2018). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization". Journal of Machine Learning Research, 18(185), 1-52.
- Falkner, S., Klein, A., & Hutter, F. (2018). "BOHB: Robust and Efficient Hyperparameter Optimization at Scale". ICML 2018.
- Jaderberg, M., et al. (2017). "Population Based Training of Neural Networks". arXiv:1711.09846.
- Liaw, R., et al. (2018). "Tune: A Research Platform for Distributed Model Selection and Training". arXiv:1807.05118.
- Bergstra, J., et al. (2013). "Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures". ICML 2013.