🌐 EN | 🇯🇵 JP | Last sync: 2025-11-16

Chapter 3: Advanced Tuning Methods

Efficient Search with Hyperband, BOHB, and Population-based Training

📖 Reading Time: 25-30 minutes 📊 Difficulty: Intermediate-Advanced 💻 Code Examples: 6 🚀 Practical Methods

This chapter covers advanced topics in Advanced Tuning Methods. You will master principles of Successive Halving, Utilize BOHB's fusion of Bayesian optimization, and large-scale distributed tuning with Ray Tune.

Learning Objectives

By reading this chapter, you will be able to:


3.1 Hyperband

Principles of Successive Halving

Successive Halving is a method for efficiently allocating limited computational resources. The basic idea is as follows:

  1. Start training with many configurations using a small amount of resources
  2. Progressively eliminate poorly performing configurations (by half)
  3. Allocate more resources to the remaining promising configurations

Important: By eliminating poorly performing configurations early, computational costs can be significantly reduced.

Algorithm Flow

graph TD A[Generate n random configurations] --> B[Evaluate each configuration with r resources] B --> C{Select top n/2 by performance} C --> D[Double the resources] D --> E{Further select top n/4} E --> F[Double the resources] F --> G[The best configuration remains] style A fill:#ffebee style B fill:#fff3e0 style C fill:#e3f2fd style D fill:#f3e5f5 style E fill:#e3f2fd style F fill:#f3e5f5 style G fill:#c8e6c9

Hyperband Algorithm

Hyperband runs Successive Halving with multiple different configurations to optimize resource allocation strategies.

Parameters:

$$ s_{\max} = \lfloor \log_\eta(R) \rfloor $$

Implementation in Optuna (HyperbandPruner)

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - optuna>=3.2.0

"""
Example: Implementation in Optuna (HyperbandPruner)

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""

import optuna
from optuna.pruners import HyperbandPruner
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Hyperband configuration
pruner = HyperbandPruner(
    min_resource=1,      # Minimum resources (epochs)
    max_resource=100,    # Maximum resources
    reduction_factor=3   # Reduction rate η
)

def objective(trial):
    # Hyperparameter suggestions
    n_estimators = trial.suggest_int('n_estimators', 10, 200)
    max_depth = trial.suggest_int('max_depth', 2, 32)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)

    # Data preparation
    X, y = load_iris(return_X_y=True)

    # Gradually increase n_estimators for evaluation (Hyperband compatible)
    for step in range(1, 6):
        # Number of trees according to current step
        current_n_estimators = int(n_estimators * step / 5)

        model = RandomForestClassifier(
            n_estimators=current_n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42
        )

        # Cross-validation score
        score = cross_val_score(model, X, y, cv=3, n_jobs=-1).mean()

        # Report intermediate value to Optuna
        trial.report(score, step)

        # Pruning decision
        if trial.should_prune():
            raise optuna.TrialPruned()

    return score

# Study execution
study = optuna.create_study(
    direction='maximize',
    pruner=pruner,
    study_name='hyperband_example'
)

study.optimize(objective, n_trials=100, timeout=300)

print("\n=== Hyperband Optimization Results ===")
print(f"Best Score: {study.best_value:.4f}")
print(f"Best Parameters: {study.best_params}")
print(f"\nCompleted Trials: {len([t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE])}")
print(f"Pruned Trials: {len([t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED])}")

Example Output:

=== Hyperband Optimization Results ===
Best Score: 0.9733
Best Parameters: {'n_estimators': 142, 'max_depth': 8, 'min_samples_split': 3, 'min_samples_leaf': 1}

Completed Trials: 28
Pruned Trials: 72

Effect: Out of 100 trials, 72 were pruned early, significantly reducing computation time.


3.2 BOHB (Bayesian Optimization and HyperBand)

Fusion of Hyperband and Bayesian Optimization

BOHB is a method that combines Hyperband's efficient resource allocation with Bayesian optimization's intelligent search.

Method Strengths Weaknesses
Hyperband Efficient resource allocation Random sampling
Bayesian Optimization Intelligent search Allocates all resources
BOHB Efficient + Intelligent search Complex implementation

BOHB Operating Principles

  1. Manage resource allocation with the Hyperband framework
  2. At each round, use TPE (Tree-structured Parzen Estimator) to propose hyperparameters
  3. Learn from past trial results and preferentially explore promising regions
graph LR A[Past trial data] --> B[Build TPE model] B --> C[Propose promising configurations] C --> D[Evaluate with Successive Halving] D --> E[Feedback results] E --> A style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#fff3e0 style D fill:#ffebee style E fill:#e8f5e9

Implementation and Use Cases

# Requirements:
# - Python 3.9+
# - optuna>=3.2.0

"""
Example: Implementation and Use Cases

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""

import optuna
from optuna.samplers import TPESampler
from optuna.pruners import HyperbandPruner
from sklearn.datasets import load_digits
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

# BOHB configuration (TPE + Hyperband)
sampler = TPESampler(seed=42, n_startup_trials=10)
pruner = HyperbandPruner(
    min_resource=5,
    max_resource=100,
    reduction_factor=3
)

def objective_bohb(trial):
    # Hyperparameter proposals (TPE selects intelligently)
    hidden_layer_size = trial.suggest_int('hidden_layer_size', 50, 200)
    alpha = trial.suggest_float('alpha', 1e-5, 1e-1, log=True)
    learning_rate_init = trial.suggest_float('learning_rate_init', 1e-4, 1e-1, log=True)

    X, y = load_digits(return_X_y=True)

    # Hyperband: gradually increase max_iter
    for step in range(1, 6):
        max_iter = int(100 * step / 5)

        model = MLPClassifier(
            hidden_layer_sizes=(hidden_layer_size,),
            alpha=alpha,
            learning_rate_init=learning_rate_init,
            max_iter=max_iter,
            random_state=42
        )

        score = cross_val_score(model, X, y, cv=3, n_jobs=-1).mean()

        trial.report(score, step)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return score

# BOHB study
study_bohb = optuna.create_study(
    direction='maximize',
    sampler=sampler,
    pruner=pruner,
    study_name='bohb_example'
)

study_bohb.optimize(objective_bohb, n_trials=50, timeout=180)

print("\n=== BOHB Optimization Results ===")
print(f"Best Score: {study_bohb.best_value:.4f}")
print(f"Best Parameters:")
for key, value in study_bohb.best_params.items():
    print(f"  {key}: {value}")
print(f"\nCompleted/Pruned: {len([t for t in study_bohb.trials if t.state == optuna.trial.TrialState.COMPLETE])}/{len([t for t in study_bohb.trials if t.state == optuna.trial.TrialState.PRUNED])}")

Use Cases


3.3 Population-based Training (PBT)

Principles of PBT

Population-based Training trains multiple models in parallel and periodically performs the following:

  1. Exploit: Replace poorly performing models with well-performing ones
  2. Explore: Perturb hyperparameters to try new configurations

Feature: The ability to dynamically adjust hyperparameters during training is the major difference from traditional methods.

PBT Workflow

graph TD A[Initialize Population
n models] --> B[Train each model in parallel] B --> C{Periodic evaluation point} C --> D[Identify poorly performing models] D --> E[Copy weights from good models
Exploit] E --> F[Perturb hyperparameters
Explore] F --> G{Training complete?} G -->|No| B G -->|Yes| H[Select best model] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style E fill:#e3f2fd style F fill:#e8f5e9 style H fill:#c8e6c9

Implementation with Ray Tune

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - ray>=2.5.0

from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
import numpy as np

def train_function(config):
    """Training function (simulation)"""
    # Initial configuration
    learning_rate = config["lr"]
    momentum = config["momentum"]

    # Training simulation
    for step in range(100):
        # Dummy performance metric (actual model training in practice)
        # Good performance when learning rate and momentum are in appropriate ranges
        optimal_lr = 0.01
        optimal_momentum = 0.9

        score = 1.0 - (
            abs(learning_rate - optimal_lr) / optimal_lr +
            abs(momentum - optimal_momentum) / optimal_momentum
        ) / 2

        # Add noise to mimic realistic training
        score += np.random.normal(0, 0.05)

        # Report results to Ray Tune
        tune.report(score=score, lr=learning_rate, momentum=momentum)

# PBT scheduler configuration
pbt_scheduler = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="score",
    mode="max",
    perturbation_interval=10,  # Perturb every 10 iterations
    hyperparam_mutations={
        "lr": lambda: np.random.uniform(0.001, 0.1),
        "momentum": lambda: np.random.uniform(0.8, 0.99)
    }
)

# Ray Tune execution
analysis = tune.run(
    train_function,
    name="pbt_example",
    scheduler=pbt_scheduler,
    num_samples=8,  # Run 8 models in parallel
    config={
        "lr": tune.uniform(0.001, 0.1),
        "momentum": tune.uniform(0.8, 0.99)
    },
    stop={"training_iteration": 100},
    verbose=1
)

print("\n=== PBT Optimization Results ===")
best_config = analysis.get_best_config(metric="score", mode="max")
print(f"Best Configuration:")
print(f"  Learning Rate: {best_config['lr']:.4f}")
print(f"  Momentum: {best_config['momentum']:.4f}")
print(f"\nBest Score: {analysis.best_result['score']:.4f}")

Combination with Parallel Training

The greatest advantage of PBT is its ability to fully utilize parallel computational resources:

Scenario Traditional Methods PBT
8 GPUs for 100 epochs Try 8 configurations sequentially
800 epochs worth of time
Train 8 configurations simultaneously
100 epochs worth of time
Dynamic adjustment Not possible Optimized during training
Resource efficiency Poor configs run to completion Early convergence to good configs

3.4 Other Advanced Methods

Hyperopt (TPE Implementation)

Hyperopt is a popular library that implements Tree-structured Parzen Estimator (TPE).

# Requirements:
# - Python 3.9+
# - hyperopt>=0.2.7
# - numpy>=1.24.0, <2.0.0

"""
Example: Hyperoptis a popular library that implements Tree-structured

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Define search space
space = {
    'n_estimators': hp.quniform('n_estimators', 50, 300, 1),
    'max_depth': hp.quniform('max_depth', 3, 15, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.001), np.log(0.3)),
    'subsample': hp.uniform('subsample', 0.5, 1.0),
    'min_samples_split': hp.quniform('min_samples_split', 2, 20, 1)
}

# Data preparation
X, y = load_breast_cancer(return_X_y=True)

def objective_hyperopt(params):
    """Objective function for Hyperopt"""
    # Convert to integer types
    params['n_estimators'] = int(params['n_estimators'])
    params['max_depth'] = int(params['max_depth'])
    params['min_samples_split'] = int(params['min_samples_split'])

    model = GradientBoostingClassifier(**params, random_state=42)
    score = cross_val_score(model, X, y, cv=5, n_jobs=-1).mean()

    # Hyperopt minimizes, so return negative value
    return {'loss': -score, 'status': STATUS_OK}

# Optimization execution
trials = Trials()
best = fmin(
    fn=objective_hyperopt,
    space=space,
    algo=tpe.suggest,  # TPE algorithm
    max_evals=50,
    trials=trials,
    rstate=np.random.default_rng(42)
)

print("\n=== Hyperopt (TPE) Optimization Results ===")
print("Best Parameters:")
for key, value in best.items():
    print(f"  {key}: {value}")
print(f"\nBest Score: {-min(trials.losses()):.4f}")

SMAC (Random Forest based)

SMAC (Sequential Model-based Algorithm Configuration) uses random forests as surrogate models.

Features:

Ax/BoTorch (Facebook Research)

Ax and BoTorch are next-generation Bayesian optimization frameworks developed by Facebook Research.

from ax.service.ax_client import AxClient
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Create Ax client
ax_client = AxClient()

# Define search space
ax_client.create_experiment(
    name="svm_optimization",
    parameters=[
        {"name": "C", "type": "range", "bounds": [0.1, 100.0], "log_scale": True},
        {"name": "gamma", "type": "range", "bounds": [0.0001, 1.0], "log_scale": True},
        {"name": "kernel", "type": "choice", "values": ["rbf", "poly", "sigmoid"]}
    ],
    objective_name="accuracy",
    minimize=False
)

# Data preparation
X, y = load_wine(return_X_y=True)

# Optimization loop
for i in range(30):
    # Propose next configuration
    parameters, trial_index = ax_client.get_next_trial()

    # Model evaluation
    model = SVC(**parameters, random_state=42)
    score = cross_val_score(model, X, y, cv=5, n_jobs=-1).mean()

    # Report results
    ax_client.complete_trial(trial_index=trial_index, raw_data=score)

# Get best configuration
best_parameters, metrics = ax_client.get_best_parameters()

print("\n=== Ax/BoTorch Optimization Results ===")
print("Best Parameters:")
for key, value in best_parameters.items():
    print(f"  {key}: {value}")
print(f"\nBest Accuracy: {metrics[0]['accuracy']:.4f}")
print(f"Confidence Interval: [{metrics[0]['accuracy'] - metrics[1]['accuracy']['accuracy']:.4f}, "
      f"{metrics[0]['accuracy'] + metrics[1]['accuracy']['accuracy']:.4f}]")

Method Comparison Table

Method Surrogate Model Strengths Application Scenarios
Hyperopt (TPE) Kernel density estimation Simple, fast General optimization
SMAC Random Forest Conditional parameters Complex search spaces
Ax/BoTorch Gaussian Process Uncertainty estimation, multi-task Research & experiments
Optuna TPE/GP/CMA-ES Flexible, pruning Practical optimization

3.5 Practical Application: Large-scale Tuning with Ray Tune

Ray Tune Setup

Ray Tune is a unified framework for distributed hyperparameter tuning.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - ray>=2.5.0
# - torch>=2.0.0, <2.3.0

"""
Example: Ray Tuneis a unified framework for distributed hyperparamete

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Data preparation
X, y = make_classification(
    n_samples=10000, n_features=20, n_informative=15,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# PyTorch datasets
train_dataset = TensorDataset(
    torch.FloatTensor(X_train),
    torch.LongTensor(y_train)
)
test_dataset = TensorDataset(
    torch.FloatTensor(X_test),
    torch.LongTensor(y_test)
)

def train_model(config):
    """Training function for Ray Tune"""
    # Model definition
    model = nn.Sequential(
        nn.Linear(20, config["hidden_size_1"]),
        nn.ReLU(),
        nn.Dropout(config["dropout"]),
        nn.Linear(config["hidden_size_1"], config["hidden_size_2"]),
        nn.ReLU(),
        nn.Dropout(config["dropout"]),
        nn.Linear(config["hidden_size_2"], 2)
    )

    # Optimizer
    optimizer = optim.Adam(model.parameters(), lr=config["lr"])
    criterion = nn.CrossEntropyLoss()

    # Data loaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=config["batch_size"],
        shuffle=True
    )
    test_loader = DataLoader(test_dataset, batch_size=256)

    # Training loop
    for epoch in range(50):
        model.train()
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for batch_X, batch_y in test_loader:
                outputs = model(batch_X)
                _, predicted = torch.max(outputs.data, 1)
                total += batch_y.size(0)
                correct += (predicted == batch_y).sum().item()

        accuracy = correct / total

        # Report to Ray Tune
        tune.report(accuracy=accuracy, epoch=epoch)

# Search space
search_space = {
    "hidden_size_1": tune.choice([32, 64, 128, 256]),
    "hidden_size_2": tune.choice([16, 32, 64, 128]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([32, 64, 128]),
    "dropout": tune.uniform(0.1, 0.5)
}

print("=== Ray Tune Setup Complete ===")
print(f"Search Space: {len(search_space)} dimensions")

Utilizing PBT Scheduler

from ray.tune.schedulers import PopulationBasedTraining

# PBT scheduler
pbt = PopulationBasedTraining(
    time_attr="epoch",
    metric="accuracy",
    mode="max",
    perturbation_interval=5,
    hyperparam_mutations={
        "lr": lambda: 10 ** np.random.uniform(-4, -1),
        "dropout": lambda: np.random.uniform(0.1, 0.5)
    }
)

# Ray Tune execution (PBT)
analysis_pbt = tune.run(
    train_model,
    name="pbt_neural_net",
    scheduler=pbt,
    num_samples=8,  # Run 8 models in parallel
    config=search_space,
    resources_per_trial={"cpu": 2, "gpu": 0},  # Change when using GPU
    verbose=1
)

print("\n=== PBT Execution Results ===")
best_trial_pbt = analysis_pbt.get_best_trial("accuracy", "max", "last")
print(f"Best Accuracy: {best_trial_pbt.last_result['accuracy']:.4f}")
print(f"Best Configuration:")
for key, value in best_trial_pbt.config.items():
    print(f"  {key}: {value}")

Distributed Execution

Ray Tune supports distributed execution across multiple machines:

# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

"""
Example: Ray Tune supports distributed execution across multiple mach

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

# ASHA scheduler + Bayesian optimization
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch

# ASHA scheduler (improved version of Hyperband)
asha_scheduler = ASHAScheduler(
    max_t=50,              # Maximum epochs
    grace_period=5,        # Minimum epochs
    reduction_factor=3     # Reduction rate
)

# Bayesian optimization searcher
bayesopt = BayesOptSearch(
    metric="accuracy",
    mode="max"
)

# Distributed execution
analysis_distributed = tune.run(
    train_model,
    name="distributed_tuning",
    scheduler=asha_scheduler,
    search_alg=bayesopt,
    num_samples=100,  # 100 trials
    config=search_space,
    resources_per_trial={"cpu": 2},
    verbose=1
)

print("\n=== Distributed Tuning Results ===")
best_trial = analysis_distributed.get_best_trial("accuracy", "max", "last")
print(f"Best Accuracy: {best_trial.last_result['accuracy']:.4f}")
print(f"\nTrial Statistics:")
print(f"  Completed Trials: {len(analysis_distributed.trials)}")
print(f"  Average Accuracy: {np.mean([t.last_result['accuracy'] for t in analysis_distributed.trials if 'accuracy' in t.last_result]):.4f}")

# Visualize results
import pandas as pd

df = analysis_distributed.results_df
print(f"\n=== Top 5 Configurations ===")
top5 = df.nlargest(5, 'accuracy')[['accuracy', 'config/hidden_size_1', 'config/lr', 'config/dropout']]
print(top5)

# Shutdown Ray
ray.shutdown()

Advantages of Ray Tune

Feature Description Benefits
Unified API Multiple schedulers/searchers with unified interface Easy method switching
Distributed Execution Automatic scaling across machines Large-scale exploration possible
Early Stopping ASHA, Hyperband, Median, etc. Resource savings
Checkpointing Interruption and resumption support Safety for long-running tasks
Visualization TensorBoard integration Real-time monitoring

3.6 Chapter Summary

What We Learned

  1. Hyperband

    • Efficient resource allocation with Successive Halving
    • Early elimination of poorly performing configurations
    • Easy implementation with Optuna
  2. BOHB

    • Fusion of Hyperband and TPE
    • Balances efficient resource allocation and intelligent search
    • Especially effective for neural networks
  3. Population-based Training

    • Dynamically adjusts hyperparameters during parallel training
    • Balances Exploit and Explore
    • Delivers true value in large-scale parallel environments
  4. Other Methods

    • Hyperopt: Simple and fast TPE implementation
    • SMAC: Strong with conditional parameters
    • Ax/BoTorch: State-of-the-art Bayesian optimization
  5. Ray Tune

    • Unified framework utilizing multiple methods
    • Large-scale tuning in distributed environments
    • Integration with practical tools

Method Selection Guidelines

Scenario Recommended Method Reason
Limited compute resources Hyperband Efficient resource allocation
Neural networks BOHB, PBT Progressive learning and dynamic adjustment
Large-scale parallel environment PBT, Ray Tune Maximizes parallel resources
Conditional parameters SMAC Handles complex search spaces
Research & experiments Ax/BoTorch Cutting-edge methods and customizability
Practical projects Optuna, Ray Tune Usability and proven track record

To the Next Chapter

In Chapter 4, we will learn practical optimization strategies:


References

  1. Li, L., et al. (2018). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization". Journal of Machine Learning Research, 18(185), 1-52.
  2. Falkner, S., Klein, A., & Hutter, F. (2018). "BOHB: Robust and Efficient Hyperparameter Optimization at Scale". ICML 2018.
  3. Jaderberg, M., et al. (2017). "Population Based Training of Neural Networks". arXiv:1711.09846.
  4. Liaw, R., et al. (2018). "Tune: A Research Platform for Distributed Model Selection and Training". arXiv:1807.05118.
  5. Bergstra, J., et al. (2013). "Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures". ICML 2013.

Disclaimer