Chapter 3: Advanced Tuning Methods

This chapter covers advanced topics in Advanced Tuning Methods. You will master principles of Successive Halving, Utilize BOHB's fusion of Bayesian optimization, and large-scale distributed tuning with Ray Tune.

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the principles of Successive Halving and Hyperband
✅ Utilize BOHB's fusion of Bayesian optimization and Hyperband
✅ Optimize parallel training with Population-based Training (PBT)
✅ Understand the characteristics of major libraries including Hyperopt, SMAC, and Ax/BoTorch
✅ Implement large-scale distributed tuning with Ray Tune

3.1 Hyperband

Principles of Successive Halving

Successive Halving is a method for efficiently allocating limited computational resources. The basic idea is as follows:

Start training with many configurations using a small amount of resources
Progressively eliminate poorly performing configurations (by half)
Allocate more resources to the remaining promising configurations

Important: By eliminating poorly performing configurations early, computational costs can be significantly reduced.

Algorithm Flow

graph TD A[Generate n random configurations] --> B[Evaluate each configuration with r resources] B --> C{Select top n/2 by performance} C --> D[Double the resources] D --> E{Further select top n/4} E --> F[Double the resources] F --> G[The best configuration remains] style A fill:#ffebee style B fill:#fff3e0 style C fill:#e3f2fd style D fill:#f3e5f5 style E fill:#e3f2fd style F fill:#f3e5f5 style G fill:#c8e6c9

Hyperband Algorithm

Hyperband runs Successive Halving with multiple different configurations to optimize resource allocation strategies.

Parameters:

R: Maximum resources to allocate to one configuration (e.g., number of epochs)
η: Reduction rate at each round (typically 3 or 4)

$$ s_{\max} = \lfloor \log_\eta(R) \rfloor $$

Implementation in Optuna (HyperbandPruner)

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - optuna>=3.2.0

"""
Example: Implementation in Optuna (HyperbandPruner)

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""

import optuna
from optuna.pruners import HyperbandPruner
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Hyperband configuration
pruner = HyperbandPruner(
    min_resource=1,      # Minimum resources (epochs)
    max_resource=100,    # Maximum resources
    reduction_factor=3   # Reduction rate η
)

def objective(trial):
    # Hyperparameter suggestions
    n_estimators = trial.suggest_int('n_estimators', 10, 200)
    max_depth = trial.suggest_int('max_depth', 2, 32)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)

    # Data preparation
    X, y = load_iris(return_X_y=True)

    # Gradually increase n_estimators for evaluation (Hyperband compatible)
    for step in range(1, 6):
        # Number of trees according to current step
        current_n_estimators = int(n_estimators * step / 5)

        model = RandomForestClassifier(
            n_estimators=current_n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42
        )

        # Cross-validation score
        score = cross_val_score(model, X, y, cv=3, n_jobs=-1).mean()

        # Report intermediate value to Optuna
        trial.report(score, step)

        # Pruning decision
        if trial.should_prune():
            raise optuna.TrialPruned()

    return score

# Study execution
study = optuna.create_study(
    direction='maximize',
    pruner=pruner,
    study_name='hyperband_example'
)

study.optimize(objective, n_trials=100, timeout=300)

print("\n=== Hyperband Optimization Results ===")
print(f"Best Score: {study.best_value:.4f}")
print(f"Best Parameters: {study.best_params}")
print(f"\nCompleted Trials: {len([t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE])}")
print(f"Pruned Trials: {len([t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED])}")

Example Output:

=== Hyperband Optimization Results ===
Best Score: 0.9733
Best Parameters: {'n_estimators': 142, 'max_depth': 8, 'min_samples_split': 3, 'min_samples_leaf': 1}

Completed Trials: 28
Pruned Trials: 72

Effect: Out of 100 trials, 72 were pruned early, significantly reducing computation time.

3.2 BOHB (Bayesian Optimization and HyperBand)

Fusion of Hyperband and Bayesian Optimization

BOHB is a method that combines Hyperband's efficient resource allocation with Bayesian optimization's intelligent search.

Method	Strengths	Weaknesses
Hyperband	Efficient resource allocation	Random sampling
Bayesian Optimization	Intelligent search	Allocates all resources
BOHB	Efficient + Intelligent search	Complex implementation

BOHB Operating Principles

Manage resource allocation with the Hyperband framework
At each round, use TPE (Tree-structured Parzen Estimator) to propose hyperparameters
Learn from past trial results and preferentially explore promising regions

graph LR A[Past trial data] --> B[Build TPE model] B --> C[Propose promising configurations] C --> D[Evaluate with Successive Halving] D --> E[Feedback results] E --> A style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#fff3e0 style D fill:#ffebee style E fill:#e8f5e9

Implementation and Use Cases

# Requirements:
# - Python 3.9+
# - optuna>=3.2.0

"""
Example: Implementation and Use Cases

Purpose: Demonstrate neural network implementation
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""

import optuna
from optuna.samplers import TPESampler
from optuna.pruners import HyperbandPruner
from sklearn.datasets import load_digits
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

# BOHB configuration (TPE + Hyperband)
sampler = TPESampler(seed=42, n_startup_trials=10)
pruner = HyperbandPruner(
    min_resource=5,
    max_resource=100,
    reduction_factor=3
)

def objective_bohb(trial):
    # Hyperparameter proposals (TPE selects intelligently)
    hidden_layer_size = trial.suggest_int('hidden_layer_size', 50, 200)
    alpha = trial.suggest_float('alpha', 1e-5, 1e-1, log=True)
    learning_rate_init = trial.suggest_float('learning_rate_init', 1e-4, 1e-1, log=True)

    X, y = load_digits(return_X_y=True)

    # Hyperband: gradually increase max_iter
    for step in range(1, 6):
        max_iter = int(100 * step / 5)

        model = MLPClassifier(
            hidden_layer_sizes=(hidden_layer_size,),
            alpha=alpha,
            learning_rate_init=learning_rate_init,
            max_iter=max_iter,
            random_state=42
        )

        score = cross_val_score(model, X, y, cv=3, n_jobs=-1).mean()

        trial.report(score, step)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return score

# BOHB study
study_bohb = optuna.create_study(
    direction='maximize',
    sampler=sampler,
    pruner=pruner,
    study_name='bohb_example'
)

study_bohb.optimize(objective_bohb, n_trials=50, timeout=180)

print("\n=== BOHB Optimization Results ===")
print(f"Best Score: {study_bohb.best_value:.4f}")
print(f"Best Parameters:")
for key, value in study_bohb.best_params.items():
    print(f"  {key}: {value}")
print(f"\nCompleted/Pruned: {len([t for t in study_bohb.trials if t.state == optuna.trial.TrialState.COMPLETE])}/{len([t for t in study_bohb.trials if t.state == optuna.trial.TrialState.PRUNED])}")

Use Cases

Neural Networks: Gradually increase the number of epochs
Ensemble Learning: Gradually increase the number of weak learners
Large-scale Data: Gradually increase the number of data samples

3.3 Population-based Training (PBT)

Principles of PBT

Population-based Training trains multiple models in parallel and periodically performs the following:

Exploit: Replace poorly performing models with well-performing ones
Explore: Perturb hyperparameters to try new configurations

Feature: The ability to dynamically adjust hyperparameters during training is the major difference from traditional methods.

PBT Workflow

graph TD A[Initialize Population
n models] --> B[Train each model in parallel] B --> C{Periodic evaluation point} C --> D[Identify poorly performing models] D --> E[Copy weights from good models
Exploit] E --> F[Perturb hyperparameters
Explore] F --> G{Training complete?} G -->|No| B G -->|Yes| H[Select best model] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style E fill:#e3f2fd style F fill:#e8f5e9 style H fill:#c8e6c9

Implementation with Ray Tune

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - ray>=2.5.0

from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
import numpy as np

def train_function(config):
    """Training function (simulation)"""
    # Initial configuration
    learning_rate = config["lr"]
    momentum = config["momentum"]

    # Training simulation
    for step in range(100):
        # Dummy performance metric (actual model training in practice)
        # Good performance when learning rate and momentum are in appropriate ranges
        optimal_lr = 0.01
        optimal_momentum = 0.9

        score = 1.0 - (
            abs(learning_rate - optimal_lr) / optimal_lr +
            abs(momentum - optimal_momentum) / optimal_momentum
        ) / 2

        # Add noise to mimic realistic training
        score += np.random.normal(0, 0.05)

        # Report results to Ray Tune
        tune.report(score=score, lr=learning_rate, momentum=momentum)

# PBT scheduler configuration
pbt_scheduler = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="score",
    mode="max",
    perturbation_interval=10,  # Perturb every 10 iterations
    hyperparam_mutations={
        "lr": lambda: np.random.uniform(0.001, 0.1),
        "momentum": lambda: np.random.uniform(0.8, 0.99)
    }
)

# Ray Tune execution
analysis = tune.run(
    train_function,
    name="pbt_example",
    scheduler=pbt_scheduler,
    num_samples=8,  # Run 8 models in parallel
    config={
        "lr": tune.uniform(0.001, 0.1),
        "momentum": tune.uniform(0.8, 0.99)
    },
    stop={"training_iteration": 100},
    verbose=1
)

print("\n=== PBT Optimization Results ===")
best_config = analysis.get_best_config(metric="score", mode="max")
print(f"Best Configuration:")
print(f"  Learning Rate: {best_config['lr']:.4f}")
print(f"  Momentum: {best_config['momentum']:.4f}")
print(f"\nBest Score: {analysis.best_result['score']:.4f}")

Combination with Parallel Training

The greatest advantage of PBT is its ability to fully utilize parallel computational resources:

Scenario	Traditional Methods	PBT
8 GPUs for 100 epochs	Try 8 configurations sequentially 800 epochs worth of time	Train 8 configurations simultaneously 100 epochs worth of time
Dynamic adjustment	Not possible	Optimized during training
Resource efficiency	Poor configs run to completion	Early convergence to good configs

3.4 Other Advanced Methods

Hyperopt (TPE Implementation)

Hyperopt is a popular library that implements Tree-structured Parzen Estimator (TPE).

# Requirements:
# - Python 3.9+
# - hyperopt>=0.2.7
# - numpy>=1.24.0, <2.0.0

"""
Example: Hyperoptis a popular library that implements Tree-structured

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Define search space
space = {
    'n_estimators': hp.quniform('n_estimators', 50, 300, 1),
    'max_depth': hp.quniform('max_depth', 3, 15, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.001), np.log(0.3)),
    'subsample': hp.uniform('subsample', 0.5, 1.0),
    'min_samples_split': hp.quniform('min_samples_split', 2, 20, 1)
}

# Data preparation
X, y = load_breast_cancer(return_X_y=True)

def objective_hyperopt(params):
    """Objective function for Hyperopt"""
    # Convert to integer types
    params['n_estimators'] = int(params['n_estimators'])
    params['max_depth'] = int(params['max_depth'])
    params['min_samples_split'] = int(params['min_samples_split'])

    model = GradientBoostingClassifier(**params, random_state=42)
    score = cross_val_score(model, X, y, cv=5, n_jobs=-1).mean()

    # Hyperopt minimizes, so return negative value
    return {'loss': -score, 'status': STATUS_OK}

# Optimization execution
trials = Trials()
best = fmin(
    fn=objective_hyperopt,
    space=space,
    algo=tpe.suggest,  # TPE algorithm
    max_evals=50,
    trials=trials,
    rstate=np.random.default_rng(42)
)

print("\n=== Hyperopt (TPE) Optimization Results ===")
print("Best Parameters:")
for key, value in best.items():
    print(f"  {key}: {value}")
print(f"\nBest Score: {-min(trials.losses()):.4f}")

SMAC (Random Forest based)

SMAC (Sequential Model-based Algorithm Configuration) uses random forests as surrogate models.

Features:

Strong with categorical variables and conditional parameters
Excellent uncertainty estimation
Robust to noisy objective functions

Ax/BoTorch (Facebook Research)

Ax and BoTorch are next-generation Bayesian optimization frameworks developed by Facebook Research.

from ax.service.ax_client import AxClient
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Create Ax client
ax_client = AxClient()

# Define search space
ax_client.create_experiment(
    name="svm_optimization",
    parameters=[
        {"name": "C", "type": "range", "bounds": [0.1, 100.0], "log_scale": True},
        {"name": "gamma", "type": "range", "bounds": [0.0001, 1.0], "log_scale": True},
        {"name": "kernel", "type": "choice", "values": ["rbf", "poly", "sigmoid"]}
    ],
    objective_name="accuracy",
    minimize=False
)

# Data preparation
X, y = load_wine(return_X_y=True)

# Optimization loop
for i in range(30):
    # Propose next configuration
    parameters, trial_index = ax_client.get_next_trial()

    # Model evaluation
    model = SVC(**parameters, random_state=42)
    score = cross_val_score(model, X, y, cv=5, n_jobs=-1).mean()

    # Report results
    ax_client.complete_trial(trial_index=trial_index, raw_data=score)

# Get best configuration
best_parameters, metrics = ax_client.get_best_parameters()

print("\n=== Ax/BoTorch Optimization Results ===")
print("Best Parameters:")
for key, value in best_parameters.items():
    print(f"  {key}: {value}")
print(f"\nBest Accuracy: {metrics[0]['accuracy']:.4f}")
print(f"Confidence Interval: [{metrics[0]['accuracy'] - metrics[1]['accuracy']['accuracy']:.4f}, "
      f"{metrics[0]['accuracy'] + metrics[1]['accuracy']['accuracy']:.4f}]")

Method Comparison Table

Method	Surrogate Model	Strengths	Application Scenarios
Hyperopt (TPE)	Kernel density estimation	Simple, fast	General optimization
SMAC	Random Forest	Conditional parameters	Complex search spaces
Ax/BoTorch	Gaussian Process	Uncertainty estimation, multi-task	Research & experiments
Optuna	TPE/GP/CMA-ES	Flexible, pruning	Practical optimization

3.5 Practical Application: Large-scale Tuning with Ray Tune

Ray Tune Setup

Ray Tune is a unified framework for distributed hyperparameter tuning.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - ray>=2.5.0
# - torch>=2.0.0, <2.3.0

"""
Example: Ray Tuneis a unified framework for distributed hyperparamete

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Data preparation
X, y = make_classification(
    n_samples=10000, n_features=20, n_informative=15,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# PyTorch datasets
train_dataset = TensorDataset(
    torch.FloatTensor(X_train),
    torch.LongTensor(y_train)
)
test_dataset = TensorDataset(
    torch.FloatTensor(X_test),
    torch.LongTensor(y_test)
)

def train_model(config):
    """Training function for Ray Tune"""
    # Model definition
    model = nn.Sequential(
        nn.Linear(20, config["hidden_size_1"]),
        nn.ReLU(),
        nn.Dropout(config["dropout"]),
        nn.Linear(config["hidden_size_1"], config["hidden_size_2"]),
        nn.ReLU(),
        nn.Dropout(config["dropout"]),
        nn.Linear(config["hidden_size_2"], 2)
    )

    # Optimizer
    optimizer = optim.Adam(model.parameters(), lr=config["lr"])
    criterion = nn.CrossEntropyLoss()

    # Data loaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=config["batch_size"],
        shuffle=True
    )
    test_loader = DataLoader(test_dataset, batch_size=256)

    # Training loop
    for epoch in range(50):
        model.train()
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for batch_X, batch_y in test_loader:
                outputs = model(batch_X)
                _, predicted = torch.max(outputs.data, 1)
                total += batch_y.size(0)
                correct += (predicted == batch_y).sum().item()

        accuracy = correct / total

        # Report to Ray Tune
        tune.report(accuracy=accuracy, epoch=epoch)

# Search space
search_space = {
    "hidden_size_1": tune.choice([32, 64, 128, 256]),
    "hidden_size_2": tune.choice([16, 32, 64, 128]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([32, 64, 128]),
    "dropout": tune.uniform(0.1, 0.5)
}

print("=== Ray Tune Setup Complete ===")
print(f"Search Space: {len(search_space)} dimensions")

Utilizing PBT Scheduler

from ray.tune.schedulers import PopulationBasedTraining

# PBT scheduler
pbt = PopulationBasedTraining(
    time_attr="epoch",
    metric="accuracy",
    mode="max",
    perturbation_interval=5,
    hyperparam_mutations={
        "lr": lambda: 10 ** np.random.uniform(-4, -1),
        "dropout": lambda: np.random.uniform(0.1, 0.5)
    }
)

# Ray Tune execution (PBT)
analysis_pbt = tune.run(
    train_model,
    name="pbt_neural_net",
    scheduler=pbt,
    num_samples=8,  # Run 8 models in parallel
    config=search_space,
    resources_per_trial={"cpu": 2, "gpu": 0},  # Change when using GPU
    verbose=1
)

print("\n=== PBT Execution Results ===")
best_trial_pbt = analysis_pbt.get_best_trial("accuracy", "max", "last")
print(f"Best Accuracy: {best_trial_pbt.last_result['accuracy']:.4f}")
print(f"Best Configuration:")
for key, value in best_trial_pbt.config.items():
    print(f"  {key}: {value}")

Distributed Execution

Ray Tune supports distributed execution across multiple machines:

# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

"""
Example: Ray Tune supports distributed execution across multiple mach

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

# ASHA scheduler + Bayesian optimization
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch

# ASHA scheduler (improved version of Hyperband)
asha_scheduler = ASHAScheduler(
    max_t=50,              # Maximum epochs
    grace_period=5,        # Minimum epochs
    reduction_factor=3     # Reduction rate
)

# Bayesian optimization searcher
bayesopt = BayesOptSearch(
    metric="accuracy",
    mode="max"
)

# Distributed execution
analysis_distributed = tune.run(
    train_model,
    name="distributed_tuning",
    scheduler=asha_scheduler,
    search_alg=bayesopt,
    num_samples=100,  # 100 trials
    config=search_space,
    resources_per_trial={"cpu": 2},
    verbose=1
)

print("\n=== Distributed Tuning Results ===")
best_trial = analysis_distributed.get_best_trial("accuracy", "max", "last")
print(f"Best Accuracy: {best_trial.last_result['accuracy']:.4f}")
print(f"\nTrial Statistics:")
print(f"  Completed Trials: {len(analysis_distributed.trials)}")
print(f"  Average Accuracy: {np.mean([t.last_result['accuracy'] for t in analysis_distributed.trials if 'accuracy' in t.last_result]):.4f}")

# Visualize results
import pandas as pd

df = analysis_distributed.results_df
print(f"\n=== Top 5 Configurations ===")
top5 = df.nlargest(5, 'accuracy')[['accuracy', 'config/hidden_size_1', 'config/lr', 'config/dropout']]
print(top5)

# Shutdown Ray
ray.shutdown()

Advantages of Ray Tune

Feature	Description	Benefits
Unified API	Multiple schedulers/searchers with unified interface	Easy method switching
Distributed Execution	Automatic scaling across machines	Large-scale exploration possible
Early Stopping	ASHA, Hyperband, Median, etc.	Resource savings
Checkpointing	Interruption and resumption support	Safety for long-running tasks
Visualization	TensorBoard integration	Real-time monitoring

3.6 Chapter Summary

What We Learned

Hyperband
- Efficient resource allocation with Successive Halving
- Early elimination of poorly performing configurations
- Easy implementation with Optuna
BOHB
- Fusion of Hyperband and TPE
- Balances efficient resource allocation and intelligent search
- Especially effective for neural networks
Population-based Training
- Dynamically adjusts hyperparameters during parallel training
- Balances Exploit and Explore
- Delivers true value in large-scale parallel environments
Other Methods
- Hyperopt: Simple and fast TPE implementation
- SMAC: Strong with conditional parameters
- Ax/BoTorch: State-of-the-art Bayesian optimization
Ray Tune
- Unified framework utilizing multiple methods
- Large-scale tuning in distributed environments
- Integration with practical tools

Method Selection Guidelines

Scenario	Recommended Method	Reason
Limited compute resources	Hyperband	Efficient resource allocation
Neural networks	BOHB, PBT	Progressive learning and dynamic adjustment
Large-scale parallel environment	PBT, Ray Tune	Maximizes parallel resources
Conditional parameters	SMAC	Handles complex search spaces
Research & experiments	Ax/BoTorch	Cutting-edge methods and customizability
Practical projects	Optuna, Ray Tune	Usability and proven track record

To the Next Chapter

In Chapter 4, we will learn practical optimization strategies:

Best practices for search space design
Optimization of parallelization and distributed execution
Result analysis and visualization
Deployment to production environments

References

Li, L., et al. (2018). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization". Journal of Machine Learning Research, 18(185), 1-52.
Falkner, S., Klein, A., & Hutter, F. (2018). "BOHB: Robust and Efficient Hyperparameter Optimization at Scale". ICML 2018.
Jaderberg, M., et al. (2017). "Population Based Training of Neural Networks". arXiv:1711.09846.
Liaw, R., et al. (2018). "Tune: A Research Platform for Distributed Model Selection and Training". arXiv:1807.05118.
Bergstra, J., et al. (2013). "Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures". ICML 2013.