This chapter introduces the basics of Hyperparameter Tuning Basics. You will learn difference between hyperparameters and importance of tuning.
Learning Objectives
By reading this chapter, you will master the following:
- β Understand the difference between hyperparameters and model parameters
- β Learn the importance of tuning and designing search spaces
- β Master the mechanism and implementation of grid search
- β Understand the advantages and usage of random search
- β Combine cross-validation with hyperparameter search
- β Execute practical tuning with scikit-learn
1.1 What are Hyperparameters
Difference from Model Parameters
Hyperparameters are values set by humans before training that control the structure and learning process of a model.
| Type | Definition | Examples | How Determined |
|---|---|---|---|
| Model Parameters | Automatically optimized through learning | Linear regression coefficients, neural network weights | Learned from training data |
| Hyperparameters | Set by humans before learning | Learning rate, tree depth, regularization coefficient | Trial and error, search algorithms |
Key Hyperparameters
| Algorithm | Key Hyperparameters | Role |
|---|---|---|
| Random Forest | n_estimators, max_depth, min_samples_split | Number of trees, depth, split conditions |
| XGBoost | learning_rate, max_depth, n_estimators, subsample | Learning speed, complexity, sampling |
| SVM | C, kernel, gamma | Regularization, kernel, influence range |
| Neural Network | learning_rate, batch_size, hidden_layers | Learning speed, batch size, structure |
Importance of Tuning
With proper hyperparameter settings, model performance can improve by 10-30% or more.
Designing the Search Space
The search space is the range of candidate values for each hyperparameter. Proper design is crucial.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: The search space is the range of candidate values for each h
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Example search space definition
param_space = {
'n_estimators': [50, 100, 200, 300], # Number of trees
'max_depth': [5, 10, 15, 20, None], # Maximum depth
'min_samples_split': [2, 5, 10], # Minimum samples to split
'min_samples_leaf': [1, 2, 4], # Minimum samples per leaf
'max_features': ['sqrt', 'log2', None] # Number of features for split
}
print("=== Search Space Overview ===")
print(f"n_estimators: {len(param_space['n_estimators'])} options")
print(f"max_depth: {len(param_space['max_depth'])} options")
print(f"min_samples_split: {len(param_space['min_samples_split'])} options")
print(f"min_samples_leaf: {len(param_space['min_samples_leaf'])} options")
print(f"max_features: {len(param_space['max_features'])} options")
total_combinations = np.prod([len(v) for v in param_space.values()])
print(f"\nTotal combinations: {total_combinations:,}")
Output:
=== Search Space Overview ===
n_estimators: 4 options
max_depth: 5 options
min_samples_split: 3 options
min_samples_leaf: 3 options
max_features: 3 options
Total combinations: 540
Important: If the search space is too wide, computational costs become enormous. Utilize domain knowledge and empirical ranges.
1.2 Grid Search
Mechanism and Implementation
Grid Search exhaustively explores all combinations of specified hyperparameters.
scikit-learn GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import time
# Data preparation
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Parameter grid for grid search
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
# GridSearchCV configuration
grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy', # Evaluation metric
n_jobs=-1, # Use all CPU cores
verbose=2 # Detailed output
)
# Execute grid search
print("=== Starting Grid Search ===")
start_time = time.time()
grid_search.fit(X_train, y_train)
elapsed_time = time.time() - start_time
# Display results
print(f"\nExecution time: {elapsed_time:.2f} seconds")
print(f"\nBest parameters:")
print(grid_search.best_params_)
print(f"\nBest score (cross-validation): {grid_search.best_score_:.4f}")
# Evaluate on test data
y_pred = grid_search.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test data accuracy: {test_accuracy:.4f}")
Example Output:
=== Starting Grid Search ===
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Execution time: 12.34 seconds
Best parameters:
{'max_depth': 15, 'min_samples_split': 2, 'n_estimators': 200}
Best score (cross-validation): 0.9648
Test data accuracy: 0.9737
Detailed Analysis of Search Results
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Detailed Analysis of Search Results
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import pandas as pd
import matplotlib.pyplot as plt
# Convert results to DataFrame
results_df = pd.DataFrame(grid_search.cv_results_)
# Extract important columns only
results_summary = results_df[[
'param_n_estimators',
'param_max_depth',
'param_min_samples_split',
'mean_test_score',
'std_test_score',
'rank_test_score'
]].sort_values('rank_test_score')
print("\n=== Top 5 Combinations ===")
print(results_summary.head(10))
# Visualization: Parameter influence analysis
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Influence of n_estimators
results_df.groupby('param_n_estimators')['mean_test_score'].mean().plot(
kind='bar', ax=axes[0], color='steelblue'
)
axes[0].set_title('Influence of n_estimators', fontsize=12)
axes[0].set_ylabel('Average Score')
axes[0].grid(True, alpha=0.3)
# Influence of max_depth
results_df.groupby('param_max_depth')['mean_test_score'].mean().plot(
kind='bar', ax=axes[1], color='forestgreen'
)
axes[1].set_title('Influence of max_depth', fontsize=12)
axes[1].set_ylabel('Average Score')
axes[1].grid(True, alpha=0.3)
# Influence of min_samples_split
results_df.groupby('param_min_samples_split')['mean_test_score'].mean().plot(
kind='bar', ax=axes[2], color='coral'
)
axes[2].set_title('Influence of min_samples_split', fontsize=12)
axes[2].set_ylabel('Average Score')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Pros and Cons
| Aspect | Details |
|---|---|
| Pros | β
Exhaustive search ensures optimal solution isn't missed β Simple implementation, easy to understand β Easy to parallelize |
| Cons | β Computational cost increases exponentially β Not suitable for high-dimensional search β Limited for continuous-valued parameters |
| Use Cases | Few parameters (around 2-4) Few candidates per parameter Sufficient computational resources |
1.3 Random Search
Benefits of Probabilistic Search
Random Search randomly samples parameter combinations from the search space.
Research by Bergstra & Bengio (2012) has shown that random search is more efficient than grid search.
High Computational Cost] C[Random Search] --> D[Random Sampling
Low Computational Cost] style A fill:#ffcdd2 style B fill:#ef9a9a style C fill:#c8e6c9 style D fill:#81c784
RandomizedSearchCV
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: RandomizedSearchCV
Purpose: Demonstrate machine learning model training and evaluation
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import numpy as np
# Distribution definition for random search
param_distributions = {
'n_estimators': randint(50, 500), # Integer 50-500
'max_depth': randint(5, 30), # Integer 5-30
'min_samples_split': randint(2, 20), # Integer 2-20
'min_samples_leaf': randint(1, 10), # Integer 1-10
'max_features': uniform(0.1, 0.9) # Real 0.1-1.0
}
# RandomizedSearchCV configuration
random_search = RandomizedSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=100, # 100 random samplings
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=2,
random_state=42
)
# Execute random search
print("=== Starting Random Search ===")
start_time = time.time()
random_search.fit(X_train, y_train)
elapsed_time = time.time() - start_time
print(f"\nExecution time: {elapsed_time:.2f} seconds")
print(f"\nBest parameters:")
print(random_search.best_params_)
print(f"\nBest score (cross-validation): {random_search.best_score_:.4f}")
# Evaluate on test data
y_pred_random = random_search.predict(X_test)
test_accuracy_random = accuracy_score(y_test, y_pred_random)
print(f"Test data accuracy: {test_accuracy_random:.4f}")
Example Output:
=== Starting Random Search ===
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Execution time: 18.56 seconds
Best parameters:
{'max_depth': 18, 'max_features': 0.7234, 'min_samples_leaf': 1,
'min_samples_split': 2, 'n_estimators': 387}
Best score (cross-validation): 0.9692
Test data accuracy: 0.9825
Comparison with Grid Search
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Comparison with Grid Search
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
# Visualization of comparison results
comparison_data = {
'Grid Search': {
'Search Count': len(grid_search.cv_results_['params']),
'Execution Time': 12.34,
'CV Accuracy': grid_search.best_score_,
'Test Accuracy': test_accuracy
},
'Random Search': {
'Search Count': len(random_search.cv_results_['params']),
'Execution Time': 18.56,
'CV Accuracy': random_search.best_score_,
'Test Accuracy': test_accuracy_random
}
}
# Convert to DataFrame
comparison_df = pd.DataFrame(comparison_data).T
print("\n=== Grid Search vs Random Search ===")
print(comparison_df)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Search count
comparison_df['Search Count'].plot(kind='bar', ax=axes[0], color=['steelblue', 'coral'])
axes[0].set_title('Search Count Comparison', fontsize=12)
axes[0].set_ylabel('Count')
axes[0].grid(True, alpha=0.3)
# Execution time
comparison_df['Execution Time'].plot(kind='bar', ax=axes[1], color=['steelblue', 'coral'])
axes[1].set_title('Execution Time Comparison', fontsize=12)
axes[1].set_ylabel('Seconds')
axes[1].grid(True, alpha=0.3)
# Accuracy
comparison_df[['CV Accuracy', 'Test Accuracy']].plot(kind='bar', ax=axes[2])
axes[2].set_title('Accuracy Comparison', fontsize=12)
axes[2].set_ylabel('Accuracy')
axes[2].set_ylim([0.95, 1.0])
axes[2].legend(['CV Accuracy', 'Test Accuracy'])
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Advantages of Random Search
| Aspect | Grid Search | Random Search |
|---|---|---|
| Computational Efficiency | Search count = all combinations | Search count can be specified |
| Continuous Value Support | Discrete values only | Direct sampling from continuous distributions |
| Handling Importance | All explored equally | Can explore important parameter ranges widely |
| High-Dimensional Search | Exponential growth with dimensions | Linear with respect to dimensions |
1.4 Cross-Validation and Hyperparameter Search
Choosing CV Strategy
Cross-validation is essential for evaluating the generalization performance of hyperparameters.
| CV Method | Description | Use Case |
|---|---|---|
| K-Fold CV | Split data into K parts, evaluate K times | Standard scenarios (K=5 or 10) |
| Stratified K-Fold | Split while preserving class ratios | Classification problems, imbalanced data |
| Time Series Split | Preserve temporal ordering | Time series data |
| Leave-One-Out | Test one sample at a time | Small datasets (high computational cost) |
Setting Evaluation Metrics
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score
# Compare with multiple evaluation metrics
scoring_metrics = {
'accuracy': 'accuracy',
'precision': make_scorer(precision_score, average='weighted'),
'recall': make_scorer(recall_score, average='weighted'),
'f1': make_scorer(f1_score, average='weighted')
}
# Cross-validation with Stratified K-Fold
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Apply multiple evaluation metrics to RandomizedSearchCV
random_search_multi = RandomizedSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=50,
cv=cv_strategy,
scoring=scoring_metrics,
refit='f1', # Select best model based on F1 score
n_jobs=-1,
verbose=1,
random_state=42
)
random_search_multi.fit(X_train, y_train)
print("=== Results with Multiple Evaluation Metrics ===")
print(f"Best parameters (F1 criterion):")
print(random_search_multi.best_params_)
# Score for each metric
results = random_search_multi.cv_results_
best_index = random_search_multi.best_index_
print(f"\nBest model scores:")
for metric in scoring_metrics.keys():
score = results[f'mean_test_{metric}'][best_index]
std = results[f'std_test_{metric}'][best_index]
print(f" {metric}: {score:.4f} (Β±{std:.4f})")
Example Output:
=== Results with Multiple Evaluation Metrics ===
Best parameters (F1 criterion):
{'max_depth': 22, 'max_features': 0.6543, 'min_samples_leaf': 1,
'min_samples_split': 3, 'n_estimators': 298}
Best model scores:
accuracy: 0.9670 (Β±0.0123)
precision: 0.9678 (Β±0.0118)
recall: 0.9670 (Β±0.0123)
f1: 0.9672 (Β±0.0121)
Preventing Overfitting
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Preventing Overfitting
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import matplotlib.pyplot as plt
# Compare train and test scores
results = random_search.cv_results_
train_scores = results['mean_train_score']
test_scores = results['mean_test_score']
# Detect overfitting
overfit_gap = train_scores - test_scores
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Score distribution
axes[0].scatter(train_scores, test_scores, alpha=0.6, s=50)
axes[0].plot([0.9, 1.0], [0.9, 1.0], 'r--', label='Ideal Line')
axes[0].set_xlabel('Train Score')
axes[0].set_ylabel('Test Score (CV)')
axes[0].set_title('Train vs Test Score', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Overfitting gap
axes[1].hist(overfit_gap, bins=30, alpha=0.7, edgecolor='black')
axes[1].axvline(x=overfit_gap.mean(), color='r', linestyle='--',
label=f'Average Gap: {overfit_gap.mean():.4f}')
axes[1].set_xlabel('Overfitting Gap (Train - Test)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Degree of Overfitting', fontsize=12)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Top 5 models with least overfitting
results_df = pd.DataFrame({
'rank': results['rank_test_score'],
'train_score': train_scores,
'test_score': test_scores,
'overfit_gap': overfit_gap
})
print("\n=== Top 5 Models with Least Overfitting ===")
print(results_df.nsmallest(5, 'overfit_gap'))
1.5 Practice: Basic Tuning with scikit-learn
Random Forest Tuning Example
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import time
# Generate data
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=15,
n_redundant=5,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Performance with default settings
print("=== Random Forest Tuning ===\n")
rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train, y_train)
default_score = accuracy_score(y_test, rf_default.predict(X_test))
print(f"Default settings accuracy: {default_score:.4f}")
# Grid search
param_grid_rf = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_rf = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid_rf,
cv=5,
scoring='accuracy',
n_jobs=-1
)
start = time.time()
grid_rf.fit(X_train, y_train)
elapsed = time.time() - start
# Performance after tuning
tuned_score = accuracy_score(y_test, grid_rf.predict(X_test))
print(f"\nBest parameters: {grid_rf.best_params_}")
print(f"Accuracy after tuning: {tuned_score:.4f}")
print(f"Improvement: {(tuned_score - default_score) * 100:.2f}%")
print(f"Execution time: {elapsed:.2f} seconds")
Example Output:
=== Random Forest Tuning ===
Default settings accuracy: 0.8700
Best parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Accuracy after tuning: 0.9250
Improvement: 5.50%
Execution time: 24.56 seconds
XGBoost Tuning Example
# Requirements:
# - Python 3.9+
# - xgboost>=2.0.0
"""
Example: XGBoost Tuning Example
Purpose: Demonstrate machine learning model training and evaluation
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
# XGBoost parameter distributions
param_dist_xgb = {
'n_estimators': randint(100, 500),
'max_depth': randint(3, 10),
'learning_rate': uniform(0.01, 0.3),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.6, 0.4),
'gamma': uniform(0, 0.5)
}
# Default settings
print("\n=== XGBoost Tuning ===\n")
xgb_default = xgb.XGBClassifier(random_state=42, eval_metric='logloss')
xgb_default.fit(X_train, y_train)
default_score_xgb = accuracy_score(y_test, xgb_default.predict(X_test))
print(f"Default settings accuracy: {default_score_xgb:.4f}")
# Random search
random_xgb = RandomizedSearchCV(
xgb.XGBClassifier(random_state=42, eval_metric='logloss'),
param_dist_xgb,
n_iter=100,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
start = time.time()
random_xgb.fit(X_train, y_train)
elapsed = time.time() - start
# Performance after tuning
tuned_score_xgb = accuracy_score(y_test, random_xgb.predict(X_test))
print(f"\nBest parameters:")
for param, value in random_xgb.best_params_.items():
print(f" {param}: {value:.4f}" if isinstance(value, float) else f" {param}: {value}")
print(f"\nAccuracy after tuning: {tuned_score_xgb:.4f}")
print(f"Improvement: {(tuned_score_xgb - default_score_xgb) * 100:.2f}%")
print(f"Execution time: {elapsed:.2f} seconds")
Example Output:
=== XGBoost Tuning ===
Default settings accuracy: 0.9000
Best parameters:
colsample_bytree: 0.8234
gamma: 0.1234
learning_rate: 0.0876
max_depth: 7
n_estimators: 387
subsample: 0.8567
Accuracy after tuning: 0.9400
Improvement: 4.00%
Execution time: 42.18 seconds
Result Visualization
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Result Visualization
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import numpy as np
# Model comparison
models_comparison = {
'RF (Default)': default_score,
'RF (Tuned)': tuned_score,
'XGB (Default)': default_score_xgb,
'XGB (Tuned)': tuned_score_xgb
}
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy comparison
models = list(models_comparison.keys())
scores = list(models_comparison.values())
colors = ['lightcoral', 'lightgreen', 'lightcoral', 'lightgreen']
axes[0].bar(models, scores, color=colors, edgecolor='black', alpha=0.7)
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Performance Comparison', fontsize=14)
axes[0].set_ylim([0.8, 1.0])
axes[0].grid(True, alpha=0.3, axis='y')
for i, score in enumerate(scores):
axes[0].text(i, score + 0.01, f'{score:.4f}', ha='center', fontsize=10)
# Improvement rate
improvements = [
0,
(tuned_score - default_score) * 100,
0,
(tuned_score_xgb - default_score_xgb) * 100
]
axes[1].bar(models, improvements, color=colors, edgecolor='black', alpha=0.7)
axes[1].set_ylabel('Improvement (%)')
axes[1].set_title('Improvement from Tuning', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='y')
for i, imp in enumerate(improvements):
if imp > 0:
axes[1].text(i, imp + 0.2, f'{imp:.2f}%', ha='center', fontsize=10)
plt.tight_layout()
plt.show()
1.6 Chapter Summary
What We Learned
Understanding Hyperparameters
- Difference from model parameters
- Key hyperparameters and their roles
- Proper design of search spaces
Grid Search
- Optimization through exhaustive search
- Using scikit-learn GridSearchCV
- Trade-off between computational cost and search efficiency
Random Search
- Efficiency of probabilistic sampling
- Direct search from continuous distributions
- Advantages over grid search
Importance of Cross-Validation
- Choosing appropriate CV strategies
- Comprehensive evaluation with multiple metrics
- Detecting and preventing overfitting
Practical Tuning
- Optimizing Random Forest and XGBoost
- Improvement from default settings
- Visualization and interpretation of results
Method Selection Guidelines
| Situation | Recommended Method | Reason |
|---|---|---|
| Few parameters (2-3) | Grid Search | Exhaustive search is practical |
| Many parameters (4+) | Random Search | Better computational efficiency |
| Continuous-valued parameters | Random Search | Direct sampling from distributions |
| Limited computational resources | Random Search | Search count can be controlled |
| Highest accuracy needed | Combine both | Two-stage: coarse to fine search |
To the Next Chapter
In Chapter 2, we will learn about Bayesian Optimization:
- Surrogate models using Gaussian processes
- Acquisition function design
- Implementation using Optuna
- Performance comparison with traditional methods
- Practical application examples
Exercises
Exercise 1 (Difficulty: easy)
Explain the difference between hyperparameters and model parameters from three perspectives (definition, determination method, examples).
Sample Answer
Answer:
| Perspective | Hyperparameters | Model Parameters |
|---|---|---|
| Definition | Values set by humans before training | Values automatically optimized through training |
| Determination Method | Trial and error, search algorithms, experience | Learned from training data via gradient descent, etc. |
| Examples | Learning rate, tree depth, regularization coefficient | Linear regression coefficients, neural network weights |
Additional Explanation:
- Hyperparameters control model structure and learning process
- Model parameters represent data patterns
- Appropriate hyperparameter selection makes model parameter learning more efficient
Exercise 2 (Difficulty: medium)
Calculate the total number of combinations for the following parameter grid and discuss the computational cost of grid search.
param_grid = {
'n_estimators': [100, 200, 300, 400, 500],
'max_depth': [5, 10, 15, 20, 25, 30],
'min_samples_split': [2, 5, 10, 15],
'learning_rate': [0.01, 0.05, 0.1, 0.2]
}
# Using 5-fold cross-validation
Sample Answer
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Calculate the total number of combinations for the following
Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
param_grid = {
'n_estimators': [100, 200, 300, 400, 500],
'max_depth': [5, 10, 15, 20, 25, 30],
'min_samples_split': [2, 5, 10, 15],
'learning_rate': [0.01, 0.05, 0.1, 0.2]
}
# Candidate count for each parameter
param_counts = [len(v) for v in param_grid.values()]
print("Candidate count per parameter:")
for param, count in zip(param_grid.keys(), param_counts):
print(f" {param}: {count}")
# Total combinations
total_combinations = np.prod(param_counts)
print(f"\nTotal combinations: {total_combinations:,}")
# Total training runs with 5-fold cross-validation
cv_folds = 5
total_fits = total_combinations * cv_folds
print(f"Total training runs with 5-fold CV: {total_fits:,}")
# Assuming 1 minute per training run
time_per_fit = 1 # minutes
total_time_minutes = total_fits * time_per_fit
total_time_hours = total_time_minutes / 60
print(f"\nComputation time (assuming 1 minute per training):")
print(f" {total_time_minutes:,} minutes")
print(f" {total_time_hours:.1f} hours")
Output:
Candidate count per parameter:
n_estimators: 5
max_depth: 6
min_samples_split: 4
learning_rate: 4
Total combinations: 480
Total training runs with 5-fold CV: 2,400
Computation time (assuming 1 minute per training):
2,400 minutes
40.0 hours
Discussion:
- Combinations increase exponentially as parameter count grows
- Cross-validation further increases computational cost
- This example requires approximately 40 hours of computation
- Random search with search count limited to 100 would take approximately 8.3 hours (500 training runs)
Exercise 3 (Difficulty: medium)
Compare the pros and cons of grid search and random search, and explain in which scenarios random search is advantageous.
Sample Answer
Answer:
| Aspect | Grid Search | Random Search |
|---|---|---|
| Search Method | Exhaustive all combinations | Random sampling |
| Computational Cost | Increases exponentially | Search count can be controlled |
| Optimal Solution Guarantee | Guaranteed within search space | Probabilistic (no guarantee) |
| Continuous Value Support | Requires discretization | Direct sampling from continuous distributions |
| High-Dimensional Search | Difficult (combinatorial explosion) | Linear with respect to dimensions |
Scenarios Where Random Search is Advantageous:
- Many parameters (4 or more)
- Grid search suffers from combinatorial explosion
- Random search can fix the number of searches
- Optimizing continuous-valued parameters
- Continuous values like learning rate, regularization coefficient
- Can sample directly from distributions
- When some parameters are more important
- As shown by Bergstra & Bengio (2012), widely explores important parameter ranges
- Grid search is limited to uniform spacing
- Limited computational resources
- When time constraints exist
- Can control search count within budget
Exercise 4 (Difficulty: hard)
Implement hyperparameter tuning for RandomForestClassifier on the following dataset and report the improvement rate from default settings.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
Sample Answer
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Implement hyperparameter tuning for RandomForestClassifier o
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from scipy.stats import randint, uniform
import time
# Data preparation
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
print("=== Tuning on Wine Dataset ===\n")
print(f"Training data: {X_train.shape}")
print(f"Test data: {X_test.shape}")
print(f"Number of classes: {len(data.target_names)}")
# 1. Performance with default settings
print("\n1. Evaluation with Default Settings")
rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train, y_train)
y_pred_default = rf_default.predict(X_test)
default_accuracy = accuracy_score(y_test, y_pred_default)
print(f"Accuracy: {default_accuracy:.4f}")
# 2. Tuning with random search
print("\n2. Tuning with Random Search")
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 30),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=100,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
start_time = time.time()
random_search.fit(X_train, y_train)
elapsed_time = time.time() - start_time
# Evaluate with best model
y_pred_tuned = random_search.predict(X_test)
tuned_accuracy = accuracy_score(y_test, y_pred_tuned)
print(f"\nBest parameters:")
for param, value in random_search.best_params_.items():
if isinstance(value, float):
print(f" {param}: {value:.4f}")
else:
print(f" {param}: {value}")
print(f"\nCV accuracy: {random_search.best_score_:.4f}")
print(f"Test accuracy: {tuned_accuracy:.4f}")
print(f"Execution time: {elapsed_time:.2f} seconds")
# 3. Calculate improvement rate
improvement = (tuned_accuracy - default_accuracy) * 100
improvement_pct = (tuned_accuracy / default_accuracy - 1) * 100
print(f"\n=== Summary of Results ===")
print(f"Default settings: {default_accuracy:.4f}")
print(f"After tuning: {tuned_accuracy:.4f}")
print(f"Absolute improvement: {improvement:.2f} points")
print(f"Relative improvement: {improvement_pct:.2f}%")
# 4. Detailed classification report
print(f"\n=== Classification Report (After Tuning) ===")
print(classification_report(y_test, y_pred_tuned,
target_names=data.target_names))
# 5. Visualization
import matplotlib.pyplot as plt
import pandas as pd
results_df = pd.DataFrame(random_search.cv_results_)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Score distribution
axes[0, 0].hist(results_df['mean_test_score'], bins=20,
alpha=0.7, edgecolor='black')
axes[0, 0].axvline(x=random_search.best_score_, color='r',
linestyle='--', label='Best Score')
axes[0, 0].set_xlabel('CV Accuracy')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Score Distribution')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Parameter influence: n_estimators
axes[0, 1].scatter(results_df['param_n_estimators'],
results_df['mean_test_score'], alpha=0.5)
axes[0, 1].set_xlabel('n_estimators')
axes[0, 1].set_ylabel('CV Accuracy')
axes[0, 1].set_title('Influence of n_estimators')
axes[0, 1].grid(True, alpha=0.3)
# Parameter influence: max_depth
axes[1, 0].scatter(results_df['param_max_depth'],
results_df['mean_test_score'], alpha=0.5)
axes[1, 0].set_xlabel('max_depth')
axes[1, 0].set_ylabel('CV Accuracy')
axes[1, 0].set_title('Influence of max_depth')
axes[1, 0].grid(True, alpha=0.3)
# Default vs Tuned
comparison = ['Default', 'Tuned']
scores = [default_accuracy, tuned_accuracy]
colors = ['lightcoral', 'lightgreen']
axes[1, 1].bar(comparison, scores, color=colors,
edgecolor='black', alpha=0.7)
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].set_title('Performance Comparison')
axes[1, 1].set_ylim([0.9, 1.0])
axes[1, 1].grid(True, alpha=0.3, axis='y')
for i, score in enumerate(scores):
axes[1, 1].text(i, score + 0.005, f'{score:.4f}',
ha='center', fontsize=12)
plt.tight_layout()
plt.show()
Example Output:
=== Tuning on Wine Dataset ===
Training data: (142, 13)
Test data: (36, 13)
Number of classes: 3
1. Evaluation with Default Settings
Accuracy: 0.9722
2. Tuning with Random Search
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters:
max_depth: 18
max_features: 0.3456
min_samples_leaf: 1
min_samples_split: 2
n_estimators: 287
CV accuracy: 0.9859
Test accuracy: 1.0000
Execution time: 15.23 seconds
=== Summary of Results ===
Default settings: 0.9722
After tuning: 1.0000
Absolute improvement: 2.78 points
Relative improvement: 2.86%
=== Classification Report (After Tuning) ===
precision recall f1-score support
class_0 1.00 1.00 1.00 14
class_1 1.00 1.00 1.00 15
class_2 1.00 1.00 1.00 7
accuracy 1.00 36
macro avg 1.00 1.00 1.00 36
weighted avg 1.00 1.00 1.00 36
Exercise 5 (Difficulty: hard)
Explain the dangers of data leakage in cross-validation and show the correct implementation method. Consider especially in the context of scaling and hyperparameter search.
Sample Answer
Answer:
What is Data Leakage:
Information leaking across the boundary between training and test data, causing model performance to be overestimated.
Specific Dangers:
- Leakage in Scaling
- Scaling all data then train/test split allows test data statistics to leak into training
- Using test data's mean and standard deviation
- Leakage in Feature Selection
- Feature selection on all data then train/test split allows test data information to influence selection
- Leakage in Cross-Validation
- Preprocessing outside CV allows test fold information to leak to each fold
Incorrect Implementation Example:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# β Wrong: Scale all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fit on all data
# Then cross-validation
scores = cross_val_score(RandomForestClassifier(), X_scaled, y, cv=5)
# β Test fold information leaks to training folds
Correct Implementation Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# β
Correct: Use Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
# Cross-validation with Pipeline
# Scaler is fit on training data only for each fold
scores = cross_val_score(pipeline, X, y, cv=5)
# Hyperparameter search similarly
param_grid = {
'classifier__n_estimators': [100, 200, 300],
'classifier__max_depth': [10, 20, None]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
Demonstration Experiment:
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Demonstration Experiment:
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import numpy as np
# Generate data (features with different scales)
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=10, random_state=42)
# Intentionally change scale
X[:, :10] = X[:, :10] * 1000 # Multiply first 10 features by 1000
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("=== Demonstration of Data Leakage ===\n")
# 1. Incorrect method (with data leakage)
scaler_wrong = StandardScaler()
X_train_wrong = scaler_wrong.fit_transform(X_train)
X_test_wrong = scaler_wrong.transform(X_test)
# Leakage also occurs in CV
X_all_scaled = StandardScaler().fit_transform(X)
cv_scores_wrong = cross_val_score(
RandomForestClassifier(random_state=42),
X_all_scaled, y, cv=5
)
print("β Incorrect Method (CV after scaling all data)")
print(f"CV accuracy: {cv_scores_wrong.mean():.4f} (Β±{cv_scores_wrong.std():.4f})")
# 2. Correct method (prevent leakage with Pipeline)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
cv_scores_correct = cross_val_score(pipeline, X, y, cv=5)
print(f"\nβ
Correct Method (using Pipeline)")
print(f"CV accuracy: {cv_scores_correct.mean():.4f} (Β±{cv_scores_correct.std():.4f})")
# Calculate difference
difference = cv_scores_wrong.mean() - cv_scores_correct.mean()
print(f"\nDegree of overestimation: {difference:.4f} ({difference*100:.2f}% points)")
print("\n=== Conclusion ===")
print("Performance is overestimated due to data leakage")
print("Correct evaluation is possible using Pipeline")
Example Output:
=== Demonstration of Data Leakage ===
β Incorrect Method (CV after scaling all data)
CV accuracy: 0.9120 (Β±0.0234)
β
Correct Method (using Pipeline)
CV accuracy: 0.9050 (Β±0.0287)
Degree of overestimation: 0.0070 (0.70% points)
=== Conclusion ===
Performance is overestimated due to data leakage
Correct evaluation is possible using Pipeline
Best Practices:
- Always use Pipeline to integrate preprocessing and model
- Execute cross-validation on the entire pipeline including preprocessing
- Fit on training data, only transform on test data
- Also perform hyperparameter search on the entire Pipeline
References
- Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(1), 281-305.
- Feurer, M., & Hutter, F. (2019). Hyperparameter optimization. In Automated Machine Learning (pp. 3-33). Springer.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
- GΓ©ron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media.