Chapter 3: Model Selection and Hyperparameter Optimization

This chapter covers Model Selection and Hyperparameter Optimization. You will learn models based on data size (linear, cross-validation techniques (K-Fold, and ensemble learning methods (Bagging.

Learning Objectives

After completing this chapter, you will be able to:

✅ Select appropriate models based on data size (linear, tree-based, neural networks, GNNs)
✅ Apply cross-validation techniques (K-Fold, Stratified, Time Series Split)
✅ Perform automated hyperparameter optimization using Bayesian optimization with Optuna
✅ Implement ensemble learning methods (Bagging, Boosting, Stacking)
✅ Execute practical workflows for Li-ion battery capacity prediction

3.1 Model Selection Strategy

In materials science machine learning, selecting appropriate models based on data characteristics is crucial for achieving good performance.

Data Size and Model Complexity

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0

"""
Example: Data Size and Model Complexity

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import learning_curve
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

# Generate sample materials data
np.random.seed(42)

def generate_material_data(n_samples, n_features=20):
    """Simulate materials data"""
    X = np.random.randn(n_samples, n_features)
    # Nonlinear relationships
    y = (
        2 * X[:, 0]**2 +
        3 * X[:, 1] * X[:, 2] -
        1.5 * X[:, 3] +
        np.random.normal(0, 0.5, n_samples)
    )
    return X, y

# Relationship between model complexity and sample size
sample_sizes = [50, 100, 200, 500, 1000]
models = {
    'Ridge': Ridge(),
    'Random Forest': RandomForestRegressor(n_estimators=50, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=50, random_state=42),
    'Neural Network': MLPRegressor(hidden_layers=(50, 50), max_iter=1000, random_state=42)
}

# Learning curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, (model_name, model) in enumerate(models.items()):
    X, y = generate_material_data(1000, n_features=20)

    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )

    train_mean = -train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = -val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    axes[idx].plot(train_sizes, train_mean, 'o-',
                   color='steelblue', label='Training Error')
    axes[idx].fill_between(train_sizes,
                           train_mean - train_std,
                           train_mean + train_std,
                           alpha=0.2, color='steelblue')

    axes[idx].plot(train_sizes, val_mean, 'o-',
                   color='coral', label='Validation Error')
    axes[idx].fill_between(train_sizes,
                           val_mean - val_std,
                           val_mean + val_std,
                           alpha=0.2, color='coral')

    axes[idx].set_xlabel('Training Size', fontsize=11)
    axes[idx].set_ylabel('MSE', fontsize=11)
    axes[idx].set_title(f'{model_name}', fontsize=12, fontweight='bold')
    axes[idx].legend(loc='upper right')
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Model Selection Guidelines:")
print("- Small data (<100): Ridge, Lasso (regularized linear models)")
print("- Medium data (100-1000): Random Forest, Gradient Boosting")
print("- Large data (>1000): Neural Network, Deep Learning")

Interpretability vs. Accuracy Trade-off

# Model interpretability and accuracy comparison
model_comparison = pd.DataFrame({
    'Model': [
        'Linear Regression',
        'Ridge/Lasso',
        'Decision Tree',
        'Random Forest',
        'Gradient Boosting',
        'Neural Network',
        'GNN'
    ],
    'Interpretability': [10, 9, 7, 4, 3, 2, 1],
    'Accuracy': [4, 5, 5, 8, 9, 9, 10],
    'Training Speed': [10, 9, 8, 6, 5, 3, 2],
    'Inference Speed': [10, 10, 9, 7, 6, 8, 4]
})

# Radar chart
from math import pi

categories = ['Interpretability', 'Accuracy', 'Training Speed', 'Inference Speed']
N = len(categories)

angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]

fig, axes = plt.subplots(2, 4, figsize=(18, 10),
                         subplot_kw=dict(projection='polar'))
axes = axes.flatten()

for idx, row in model_comparison.iterrows():
    values = row[categories].tolist()
    values += values[:1]

    axes[idx].plot(angles, values, 'o-', linewidth=2)
    axes[idx].fill(angles, values, alpha=0.25)
    axes[idx].set_xticks(angles[:-1])
    axes[idx].set_xticklabels(categories, size=9)
    axes[idx].set_ylim(0, 10)
    axes[idx].set_title(row['Model'], size=11, fontweight='bold', pad=20)
    axes[idx].grid(True)

plt.tight_layout()
plt.show()

print("\nModel Selection Criteria:")
print("- Interpretability prioritized: Ridge, Lasso, Decision Tree")
print("- Accuracy prioritized: Gradient Boosting, Neural Network, GNN")
print("- Balanced approach: Random Forest")

Linear Models, Tree-based, Neural Networks, and GNNs Comparison

# Performance comparison on real data
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, r2_score

X, y = generate_material_data(500, n_features=20)

models_benchmark = {
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'MLP': MLPRegressor(hidden_layers=(100, 50), max_iter=1000, random_state=42)
}

results = []

for model_name, model in models_benchmark.items():
    # Cross-validation
    cv_scores = cross_val_score(model, X, y, cv=5,
                                scoring='neg_mean_absolute_error')
    mae = -cv_scores.mean()
    mae_std = cv_scores.std()

    # R²
    cv_r2 = cross_val_score(model, X, y, cv=5, scoring='r2')
    r2 = cv_r2.mean()

    results.append({
        'Model': model_name,
        'MAE': mae,
        'MAE_std': mae_std,
        'R²': r2
    })

results_df = pd.DataFrame(results)
print("\nModel Performance Comparison:")
print(results_df.to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# MAE
axes[0].barh(results_df['Model'], results_df['MAE'],
             xerr=results_df['MAE_std'],
             color='steelblue', alpha=0.7)
axes[0].set_xlabel('MAE (lower is better)', fontsize=11)
axes[0].set_title('Prediction Error (MAE)', fontsize=12, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# R²
axes[1].barh(results_df['Model'], results_df['R²'],
             color='coral', alpha=0.7)
axes[1].set_xlabel('R² (higher is better)', fontsize=11)
axes[1].set_title('Coefficient of Determination (R²)', fontsize=12, fontweight='bold')
axes[1].set_xlim(0, 1)
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

3.2 Cross-Validation Techniques

Cross-validation is a fundamental technique for properly evaluating model generalization performance.

K-Fold Cross-Validation

from sklearn.model_selection import KFold, cross_validate

def kfold_cv_demo(X, y, model, k=5):
    """
    K-Fold cross-validation demo
    """
    kf = KFold(n_splits=k, shuffle=True, random_state=42)

    fold_results = []

    for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)

        mae = mean_absolute_error(y_val, y_pred)
        r2 = r2_score(y_val, y_pred)

        fold_results.append({
            'Fold': fold_idx + 1,
            'MAE': mae,
            'R²': r2
        })

    return pd.DataFrame(fold_results)

# Run K-Fold
model = RandomForestRegressor(n_estimators=100, random_state=42)
fold_results = kfold_cv_demo(X, y, model, k=5)

print("K-Fold CV Results:")
print(fold_results.to_string(index=False))
print(f"\nMean MAE: {fold_results['MAE'].mean():.4f} ± {fold_results['MAE'].std():.4f}")
print(f"Mean R²: {fold_results['R²'].mean():.4f} ± {fold_results['R²'].std():.4f}")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
x_pos = np.arange(len(fold_results))

ax.bar(x_pos, fold_results['MAE'], color='steelblue', alpha=0.7,
       label='MAE per fold')
ax.axhline(y=fold_results['MAE'].mean(), color='red',
           linestyle='--', linewidth=2, label='Mean MAE')
ax.fill_between(x_pos,
                fold_results['MAE'].mean() - fold_results['MAE'].std(),
                fold_results['MAE'].mean() + fold_results['MAE'].std(),
                color='red', alpha=0.2, label='±1 Std')

ax.set_xlabel('Fold', fontsize=12)
ax.set_ylabel('MAE', fontsize=12)
ax.set_title('K-Fold Cross-Validation Results', fontsize=13, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(fold_results['Fold'])
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

Stratified K-Fold

from sklearn.model_selection import StratifiedKFold

# Classification data
X_class, _ = generate_material_data(500, n_features=20)
# 3-class classification
y_class = np.digitize(y, bins=np.percentile(y, [33, 67]))

def compare_kfold_strategies(X, y):
    """
    Standard K-Fold vs Stratified K-Fold comparison
    """
    # Standard K-Fold
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    normal_distributions = []

    for train_idx, _ in kf.split(X):
        y_train = y[train_idx]
        class_dist = np.bincount(y_train) / len(y_train)
        normal_distributions.append(class_dist)

    # Stratified K-Fold
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    stratified_distributions = []

    for train_idx, _ in skf.split(X, y):
        y_train = y[train_idx]
        class_dist = np.bincount(y_train) / len(y_train)
        stratified_distributions.append(class_dist)

    return normal_distributions, stratified_distributions

normal_dist, stratified_dist = compare_kfold_strategies(X_class, y_class)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Standard K-Fold
normal_array = np.array(normal_dist)
axes[0].bar(range(len(normal_array)), normal_array[:, 0],
            label='Class 0', alpha=0.7)
axes[0].bar(range(len(normal_array)), normal_array[:, 1],
            bottom=normal_array[:, 0],
            label='Class 1', alpha=0.7)
axes[0].bar(range(len(normal_array)), normal_array[:, 2],
            bottom=normal_array[:, 0] + normal_array[:, 1],
            label='Class 2', alpha=0.7)
axes[0].set_xlabel('Fold', fontsize=11)
axes[0].set_ylabel('Class Distribution', fontsize=11)
axes[0].set_title('Standard K-Fold', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].set_ylim(0, 1)

# Stratified K-Fold
stratified_array = np.array(stratified_dist)
axes[1].bar(range(len(stratified_array)), stratified_array[:, 0],
            label='Class 0', alpha=0.7)
axes[1].bar(range(len(stratified_array)), stratified_array[:, 1],
            bottom=stratified_array[:, 0],
            label='Class 1', alpha=0.7)
axes[1].bar(range(len(stratified_array)), stratified_array[:, 2],
            bottom=stratified_array[:, 0] + stratified_array[:, 1],
            label='Class 2', alpha=0.7)
axes[1].set_xlabel('Fold', fontsize=11)
axes[1].set_ylabel('Class Distribution', fontsize=11)
axes[1].set_title('Stratified K-Fold', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

print("Stratified K-Fold Advantages:")
print("- Uniform class distribution across folds")
print("- Stable evaluation even with imbalanced data")

Time Series Split (for Sequential Data)

from sklearn.model_selection import TimeSeriesSplit

# Time series data simulation
n_time_points = 200
time = np.arange(n_time_points)
# Trend + seasonality + noise
y_timeseries = (
    0.05 * time +
    10 * np.sin(2 * np.pi * time / 50) +
    np.random.normal(0, 2, n_time_points)
)
X_timeseries = np.column_stack([time, np.sin(2 * np.pi * time / 50)])

# Time Series Split
tscv = TimeSeriesSplit(n_splits=5)

# Visualization
fig, ax = plt.subplots(figsize=(14, 6))

colors = plt.cm.viridis(np.linspace(0, 1, 5))

for fold_idx, (train_idx, test_idx) in enumerate(tscv.split(X_timeseries)):
    # Train
    ax.scatter(time[train_idx], y_timeseries[train_idx],
               c=[colors[fold_idx]], s=20, alpha=0.3,
               label=f'Fold {fold_idx+1} Train')
    # Test
    ax.scatter(time[test_idx], y_timeseries[test_idx],
               c=[colors[fold_idx]], s=50, marker='s',
               label=f'Fold {fold_idx+1} Test')

ax.plot(time, y_timeseries, 'k-', alpha=0.3, linewidth=1)
ax.set_xlabel('Time', fontsize=12)
ax.set_ylabel('Value', fontsize=12)
ax.set_title('Time Series Split', fontsize=13, fontweight='bold')
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Time Series Split Characteristics:")
print("- Training data always precedes test data")
print("- Prevents future data leakage (data leakage prevention)")

Leave-One-Out CV (for Small Datasets)

from sklearn.model_selection import LeaveOneOut

def loo_cv_demo(X, y, model):
    """
    Leave-One-Out CV
    For small-scale datasets
    """
    loo = LeaveOneOut()
    predictions = []
    actuals = []

    for train_idx, test_idx in loo.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        predictions.append(y_pred[0])
        actuals.append(y_test[0])

    return np.array(actuals), np.array(predictions)

# Small dataset
X_small, y_small = generate_material_data(50, n_features=10)
model_small = Ridge(alpha=1.0)

y_actual, y_pred_loo = loo_cv_demo(X_small, y_small, model_small)

mae_loo = mean_absolute_error(y_actual, y_pred_loo)
r2_loo = r2_score(y_actual, y_pred_loo)

print(f"LOO CV Results (n={len(X_small)}):")
print(f"MAE: {mae_loo:.4f}")
print(f"R²: {r2_loo:.4f}")

# Predicted vs actual
plt.figure(figsize=(8, 8))
plt.scatter(y_actual, y_pred_loo, c='steelblue', s=50, alpha=0.6)
plt.plot([y_actual.min(), y_actual.max()],
         [y_actual.min(), y_actual.max()],
         'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual', fontsize=12)
plt.ylabel('Predicted', fontsize=12)
plt.title('Leave-One-Out CV Results', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

3.3 Hyperparameter Optimization

Hyperparameter optimization explores the parameter space to maximize model performance.

Grid Search (Exhaustive Search)

from sklearn.model_selection import GridSearchCV

def grid_search_demo(X, y):
    """
    Grid Search exhaustive search
    """
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [5, 10, 15, None],
        'min_samples_split': [2, 5, 10]
    }

    model = RandomForestRegressor(random_state=42)

    grid_search = GridSearchCV(
        model,
        param_grid,
        cv=5,
        scoring='neg_mean_absolute_error',
        n_jobs=-1,
        verbose=1
    )

    grid_search.fit(X, y)

    return grid_search

# Execute
grid_result = grid_search_demo(X, y)

print("Grid Search Results:")
print(f"Best parameters: {grid_result.best_params_}")
print(f"Best score (MAE): {-grid_result.best_score_:.4f}")
print(f"\nSearch space size: {len(grid_result.cv_results_['params'])}")

# Visualization (2D heatmap)
results = pd.DataFrame(grid_result.cv_results_)

# n_estimators vs max_depth
pivot_table = results.pivot_table(
    values='mean_test_score',
    index='param_max_depth',
    columns='param_n_estimators',
    aggfunc='mean'
)

plt.figure(figsize=(10, 8))
sns.heatmap(-pivot_table, annot=True, fmt='.3f',
            cmap='YlOrRd', cbar_kws={'label': 'MAE'})
plt.xlabel('n_estimators', fontsize=12)
plt.ylabel('max_depth', fontsize=12)
plt.title('Grid Search Results (MAE)', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Random Search

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

def random_search_demo(X, y, n_iter=50):
    """
    Random Search random exploration
    """
    param_distributions = {
        'n_estimators': randint(50, 300),
        'max_depth': randint(5, 30),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'max_features': uniform(0.3, 0.7)
    }

    model = RandomForestRegressor(random_state=42)

    random_search = RandomizedSearchCV(
        model,
        param_distributions,
        n_iter=n_iter,
        cv=5,
        scoring='neg_mean_absolute_error',
        n_jobs=-1,
        random_state=42,
        verbose=1
    )

    random_search.fit(X, y)

    return random_search

# Execute
random_result = random_search_demo(X, y, n_iter=50)

print("\nRandom Search Results:")
print(f"Best parameters: {random_result.best_params_}")
print(f"Best score (MAE): {-random_result.best_score_:.4f}")

# Grid vs Random comparison
print(f"\nGrid Search best score: {-grid_result.best_score_:.4f}")
print(f"Random Search best score: {-random_result.best_score_:.4f}")
print(f"\nRandom Search efficiency: Explored {50 / len(grid_result.cv_results_['params']) * 100:.1f}% of search space with comparable performance")

Bayesian Optimization (Optuna, Hyperopt)

# Requirements:
# - Python 3.9+
# - optuna>=3.2.0

import optuna

def objective(trial, X, y):
    """
    Optuna objective function
    """
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 5, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'max_features': trial.suggest_float('max_features', 0.3, 1.0),
        'random_state': 42
    }

    model = RandomForestRegressor(**params)

    # Cross-validation
    cv_scores = cross_val_score(
        model, X, y, cv=5,
        scoring='neg_mean_absolute_error'
    )

    return -cv_scores.mean()  # Negate for minimization

# Optuna optimization
study = optuna.create_study(direction='minimize')
study.optimize(lambda trial: objective(trial, X, y), n_trials=100, show_progress_bar=True)

print("\nOptuna Bayesian Optimization Results:")
print(f"Best parameters: {study.best_params}")
print(f"Best score (MAE): {study.best_value:.4f}")

# Optimization history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Optimization history
axes[0].plot(study.trials_dataframe()['number'],
             study.trials_dataframe()['value'],
             'o-', color='steelblue', alpha=0.6, label='Trial Score')
axes[0].plot(study.trials_dataframe()['number'],
             study.trials_dataframe()['value'].cummin(),
             'r-', linewidth=2, label='Best Score')
axes[0].set_xlabel('Trial', fontsize=11)
axes[0].set_ylabel('MAE', fontsize=11)
axes[0].set_title('Optimization History', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Parameter importance
importances = optuna.importance.get_param_importances(study)
axes[1].barh(list(importances.keys()), list(importances.values()),
             color='coral', alpha=0.7)
axes[1].set_xlabel('Importance', fontsize=11)
axes[1].set_title('Hyperparameter Importance', fontsize=12, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

Early Stopping

from sklearn.ensemble import GradientBoostingRegressor

def early_stopping_demo(X, y):
    """
    Early Stopping demo
    """
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Track performance for each boosting stage
    model = GradientBoostingRegressor(
        n_estimators=500,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )

    model.fit(X_train, y_train)

    # Predictions for each stage
    train_scores = []
    val_scores = []

    for i, y_pred_train in enumerate(model.staged_predict(X_train)):
        y_pred_val = list(model.staged_predict(X_val))[i]

        train_mae = mean_absolute_error(y_train, y_pred_train)
        val_mae = mean_absolute_error(y_val, y_pred_val)

        train_scores.append(train_mae)
        val_scores.append(val_mae)

    # Optimal n_estimators (minimum validation error)
    best_n_estimators = np.argmin(val_scores) + 1

    return train_scores, val_scores, best_n_estimators

train_curve, val_curve, best_n = early_stopping_demo(X, y)

# Visualization
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(train_curve)+1), train_curve,
         'b-', label='Training Error', linewidth=2)
plt.plot(range(1, len(val_curve)+1), val_curve,
         'r-', label='Validation Error', linewidth=2)
plt.axvline(x=best_n, color='green', linestyle='--',
            label=f'Best n_estimators={best_n}', linewidth=2)
plt.xlabel('n_estimators', fontsize=12)
plt.ylabel('MAE', fontsize=12)
plt.title('Early Stopping', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Early Stopping Results:")
print(f"Optimal n_estimators: {best_n}")
print(f"Validation MAE: {val_curve[best_n-1]:.4f}")
print(f"Overfitting prevention: Reduced {500 - best_n} iterations")

3.4 Ensemble Learning

Combining multiple models improves prediction accuracy through ensemble methods.

Bagging (Bootstrap Aggregating)

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

# Bagging
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(max_depth=10),
    n_estimators=50,
    max_samples=0.8,
    random_state=42
)

# Single Decision Tree for comparison
single_tree = DecisionTreeRegressor(max_depth=10, random_state=42)

# Evaluation
cv_bagging = cross_val_score(bagging, X, y, cv=5,
                             scoring='neg_mean_absolute_error')
cv_single = cross_val_score(single_tree, X, y, cv=5,
                            scoring='neg_mean_absolute_error')

print("Bagging Results:")
print(f"Single Decision Tree MAE: {-cv_single.mean():.4f} ± {cv_single.std():.4f}")
print(f"Bagging MAE: {-cv_bagging.mean():.4f} ± {cv_bagging.std():.4f}")
print(f"Improvement: {(cv_single.mean() - cv_bagging.mean()) / cv_single.mean() * 100:.1f}%")

Boosting (AdaBoost, Gradient Boosting, LightGBM, XGBoost)

# Requirements:
# - Python 3.9+
# - lightgbm>=4.0.0

"""
Example: Boosting (AdaBoost, Gradient Boosting, LightGBM, XGBoost)

Purpose: Demonstrate data manipulation and preprocessing
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
import lightgbm as lgb
# import xgboost as xgb  # Optional

# Various Boosting algorithms
boosting_models = {
    'AdaBoost': AdaBoostRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'LightGBM': lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1),
    # 'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42)
}

boosting_results = []

for model_name, model in boosting_models.items():
    cv_scores = cross_val_score(model, X, y, cv=5,
                                scoring='neg_mean_absolute_error')
    mae = -cv_scores.mean()
    mae_std = cv_scores.std()

    boosting_results.append({
        'Model': model_name,
        'MAE': mae,
        'MAE_std': mae_std
    })

boosting_df = pd.DataFrame(boosting_results)

print("\nBoosting Methods Comparison:")
print(boosting_df.to_string(index=False))

# Visualization
plt.figure(figsize=(10, 6))
plt.barh(boosting_df['Model'], boosting_df['MAE'],
         xerr=boosting_df['MAE_std'],
         color='steelblue', alpha=0.7)
plt.xlabel('MAE', fontsize=12)
plt.title('Boosting Methods Performance Comparison', fontsize=13, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

Stacking

from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge

# Base models
base_models = [
    ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingRegressor(n_estimators=100, random_state=42)),
    ('lgbm', lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1))
]

# Meta model
meta_model = Ridge(alpha=1.0)

# Stacking
stacking = StackingRegressor(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5
)

# Evaluation
cv_stacking = cross_val_score(stacking, X, y, cv=5,
                              scoring='neg_mean_absolute_error')

print("\nStacking Results:")
for name, _ in base_models:
    model_cv = cross_val_score(dict(base_models)[name], X, y, cv=5,
                               scoring='neg_mean_absolute_error')
    print(f"{name} MAE: {-model_cv.mean():.4f}")

print(f"Stacking Ensemble MAE: {-cv_stacking.mean():.4f}")
print(f"\nImprovement: Stacking shows "
      f"{((-cv_stacking.mean() / min([cross_val_score(m, X, y, cv=5, scoring='neg_mean_absolute_error').mean() for _, m in base_models])) - 1) * -100:.1f}% improvement over best single model")

Voting

from sklearn.ensemble import VotingRegressor

# Voting Ensemble
voting = VotingRegressor(
    estimators=base_models,
    weights=[1, 1.5, 1]  # Higher weight for GB
)

cv_voting = cross_val_score(voting, X, y, cv=5,
                            scoring='neg_mean_absolute_error')

print(f"\nVoting Ensemble MAE: {-cv_voting.mean():.4f}")

# Ensemble methods comparison
ensemble_comparison = pd.DataFrame({
    'Method': ['Single Best', 'Bagging', 'Boosting (LightGBM)',
               'Stacking', 'Voting'],
    'MAE': [
        min([cross_val_score(m, X, y, cv=5, scoring='neg_mean_absolute_error').mean() for _, m in base_models]),
        -cv_bagging.mean(),
        boosting_df[boosting_df['Model'] == 'LightGBM']['MAE'].values[0],
        -cv_stacking.mean(),
        -cv_voting.mean()
    ]
})

plt.figure(figsize=(10, 6))
plt.barh(ensemble_comparison['Method'], ensemble_comparison['MAE'],
         color='coral', alpha=0.7)
plt.xlabel('MAE (lower is better)', fontsize=12)
plt.title('Ensemble Methods Performance Comparison', fontsize=13, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

3.5 Case Study: Li-ion Battery Capacity Prediction

We implement the complete workflow of model selection and optimization using real Li-ion battery data.

# Li-ion battery capacity dataset (simulation)
np.random.seed(42)
n_batteries = 300

battery_data = pd.DataFrame({
    'Cathode_Composition_Li': np.random.uniform(0.9, 1.1, n_batteries),
    'Cathode_Composition_Co': np.random.uniform(0, 0.6, n_batteries),
    'Cathode_Composition_Ni': np.random.uniform(0, 0.8, n_batteries),
    'Cathode_Composition_Mn': np.random.uniform(0, 0.4, n_batteries),
    'Anode_Carbon_Fraction': np.random.uniform(0.8, 1.0, n_batteries),
    'Electrolyte_Concentration': np.random.uniform(0.5, 2.0, n_batteries),
    'Electrode_Thickness': np.random.uniform(50, 200, n_batteries),
    'Particle_Size': np.random.uniform(1, 20, n_batteries),
    'Sintering_Temperature': np.random.uniform(700, 1000, n_batteries),
    'BET_Surface_Area': np.random.uniform(1, 50, n_batteries)
})

# Capacity (complex nonlinear relationship)
capacity = (
    150 * battery_data['Cathode_Composition_Ni'] +
    120 * battery_data['Cathode_Composition_Co'] +
    80 * battery_data['Cathode_Composition_Mn'] +
    30 * battery_data['Electrolyte_Concentration'] -
    0.5 * battery_data['Electrode_Thickness'] +
    2 * battery_data['BET_Surface_Area'] +
    0.1 * battery_data['Sintering_Temperature'] +
    20 * battery_data['Cathode_Composition_Ni'] * battery_data['Electrolyte_Concentration'] +
    np.random.normal(0, 5, n_batteries)
)

battery_data['Capacity_mAh_g'] = capacity

print("=== Li-ion Battery Capacity Prediction Dataset ===")
print(f"Number of samples: {len(battery_data)}")
print(f"Number of features: {battery_data.shape[1] - 1}")
print(f"\nCapacity Statistics:")
print(battery_data['Capacity_mAh_g'].describe())

X_battery = battery_data.drop('Capacity_mAh_g', axis=1)
y_battery = battery_data['Capacity_mAh_g']

Step 1: Comparing Five Models

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_battery, y_battery, test_size=0.2, random_state=42
)

# Five models for comparison
models_to_compare = {
    'Ridge': Ridge(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'LightGBM': lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1),
    'Stacking': StackingRegressor(
        estimators=[
            ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
            ('gb', GradientBoostingRegressor(n_estimators=100, random_state=42)),
            ('lgbm', lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1))
        ],
        final_estimator=Ridge(alpha=1.0),
        cv=5
    )
}

comparison_results = []

for model_name, model in models_to_compare.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    comparison_results.append({
        'Model': model_name,
        'MAE': mae,
        'RMSE': rmse,
        'R²': r2
    })

comparison_df = pd.DataFrame(comparison_results)
print("\n=== Step 1: Model Comparison Results ===")
print(comparison_df.to_string(index=False))

Step 2: Automated Hyperparameter Optimization with Optuna

def objective_lightgbm(trial, X, y):
    """
    LightGBM hyperparameter optimization
    """
    param = {
        'objective': 'regression',
        'metric': 'mae',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'num_leaves': trial.suggest_int('num_leaves', 20, 100),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'random_state': 42
    }

    model = lgb.LGBMRegressor(**param)

    cv_scores = cross_val_score(
        model, X, y, cv=5, scoring='neg_mean_absolute_error'
    )

    return -cv_scores.mean()

# Optuna optimization
study_battery = optuna.create_study(direction='minimize')
study_battery.optimize(
    lambda trial: objective_lightgbm(trial, X_battery, y_battery),
    n_trials=100,
    show_progress_bar=True
)

print("\n=== Step 2: Optuna Optimization ===")
print(f"Best parameters:")
for key, value in study_battery.best_params.items():
    print(f"  {key}: {value}")
print(f"\nBest MAE: {study_battery.best_value:.4f} mAh/g")

# Re-evaluate with optimized model
best_model = lgb.LGBMRegressor(**study_battery.best_params, random_state=42)
best_model.fit(X_train, y_train)
y_pred_best = best_model.predict(X_test)

mae_best = mean_absolute_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print(f"\nOptimized Model Performance:")
print(f"MAE: {mae_best:.4f} mAh/g")
print(f"R²: {r2_best:.4f}")

Step 3: Final Stacking Ensemble for Peak Performance

# Stacking with optimized LightGBM
optimized_stacking = StackingRegressor(
    estimators=[
        ('rf_tuned', RandomForestRegressor(
            n_estimators=200, max_depth=15, min_samples_split=5,
            random_state=42
        )),
        ('lgbm_tuned', lgb.LGBMRegressor(**study_battery.best_params, random_state=42)),
        ('gb_tuned', GradientBoostingRegressor(
            n_estimators=150, learning_rate=0.1, max_depth=7,
            random_state=42
        ))
    ],
    final_estimator=Ridge(alpha=0.5),
    cv=5
)

optimized_stacking.fit(X_train, y_train)
y_pred_stack = optimized_stacking.predict(X_test)

mae_stack = mean_absolute_error(y_test, y_pred_stack)
r2_stack = r2_score(y_test, y_pred_stack)

print("\n=== Step 3: Final Stacking Ensemble ===")
print(f"MAE: {mae_stack:.4f} mAh/g")
print(f"R²: {r2_stack:.4f}")

# Comprehensive results comparison
final_comparison = pd.DataFrame({
    'Stage': [
        'Baseline (Ridge)',
        'Best Single Model',
        'Optuna Optimized',
        'Final Stacking'
    ],
    'MAE': [
        comparison_df[comparison_df['Model'] == 'Ridge']['MAE'].values[0],
        comparison_df['MAE'].min(),
        mae_best,
        mae_stack
    ],
    'R²': [
        comparison_df[comparison_df['Model'] == 'Ridge']['R²'].values[0],
        comparison_df['R²'].max(),
        r2_best,
        r2_stack
    ]
})

print("\n=== Overall Performance Progression ===")
print(final_comparison.to_string(index=False))

# Predicted vs actual
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Best single model before optimization
best_single_idx = comparison_df['MAE'].idxmin()
best_single_model = list(models_to_compare.values())[best_single_idx]
y_pred_single = best_single_model.predict(X_test)

axes[0].scatter(y_test, y_pred_single, c='steelblue', s=50, alpha=0.6)
axes[0].plot([y_test.min(), y_test.max()],
             [y_test.min(), y_test.max()],
             'r--', linewidth=2, label='Perfect')
axes[0].set_xlabel('Actual Capacity (mAh/g)', fontsize=11)
axes[0].set_ylabel('Predicted Capacity (mAh/g)', fontsize=11)
axes[0].set_title(f'Best Single Model (MAE={comparison_df["MAE"].min():.2f})',
                  fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Final Stacking
axes[1].scatter(y_test, y_pred_stack, c='coral', s=50, alpha=0.6)
axes[1].plot([y_test.min(), y_test.max()],
             [y_test.min(), y_test.max()],
             'r--', linewidth=2, label='Perfect')
axes[1].set_xlabel('Actual Capacity (mAh/g)', fontsize=11)
axes[1].set_ylabel('Predicted Capacity (mAh/g)', fontsize=11)
axes[1].set_title(f'Final Stacking (MAE={mae_stack:.2f})',
                  fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

improvement = (comparison_df['MAE'].min() - mae_stack) / comparison_df['MAE'].min() * 100
print(f"\nFinal Improvement Rate: {improvement:.1f}%")

Exercises

Exercise 1 (Difficulty: Easy)

Compare K-Fold CV and Stratified K-Fold CV using imbalanced classification data to evaluate performance.

Solution Example

from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Generate imbalanced data
X, y = make_classification(n_samples=200, n_features=20,
                          weights=[0.9, 0.1], random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)

# K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kfold = cross_val_score(model, X, y, cv=kf, scoring='f1')

# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_stratified = cross_val_score(model, X, y, cv=skf, scoring='f1')

print(f"K-Fold F1: {scores_kfold.mean():.4f} ± {scores_kfold.std():.4f}")
print(f"Stratified K-Fold F1: {scores_stratified.mean():.4f} ± {scores_stratified.std():.4f}")

Exercise 2 (Difficulty: Medium)

Use Optuna to optimize Random Forest hyperparameters. Explore four parameters: n_estimators, max_depth, min_samples_split, and max_features.

Solution Example

# Requirements:
# - Python 3.9+
# - optuna>=3.2.0

"""
Example: Use Optuna to optimize Random Forest hyperparameters. Explor

Purpose: Demonstrate optimization techniques
Target: Advanced
Execution time: 10-30 seconds
Dependencies: None
"""

import optuna
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

def objective_rf(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 5, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'max_features': trial.suggest_float('max_features', 0.3, 1.0),
        'random_state': 42
    }

    model = RandomForestRegressor(**params)
    cv_scores = cross_val_score(model, X, y, cv=5,
                                scoring='neg_mean_absolute_error')

    return -cv_scores.mean()

study = optuna.create_study(direction='minimize')
study.optimize(objective_rf, n_trials=50)

print(f"Best parameters: {study.best_params}")
print(f"Best score: {study.best_value:.4f}")

Exercise 3 (Difficulty: Hard)

Build a Stacking Ensemble with three base models (Ridge, Random Forest, LightGBM) and compare two meta-models (Ridge, Lasso). Which meta-model performs better?

Solution Example

# Requirements:
# - Python 3.9+
# - lightgbm>=4.0.0

"""
Example: Build a Stacking Ensemble with three base models (Ridge, Ran

Purpose: Demonstrate data manipulation and preprocessing
Target: Advanced
Execution time: ~5 seconds
Dependencies: None
"""

from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge, Lasso
import lightgbm as lgb

base_models = [
    ('ridge', Ridge(alpha=1.0)),
    ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
    ('lgbm', lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1))
]

meta_models = {
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1)
}

results = []

for meta_name, meta_model in meta_models.items():
    stacking = StackingRegressor(
        estimators=base_models,
        final_estimator=meta_model,
        cv=5
    )

    cv_scores = cross_val_score(stacking, X, y, cv=5,
                                scoring='neg_mean_absolute_error')
    mae = -cv_scores.mean()

    results.append({
        'Meta Model': meta_name,
        'MAE': mae
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

3.6 Model Optimization Environment and Reproducibility

Library Version Management

# Requirements:
# - Python 3.9+
# - lightgbm>=4.0.0
# - numpy>=1.24.0, <2.0.0
# - optuna>=3.2.0
# - pandas>=2.0.0, <2.2.0
# - scikit-learn>=1.3.0, <1.5.0

"""
Example: Library Version Management

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

# Libraries required for model selection and optimization
import sys
import sklearn
import optuna
import lightgbm
import pandas as pd
import numpy as np

reproducibility_info = {
    'Python': sys.version,
    'NumPy': np.__version__,
    'Pandas': pd.__version__,
    'scikit-learn': sklearn.__version__,
    'Optuna': optuna.__version__,
    'LightGBM': lightgbm.__version__,
    'Date': '2025-10-19'
}

print("=== Model Optimization Environment ===")
for key, value in reproducibility_info.items():
    print(f"{key}: {value}")

# Recommended versions
print("\n【Recommended Environment】")
recommended = """
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
optuna==3.3.0
lightgbm==4.0.0
xgboost==2.0.0  # Optional
matplotlib==3.7.2
seaborn==0.12.2
"""
print(recommended)

print("\n【Installation Command】")
print("```bash")
print("pip install numpy==1.24.3 pandas==2.0.3 scikit-learn==1.3.0")
print("pip install optuna==3.3.0 lightgbm==4.0.0")
print("```")

print("\n【Important Notes】")
print("⚠️ Optuna's optimization algorithm may differ across versions")
print("⚠️ LightGBM/XGBoost may have OS-dependent issues → validate in consistent environment")
print("⚠️ Always specify library versions in papers and reports")

Random Seed Management

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

# Seed setting for reproducibility
def set_all_seeds(seed=42):
    """
    Set all random number generators
    """
    import random
    import numpy as np

    random.seed(seed)
    np.random.seed(seed)

    # scikit-learn models specify via random_state parameter
    print(f"✅ Random seed {seed} set")
    print("Specify random_state={seed} explicitly in model training")

set_all_seeds(42)

# Optuna seed management
print("\n【Optuna Seed Management】")
print("```python")
print("study = optuna.create_study(")
print("    direction='minimize',")
print("    sampler=optuna.samplers.TPESampler(seed=42)  # Ensure reproducibility")
print(")")
print("```")

Practical Pitfalls (Model Selection and Optimization)

print("=== Model Selection and Optimization Pitfalls ===\n")

print("【Pitfall 1: Hyperparameter Tuning on Test Data】")
print("❌ Bad practice: Optimize on test set")
print("```python")
print("X_train, X_test = train_test_split(X, y)")
print("# Evaluate multiple models and select best on test set")
print("for model in models:")
print("    score = model.score(X_test, y_test)  # WRONG!")
print("best_model = models[best_idx]")
print("```")

print("\n✅ Correct approach: Train/Validation/Test split")
print("```python")
print("X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)")
print("X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)")
print("# Optimize on Validation, evaluate on Test only")
print("```")

print("\n【Pitfall 2: Data Leakage in Cross-Validation】")
print("⚠️ Running preprocessing (StandardScaler, Imputer) on all data")
print("→ Test data information leaks into training")

print("\n✅ Solution: Use Pipeline to include preprocessing")
print("```python")
print("from sklearn.pipeline import Pipeline")
print("from sklearn.preprocessing import StandardScaler")
print("")
print("pipeline = Pipeline([")
print("    ('scaler', StandardScaler()),")
print("    ('model', RandomForestRegressor())")
print("])")
print("cv_scores = cross_val_score(pipeline, X, y, cv=5)")
print("```")

print("\n【Pitfall 3: Excessive Optuna Iterations】")
print("⚠️ Using n_trials=1000 for optimization → overfitting risk")
print("→ Overfit to validation set")

print("\n✅ Solution: Early Stopping + Test Evaluation")
print("```python")
print("study.optimize(objective, n_trials=100,")
print("    callbacks=[optuna.study.MaxTrialsCallback(100)])")
print("# Evaluate final model on test set")
print("```")

print("\n【Pitfall 4: Excessive Ensemble Complexity】")
print("⚠️ Combining Stacking + Voting + Blending in multiple stages")
print("→ Increased training time, lost interpretability, overfitting")

print("\n✅ Solution: Simple ensemble (2-3 models)")
print("```python")
print("# Simple Stacking (3 base models + 1 meta model)")
print("stacking = StackingRegressor(")
print("    estimators=[('rf', RF), ('gb', GB), ('lgbm', LGBM)],")
print("    final_estimator=Ridge(),")
print("    cv=5")
print(")")
print("```")

print("\n【Pitfall 5: Complex Models on Small Datasets】")
print("⚠️ Training Neural Network on 50 samples")
print("→ Overfitting, unstable")

print("\n✅ Solution: Select model based on data size")
print("```python")
print("if len(X) < 100:")
print("    model = Ridge()  # Linear model")
print("elif len(X) < 1000:")
print("    model = RandomForest()  # Tree-based")
print("else:")
print("    model = NeuralNetwork()  # Deep learning")
print("```")

Summary

In this chapter, we learned model selection and hyperparameter optimization.

Key Takeaways:

Model Selection: Choose appropriate models based on data size (small: Ridge, medium: RF, large: NN)
Cross-Validation: Apply K-Fold, Stratified, Time Series appropriately
Optimization Methods: Progression from Grid < Random < Bayesian (Optuna) for efficiency
Ensemble Methods: Stacking > Voting > Boosting > Bagging in performance hierarchy
Real-world Case Study: Li-ion battery capacity prediction showed 30%+ performance improvement
Environment Management: Strict library versioning and random seed control
Common Pitfalls: Avoid test set tuning, data leakage, overfitting, and oversized models on small data

Next Chapter Preview: Chapter 4 covers Explainable AI (XAI). We'll use SHAP, LIME, and attention visualization to understand prediction physical meanings and explore real-world applications and career paths.

Chapter 3 Checklist

Model Selection

[ ] Evaluate Data Size
[ ] Samples < 100 → Ridge/Lasso (regularized linear models)
[ ] Samples 100-1000 → Random Forest/Gradient Boosting
[ ] Samples > 1000 → Neural Network/Deep Learning
[ ] Verify sample-to-feature ratio > 10:1
[ ] Balance Interpretability vs. Accuracy
[ ] Interpretability priority (material design guidelines) → Ridge, Decision Tree
[ ] Accuracy priority (maximize prediction) → Gradient Boosting, Stacking
[ ] Balanced approach → Random Forest
[ ] Verify Model Complexity
[ ] Check learning curves for overfitting (training error << validation error)
[ ] Tune regularization (alpha, lambda)
[ ] Set appropriate iteration count via Early Stopping

Cross-Validation

[ ] K-Fold CV (Standard)
[ ] Use K=5 for regression tasks
[ ] Use K=10 when computational resources available for better accuracy
[ ] Set shuffle=True to remove order effects
[ ] Stratified K-Fold (Classification/Imbalanced)
[ ] Essential for imbalanced datasets
[ ] Verify class distribution uniformity across folds
[ ] Verify minority class samples > K
[ ] Time Series Split (Sequential Data)
[ ] Apply to sequential material experiment data
[ ] Verify training always precedes test temporally
[ ] Absolutely prevent future data training (data leakage prevention)
[ ] Leave-One-Out CV (Small Datasets)
[ ] Use when samples < 50
[ ] Recognize high computational cost (n iterations)
[ ] Be aware of overly optimistic evaluation

Hyperparameter Optimization

[ ] Grid Search
[ ] Use when search space small (<100 combinations)
[ ] Focus on 2-3 important parameters
[ ] Exhaustively evaluate all combinations
[ ] Random Search
[ ] Use for large search spaces
[ ] Achieves Grid Search performance with 10-20% trials
[ ] Efficiently explore continuous parameters
[ ] Bayesian Optimization (Optuna)
[ ] Most efficient (recommended)
[ ] n_trials=50-100 sufficient for good performance
[ ] Parameter importance analysis improves interpretability
[ ] Set sampler algorithm (TPESampler) seed for reproducibility
[ ] Early Stopping
[ ] Essential for Boosting models
[ ] Stop at minimum validation error point
[ ] Prevents overfitting and reduces computation

Ensemble Learning

[ ] Bagging
[ ] Random Forest implements Bagging automatically
[ ] Stabilizes high-variance models
[ ] Expect 20-30% improvement with Decision Tree ensemble
[ ] Boosting
[ ] Test Gradient Boosting, LightGBM, XGBoost
[ ] Trade-off between learning_rate and n_estimators
[ ] Use Early Stopping to prevent overfitting
[ ] Stacking
[ ] Use 3-5 diverse base models
[ ] Keep meta-model simple (Ridge, Lasso)
[ ] Generate meta-features with cross-validation (cv=5)
[ ] Target 5-10% improvement over single best model
[ ] Voting
[ ] Simpler than Stacking
[ ] Weight models by performance
[ ] 3 models typically sufficient

Avoiding Practical Pitfalls

[ ] No Test Set Tuning
[ ] Use Train/Validation/Test split
[ ] Optimize on Validation, evaluate on Test only
[ ] Use test set exactly once
[ ] Prevent Cross-Validation Data Leakage
[ ] Embed preprocessing in Pipeline
[ ] Perform preprocessing independently per fold (StandardScaler.fit)
[ ] Execute feature selection within cross-validation
[ ] Avoid Excessive Optimization
[ ] Use Optuna n_trials < 200 (recommend 100)
[ ] Monitor validation vs. test performance gap
[ ] Start with simple models
[ ] Avoid Ensemble Overcomplexity
[ ] Limit Stacking to single level (no multi-stage)
[ ] Keep base models ≤ 5
[ ] Evaluate training time vs. performance trade-off
[ ] Avoid Complex Models on Small Data
[ ] Samples < 100 → linear models
[ ] Samples < 1000 → tree-based models
[ ] Deep learning only for large data (>1000)

Reproducibility Assurance

[ ] Version Management
[ ] Record scikit-learn, Optuna, LightGBM versions
[ ] Create requirements.txt
[ ] Consider Docker for environment consistency (recommended)
[ ] Random Seed Configuration
[ ] NumPy: np.random.seed(42)
[ ] scikit-learn: random_state=42
[ ] Optuna: TPESampler(seed=42)
[ ] Use consistent seed across all random elements
[ ] Save Optimization History
[ ] Save Optuna study.trials_dataframe() to CSV
[ ] Save best parameters as JSON
[ ] Save learning curves as images
[ ] Model Persistence
[ ] Save trained model with joblib.dump()
[ ] Apply identical preprocessing during prediction
[ ] Version control models (Git LFS recommended)

Li-ion Battery Capacity Prediction Case Study

[ ] Data Preparation
[ ] Generate composition, structure, process condition features
[ ] Split Train/Test (80/20)
[ ] Handle missing values and outliers pre-training
[ ] Step 1: Model Comparison
[ ] Evaluate 5 models (Ridge, RF, GB, LGBM, Stacking)
[ ] Compare using MAE, RMSE, R²
[ ] Identify best single model
[ ] Step 2: Optuna Optimization
[ ] Optimize 8 LightGBM hyperparameters
[ ] Run with n_trials=100
[ ] Analyze parameter importance
[ ] Step 3: Final Stacking
[ ] Use 3 optimized models as base
[ ] Select Ridge as meta-model
[ ] Achieve >30% improvement from Baseline (Ridge)

Model Selection and Optimization Quality Metrics

[ ] Prediction Accuracy
[ ] MAE (regression) < 20% of data standard deviation
[ ] R² > 0.8 (R² > 0.7 acceptable for materials science)
[ ] Verify RMSE (evaluate outlier influence)
[ ] Generalization Performance
[ ] Training-test error gap < 10%
[ ] Small cross-validation std (stability)
[ ] Verify performance maintenance on new data
[ ] Computational Efficiency
[ ] Training time < 1 hour (medium-scale data)
[ ] Inference time < 1 second/sample
[ ] Optuna optimization time < 10 minutes (n_trials=100)
[ ] Interpretability
[ ] Identify important variables via feature_importances_
[ ] Use SHAP values (next chapter) for physical interpretation
[ ] Verify validity with domain expert

References

Akiba, T., Sano, S., Yanase, T., et al. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD, 2623-2631. DOI: 10.1145/3292500.3330701
Bergstra, J. & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281-305.
Dietterich, T. G. (2000). Ensemble methods in machine learning. International Workshop on Multiple Classifier Systems, 1-15. Springer.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241-259. DOI: 10.1016/S0893-6080(05)80023-1

← Chapter 2 | Chapter 4 →

Chapter 3: Model Selection and Hyperparameter Optimization

Learning Objectives

3.1 Model Selection Strategy

Data Size and Model Complexity

Interpretability vs. Accuracy Trade-off

Linear Models, Tree-based, Neural Networks, and GNNs Comparison

3.2 Cross-Validation Techniques

K-Fold Cross-Validation

Stratified K-Fold

Time Series Split (for Sequential Data)

Leave-One-Out CV (for Small Datasets)

3.3 Hyperparameter Optimization

Grid Search (Exhaustive Search)

Random Search

Bayesian Optimization (Optuna, Hyperopt)

Early Stopping

3.4 Ensemble Learning

Bagging (Bootstrap Aggregating)

Boosting (AdaBoost, Gradient Boosting, LightGBM, XGBoost)

Stacking

Voting

3.5 Case Study: Li-ion Battery Capacity Prediction

Step 1: Comparing Five Models

Step 2: Automated Hyperparameter Optimization with Optuna

Step 3: Final Stacking Ensemble for Peak Performance

Exercises

Exercise 1 (Difficulty: Easy)

Exercise 2 (Difficulty: Medium)

Exercise 3 (Difficulty: Hard)

3.6 Model Optimization Environment and Reproducibility

Library Version Management

Random Seed Management

Practical Pitfalls (Model Selection and Optimization)

Summary

Chapter 3 Checklist

Model Selection

Cross-Validation

Hyperparameter Optimization

Ensemble Learning

Avoiding Practical Pitfalls

Reproducibility Assurance

Li-ion Battery Capacity Prediction Case Study

Model Selection and Optimization Quality Metrics

References

Disclaimer