第1章：AutoML基礎

学習目標

この章を読むことで、以下を習得できます：

✅ AutoMLの概念と目的を理解する
✅ 従来のMLワークフローとの違いを説明できる
✅ AutoMLの構成要素とその役割を把握する
✅ Neural Architecture Search（NAS）の基本を理解する
✅ Meta-Learningの概念と応用を学ぶ
✅ AutoMLの評価方法を習得する

1.1 AutoMLとは

機械学習の民主化

AutoML（Automated Machine Learning）は、機械学習モデルの開発プロセスを自動化する技術です。データサイエンティストでなくても、高品質な機械学習モデルを構築できるようにすることを目指しています。

「AutoMLは機械学習の民主化を実現し、より多くの人々がAI技術を活用できるようにする」

AutoMLの目的

目的	説明	効果
効率化	手作業のプロセスを自動化	開発時間を短縮
専門知識の軽減	機械学習の深い知識が不要に	参入障壁を下げる
性能向上	体系的な探索で最適解を発見	人間のバイアスを排除
再現性	標準化されたプロセス	結果の信頼性向上

従来のMLワークフローとの比較

graph TD subgraph "従来のワークフロー" A1[データ収集] --> B1[手動前処理] B1 --> C1[特徴量エンジニアリング] C1 --> D1[モデル選択] D1 --> E1[ハイパーパラメータ調整] E1 --> F1[評価] F1 -->|試行錯誤| C1 end subgraph "AutoMLワークフロー" A2[データ収集] --> B2[自動前処理] B2 --> C2[自動特徴量生成] C2 --> D2[自動モデル選択] D2 --> E2[自動ハイパーパラメータ最適化] E2 --> F2[評価] end style A1 fill:#ffebee style A2 fill:#ffebee style B1 fill:#fff3e0 style B2 fill:#e8f5e9 style C1 fill:#f3e5f5 style C2 fill:#e8f5e9 style D1 fill:#e3f2fd style D2 fill:#e8f5e9 style E1 fill:#fce4ec style E2 fill:#e8f5e9

AutoMLのメリット・デメリット

メリット

時間短縮: 数週間かかる作業を数時間に短縮
アクセシビリティ: 専門知識が少ない人でも利用可能
最適化: 人間が見落とす組み合わせを発見
ベストプラクティス: 自動的に適用される

デメリット

計算コスト: 大規模な探索には多くのリソースが必要
ブラックボックス化: プロセスの透明性が低下
柔軟性の制約: カスタマイズが困難な場合がある
ドメイン知識の軽視: データの背景知識が活かせない

実例：AutoMLの効果

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import time

# データ準備
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 従来の手法（固定パラメータ）
start_time = time.time()
model_manual = RandomForestClassifier(n_estimators=100, random_state=42)
model_manual.fit(X_train, y_train)
y_pred_manual = model_manual.predict(X_test)
acc_manual = accuracy_score(y_test, y_pred_manual)
time_manual = time.time() - start_time

# AutoML風の簡易実装（グリッドサーチ）
from sklearn.model_selection import GridSearchCV

start_time = time.time()
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
model_auto = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=3,
    n_jobs=-1
)
model_auto.fit(X_train, y_train)
y_pred_auto = model_auto.predict(X_test)
acc_auto = accuracy_score(y_test, y_pred_auto)
time_auto = time.time() - start_time

print("=== 従来手法 vs AutoML的手法 ===")
print(f"\n従来手法:")
print(f"  精度: {acc_manual:.4f}")
print(f"  時間: {time_manual:.2f}秒")

print(f"\nAutoML的手法:")
print(f"  精度: {acc_auto:.4f}")
print(f"  時間: {time_auto:.2f}秒")
print(f"  最適パラメータ: {model_auto.best_params_}")

print(f"\n改善:")
print(f"  精度向上: {(acc_auto - acc_manual) * 100:.2f}%")

出力例：

=== 従来手法 vs AutoML的手法 ===

従来手法:
  精度: 0.9649
  時間: 0.15秒

AutoML的手法:
  精度: 0.9737
  時間: 12.34秒
  最適パラメータ: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 100}

改善:
  精度向上: 0.88%

1.2 AutoMLの構成要素

AutoMLシステムは、機械学習パイプライン全体を自動化するために、複数の構成要素から成り立っています。

データ前処理の自動化

生データから学習可能な形式への変換を自動化します：

欠損値処理: 自動検出と補完戦略の選択
外れ値検出: 統計的手法やIsolation Forestによる検出
スケーリング: StandardScaler、MinMaxScalerの自動選択
エンコーディング: カテゴリカル変数の自動変換

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# サンプルデータ（欠損値を含む）
np.random.seed(42)
data = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 50, 35],
    'salary': [50000, 60000, 55000, np.nan, 80000, 65000],
    'department': ['Sales', 'IT', 'HR', 'IT', 'Sales', np.nan]
})

print("=== 元のデータ ===")
print(data)

# 自動前処理パイプライン
numeric_features = ['age', 'salary']
categorical_features = ['department']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 前処理の実行
data_transformed = preprocessor.fit_transform(data)

print("\n=== 前処理後のデータ形状 ===")
print(f"形状: {data_transformed.shape}")
print(f"欠損値: 0（すべて処理済み）")

特徴量エンジニアリング

新しい特徴量を自動生成します：

多項式特徴量: 既存特徴量の組み合わせ
集約特徴量: グループごとの統計量
時系列特徴量: ラグ、移動平均、季節性
テキスト特徴量: TF-IDF、埋め込み表現

from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

# サンプルデータ
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)

# 多項式特徴量の生成
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print("=== 特徴量エンジニアリング ===")
print(f"元の特徴量数: {X.shape[1]}")
print(f"生成後の特徴量数: {X_poly.shape[1]}")
print(f"\n生成された特徴量:")
print(poly.get_feature_names_out(['x1', 'x2']))

# 性能比較
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 元の特徴量
model_original = LinearRegression()
model_original.fit(X, y)
y_pred_original = model_original.predict(X)
r2_original = r2_score(y, y_pred_original)

# 多項式特徴量
model_poly = LinearRegression()
model_poly.fit(X_poly, y)
y_pred_poly = model_poly.predict(X_poly)
r2_poly = r2_score(y, y_pred_poly)

print(f"\n=== 性能比較 ===")
print(f"元の特徴量のR²: {r2_original:.4f}")
print(f"多項式特徴量のR²: {r2_poly:.4f}")
print(f"改善: {(r2_poly - r2_original) * 100:.2f}%")

モデル選択

タスクとデータに最適なアルゴリズムを自動選択します：

線形モデル: Logistic Regression, Ridge, Lasso
ツリーベース: Decision Tree, Random Forest, XGBoost
サポートベクターマシン: SVC, SVR
ニューラルネットワーク: MLP, CNN, RNN

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# データ準備
iris = load_iris()
X, y = iris.data, iris.target

# 複数のモデルを評価
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier()
}

print("=== 自動モデル選択 ===")
results = {}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    results[name] = scores.mean()
    print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

# 最良モデルの選択
best_model = max(results, key=results.get)
print(f"\n最良モデル: {best_model} (精度: {results[best_model]:.4f})")

ハイパーパラメータ最適化

モデルのパラメータを自動調整します：

グリッドサーチ: 全組み合わせを探索
ランダムサーチ: ランダムサンプリング
ベイズ最適化: 効率的な探索
進化的アルゴリズム: 遺伝的アルゴリズム

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# データ準備
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ハイパーパラメータ探索空間
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': [None] + list(range(5, 50, 5)),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.1, 0.9)
}

# ランダムサーチ
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_distributions,
    n_iter=50,
    cv=3,
    random_state=42,
    n_jobs=-1,
    verbose=0
)

print("=== ハイパーパラメータ最適化 ===")
random_search.fit(X_train, y_train)

print(f"最良スコア (CV): {random_search.best_score_:.4f}")
print(f"最良パラメータ:")
for param, value in random_search.best_params_.items():
    print(f"  {param}: {value}")

# テストセットでの評価
test_score = random_search.score(X_test, y_test)
print(f"\nテストセット精度: {test_score:.4f}")

AutoMLワークフロー図

graph TD A[生データ] --> B[データ前処理の自動化] B --> C[特徴量エンジニアリング] C --> D[モデル選択] D --> E[ハイパーパラメータ最適化] E --> F[アンサンブル] F --> G[最終モデル] B --> B1[欠損値処理] B --> B2[外れ値検出] B --> B3[スケーリング] C --> C1[多項式特徴量] C --> C2[集約特徴量] C --> C3[特徴選択] D --> D1[線形モデル] D --> D2[ツリーベース] D --> D3[ニューラルネット] E --> E1[グリッドサーチ] E --> E2[ベイズ最適化] E --> E3[進化的手法] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e3f2fd style E fill:#fce4ec style F fill:#e8f5e9 style G fill:#c8e6c9

1.3 Neural Architecture Search (NAS)

NASの概念

Neural Architecture Search（NAS）は、ニューラルネットワークのアーキテクチャを自動的に設計する技術です。人間が手作業で設計していたネットワーク構造を、アルゴリズムが自動的に探索します。

NASは「ニューラルネットワークを設計するニューラルネットワーク」とも言えます

探索空間

NASが探索する設計要素：

層の種類: 畳み込み層、全結合層、プーリング層など
層の数: ネットワークの深さ
層のパラメータ: フィルタ数、カーネルサイズ、ストライドなど
接続パターン: スキップ接続、残差接続など
活性化関数: ReLU、Sigmoid、Tanhなど

探索戦略

1. ランダムサーチ

アーキテクチャをランダムにサンプリングして評価します。シンプルですが、効率は低いです。

2. 強化学習ベース

コントローラ（RNN）がアーキテクチャを生成し、その性能を報酬として学習します。

報酬関数：

$$ R = \text{Accuracy} - \lambda \cdot \text{Complexity} $$

$\text{Accuracy}$: 検証精度
$\text{Complexity}$: モデルの複雑さ（パラメータ数など）
$\lambda$: 複雑さのペナルティ係数

3. 進化的アルゴリズム

遺伝的アルゴリズムを用いて、優れたアーキテクチャを進化させます。

突然変異: 層の追加・削除、パラメータ変更
交叉: 2つのアーキテクチャの組み合わせ
選択: 性能の高いアーキテクチャを残す

4. 勾配ベース手法（DARTS）

探索空間を連続緩和し、勾配降下法で最適化します。計算効率が高いです。

NAS実装例（簡易版）

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier

# データ準備
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

# 簡易NAS: ランダムサーチでアーキテクチャを探索
def random_architecture_search(n_trials=10):
    best_score = 0
    best_architecture = None

    print("=== Neural Architecture Search ===")
    for i in range(n_trials):
        # ランダムにアーキテクチャを生成
        n_layers = np.random.randint(1, 4)  # 1-3層
        hidden_layer_sizes = tuple(
            np.random.choice([32, 64, 128, 256]) for _ in range(n_layers)
        )
        activation = np.random.choice(['relu', 'tanh', 'logistic'])

        # モデルの訓練と評価
        model = MLPClassifier(
            hidden_layer_sizes=hidden_layer_sizes,
            activation=activation,
            max_iter=100,
            random_state=42
        )
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)

        print(f"Trial {i+1}: layers={hidden_layer_sizes}, "
              f"activation={activation}, score={score:.4f}")

        if score > best_score:
            best_score = score
            best_architecture = {
                'hidden_layer_sizes': hidden_layer_sizes,
                'activation': activation,
                'score': score
            }

    return best_architecture

# NASの実行
best_arch = random_architecture_search(n_trials=10)

print(f"\n=== 最良アーキテクチャ ===")
print(f"層構成: {best_arch['hidden_layer_sizes']}")
print(f"活性化関数: {best_arch['activation']}")
print(f"精度: {best_arch['score']:.4f}")

NASの課題

課題	説明	対策
計算コスト	数千のアーキテクチャを評価	早期停止、プロキシタスク使用
探索空間の広さ	組み合わせ爆発	探索空間の制約、階層的探索
転移性の欠如	タスクごとに探索が必要	転移学習、メタラーニング活用
過学習	検証データへの過適合	正則化、複数データセット使用

1.4 Meta-Learning

Learning to Learn

Meta-Learning（メタ学習）は、「学習の仕方を学習する」手法です。過去のタスクでの経験を活用して、新しいタスクを効率的に学習します。

「学習アルゴリズム自体を学習する」- メタ学習の本質

Few-shot Learning

少数のサンプルから効率的に学習する手法です。

N-way K-shot学習：

N: クラス数
K: 各クラスのサンプル数
例: 5-way 1-shot = 5クラス、各1サンプル

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Few-shot学習のシミュレーション
def few_shot_learning_demo(n_way=5, k_shot=3):
    # データ準備
    digits = load_digits()
    X, y = digits.data, digits.target

    # タスクの選択（n_wayクラス）
    selected_classes = np.random.choice(10, n_way, replace=False)

    # サポートセット（学習用: k_shot × n_way サンプル）
    support_X, support_y = [], []
    # クエリセット（テスト用）
    query_X, query_y = [], []

    for cls in selected_classes:
        cls_indices = np.where(y == cls)[0]
        selected = np.random.choice(cls_indices, k_shot + 10, replace=False)

        # k_shotサンプルをサポートセットに
        support_X.extend(X[selected[:k_shot]])
        support_y.extend([cls] * k_shot)

        # 残りをクエリセットに
        query_X.extend(X[selected[k_shot:]])
        query_y.extend([cls] * 10)

    support_X = np.array(support_X)
    support_y = np.array(support_y)
    query_X = np.array(query_X)
    query_y = np.array(query_y)

    # Few-shot学習（KNN使用）
    model = KNeighborsClassifier(n_neighbors=min(3, k_shot))
    model.fit(support_X, support_y)

    # 評価
    accuracy = model.score(query_X, query_y)

    print(f"=== {n_way}-way {k_shot}-shot学習 ===")
    print(f"サポートセット: {len(support_X)}サンプル")
    print(f"クエリセット: {len(query_X)}サンプル")
    print(f"精度: {accuracy:.4f}")

    return accuracy

# 異なる設定で実験
for k in [1, 3, 5]:
    few_shot_learning_demo(n_way=5, k_shot=k)
    print()

Transfer Learning

あるタスクで学習した知識を別のタスクに転移させます。

事前学習済みモデル: ImageNet等で学習したモデルを利用
ファインチューニング: 新しいタスクで調整
ドメイン適応: ドメイン間の差を縮小

Warm-starting

過去のタスクでの最適パラメータを初期値として使用し、新しいタスクの学習を高速化します。

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification

# タスク1とタスク2（類似したタスク）
X1, y1 = make_classification(n_samples=1000, n_features=20,
                             n_informative=15, random_state=42)
X2, y2 = make_classification(n_samples=1000, n_features=20,
                             n_informative=15, random_state=43)

print("=== Warm-starting効果の検証 ===")

# コールドスタート（タスク2を最初から学習）
model_cold = SGDClassifier(max_iter=100, random_state=42)
model_cold.fit(X2[:100], y2[:100])  # 少ないデータで学習
score_cold = model_cold.score(X2[100:], y2[100:])

# ウォームスタート（タスク1で事前学習）
model_warm = SGDClassifier(max_iter=100, random_state=42)
model_warm.fit(X1, y1)  # タスク1で学習
model_warm.partial_fit(X2[:100], y2[:100])  # タスク2で追加学習
score_warm = model_warm.score(X2[100:], y2[100:])

print(f"コールドスタート精度: {score_cold:.4f}")
print(f"ウォームスタート精度: {score_warm:.4f}")
print(f"改善: {(score_warm - score_cold) * 100:.2f}%")

1.5 AutoMLの評価

Performance Metrics

AutoMLシステムの性能を評価する指標：

指標	説明	重要性
予測精度	モデルの予測性能	最も重要
探索時間	最適モデルを見つけるまでの時間	実用上重要
計算コスト	必要なリソース（CPU、GPU、メモリ）	スケーラビリティ
ロバスト性	異なるデータセットでの安定性	汎用性

計算コスト

AutoMLの計算コストを定量化：

$$ \text{Total Cost} = \sum_{i=1}^{n} C_i \times T_i $$

$C_i$: i番目のモデルの計算コスト（FLOPS等）
$T_i$: i番目のモデルの訓練時間
$n$: 評価したモデルの総数

再現性

同じ入力で同じ結果が得られるか：

乱数シード固定: 再現可能な実験
パイプラインの保存: 学習済みモデルと前処理の保存
バージョン管理: ライブラリバージョンの記録

解釈可能性

AutoMLの決定プロセスを理解する：

特徴量重要度: どの特徴量が重要か
モデル選択理由: なぜそのモデルが選ばれたか
ハイパーパラメータの影響: 各パラメータの寄与度

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

# データ準備
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# モデル訓練
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 特徴量重要度
feature_importance = model.feature_importances_
feature_names = iris.feature_names

# Permutation Importance
perm_importance = permutation_importance(
    model, X_test, y_test, n_repeats=10, random_state=42
)

# 可視化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 特徴量重要度
axes[0].barh(feature_names, feature_importance)
axes[0].set_xlabel('重要度')
axes[0].set_title('特徴量重要度（Gini）')
axes[0].grid(True, alpha=0.3)

# Permutation Importance
axes[1].barh(feature_names, perm_importance.importances_mean)
axes[1].set_xlabel('重要度')
axes[1].set_title('Permutation Importance')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== 解釈可能性分析 ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name:20s}: {importance:.4f}")

1.6 本章のまとめ

学んだこと

AutoMLの概念
- 機械学習の民主化を実現
- 効率化と専門知識の軽減
- 従来手法との違いと利点
AutoMLの構成要素
- データ前処理の自動化
- 特徴量エンジニアリング
- モデル選択とハイパーパラメータ最適化
Neural Architecture Search
- ネットワーク構造の自動設計
- 探索戦略（RL、進化的、勾配ベース）
- 計算コストとの戦い
Meta-Learning
- 学習の仕方を学習
- Few-shot learning、Transfer learning
- Warm-startingによる高速化
AutoMLの評価
- 性能指標（精度、時間、コスト）
- 再現性と解釈可能性の重要性

AutoMLの原則

原則	説明
自動化と透明性のバランス	ブラックボックス化を避け、解釈可能性を維持
効率性	計算リソースを考慮した探索戦略
汎用性	様々なタスクとデータに適用可能
ドメイン知識の活用	自動化と専門知識の組み合わせ
継続的改善	メタ学習による学習効率の向上

次の章へ

第2章では、AutoMLツールとフレームワークを学びます：

Auto-sklearn
TPOT
H2O AutoML
Google Cloud AutoML
AutoKeras

演習問題

問題1（難易度：easy）

AutoMLの主な目的を3つ挙げ、それぞれを説明してください。

解答例

解答：

効率化
- 説明: 手作業で行っていたモデル開発プロセスを自動化し、開発時間を大幅に短縮する
- 効果: 数週間かかる作業を数時間に短縮可能
専門知識の軽減
- 説明: 機械学習の深い専門知識がなくても、高品質なモデルを構築できるようにする
- 効果: より多くの人々がAI技術を活用可能になる（民主化）
性能向上
- 説明: 体系的な探索により、人間が見落としがちな最適な組み合わせを発見する
- 効果: 人間のバイアスを排除し、客観的に最良のモデルを見つける

問題2（難易度：medium）

Neural Architecture Search（NAS）の4つの探索戦略を説明し、それぞれの長所と短所を述べてください。

解答例

解答：

探索戦略	説明	長所	短所
ランダムサーチ	アーキテクチャをランダムにサンプリング	実装が簡単、並列化が容易	効率が低い、大規模探索に不向き
強化学習ベース	RNNコントローラがアーキテクチャを生成	有望な領域を効率的に探索	計算コストが高い、安定性に課題
進化的アルゴリズム	遺伝的操作で優れたアーキテクチャを進化	多様性を保持、局所最適を回避	収束が遅い、大規模な集団が必要
勾配ベース（DARTS）	探索空間を連続緩和し勾配降下法で最適化	計算効率が高い、高速	離散化誤差、探索空間に制約

問題3（難易度：medium）

Few-shot learningにおける「5-way 3-shot学習」とは何を意味するか説明し、この設定での学習サンプル数を計算してください。

解答例

解答：

「5-way 3-shot学習」の意味：

5-way: 5つのクラスを分類するタスク
3-shot: 各クラスにつき3個のサンプルのみを学習に使用

学習サンプル数：

$$ \text{サンプル数} = \text{クラス数} \times \text{各クラスのサンプル数} = 5 \times 3 = 15 $$

つまり、わずか15サンプルで5クラス分類を学習します。

具体例：

5種類の動物（犬、猫、鳥、魚、馬）を分類
各動物の画像を3枚ずつ（合計15枚）だけ学習に使用
新しい動物の画像を正しく分類できるようになる

問題4（難易度：hard）

以下のコードを完成させて、簡易的なAutoMLシステムを実装してください。データ前処理、モデル選択、ハイパーパラメータ最適化を含めること。

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# データ準備
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ここにAutoMLシステムを実装
# TODO: 前処理パイプライン、モデル選択、ハイパーパラメータ最適化

解答例

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import numpy as np

# データ準備
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("=== 簡易AutoMLシステム ===\n")

# ステップ1: モデル候補とハイパーパラメータ空間の定義
models = {
    'Logistic Regression': {
        'model': LogisticRegression(max_iter=1000),
        'params': {
            'classifier__C': [0.1, 1.0, 10.0],
            'classifier__penalty': ['l2']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'classifier__n_estimators': [50, 100, 200],
            'classifier__max_depth': [None, 10, 20],
            'classifier__min_samples_split': [2, 5]
        }
    },
    'SVM': {
        'model': SVC(),
        'params': {
            'classifier__C': [0.1, 1.0, 10.0],
            'classifier__kernel': ['rbf', 'linear']
        }
    }
}

# ステップ2: 各モデルで前処理パイプライン + ハイパーパラメータ最適化
best_overall_score = 0
best_overall_model = None
best_overall_name = None

for name, config in models.items():
    print(f"--- {name} ---")

    # パイプライン構築（前処理 + モデル）
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', config['model'])
    ])

    # グリッドサーチでハイパーパラメータ最適化
    grid_search = GridSearchCV(
        pipeline,
        param_grid=config['params'],
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )

    grid_search.fit(X_train, y_train)

    # 結果
    cv_score = grid_search.best_score_
    test_score = grid_search.score(X_test, y_test)

    print(f"  最良CVスコア: {cv_score:.4f}")
    print(f"  テストスコア: {test_score:.4f}")
    print(f"  最良パラメータ: {grid_search.best_params_}")
    print()

    # 最良モデルの更新
    if cv_score > best_overall_score:
        best_overall_score = cv_score
        best_overall_model = grid_search.best_estimator_
        best_overall_name = name

# ステップ3: 最終結果
print("=" * 50)
print(f"最良モデル: {best_overall_name}")
print(f"CVスコア: {best_overall_score:.4f}")
print(f"テストスコア: {best_overall_model.score(X_test, y_test):.4f}")
print("=" * 50)