This chapter covers the fundamentals of AutoML Fundamentals, which what is automl. You will learn differences from traditional ML workflows, Comprehend the components of AutoML, and fundamentals of Neural Architecture Search (NAS).
Learning Objectives
By reading this chapter, you will master the following:
- ✅ Understand the concept and purpose of AutoML
- ✅ Explain the differences from traditional ML workflows
- ✅ Comprehend the components of AutoML and their roles
- ✅ Understand the fundamentals of Neural Architecture Search (NAS)
- ✅ Learn the concepts and applications of Meta-Learning
- ✅ Master AutoML evaluation methods
1.1 What is AutoML
Democratization of Machine Learning
AutoML (Automated Machine Learning) is a technology that automates the machine learning model development process. It aims to enable even non-data scientists to build high-quality machine learning models.
"AutoML realizes the democratization of machine learning, allowing more people to utilize AI technology"
Purpose of AutoML
| Purpose | Description | Effect |
|---|---|---|
| Efficiency | Automate manual processes | Reduce development time |
| Reduce Expertise Requirements | Deep machine learning knowledge not required | Lower barriers to entry |
| Performance Improvement | Discover optimal solutions through systematic search | Eliminate human bias |
| Reproducibility | Standardized processes | Improve result reliability |
Comparison with Traditional ML Workflow
Advantages and Disadvantages of AutoML
Advantages
- Time Savings: Reduce work from weeks to hours
- Accessibility: Usable by people with less expertise
- Optimization: Discover combinations humans might overlook
- Best Practices: Automatically applied
Disadvantages
- Computational Cost: Large-scale searches require many resources
- Black Box Nature: Reduced process transparency
- Flexibility Constraints: Customization can be difficult
- Neglect of Domain Knowledge: Cannot leverage data background knowledge
Example: Effects of AutoML
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Example: Effects of AutoML
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import time
# Data preparation
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Traditional approach (fixed parameters)
start_time = time.time()
model_manual = RandomForestClassifier(n_estimators=100, random_state=42)
model_manual.fit(X_train, y_train)
y_pred_manual = model_manual.predict(X_test)
acc_manual = accuracy_score(y_test, y_pred_manual)
time_manual = time.time() - start_time
# AutoML-style simple implementation (grid search)
from sklearn.model_selection import GridSearchCV
start_time = time.time()
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
model_auto = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=3,
n_jobs=-1
)
model_auto.fit(X_train, y_train)
y_pred_auto = model_auto.predict(X_test)
acc_auto = accuracy_score(y_test, y_pred_auto)
time_auto = time.time() - start_time
print("=== Traditional vs AutoML-style Approach ===")
print(f"\nTraditional Approach:")
print(f" Accuracy: {acc_manual:.4f}")
print(f" Time: {time_manual:.2f}s")
print(f"\nAutoML-style Approach:")
print(f" Accuracy: {acc_auto:.4f}")
print(f" Time: {time_auto:.2f}s")
print(f" Best Parameters: {model_auto.best_params_}")
print(f"\nImprovement:")
print(f" Accuracy Gain: {(acc_auto - acc_manual) * 100:.2f}%")
Example Output:
=== Traditional vs AutoML-style Approach ===
Traditional Approach:
Accuracy: 0.9649
Time: 0.15s
AutoML-style Approach:
Accuracy: 0.9737
Time: 12.34s
Best Parameters: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 100}
Improvement:
Accuracy Gain: 0.88%
1.2 Components of AutoML
AutoML systems consist of multiple components to automate the entire machine learning pipeline.
Data Preprocessing Automation
Automates transformation from raw data to learnable formats:
- Missing Value Handling: Automatic detection and imputation strategy selection
- Outlier Detection: Detection using statistical methods or Isolation Forest
- Scaling: Automatic selection of StandardScaler, MinMaxScaler
- Encoding: Automatic conversion of categorical variables
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Automates transformation from raw data to learnable formats:
Purpose: Demonstrate machine learning model training and evaluation
Target: Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# Sample data (with missing values)
np.random.seed(42)
data = pd.DataFrame({
'age': [25, 30, np.nan, 45, 50, 35],
'salary': [50000, 60000, 55000, np.nan, 80000, 65000],
'department': ['Sales', 'IT', 'HR', 'IT', 'Sales', np.nan]
})
print("=== Original Data ===")
print(data)
# Automatic preprocessing pipeline
numeric_features = ['age', 'salary']
categorical_features = ['department']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Execute preprocessing
data_transformed = preprocessor.fit_transform(data)
print("\n=== Data Shape After Preprocessing ===")
print(f"Shape: {data_transformed.shape}")
print(f"Missing Values: 0 (all handled)")
Feature Engineering
Automatically generates new features:
- Polynomial Features: Combinations of existing features
- Aggregate Features: Statistics by group
- Time Series Features: Lags, moving averages, seasonality
- Text Features: TF-IDF, embedding representations
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Automatically generates new features:
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
# Sample data
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)
# Polynomial feature generation
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print("=== Feature Engineering ===")
print(f"Original feature count: {X.shape[1]}")
print(f"Feature count after generation: {X_poly.shape[1]}")
print(f"\nGenerated features:")
print(poly.get_feature_names_out(['x1', 'x2']))
# Performance comparison
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Original features
model_original = LinearRegression()
model_original.fit(X, y)
y_pred_original = model_original.predict(X)
r2_original = r2_score(y, y_pred_original)
# Polynomial features
model_poly = LinearRegression()
model_poly.fit(X_poly, y)
y_pred_poly = model_poly.predict(X_poly)
r2_poly = r2_score(y, y_pred_poly)
print(f"\n=== Performance Comparison ===")
print(f"Original Features R²: {r2_original:.4f}")
print(f"Polynomial Features R²: {r2_poly:.4f}")
print(f"Improvement: {(r2_poly - r2_original) * 100:.2f}%")
Model Selection
Automatically selects the optimal algorithm for the task and data:
- Linear Models: Logistic Regression, Ridge, Lasso
- Tree-based: Decision Tree, Random Forest, XGBoost
- Support Vector Machines: SVC, SVR
- Neural Networks: MLP, CNN, RNN
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
# Data preparation
iris = load_iris()
X, y = iris.data, iris.target
# Evaluate multiple models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(),
'SVM': SVC(),
'KNN': KNeighborsClassifier()
}
print("=== Automatic Model Selection ===")
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
results[name] = scores.mean()
print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Select best model
best_model = max(results, key=results.get)
print(f"\nBest Model: {best_model} (Accuracy: {results[best_model]:.4f})")
Hyperparameter Optimization
Automatically tunes model parameters:
- Grid Search: Explore all combinations
- Random Search: Random sampling
- Bayesian Optimization: Efficient search
- Evolutionary Algorithms: Genetic algorithms
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Data preparation
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Hyperparameter search space
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': [None] + list(range(5, 50, 5)),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
# Random search
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=50,
cv=3,
random_state=42,
n_jobs=-1,
verbose=0
)
print("=== Hyperparameter Optimization ===")
random_search.fit(X_train, y_train)
print(f"Best Score (CV): {random_search.best_score_:.4f}")
print(f"Best Parameters:")
for param, value in random_search.best_params_.items():
print(f" {param}: {value}")
# Evaluate on test set
test_score = random_search.score(X_test, y_test)
print(f"\nTest Set Accuracy: {test_score:.4f}")
AutoML Workflow Diagram
1.3 Neural Architecture Search (NAS)
Concept of NAS
Neural Architecture Search (NAS) is a technology that automatically designs neural network architectures. Algorithms automatically search for network structures that were manually designed by humans.
NAS can be described as "a neural network that designs neural networks"
Search Space
Design elements explored by NAS:
- Layer Types: Convolutional layers, fully connected layers, pooling layers, etc.
- Number of Layers: Network depth
- Layer Parameters: Number of filters, kernel size, stride, etc.
- Connection Patterns: Skip connections, residual connections, etc.
- Activation Functions: ReLU, Sigmoid, Tanh, etc.
Search Strategies
1. Random Search
Randomly samples and evaluates architectures. Simple but inefficient.
2. Reinforcement Learning-based
A controller (RNN) generates architectures and learns using their performance as rewards.
Reward function:
$$ R = \text{Accuracy} - \lambda \cdot \text{Complexity} $$
- $\text{Accuracy}$: Validation accuracy
- $\text{Complexity}$: Model complexity (number of parameters, etc.)
- $\lambda$: Complexity penalty coefficient
3. Evolutionary Algorithms
Uses genetic algorithms to evolve superior architectures.
- Mutation: Add/remove layers, change parameters
- Crossover: Combine two architectures
- Selection: Retain high-performing architectures
4. Gradient-based Methods (DARTS)
Relaxes the search space to be continuous and optimizes using gradient descent. Computationally efficient.
NAS Implementation Example (Simplified Version)
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: NAS Implementation Example (Simplified Version)
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 30-60 seconds
Dependencies: None
"""
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
# Data preparation
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)
# Simple NAS: Explore architectures with random search
def random_architecture_search(n_trials=10):
best_score = 0
best_architecture = None
print("=== Neural Architecture Search ===")
for i in range(n_trials):
# Randomly generate architecture
n_layers = np.random.randint(1, 4) # 1-3 layers
hidden_layer_sizes = tuple(
np.random.choice([32, 64, 128, 256]) for _ in range(n_layers)
)
activation = np.random.choice(['relu', 'tanh', 'logistic'])
# Train and evaluate model
model = MLPClassifier(
hidden_layer_sizes=hidden_layer_sizes,
activation=activation,
max_iter=100,
random_state=42
)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Trial {i+1}: layers={hidden_layer_sizes}, "
f"activation={activation}, score={score:.4f}")
if score > best_score:
best_score = score
best_architecture = {
'hidden_layer_sizes': hidden_layer_sizes,
'activation': activation,
'score': score
}
return best_architecture
# Run NAS
best_arch = random_architecture_search(n_trials=10)
print(f"\n=== Best Architecture ===")
print(f"Layer Configuration: {best_arch['hidden_layer_sizes']}")
print(f"Activation Function: {best_arch['activation']}")
print(f"Accuracy: {best_arch['score']:.4f}")
NAS Challenges
| Challenge | Description | Countermeasure |
|---|---|---|
| Computational Cost | Evaluating thousands of architectures | Early stopping, proxy task usage |
| Search Space Size | Combinatorial explosion | Search space constraints, hierarchical search |
| Lack of Transferability | Search needed for each task | Transfer learning, meta-learning utilization |
| Overfitting | Overfitting to validation data | Regularization, use multiple datasets |
1.4 Meta-Learning
Learning to Learn
Meta-Learning is a method that "learns how to learn." It leverages experience from past tasks to efficiently learn new tasks.
"Learning the learning algorithm itself" - The essence of meta-learning
Few-shot Learning
A method for efficiently learning from a small number of samples.
N-way K-shot learning:
- N: Number of classes
- K: Number of samples per class
- Example: 5-way 1-shot = 5 classes, 1 sample each
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: N-way K-shot learning:
Purpose: Demonstrate machine learning model training and evaluation
Target: Intermediate
Execution time: 30-60 seconds
Dependencies: None
"""
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Few-shot learning simulation
def few_shot_learning_demo(n_way=5, k_shot=3):
# Data preparation
digits = load_digits()
X, y = digits.data, digits.target
# Select task (n_way classes)
selected_classes = np.random.choice(10, n_way, replace=False)
# Support set (training: k_shot × n_way samples)
support_X, support_y = [], []
# Query set (testing)
query_X, query_y = [], []
for cls in selected_classes:
cls_indices = np.where(y == cls)[0]
selected = np.random.choice(cls_indices, k_shot + 10, replace=False)
# k_shot samples to support set
support_X.extend(X[selected[:k_shot]])
support_y.extend([cls] * k_shot)
# Remaining to query set
query_X.extend(X[selected[k_shot:]])
query_y.extend([cls] * 10)
support_X = np.array(support_X)
support_y = np.array(support_y)
query_X = np.array(query_X)
query_y = np.array(query_y)
# Few-shot learning (using KNN)
model = KNeighborsClassifier(n_neighbors=min(3, k_shot))
model.fit(support_X, support_y)
# Evaluate
accuracy = model.score(query_X, query_y)
print(f"=== {n_way}-way {k_shot}-shot Learning ===")
print(f"Support Set: {len(support_X)} samples")
print(f"Query Set: {len(query_X)} samples")
print(f"Accuracy: {accuracy:.4f}")
return accuracy
# Experiment with different settings
for k in [1, 3, 5]:
few_shot_learning_demo(n_way=5, k_shot=k)
print()
Transfer Learning
Transfer knowledge learned on one task to another task.
- Pre-trained Models: Use models trained on ImageNet, etc.
- Fine-tuning: Adjust for new tasks
- Domain Adaptation: Reduce inter-domain differences
Warm-starting
Use optimal parameters from past tasks as initial values to accelerate learning on new tasks.
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
# Task 1 and Task 2 (similar tasks)
X1, y1 = make_classification(n_samples=1000, n_features=20,
n_informative=15, random_state=42)
X2, y2 = make_classification(n_samples=1000, n_features=20,
n_informative=15, random_state=43)
print("=== Warm-starting Effect Verification ===")
# Cold start (learn Task 2 from scratch)
model_cold = SGDClassifier(max_iter=100, random_state=42)
model_cold.fit(X2[:100], y2[:100]) # Learn with small data
score_cold = model_cold.score(X2[100:], y2[100:])
# Warm start (pre-train on Task 1)
model_warm = SGDClassifier(max_iter=100, random_state=42)
model_warm.fit(X1, y1) # Learn on Task 1
model_warm.partial_fit(X2[:100], y2[:100]) # Additional learning on Task 2
score_warm = model_warm.score(X2[100:], y2[100:])
print(f"Cold Start Accuracy: {score_cold:.4f}")
print(f"Warm Start Accuracy: {score_warm:.4f}")
print(f"Improvement: {(score_warm - score_cold) * 100:.2f}%")
1.5 AutoML Evaluation
Performance Metrics
Metrics for evaluating AutoML system performance:
| Metric | Description | Importance |
|---|---|---|
| Prediction Accuracy | Model prediction performance | Most important |
| Search Time | Time to find optimal model | Practically important |
| Computational Cost | Required resources (CPU, GPU, memory) | Scalability |
| Robustness | Stability across different datasets | Generality |
Computational Cost
Quantifying AutoML computational cost:
$$ \text{Total Cost} = \sum_{i=1}^{n} C_i \times T_i $$
- $C_i$: Computational cost of i-th model (FLOPS, etc.)
- $T_i$: Training time of i-th model
- $n$: Total number of models evaluated
Reproducibility
Whether the same input produces the same results:
- Fixed Random Seeds: Reproducible experiments
- Pipeline Saving: Save trained models and preprocessing
- Version Control: Record library versions
Interpretability
Understanding AutoML's decision process:
- Feature Importance: Which features are important
- Model Selection Reasoning: Why that model was chosen
- Hyperparameter Impact: Contribution of each parameter
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Understanding AutoML's decision process:
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 30-60 seconds
Dependencies: None
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt
# Data preparation
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Model training
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Feature importance
feature_importance = model.feature_importances_
feature_names = iris.feature_names
# Permutation Importance
perm_importance = permutation_importance(
model, X_test, y_test, n_repeats=10, random_state=42
)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Feature importance
axes[0].barh(feature_names, feature_importance)
axes[0].set_xlabel('Importance')
axes[0].set_title('Feature Importance (Gini)')
axes[0].grid(True, alpha=0.3)
# Permutation Importance
axes[1].barh(feature_names, perm_importance.importances_mean)
axes[1].set_xlabel('Importance')
axes[1].set_title('Permutation Importance')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("=== Interpretability Analysis ===")
for name, importance in zip(feature_names, feature_importance):
print(f"{name:20s}: {importance:.4f}")
1.6 Chapter Summary
What We Learned
AutoML Concepts
- Realizing democratization of machine learning
- Efficiency and reduced expertise requirements
- Differences and advantages over traditional methods
AutoML Components
- Data preprocessing automation
- Feature engineering
- Model selection and hyperparameter optimization
Neural Architecture Search
- Automatic network structure design
- Search strategies (RL, evolutionary, gradient-based)
- Battle with computational cost
Meta-Learning
- Learning how to learn
- Few-shot learning, Transfer learning
- Acceleration through warm-starting
AutoML Evaluation
- Performance metrics (accuracy, time, cost)
- Importance of reproducibility and interpretability
AutoML Principles
| Principle | Description |
|---|---|
| Balance Automation and Transparency | Avoid black boxes, maintain interpretability |
| Efficiency | Search strategies considering computational resources |
| Generality | Applicable to various tasks and data |
| Leverage Domain Knowledge | Combination of automation and expertise |
| Continuous Improvement | Improve learning efficiency through meta-learning |
Next Chapter
In Chapter 2, we will learn about AutoML Tools and Frameworks:
- Auto-sklearn
- TPOT
- H2O AutoML
- Google Cloud AutoML
- AutoKeras
Exercises
Question 1 (Difficulty: easy)
List three main purposes of AutoML and explain each.
Answer Example
Answer:
Efficiency
- Description: Automate manual model development processes and significantly reduce development time
- Effect: Can reduce work from weeks to hours
Reduce Expertise Requirements
- Description: Enable building high-quality models without deep machine learning expertise
- Effect: More people can utilize AI technology (democratization)
Performance Improvement
- Description: Discover optimal combinations that humans might overlook through systematic search
- Effect: Eliminate human bias and objectively find the best model
Question 2 (Difficulty: medium)
Explain four search strategies for Neural Architecture Search (NAS) and describe the advantages and disadvantages of each.
Answer Example
Answer:
| Search Strategy | Description | Advantages | Disadvantages |
|---|---|---|---|
| Random Search | Randomly sample architectures | Simple implementation, easy parallelization | Inefficient, unsuitable for large-scale search |
| Reinforcement Learning-based | RNN controller generates architectures | Efficiently explores promising regions | High computational cost, stability issues |
| Evolutionary Algorithms | Evolve superior architectures through genetic operations | Maintains diversity, avoids local optima | Slow convergence, requires large populations |
| Gradient-based (DARTS) | Relax search space and optimize with gradient descent | Computationally efficient, fast | Discretization errors, search space constraints |
Question 3 (Difficulty: medium)
Explain what "5-way 3-shot learning" means in few-shot learning and calculate the number of training samples in this setting.
Answer Example
Answer:
Meaning of "5-way 3-shot learning":
- 5-way: A task to classify 5 classes
- 3-shot: Use only 3 samples per class for training
Number of training samples:
$$ \text{Number of samples} = \text{Number of classes} \times \text{Samples per class} = 5 \times 3 = 15 $$
That is, learning 5-class classification with only 15 samples.
Concrete example:
- Classify 5 types of animals (dog, cat, bird, fish, horse)
- Use only 3 images of each animal (total 15 images) for training
- Become able to correctly classify new animal images
Question 4 (Difficulty: hard)
Complete the following code to implement a simple AutoML system. Include data preprocessing, model selection, and hyperparameter optimization.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Data preparation
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Implement AutoML system here
Answer Example
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Complete the following code to implement a simple AutoML sys
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import numpy as np
# Data preparation
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("=== Simple AutoML System ===\n")
# Step 1: Define model candidates and hyperparameter space
models = {
'Logistic Regression': {
'model': LogisticRegression(max_iter=1000),
'params': {
'classifier__C': [0.1, 1.0, 10.0],
'classifier__penalty': ['l2']
}
},
'Random Forest': {
'model': RandomForestClassifier(random_state=42),
'params': {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [None, 10, 20],
'classifier__min_samples_split': [2, 5]
}
},
'SVM': {
'model': SVC(),
'params': {
'classifier__C': [0.1, 1.0, 10.0],
'classifier__kernel': ['rbf', 'linear']
}
}
}
# Step 2: Preprocessing pipeline + hyperparameter optimization for each model
best_overall_score = 0
best_overall_model = None
best_overall_name = None
for name, config in models.items():
print(f"--- {name} ---")
# Build pipeline (preprocessing + model)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', config['model'])
])
# Hyperparameter optimization with grid search
grid_search = GridSearchCV(
pipeline,
param_grid=config['params'],
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
# Results
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, y_test)
print(f" Best CV Score: {cv_score:.4f}")
print(f" Test Score: {test_score:.4f}")
print(f" Best Parameters: {grid_search.best_params_}")
print()
# Update best model
if cv_score > best_overall_score:
best_overall_score = cv_score
best_overall_model = grid_search.best_estimator_
best_overall_name = name
# Step 3: Final results
print("=" * 50)
print(f"Best Model: {best_overall_name}")
print(f"CV Score: {best_overall_score:.4f}")
print(f"Test Score: {best_overall_model.score(X_test, y_test):.4f}")
print("=" * 50)
Example Output:
=== Simple AutoML System ===
--- Logistic Regression ---
Best CV Score: 0.9780
Test Score: 0.9825
Best Parameters: {'classifier__C': 1.0, 'classifier__penalty': 'l2'}
--- Random Forest ---
Best CV Score: 0.9648
Test Score: 0.9649
Best Parameters: {'classifier__max_depth': None, ...}
--- SVM ---
Best CV Score: 0.9758
Test Score: 0.9737
Best Parameters: {'classifier__C': 1.0, 'classifier__kernel': 'linear'}
==================================================
Best Model: Logistic Regression
CV Score: 0.9780
Test Score: 0.9825
==================================================
Question 5 (Difficulty: hard)
Explain the tradeoff between "computational cost" and "prediction accuracy" in AutoML, and describe how to balance them practically.
Answer Example
Answer:
Essence of Tradeoff:
| Aspect | High Accuracy Pursuit | Low Cost Pursuit |
|---|---|---|
| Search Range | Extensive search (thousands of models) | Limited search (dozens of models) |
| Time | Days to weeks | Hours to days |
| Resources | Large-scale GPU/cluster | Single machine |
| Accuracy Improvement | +1-2% improvement | Baseline achievement |
Strategies for Balance:
Staged Approach
- Phase 1: Fast search to narrow down promising model candidates (hours)
- Phase 2: Detailed optimization on candidates (days)
Early Stopping
- Terminate search if validation accuracy doesn't improve
- Set computational budget limits (time/cost)
Efficient Search Methods
- Use Bayesian optimization instead of random search
- Improve initial state with transfer learning or meta-learning
Task-dependent Priorities
- Production systems: Accuracy priority (allow high cost)
- Prototypes: Speed priority (emphasize low cost)
- Research: Balance both
Multi-objective Optimization
- Include computational cost in objective function
$$ \text{Objective} = \alpha \cdot \text{Accuracy} - (1-\alpha) \cdot \log(\text{Cost}) $$
- $\alpha$: Weight for accuracy and cost (0 to 1)
Practical Recommendations:
- First explore with low cost to understand baseline performance
- Only perform high-cost search when business value is high
- Quantitatively evaluate cost and effect of 1% accuracy improvement
References
- Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.). (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.
- Elsken, T., Metzen, J. H., & Hutter, F. (2019). Neural Architecture Search: A Survey. Journal of Machine Learning Research, 20(55), 1-21.
- Hospedales, T., Antoniou, A., Micaelli, P., & Storkey, A. (2021). Meta-Learning in Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Feurer, M., & Hutter, F. (2019). Hyperparameter Optimization. In Automated Machine Learning (pp. 3-33). Springer.
- He, X., Zhao, K., & Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212, 106622.