Chapter 4 Quality Enhancements
This chapter covers Chapter 4 Quality Enhancements. You will learn essential concepts and techniques.
This file contains enhancements to be integrated into chapter-4.md
Code Reproducibility Section (add after section 4.1)
Ensuring Code Reproducibility
Environment Setup:
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# Chapter 4: Active Learning Strategies
# Required Library Versions
"""
Python: 3.8+
numpy: 1.21.0
scikit-learn: 1.0.0
scipy: 1.7.0
matplotlib: 3.5.0
"""
import numpy as np
import random
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel, Matern
# Ensure reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
# Recommended kernel configuration (for Active Learning)
kernel_default = ConstantKernel(1.0, constant_value_bounds=(1e-3, 1e3)) * \
Matern(length_scale=0.2, length_scale_bounds=(1e-2, 1e0), nu=2.5)
print("Environment setup complete (for Active Learning)")
Practical Pitfalls Section (add after section 4.2)
4.3 Practical Pitfalls and Solutions
Pitfall 1: Bias in Uncertainty Sampling
Problem: Uncertainty sampling becomes too concentrated at the edges of the search space
Symptoms: - Sampling concentrated near boundaries - Insufficient information in interior regions - Uneven prediction accuracy
Solution: Combination with epsilon-greedy method
def epsilon_greedy_uncertainty_sampling(gp, X_candidate, epsilon=0.1):
"""
Uncertainty sampling with epsilon-greedy strategy
Parameters:
-----------
gp : GaussianProcessRegressor
Trained GP model
X_candidate : array (n_candidates, n_features)
Candidate points
epsilon : float
Probability of random search (0~1)
Returns:
--------
next_x : array
Next sampling point
"""
if np.random.rand() < epsilon:
# Random sampling with epsilon probability
next_idx = np.random.randint(len(X_candidate))
print(f" random search (ε={epsilon})")
else:
# Uncertainty sampling with (1-epsilon) probability
_, sigma = gp.predict(X_candidate, return_std=True)
next_idx = np.argmax(sigma)
print(f" uncertainty sampling (σ={sigma[next_idx]:.4f})")
next_x = X_candidate[next_idx]
return next_x, next_idx
# Usage example
np.random.seed(42)
X_train = np.array([[0.1], [0.5], [0.9]])
y_train = np.sin(5 * X_train).ravel()
kernel = ConstantKernel(1.0) * RBF(length_scale=0.15)
gp = GaussianProcessRegressor(kernel=kernel)
gp.fit(X_train, y_train)
X_candidate = np.linspace(0, 1, 100).reshape(-1, 1)
# Epsilon-greedy uncertainty sampling
for i in range(5):
print(f"\nIteration {i+1}:")
next_x, idx = epsilon_greedy_uncertainty_sampling(
gp, X_candidate, epsilon=0.2 # 20% random
)
print(f" Selected point: x={next_x[0]:.3f}")
Pitfall 2: Computational Cost of Diversity Sampling
Problem: Distance calculations are slow with large-scale data
Symptoms: - Time-consuming sampling - High memory usage - Does not scale
Solution: Approximation using k-means clustering
from sklearn.cluster import KMeans
def fast_diversity_sampling(X_sampled, X_candidate, n_clusters=10):
"""
Fast diversity sampling using k-means clustering
Parameters:
-----------
X_sampled : array (n_sampled, n_features)
Existing samples
X_candidate : array (n_candidates, n_features)
Candidate points
n_clusters : int
Number of clusters
Returns:
--------
next_x : array
Next sampling point
"""
# Cluster candidate points
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(X_candidate)
# Select the farthest candidate point from each cluster center
cluster_centers = kmeans.cluster_centers_
distances_from_sampled = np.min(
np.linalg.norm(
cluster_centers[:, np.newaxis, :] -
X_sampled[np.newaxis, :, :],
axis=2
),
axis=1
)
# Select the representative point of the farthest cluster
farthest_cluster = np.argmax(distances_from_sampled)
cluster_mask = (kmeans.labels_ == farthest_cluster)
candidates_in_cluster = X_candidate[cluster_mask]
# Select the point closest to the cluster center within the cluster
distances_to_center = np.linalg.norm(
candidates_in_cluster - cluster_centers[farthest_cluster],
axis=1
)
next_idx_in_cluster = np.argmin(distances_to_center)
next_x = candidates_in_cluster[next_idx_in_cluster]
return next_x
# Benchmark
import time
n_sampled = 100
n_candidates = 10000
X_sampled = np.random.rand(n_sampled, 4)
X_candidate = np.random.rand(n_candidates, 4)
# Traditional method (full distance calculation)
start = time.time()
from scipy.spatial.distance import cdist
distances = cdist(X_candidate, X_sampled)
min_distances = np.min(distances, axis=1)
next_idx_naive = np.argmax(min_distances)
time_naive = time.time() - start
# k-means approximation method
start = time.time()
next_x_fast = fast_diversity_sampling(X_sampled, X_candidate, n_clusters=20)
time_fast = time.time() - start
print(f"Traditional method: {time_naive:.4f} seconds")
print(f"k-means method: {time_fast:.4f} seconds")
print(f"Speedup ratio: {time_naive/time_fast:.1f}x")
Pitfall 3: Handling Experimental Failures in Closed-Loop Systems
Problem: Experimental failures are not considered
Symptoms: - Loop stops due to experimental failures - Cannot exploit failure data - Low robustness
Solution: Active Learning considering failures
class RobustClosedLoopOptimizer:
"""
Closed-loop optimization handling experimental failures
"""
def __init__(self, objective_function, total_budget=50, failure_rate=0.1):
"""
Parameters:
-----------
objective_function : callable
Objective function (experiment simulator)
total_budget : int
Total experiment budget
failure_rate : float
Experimental failure rate (0~1)
"""
self.objective_function = objective_function
self.total_budget = total_budget
self.failure_rate = failure_rate
self.X_sampled = []
self.y_observed = []
self.failures = []
def execute_experiment(self, x):
"""
Execute experiment (with possibility of failure)
Returns:
--------
success : bool
Experiment success flag
result : float or None
Measured value on success, None on failure
"""
# Failure simulation
if np.random.rand() < self.failure_rate:
print(f" Experiment failed: x={x}")
return False, None
# Evaluate objective function on success
y = self.objective_function(x)
return True, y
def run(self):
"""Execute closed-loop optimization"""
# Initialization
X_init = np.random.uniform(0, 1, (5, 1))
for x in X_init:
success, y = self.execute_experiment(x)
if success:
self.X_sampled.append(x)
self.y_observed.append(y)
self.failures.append(False)
else:
self.failures.append(True)
# Main loop
experiments_done = len(X_init)
while len(self.y_observed) < self.total_budget:
if experiments_done >= self.total_budget * 1.5:
print("Experiment budget exceeded (many failures)")
break
# Train GP model
if len(self.y_observed) < 3:
# Random sampling when data is insufficient
next_x = np.random.uniform(0, 1, (1, 1))
print(f"Insufficient data: random sampling")
else:
kernel = ConstantKernel(1.0) * RBF(length_scale=0.15)
gp = GaussianProcessRegressor(kernel=kernel)
X_array = np.array(self.X_sampled)
y_array = np.array(self.y_observed)
gp.fit(X_array, y_array)
# Maximize EI
X_candidate = np.linspace(0, 1, 500).reshape(-1, 1)
mu, sigma = gp.predict(X_candidate, return_std=True)
f_best = np.max(y_array)
from scipy.stats import norm
improvement = mu - f_best - 0.01
Z = improvement / (sigma + 1e-9)
ei = improvement * norm.cdf(Z) + sigma * norm.pdf(Z)
next_idx = np.argmax(ei)
next_x = X_candidate[next_idx:next_idx+1]
# Execute experiment
success, y = self.execute_experiment(next_x)
experiments_done += 1
if success:
self.X_sampled.append(next_x)
self.y_observed.append(y)
self.failures.append(False)
print(f"Success {len(self.y_observed)}/{self.total_budget}: "
f"x={next_x[0][0]:.3f}, y={y:.3f}")
else:
self.failures.append(True)
print(f"Failed: Retrying")
# Results summary
success_rate = len(self.y_observed) / experiments_done
print(f"\nFinal results:")
print(f" Total experiments: {experiments_done}")
print(f" Successful experiments: {len(self.y_observed)}")
print(f" Success rate: {success_rate:.1%}")
print(f" Best value: {np.max(self.y_observed):.4f}")
# Usage example
def noisy_objective(x):
"""Noisy objective function"""
return np.sin(5 * x[0]) * np.exp(-x[0]) + 0.1 * np.random.randn()
np.random.seed(42)
optimizer = RobustClosedLoopOptimizer(
objective_function=noisy_objective,
total_budget=20,
failure_rate=0.2 # 20% failure rate
)
optimizer.run()
End-of-Chapter Checklist (add before "Exercises")
4.7 End-of-Chapter Checklist
✅ Understanding Active Learning
- [ ] Can explain the difference between Active Learning and Bayesian Optimization
- [ ] Understand the three main strategies (uncertainty, diversity, model change)
- [ ] Can explain the advantages and disadvantages of each strategy
- [ ] Can select strategies according to the problem
- [ ] Know how to combine strategies
Selection Guide:
Understanding search space → Diversity sampling
Improving prediction accuracy → Uncertainty sampling
Improving model generalization → Expected model change
Finding optimal solutions → Bayesian Optimization (EI/UCB)
Discovering diverse candidate materials → Combination of diversity + uncertainty
✅ Uncertainty Sampling
- [ ] Understand the meaning of prediction standard deviation σ
- [ ] Know how to identify regions with high uncertainty
- [ ] Can implement combination with epsilon-greedy method
- [ ] Understand application to classification problems (margin, entropy)
- [ ] Know the limitations of uncertainty sampling
Implementation Check:
# Can you complete this code?
def uncertainty_sampling(gp, X_candidate):
"""
Select the point with maximum uncertainty
Returns:
--------
next_x : array
Next sampling point
uncertainty : float
Uncertainty at that point
"""
# Your implementation
_, sigma = gp.predict(X_candidate, return_std=True)
next_idx = np.argmax(sigma)
next_x = X_candidate[next_idx]
uncertainty = sigma[next_idx]
return next_x, uncertainty
# Correct!
✅ Diversity Sampling
- [ ] Understand the concept of MaxMin distance
- [ ] Can implement approximation using k-means clustering
- [ ] Know the basics of Determinantal Point Process (DPP)
- [ ] Can evaluate search space coverage
- [ ] Know speedup methods for large-scale data
Diversity Evaluation Metrics:
def evaluate_diversity(X_sampled, bounds):
"""
Evaluate sampling diversity
Returns:
--------
coverage_score : float
Search space coverage (0~1)
"""
# Divide search space into 10 parts and calculate coverage
n_dims = X_sampled.shape[1]
n_bins = 10
coverage_count = 0
total_bins = n_bins ** n_dims
# Simplified version: coverage per dimension
for dim in range(n_dims):
hist, _ = np.histogram(
X_sampled[:, dim],
bins=n_bins,
range=(bounds[dim, 0], bounds[dim, 1])
)
coverage_count += np.sum(hist > 0)
coverage_score = coverage_count / (n_bins * n_dims)
return coverage_score
# Usage example
bounds = np.array([[0, 1], [0, 1], [0, 1], [0, 1]])
X_sampled = np.random.rand(20, 4)
coverage = evaluate_diversity(X_sampled, bounds)
print(f"Coverage: {coverage:.1%}")
✅ Closed-Loop Optimization
- [ ] Understand the components of a closed-loop system
- [ ] Know how to integrate AI engine, experimental equipment, and data management
- [ ] Can implement methods for handling experimental failures
- [ ] Can design real-time monitoring
- [ ] Understand the role of human researchers
System Design Checklist:
□ Definition of objective function and evaluation method
□ Explicit constraints
□ Initial sampling strategy
□ Selection of acquisition function
□ Determination of batch size
□ Retry logic for experimental failures
□ Anomaly detection and human notification
□ Automatic data saving and backup
□ Progress visualization
□ Setting termination conditions
✅ Understanding Real-World Applications
- [ ] Can explain the achievements of Berkeley A-Lab
- [ ] Understand the approach of RoboRXN
- [ ] Know the features of Materials Acceleration Platform
- [ ] Can evaluate ROI of industrial applications
- [ ] Can analyze success factors and challenges
ROI Calculation Template:
Traditional method:
Number of experiments: ________ times
Experiment time: ________ hours/time
Labor cost: ________ $/hour
Total cost: ________ $
Development period: ________ months
AI-driven method (closed-loop):
Number of experiments: ________ times (__% reduction)
Experiment time: ________ hours/time (automated)
Labor cost: ________ $/hour (monitoring only)
System construction: ________ $ (initial investment)
Total cost: ________ $
Development period: ________ months (__% reduction)
Payback period: ________ months
✅ Human-AI Collaboration
- [ ] Understand human intuition and AI strengths
- [ ] Can design hybrid approaches
- [ ] Can determine when humans should intervene
- [ ] Can build decision support systems
- [ ] Can design feedback loops
Collaboration Protocol:
Phase 1: Problem formulation (human-led)
→ Define objective function, constraints, and search space
→ AI checks feasibility
Phase 2: Initial exploration (AI-led)
→ AI explores data-efficiently
→ Human validates anomalies
Phase 3: Refinement (hybrid)
→ AI proposes
→ Human evaluates physical validity
→ Collaborative decision-making
Phase 4: Implementation (human-led)
→ Human selects final candidates
→ AI quantifies uncertainty
✅ Understanding Career Paths
- [ ] Understand the academic researcher path
- [ ] Know the industry R&D engineer path
- [ ] Can consider the autonomous experimentation specialist path
- [ ] Can identify skills to learn next
- [ ] Have clarified your own career goals
Next Steps Selection Guide:
Theory research orientation
→ GNN Beginner + Reinforcement Learning Beginner
→ Paper writing, conference presentations
Implementation/application orientation
→ Robotics Experiment Automation Beginner
→ Original projects, portfolio creation
Industrial application orientation
→ Deep dive into industrial case studies
→ Internships, practical experience
System building orientation
→ Closed-loop system construction
→ API design, hardware integration
Pass Criteria
If you have achieved the following, you have completed the series:
- Theoretical Understanding: Clear 80% or more of each checklist item
- Implementation Skills: Can solve all exercises
- Application Ability: Can formulate new materials exploration problems
- Career: Next steps are clear
Final Confirmation Questions: 1. Can you implement and compare the performance of three Active Learning strategies? 2. Can you design a closed-loop optimization system? 3. Can you extract learnings from real-world application success stories? 4. Can you explain the next steps toward your career goals?
If all are YES, congratulations! You have completed the Bayesian Optimization & Active Learning Beginner series!
To the Next Series: - Robotics Experiment Automation Beginner - Reinforcement Learning Beginner (Materials Science Specialized) - GNN Beginner
Continuous Learning: - Paper reading (1 per week) - Open source contributions - Community participation - Application to real projects
We wish you success!