Chapter 1: The Need for Active Learning
This chapter covers The Need for Active Learning. You will learn four main query strategy techniques and exploration-exploitation tradeoff.
Dramatically Reduce Experiment Count Through Active Data Selection
Learning Objectives
By completing this chapter, you will be able to:
- ā Explain the definition and advantages of Active Learning
- ā Understand the four main query strategy techniques
- ā Explain the exploration-exploitation tradeoff
- ā Provide three or more successful examples in materials science
- ā Perform quantitative comparisons with random sampling
Reading Time: 20-25 minutes Code Examples: 7 Exercises: 3
1.1 What is Active Learning?
Definition: Efficient Learning Through Active Data Selection
Active Learning is a method where machine learning models actively select "which data to acquire next," enabling the construction of high-accuracy models with minimal training data.
Differences from Passive Learning:
| Aspect | Passive Learning | Active Learning |
|---|---|---|
| Data Selection | Random or existing datasets | Actively selected by model |
| Learning Efficiency | Low (requires large data) | High (high accuracy with small data) |
| Data Acquisition Cost | Not considered | Considered |
| Application Scenarios | Data is inexpensive | Data is expensive |
Importance in Materials Science: - Single experiments take days to weeks - High experimental costs (catalyst synthesis, DFT calculations, etc.) - Vast search spaces (10^6 to 10^60 candidates)
Basic Active Learning Cycle
Few samples"] --> B["Model Training
Build prediction model"] B --> C["Candidate Evaluation
Query Strategy"] C --> D["Select most
informative sample"] D --> E["Experiment Execution
Data Acquisition"] E --> F{"Stopping criteria?
Goal achieved or
budget limit"} F -->|No| B F -->|Yes| G["Final Model"] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e8f5e9 style E fill:#ffebee style G fill:#4CAF50,color:#fff
Key Points: 1. Start with small initial data (typically 10-20 samples) 2. Intelligently select next sample using query strategy 3. Execute experiments adding data one at a time 4. Repeat model updates 5. Continue until goal achieved
1.2 Query Strategy Fundamentals
1.2.1 Uncertainty Sampling
Principle: Select samples where the model's prediction is most uncertain
Formula: $$ x^* = \arg\max_{x \in \mathcal{U}} \text{Uncertainty}(x) $$
where $\mathcal{U}$ is the set of unlabeled samples
Uncertainty Measurement Methods:
Regression Problems: $$ \text{Uncertainty}(x) = \sigma(x) $$ (standard deviation of prediction)
Classification Problems (2-class): $$ \text{Uncertainty}(x) = 1 - |P(y=1|x) - P(y=0|x)| $$ (inverse of absolute probability difference; closer to 0.5 means more uncertain)
Code Example 1: Uncertainty Sampling Implementation
__PROTECTED_CODE_0__
Output:
__PROTECTED_CODE_1__
Advantages: - ā Simple and intuitive - ā Low computational cost - ā Effective for many problems
Disadvantages: - ā ļø Does not consider diversity of search space - ā ļø May be biased toward local regions
1.2.2 Diversity Sampling
Principle: Select samples that are different (diverse) from existing data
Formula: $$ x^* = \arg\max_{x \in \mathcal{U}} \min_{x_i \in \mathcal{L}} d(x, x_i) $$
where $\mathcal{L}$ is the set of labeled samples, and $d(\cdot, \cdot)$ is a distance function
Distance Measurement Methods: - Euclidean distance: $d(x_i, x_j) = |x_i - x_j|_2$ - Mahalanobis distance: $d(x_i, x_j) = \sqrt{(x_i - x_j)^T \Sigma^{-1} (x_i - x_j)}$ - Cosine distance: $d(x_i, x_j) = 1 - \frac{x_i \cdot x_j}{|x_i| |x_j|}$
Code Example 2: Diversity Sampling Implementation
__PROTECTED_CODE_2__
Output:
__PROTECTED_CODE_3__
Advantages: - ā Covers wide range of search space - ā Prevents bias toward local optima - ā Works well with clustering
Disadvantages: - ā ļø Does not consider model uncertainty - ā ļø Slightly higher computational cost
1.2.3 Query-by-Committee
Principle: Select samples where multiple models (committee) disagree the most
Formula: $$ x^* = \arg\max_{x \in \mathcal{U}} \text{Disagreement}(C, x) $$
where $C = {M_1, M_2, ..., M_K}$ is a set of models (committee)
Disagreement Measurement:
Regression Problems (Variance): $$ \text{Disagreement}(C, x) = \frac{1}{K} \sum_{k=1}^K (M_k(x) - \bar{M}(x))^2 $$
Classification Problems (Kullback-Leibler Divergence): $$ \text{Disagreement}(C, x) = \frac{1}{K} \sum_{k=1}^K KL(P_k(\cdot|x) | P_C(\cdot|x)) $$
Code Example 3: Query-by-Committee Implementation
__PROTECTED_CODE_4__
Output:
__PROTECTED_CODE_5__
Advantages: - ā Leverages knowledge from diverse models - ā Reduces model bias - ā Robust uncertainty estimation
Disadvantages: - ā ļø High computational cost (training multiple models) - ā ļø Depends on model selection
1.2.4 Expected Model Change
Principle: Select samples that cause the largest change in model parameters
Formula (gradient-based): $$ x^* = \arg\max_{x \in \mathcal{U}} |\nabla_\theta \mathcal{L}(\theta; x, \hat{y})| $$
where $\theta$ is model parameters, $\mathcal{L}$ is loss function, $\hat{y}$ is predicted value
Advantages: - ā Directly evaluates impact on model improvement - ā Enables efficient learning
Disadvantages: - ā ļø High computational cost - ā ļø Limited to models with computable gradients
1.3 Exploration vs Exploitation
The Tradeoff Concept
One of the most important concepts in active learning is the exploration-exploitation tradeoff.
Exploration: - Explore unknown regions - Collect diverse samples - Acquire new information - Take risks
Exploitation: - Intensively investigate known good regions - Prioritize high uncertainty regions - Maximize use of existing knowledge - Improve safely
Visualizing the Tradeoff
ε-greedy Approach
Principle: Explore with probability $\epsilon$, exploit with probability $1-\epsilon$
Algorithm:
__PROTECTED_CODE_6__
Code Example 4: ε-greedy Active Learning
__PROTECTED_CODE_7__
Output:
__PROTECTED_CODE_8__
Choosing ε: - $\epsilon = 0$: Full exploitation (risk of local optima) - $\epsilon = 1$: Full exploration (random sampling) - $\epsilon = 0.1 \sim 0.2$: Well-balanced (recommended)
Upper Confidence Bound (UCB)
Principle: Prediction mean + uncertainty bonus
Formula: $$ \text{UCB}(x) = \mu(x) + \kappa \sigma(x) $$
- $\mu(x)$: Prediction mean
- $\sigma(x)$: Prediction standard deviation
- $\kappa$: Exploration parameter (typically 1.0-3.0)
Code Example 5: Sample Selection Using UCB
__PROTECTED_CODE_9__
Output:
__PROTECTED_CODE_10__
Impact of Īŗ: - Large $\kappa$ ā Exploration-focused - Small $\kappa$ ā Exploitation-focused - Recommended: $\kappa = 2.0 \sim 2.5$
1.4 Case Study: Catalyst Activity Prediction
Problem Setup
Objective: Predict catalyst reaction activity and discover the most active catalyst in 10 experiments
Dataset: - Candidate catalysts: 500 types - Features: Metal composition (3 elements), loading, calcination temperature - Target variable: Reaction rate constant (k)
Constraints: - Single experiment takes 3 days - Budget limited to maximum 10 experiments
Random Sampling vs Active Learning
Code Example 6: Comparative Experiment for Catalyst Activity Prediction
__PROTECTED_CODE_11__
Expected Output:
__PROTECTED_CODE_12__
Important Observations: - ā Active Learning reaches 97.5% of true optimal value in 10 experiments - ā Random Sampling only reaches 79.3% - ā 23% performance improvement - ā R² score steadily improves (0.512 ā 0.843)
1.5 Chapter Summary
What We Learned
-
Active Learning Definition - Efficient learning through active data selection - Differences from passive learning - Importance in materials science (experimental cost reduction)
-
Query Strategies - Uncertainty Sampling: Select samples with uncertain predictions - Diversity Sampling: Select diverse samples - Query-by-Committee: Leverage disagreement between models - Expected Model Change: Select by impact on model updates
-
Exploration-Exploitation - ε-greedy: Probabilistically switch between exploration and exploitation - UCB: Prediction mean + uncertainty bonus - Importance of balance
-
Practical Examples - 23% performance improvement in catalyst activity prediction - 97.5% achievement rate in 10 experiments - 1.3Ć efficiency over random sampling
Key Takeaways
- ā Active learning excels in problems with high data acquisition costs
- ā Query strategy selection greatly affects exploration efficiency
- ā Balance in exploration-exploitation is crucial
- ā Can reduce experiments by 50-90% in materials science
- ā Significant improvements achievable in 10-20 experiments
Next Chapter
In Chapter 2, we will learn the core uncertainty estimation techniques for active learning: - Ensemble methods (Random Forest, LightGBM) - Dropout methods (Bayesian Neural Networks) - Gaussian Processes (rigorous uncertainty quantification)
Chapter 2: Uncertainty Estimation Techniques ā
Exercises
Problem 1 (Difficulty: Easy)
For the following situations, determine which query strategy is most appropriate and explain your reasoning.
Situation A: Predicting tensile strength of alloys. 10,000 candidate materials, 50 initial data samples, budget allows 20 additional experiments. Search space is vast, but strength varies relatively smoothly with composition.
Situation B: Discovery of novel organic semiconductor materials. 100,000 candidate molecules, 10 initial data samples, budget allows 10 additional experiments. Properties vary very complexly with molecular structure.
Hint
- Situation A: Vast search space ā ? - Situation B: Little data, complex function ā ? - Review characteristics of query strategiesExample Solution
**Situation A: Diversity Sampling is optimal** **Reasoning**: 1. Search space is vast (10,000 types), difficult to cover entirely with 20 experiments 2. 50 initial samples available, sufficient for reasonable model construction 3. Strength varies smoothly, so covering wide range enables grasping overall picture 4. Diversity sampling provides even coverage of search space **Alternative**: UCB sampling (with large exploration parameter Īŗ) **Situation B: Uncertainty Sampling (or Query-by-Committee) is optimal** **Reasoning**: 1. Very few initial data samples (10 samples) 2. Properties vary complexly, so should prioritize high uncertainty regions 3. Budget is limited (10 experiments), requiring efficient learning 4. Uncertainty sampling selects most informative samples **Alternative**: Query-by-Committee (handles complex functions through model diversity)Problem 2 (Difficulty: Medium)
Implement ε-greedy Active Learning and compare exploration efficiency for different ε values (0.0, 0.1, 0.2, 0.5).
Tasks: 1. Generate synthetic material properties dataset (500 samples) 2. Execute ε-greedy AL with 10 initial samples and 15 additional experiments 3. Plot best value discovered for each ε 4. Select optimal ε and explain reasoning
Hint
- Reference Code Example 4 to implement ε-greedy - Run 5 trials for each ε and average results - Plot: x-axis = experiment count, y-axis = best value discoveredExample Solution
__PROTECTED_CODE_13__**Expected Output**:
__PROTECTED_CODE_14__**Conclusion**: - **ε = 0.2 is optimal** (99.0% achievement rate) - ε = 0.0 prone to local optima (89.2%) - ε = 0.5 over-explores inefficiently (93.4%) - **Moderate exploration (ε=0.1-0.2) provides good balance**
Problem 3 (Difficulty: Hard)
Compare three query strategies (Uncertainty, Diversity, Query-by-Committee) on the same dataset and select the most efficient method.
Requirements: 1. Generate synthetic multi-objective material data (1,000 samples, 10 dimensions) 2. Execute each method with 20 initial samples and 30 additional experiments 3. Evaluate using these metrics: - Best value discovered - R² score (prediction accuracy on all data) - Computation time 4. Select most efficient method overall
Hint
- Implement each method independently - Run 5 trials and average results - Measure computation time with `time.time()` - Consider tradeoffs (accuracy vs computation time)Example Solution
__PROTECTED_CODE_15__**Expected Output**:
__PROTECTED_CODE_16__**Conclusion**: 1. **Query-by-Committee (QBC)** achieves highest performance (97.4% achievement, R²=0.856) 2. However, computation time is over 3à longer (38.12s vs 12.34s) 3. **Uncertainty Sampling provides best overall balance** - 96.2% achievement (only 1.2% difference from QBC) - R²=0.834 (only 0.022 difference from QBC) - Computation time is 1/3 **Recommendations**: - **No time constraints**: QBC - **Balance priority**: Uncertainty Sampling - **Computation cost priority**: Diversity Sampling
Data Licenses and Citations
Benchmark Datasets
Benchmark datasets for active learning that can be used in this chapter's code examples:
1. UCI Machine Learning Repository
- License: CC BY 4.0
- Citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences.
- Recommended Datasets:
make_regression()(built-in scikit-learn)- Wine Quality Dataset
- Boston Housing Dataset
2. Materials Project API
- License: CC BY 4.0
- Citation: Jain, A. et al. (2013). "Commentary: The Materials Project: A materials genome approach to accelerating materials innovation." APL Materials, 1(1), 011002.
- Usage: Active learning experiments in materials science (band gap, formation energy)
- API Access: https://materialsproject.org/api
3. Matbench Datasets
- License: MIT License
- Citation: Dunn, A. et al. (2020). "Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm." npj Computational Materials, 6(1), 138.
- Usage: Active learning for material property prediction
Library Licenses
Licenses of major libraries used in this chapter:
| Library | Version | License | Purpose |
|---|---|---|---|
| modAL | 0.4.1 | MIT | Active Learning Framework |
| scikit-learn | 1.3.0 | BSD-3-Clause | Machine Learning & Preprocessing |
| numpy | 1.24.3 | BSD-3-Clause | Numerical Computing |
| matplotlib | 3.7.1 | PSF (BSD-like) | Visualization |
License Compliance: - All are available for commercial use - Maintain original license notices when redistributing - Cite appropriately in academic publications
Ensuring Reproducibility
Random Seed Configuration
To make active learning experiments reproducible, set the following random seeds in all code:
__PROTECTED_CODE_18__
Important Points:
- Data splitting: train_test_split(..., random_state=SEED)
- Model initialization: RandomForestRegressor(..., random_state=SEED)
- Initial sample selection: Set seed before np.random.choice(..., replace=False)
Library Version Management
To fully reproduce experimental environment, create requirements.txt:
__PROTECTED_CODE_23__
Recording Experimental Logs
Record all active learning iterations:
__PROTECTED_CODE_24__
Common Pitfalls and Solutions
1. Cold Start Problem (Insufficient Initial Data)
Problem: Too few initial labeled data leads to unstable uncertainty estimation
Symptoms:
__PROTECTED_CODE_25__
Solution:
__PROTECTED_CODE_26__
Recommended Rules: - Minimum 10 samples - Ideally 3-5Ć number of features - More complex models (NN, GP) require more initial data
2. Query Selection Bias
Problem: Using only uncertainty sampling leads to selecting same regions repeatedly
Symptoms:
__PROTECTED_CODE_27__
Solution 1: ε-greedy:
__PROTECTED_CODE_28__
Solution 2: Batch Diversity:
__PROTECTED_CODE_29__
3. Stopping Criteria Errors
Problem: Unclear when to stop active learning
Bad Example: Fixed iteration count only
__PROTECTED_CODE_30__
Solution: Multiple stopping criteria:
__PROTECTED_CODE_31__
4. Distribution Shift
Problem: Distribution differs between labeled and unlabeled pools
Symptoms:
__PROTECTED_CODE_32__
Solution:
__PROTECTED_CODE_33__
5. Label Noise Handling
Problem: When experimental measurements contain noise, learning incorrect samples
Solution 1: Uncertainty threshold:
__PROTECTED_CODE_34__
Solution 2: Ensemble Robustness:
__PROTECTED_CODE_35__
6. Computational Cost of Uncertainty Estimation
Problem: Uncertainty estimation takes too long (e.g., GP's N^3 computational complexity)
Solution: Pre-filter candidate pool:
__PROTECTED_CODE_36__
Quality Checklist
Experimental Design Checklist
Initialization Phase
- [ ] Random seed configured (
np.random.seed(SEED)) - [ ] Appropriate initial sample count (minimum 10, ideally features Ć 3-5)
- [ ] Data split uses stratified sampling (avoid distribution shift)
- [ ] Library versions recorded in
requirements.txt
Query Strategy Selection
- [ ] Select appropriate method for task
- Wide exploration ā Diversity Sampling
- Efficient convergence ā Uncertainty Sampling
- Model robustness ā Query-by-Committee
- [ ] Set exploration-exploitation balance (ε-greedy, UCB)
- [ ] Consider diversity when batch selecting
Stopping Criteria Design
- [ ] Set maximum iteration count
- [ ] Define target performance metrics (R², RMSE, etc.)
- [ ] Set early stopping conditions (patience=5-10)
- [ ] Clarify budget limits (experimental cost, time)
Model Selection
- [ ] Select models capable of uncertainty estimation
- Ensemble methods (RF, LightGBM)
- MC Dropout (NN)
- Gaussian Process
- [ ] Select model based on data size
- Small (<1000) ā GP
- Medium (1000-10000) ā RF, LightGBM
- Large (>10000) ā MC Dropout
Implementation Quality Checklist
Data Preprocessing
- [ ] Missing values handled (deletion or imputation)
- [ ] Outliers detected and addressed (IQR method, etc.)
- [ ] Feature scaling applied (standardization or normalization)
- [ ] No data leakage (test data separated)
Code Quality
- [ ] Type hints added to functions (
def func(x: np.ndarray) -> float:) - [ ] Docstrings written (arguments, return values, purpose)
- [ ] Error handling implemented (try-except)
- [ ] Logging output implemented (experiment tracking)
Evaluation and Validation
- [ ] Multiple evaluation metrics calculated (R², RMSE, MAE)
- [ ] Learning curves plotted (iteration count vs performance)
- [ ] Compared with random sampling
- [ ] Statistical significance verified (mean ± std of multiple trials)
Materials Science-Specific Checklist
Physical Constraints
- [ ] Check physical validity of search space
- Temperature range: 0-1500°C
- Composition ratio: Total 100%
- pH range: 0-14
- [ ] Verify unit consistency (nm, eV, GPa, etc.)
- [ ] Synthesizability constraints (experimental feasibility)
Domain Knowledge Integration
- [ ] Leverage physical prior knowledge
- Kernel selection (periodicity, smoothness)
- Feature engineering (descriptors)
- [ ] Verify consistency with known physical laws
- Band Gap > 0
- Density > 0
Experimental Integration
- [ ] Account for measurement errors (noise terms)
- [ ] Define experimental cost function
- [ ] Design batch experiments (parallelization potential)
Additional Practice Exercise Guide
Complete Solution Example for Exercise 1 (CNT Electrical Conductivity Prediction)
Click to show complete code
__PROTECTED_CODE_40__
References
-
Settles, B. (2009). "Active Learning Literature Survey." Computer Sciences Technical Report 1648, University of Wisconsin-Madison.
-
Lookman, T. et al. (2019). "Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design." npj Computational Materials, 5(1), 1-17. DOI: 10.1038/s41524-019-0153-8
-
Raccuglia, P. et al. (2016). "Machine-learning-assisted materials discovery using failed experiments." Nature, 533(7601), 73-76. DOI: 10.1038/nature17439
-
Ren, F. et al. (2018). "Accelerated discovery of metallic glasses through iteration of machine learning and high-throughput experiments." Science Advances, 4(4), eaaq1566. DOI: 10.1126/sciadv.aaq1566
-
Kusne, A. G. et al. (2020). "On-the-fly closed-loop materials discovery via Bayesian active learning." Nature Communications, 11(1), 5966. DOI: 10.1038/s41467-020-19597-w
Navigation
Next Chapter
Chapter 2: Uncertainty Estimation Techniques ā
Series Index
Author Information
Creator: AI Terakoya Content Team Created: 2025-10-18 Version: 1.0
Update History: - 2025-10-18: v1.0 Initial release
Feedback: - GitHub Issues: AI_Homepage/issues - Email: yusuke.hashimoto.b8@tohoku.ac.jp
License: Creative Commons BY 4.0
Let's learn the details of Uncertainty Estimation in the next chapter!