Achieving GNN-level Accuracy with Composition-Based Features
In the previous chapters, we learned how to generate composition-based features (Magpie, ElementProperty, custom Featurizers). In this chapter, we will master practical techniques for integrating these features with machine learning models to predict material properties with high accuracy.
Particularly important is the possibility of achieving prediction accuracy comparable to Graph Neural Networks (GNNs) using composition information alone. In the Matbench v0.1 benchmark, composition-based models (such as ElemNet) achieved R² 0.92 for formation energy prediction, with a gap of only 3-5% compared to GNN models requiring crystal structure (R² 0.95-0.97).
In this chapter, you will master machine learning workflows that can be immediately applied in practice through scikit-learn pipeline construction, comparison of multiple models (RandomForest, GradientBoosting, MLP), hyperparameter optimization, and SHAP interpretation.
scikit-learn's Pipeline is a powerful tool that integrates data preprocessing, feature generation, and model training into a single processing flow. matminer's Featurizers can also be integrated into pipelines, offering the following advantages:
Here is the basic pattern for integrating matminer Featurizers into scikit-learn pipelines.
# Requirements:
# - Python 3.9+
# - joblib>=1.3.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Here is the basic pattern for integrating matminer Featurize
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
# Code Example1: scikit-learn Pipeline with Magpie Featurizer
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from matminer.featurizers.composition import ElementProperty
from matminer.featurizers.conversions import StrToComposition
import joblib
# Prepare sample data (formation energy prediction)
data = pd.DataFrame({
'composition': ['Fe2O3', 'Al2O3', 'TiO2', 'SiO2', 'MgO',
'CaO', 'Na2O', 'K2O', 'ZnO', 'CuO'],
'formation_energy': [-8.3, -16.6, -9.7, -9.1, -6.1,
-6.3, -4.2, -3.6, -3.6, -1.6] # eV/atom
})
# 1. Convert chemical composition from string to Composition object
str_to_comp = StrToComposition(target_col_id='composition_obj')
data = str_to_comp.featurize_dataframe(data, 'composition')
# 2. Build pipeline
# Pipeline syntax: list of tuples [('name', object), ...]
pipeline = Pipeline([
('scaler', StandardScaler()), # Feature scaling
('model', RandomForestRegressor( # Random Forest regression
n_estimators=100,
max_depth=10,
random_state=42
))
])
# 3. Feature generation (executed outside pipeline)
featurizer = ElementProperty.from_preset('magpie')
data = featurizer.featurize_dataframe(data, 'composition_obj')
# 4. Separate features and target
feature_cols = featurizer.feature_labels()
X = data[feature_cols]
y = data['formation_energy']
# 5. Train/test data split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 6. Train pipeline
pipeline.fit(X_train, y_train)
# 7. Performance evaluation
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Training Data R² Score: {train_score:.4f}")
print(f"Test Data R² Score: {test_score:.4f}")
# 8. Cross-validation
cv_scores = cross_val_score(
pipeline, X_train, y_train,
cv=5, scoring='r2'
)
print(f"\n5-Fold CV Average R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# 9. Save pipeline
joblib.dump(pipeline, 'formation_energy_model.pkl')
print("\nSaved model toformation_energy_model.pkl")
# 10. Load and use pipeline
loaded_pipeline = joblib.load('formation_energy_model.pkl')
predictions = loaded_pipeline.predict(X_test)
print(f"\nPrediction Examplesļ¼First 3 Samplesļ¼:")
for i in range(min(3, len(predictions))):
print(f" Actual Value: {y_test.iloc[i]:.2f} eV/atom, "
f"Predicted Value: {predictions[i]:.2f} eV/atom")
Using make_pipeline, each step is automatically named, allowing for more concise notation:
from sklearn.pipeline import make_pipeline
# Auto-naming version (names are 'standardscaler', 'randomforestregressor', etc.)
pipeline = make_pipeline(
StandardScaler(),
RandomForestRegressor(n_estimators=100, random_state=42)
)
# Manual naming version (arbitrary names can be specified)
pipeline = Pipeline([
('scaling', StandardScaler()),
('rf_model', RandomForestRegressor(n_estimators=100, random_state=42))
])
Selection Criteria: For hyperparameter optimization where specific steps need to be accessed, Pipeline (explicit naming) is recommended; for simple workflows, make_pipeline is preferred.
To directly integrate matminer's Featurizer into a scikit-learn pipeline, use the DFMLAdaptor class (matminer 0.7.4 and later).
# How to integrate Featurizer into pipeline
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers.composition import (
ElementProperty, Stoichiometry, ValenceOrbital
)
from matminer.utils.pipeline import DFMLAdaptor
# Integrate multiple Featurizers
multi_featurizer = MultipleFeaturizer([
ElementProperty.from_preset('magpie'),
Stoichiometry(),
ValenceOrbital()
])
# Make scikit-learn compatible with DFMLAdaptor
featurizer_step = DFMLAdaptor(
multi_featurizer,
input_cols=['composition_obj'],
ignore_cols=['composition', 'formation_energy'] # Columns excluded from feature generation
)
# Build pipeline
full_pipeline = Pipeline([
('featurize', featurizer_step),
('scale', StandardScaler()),
('model', RandomForestRegressor(n_estimators=100, random_state=42))
])
# Can train with entire dataframe as input
full_pipeline.fit(data, data['formation_energy'])
During cross-validation, if the Featurizer uses statistics from the entire dataset (e.g., mean values), data leakage can occur. To prevent this:
For simplicity in this series, we adopt Method 1 (building pipeline after applying Featurizer).
For material property prediction using composition-based features, the following three models are widely used:
| Model | Characteristics | Key Hyperparameters | Recommended Range |
|---|---|---|---|
| Random Forest | Ensemble of decision trees, robust to overfitting, can obtain feature importance |
n_estimatorsmax_depthmin_samples_split
|
100-500 10-50 2-10 |
| Gradient Boosting (XGBoost/LightGBM) |
Sequentially corrects errors, high accuracy, risk of overfitting |
learning_raten_estimatorsmax_depth
|
0.01-0.3 100-1000 3-10 |
| Neural Network (MLP) |
Learns nonlinear relationships, effective with large-scale data, long training time |
hidden_layer_sizesactivationalpha
|
(100,), (100, 50) 'relu', 'tanh' 0.0001-0.01 |
Random Forest achieves high prediction accuracy and overfitting resistance by averaging predictions from multiple decision trees. In materials science, it is also frequently used for feature importance analysis.
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Random Forest achieves high prediction accuracy and overfitt
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
# Code Example2: Random Forest regression (including hyperparameter optimization)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Use features generated in Previous Chapter (X_train, X_test, y_train, y_test)
# 1. Baseline model (default parameters)
rf_baseline = RandomForestRegressor(random_state=42)
rf_baseline.fit(X_train, y_train)
baseline_score = rf_baseline.score(X_test, y_test)
print(f"Baseline Model R² Score: {baseline_score:.4f}")
# 2. Hyperparameter grid search
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
rf_model = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(
rf_model,
param_grid,
cv=5, # 5-fold cross-validation
scoring='r2', # Evaluate with R² score
n_jobs=-1, # Use all CPU cores
verbose=1
)
# Execute GridSearch
print("\nRunning GridSearchCV...")
grid_search.fit(X_train, y_train)
# Optimal parameters
print(f"\nOptimal Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
# 3. Evaluate with optimal model
best_rf = grid_search.best_estimator_
y_pred_train = best_rf.predict(X_train)
y_pred_test = best_rf.predict(X_test)
# Calculate evaluation metrics
def evaluate_model(y_true, y_pred, dataset_name):
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
r2 = r2_score(y_true, y_pred)
print(f"\n{dataset_name} Evaluation Results:")
print(f" MAE: {mae:.4f}")
print(f" RMSE: {rmse:.4f}")
print(f" R²: {r2:.4f}")
return mae, rmse, r2
train_metrics = evaluate_model(y_train, y_pred_train, "Training Data")
test_metrics = evaluate_model(y_test, y_pred_test, "Test Data")
# 4. Visualize feature importance
feature_importance = best_rf.feature_importances_
importance_df = pd.DataFrame({
'feature': feature_cols,
'importance': feature_importance
}).sort_values('importance', ascending=False)
print("\nTop 10 Features:")
print(importance_df.head(10).to_string(index=False))
# Visualization
plt.figure(figsize=(10, 6))
plt.barh(importance_df.head(10)['feature'],
importance_df.head(10)['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance (Top 10)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=150)
print("\nSaved feature importance graph to rf_feature_importance.png")
# 5. Predicted Value vs Actual Value plot
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred_test, alpha=0.6, edgecolors='k', linewidth=0.5)
plt.plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()],
'r--', lw=2, label='Ideal prediction')
plt.xlabel('Actual Value (eV/atom)')
plt.ylabel('Predicted Value (eV/atom)')
plt.title(f'Random Forest Prediction Performance\nR² = {test_metrics[2]:.4f}')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('rf_prediction_plot.png', dpi=150)
print("Saved prediction plot to rf_prediction_plot.png")
Gradient Boosting is a method that sequentially adds weak learners (shallow decision trees), correcting the errors of the previous model at each step. XGBoost is an implementation with optimizations for speed and regularization, frequently winning Kaggle competitions.
# Requirements:
# - Python 3.9+
# - xgboost>=2.0.0
"""
Example: Gradient Boosting is a method that sequentially adds weak le
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
# Code Example3: Gradient Boosting implementation (XGBoost)
# XGBoost installation (not required in Google Colab)
# !pip install xgboost
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
# 1. Create dataset for XGBoost (for speed optimization)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# 2. Baseline model
params_baseline = {
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'eta': 0.1, # Learning rate
'max_depth': 6,
'subsample': 0.8,
'colsample_bytree': 0.8,
'seed': 42
}
# Training (using early stopping)
evals = [(dtrain, 'train'), (dtest, 'test')]
bst_baseline = xgb.train(
params_baseline,
dtrain,
num_boost_round=1000,
evals=evals,
early_stopping_rounds=50,
verbose_eval=False
)
print(f"Baseline best iteration: {bst_baseline.best_iteration}")
print(f"Test RMSE: {bst_baseline.best_score:.4f}")
# 3. Hyperparameter optimization with RandomizedSearchCV
# Use XGBRegressor (scikit-learn compatible)
xgb_model = xgb.XGBRegressor(
objective='reg:squarederror',
eval_metric='rmse',
random_state=42
)
param_distributions = {
'n_estimators': randint(100, 1000),
'learning_rate': uniform(0.01, 0.3),
'max_depth': randint(3, 10),
'min_child_weight': randint(1, 10),
'subsample': uniform(0.6, 0.4), # 0.6-1.0
'colsample_bytree': uniform(0.6, 0.4), # 0.6-1.0
'gamma': uniform(0, 0.5)
}
random_search = RandomizedSearchCV(
xgb_model,
param_distributions,
n_iter=50, # 50 random samplings
cv=5,
scoring='r2',
n_jobs=-1,
verbose=1,
random_state=42
)
print("\nRunning RandomizedSearchCV...")
random_search.fit(X_train, y_train)
print(f"\nOptimal Parameters: {random_search.best_params_}")
print(f"Best CV R² Score: {random_search.best_score_:.4f}")
# 4. Evaluate with optimal model
best_xgb = random_search.best_estimator_
y_pred_test_xgb = best_xgb.predict(X_test)
xgb_metrics = evaluate_model(y_test, y_pred_test_xgb, "XGBoost Test Data")
# 5. Feature importance (Gain metric)
xgb_importance = best_xgb.get_booster().get_score(importance_type='gain')
importance_df_xgb = pd.DataFrame({
'feature': list(xgb_importance.keys()),
'importance': list(xgb_importance.values())
}).sort_values('importance', ascending=False)
print("\nXGBoost Top 10 Features (Gain):")
print(importance_df_xgb.head(10).to_string(index=False))
# 6. Visualize learning curve
evals_result = best_xgb.evals_result()
plt.figure(figsize=(10, 6))
plt.plot(evals_result['validation_0']['rmse'], label='Training RMSE')
plt.xlabel('Iteration')
plt.ylabel('RMSE')
plt.title('XGBoost Learning Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('xgb_learning_curve.png', dpi=150)
print("\nSaved learning curve to xgb_learning_curve.png")
GridSearchCV: Exhaustively searches all parameter combinations (high computational cost)
RandomizedSearchCV: Efficiently searches through random sampling (recommended)
When to Use: RandomizedSearchCV is appropriate when the parameter space is wide (e.g., XGBoost), GridSearchCV when narrow (e.g., Random Forest).
Multi-layer Perceptron (MLP) is a neural network with multiple hidden layers. It excels at learning nonlinear relationships and demonstrates high performance on large-scale datasets.
# Code Example4: Neural Networkļ¼MLP Regressorļ¼
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import validation_curve
# 1. Data standardization (required for MLP)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 2. Baseline MLP
mlp_baseline = MLPRegressor(
hidden_layer_sizes=(100,), # 1 hidden layer, 100 units
activation='relu',
solver='adam',
alpha=0.0001, # L2 regularization coefficient
batch_size='auto',
learning_rate_init=0.001,
max_iter=500,
random_state=42,
early_stopping=True, # Stop if validation loss doesn't improve
validation_fraction=0.1,
verbose=False
)
mlp_baseline.fit(X_train_scaled, y_train)
print(f"Baseline MLP training complete ({mlp_baseline.n_iter_} iterations)")
print(f"Test R² Score: {mlp_baseline.score(X_test_scaled, y_test):.4f}")
# 3. Hyperparameter grid search
param_grid_mlp = {
'hidden_layer_sizes': [(50,), (100,), (100, 50), (150, 100, 50)],
'activation': ['relu', 'tanh'],
'alpha': [0.0001, 0.001, 0.01],
'learning_rate_init': [0.001, 0.01]
}
mlp_model = MLPRegressor(
solver='adam',
max_iter=1000,
early_stopping=True,
random_state=42,
verbose=False
)
grid_search_mlp = GridSearchCV(
mlp_model,
param_grid_mlp,
cv=3, # 3-fold because MLP is time-consuming
scoring='r2',
n_jobs=-1,
verbose=1
)
print("\nRunning MLP GridSearchCV...")
grid_search_mlp.fit(X_train_scaled, y_train)
print(f"\nOptimal Parameters: {grid_search_mlp.best_params_}")
print(f"Best CV R² Score: {grid_search_mlp.best_score_:.4f}")
# 4. Evaluate with optimal model
best_mlp = grid_search_mlp.best_estimator_
y_pred_test_mlp = best_mlp.predict(X_test_scaled)
mlp_metrics = evaluate_model(y_test, y_pred_test_mlp, "MLP Test Data")
# 5. Visualize loss curve
plt.figure(figsize=(10, 6))
plt.plot(best_mlp.loss_curve_, label='Training Loss')
if hasattr(best_mlp, 'validation_scores_'):
plt.plot(best_mlp.validation_scores_, label='Validation Score')
plt.xlabel('Iteration')
plt.ylabel('Loss / Score')
plt.title('MLP Learning Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('mlp_loss_curve.png', dpi=150)
print("\nSaved loss curve to mlp_loss_curve.png")
# 6. Investigate hidden layer size effects (Validation Curve)
train_scores, valid_scores = validation_curve(
MLPRegressor(activation='relu', alpha=0.001, random_state=42, max_iter=500),
X_train_scaled, y_train,
param_name='hidden_layer_sizes',
param_range=[(50,), (100,), (150,), (100, 50), (150, 100)],
cv=3,
scoring='r2',
n_jobs=-1
)
plt.figure(figsize=(10, 6))
plt.plot(['(50,)', '(100,)', '(150,)', '(100,50)', '(150,100)'],
train_scores.mean(axis=1), 'o-', label='Training Score')
plt.plot(['(50,)', '(100,)', '(150,)', '(100,50)', '(150,100)'],
valid_scores.mean(axis=1), 's-', label='Validation Score')
plt.xlabel('Hidden Layer Size')
plt.ylabel('R² Score')
plt.title('MLP Hidden Layer Size and Model Performance')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('mlp_validation_curve.png', dpi=150)
print("Saved Validation Curve to mlp_validation_curve.png")
A Learning Curve visualizes the relationship between training data volume and prediction performance, used to diagnose overfitting and underfitting.
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# Code Example5: Learning curve analysis (learning_curve function)
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
def plot_learning_curve(estimator, X, y, title, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10)):
"""
Function to plot learning curve
Parameters:
-----------
estimator : scikit-learn estimator
Model to evaluate
X : array-like, shape (n_samples, n_features)
Features
y : array-like, shape (n_samples,)
Target
title : str
Graph title
cv : int
Number of cross-validation folds
train_sizes : array-like
Training data sampling ratios
"""
plt.figure(figsize=(10, 6))
# Calculate training/validation scores with learning_curve function
train_sizes_abs, train_scores, valid_scores = learning_curve(
estimator, X, y,
train_sizes=train_sizes,
cv=cv,
scoring='r2',
n_jobs=n_jobs,
verbose=0
)
# Calculate mean and standard deviation
train_scores_mean = train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
valid_scores_mean = valid_scores.mean(axis=1)
valid_scores_std = valid_scores.std(axis=1)
# Plot
plt.fill_between(train_sizes_abs,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.1, color='blue')
plt.fill_between(train_sizes_abs,
valid_scores_mean - valid_scores_std,
valid_scores_mean + valid_scores_std,
alpha=0.1, color='red')
plt.plot(train_sizes_abs, train_scores_mean, 'o-', color='blue',
label='Training Score')
plt.plot(train_sizes_abs, valid_scores_mean, 's-', color='red',
label='Cross-Validation Score')
plt.xlabel('Number of Training Samples')
plt.ylabel('R² Score')
plt.title(title)
plt.legend(loc='best')
plt.grid(alpha=0.3)
plt.tight_layout()
return plt
# Compare learning curves of multiple models
models = {
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=20,
random_state=42),
'XGBoost': xgb.XGBRegressor(n_estimators=200, max_depth=6,
learning_rate=0.1, random_state=42),
'MLP': MLPRegressor(hidden_layer_sizes=(100, 50), alpha=0.001,
max_iter=500, random_state=42)
}
for model_name, model in models.items():
# Standardization required for MLP
if model_name == 'MLP':
X_current = X_train_scaled
y_current = y_train
else:
X_current = X_train
y_current = y_train
plot_learning_curve(
model, X_current, y_current,
title=f'{model_name} Learning Curve',
cv=5
)
plt.savefig(f'learning_curve_{model_name.replace(" ", "_").lower()}.png',
dpi=150)
print(f"Saved learning curve for {model_name}")
# Diagnostic guide
print("\n[Learning Curve Diagnostic Guide]")
print("1. High training / Low validation ā Overfitting (simplify model, strengthen regularization)")
print("2. Low training / Low validation ā Underfitting (complexify model, add features)")
print("3. Both scores converge high ā Good (little improvement expected from more data)")
print("4. Both scores increasing ā Expect improvement with more data")
Matbench v0.1 is a standard benchmark for material property prediction, consisting of 13 tasks (formation energy, band gap, elastic modulus, etc.). Each task contains thousands to tens of thousands of data points, enabling fair performance comparisons.
The following table compares the performance of composition-based models (Magpie, ElemNet) and GNN models (CGCNN, MEGNet, ALIGNN) on the Matbench benchmark.
| Task | Sample Count | Evaluation Metric | Magpie + RF | ElemNet | CGCNN | MEGNet | ALIGNN |
|---|---|---|---|---|---|---|---|
| Formation Energy (matbench_mp_e_form) |
132,752 | MAE (eV/atom) | 0.082 | 0.065 | 0.052 | 0.059 | 0.047 |
| Band Gap (matbench_mp_gap) |
106,113 | MAE (eV) | 0.42 | 0.35 | 0.27 | 0.30 | 0.25 |
| Bulk Modulus (matbench_mp_bulk_modulus) |
10,987 | MAE (log10(GPa)) | 0.082 | 0.075 | 0.063 | 0.068 | 0.059 |
| Shear Modulus (matbench_mp_shear_modulus) |
10,987 | MAE (log10(GPa)) | 0.092 | 0.084 | 0.071 | 0.076 | 0.067 |
| Dielectric Constant (matbench_dielectric) |
4,764 | MAE (log10) | 0.29 | 0.26 | 0.21 | 0.23 | 0.19 |
| Perovskite Formation (matbench_perovskites) |
18,928 | ROCAUC | 0.91 | 0.94 | 0.93 | 0.92 | 0.95 |
| Phonon Frequency (matbench_phonons) |
1,265 | MAE (cmā»Ā¹) | 89.5 | ā | 62.3 | 71.2 | 58.7 |
When composition-based is advantageous:
When GNN is advantageous:
The following shows selection criteria for composition-based and GNN-based models:
| Criterion | Composition-Based Recommended | GNN Recommended |
|---|---|---|
| Data Size | Small-scale (<1,000 samples) | Large-scale (>10,000 samples) |
| Crystal Structure | Unknown/Undetermined | Known/High-precision |
| Computational Cost | Seconds on CPU | Minutes to hours on GPU |
| Interpretability | Feature importance is intuitive | Strong black-box nature |
| Prediction Accuracy | 85-95% of GNN | Highest accuracy |
| Search Phase | Initial screening | Precise evaluation of final candidates |
This hybrid approach enables high-accuracy material discovery while controlling computational costs.
We execute the Matbench benchmark and evaluate the performance of composition-based models.
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
"""
Example: We execute the Matbench benchmark and evaluate the performan
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
# Code Example6: Matbench benchmark evaluation
# Matbench installation
# !pip install matbench
from matbench.bench import MatbenchBenchmark
from matminer.featurizers.composition import ElementProperty
from matminer.featurizers.conversions import StrToComposition
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pandas as pd
# 1. Initialize Matbench benchmark
mb = MatbenchBenchmark(
autoload=False,
subset=[
'matbench_mp_e_form', # Formation energy
'matbench_mp_gap', # Band gap
'matbench_perovskites' # Perovskite formation
]
)
# 2. Load datasets
for task in mb.tasks:
task.load()
print(f"\n{task.dataset_name}:")
print(f" Sample count: {len(task.df)}")
print(f" Target: {task.metadata['target']}")
print(f" Task type: {task.metadata['task_type']}")
# 3. Build and evaluate composition-based model
def evaluate_composition_model(task):
"""
Evaluate Matbench task with composition-based model (Magpie + RandomForest)
"""
# Convert chemical composition
str_to_comp = StrToComposition(target_col_id='composition_obj')
task.df = str_to_comp.featurize_dataframe(task.df, 'composition')
# Generate Magpie features
featurizer = ElementProperty.from_preset('magpie')
task.df = featurizer.featurize_dataframe(task.df, 'composition_obj')
feature_cols = featurizer.feature_labels()
# Build pipeline
if task.metadata['task_type'] == 'regression':
model = RandomForestRegressor(
n_estimators=300,
max_depth=30,
min_samples_split=5,
random_state=42,
n_jobs=-1
)
else: # classification
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=300,
max_depth=30,
random_state=42,
n_jobs=-1
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', model)
])
# Evaluate with Matbench standard 5-fold CV
for fold_idx, fold in enumerate(task.folds):
# Get train/test data
train_inputs, train_outputs = task.get_train_and_val_data(fold)
X_train = train_inputs[feature_cols]
y_train = train_outputs
# Train model
pipeline.fit(X_train, y_train)
# Predict on test data
test_inputs, test_outputs = task.get_test_data(fold,
include_target=True)
X_test = test_inputs[feature_cols]
predictions = pipeline.predict(X_test)
# Record predictions
task.record(fold, predictions)
print(f" Fold {fold_idx + 1} complete")
# Calculate overall scores
scores = task.scores
print(f"\n{task.dataset_name} Evaluation Results:")
for metric, values in scores.items():
print(f" {metric}: {values['mean']:.4f} ± {values['std']:.4f}")
return scores
# 4. Execute evaluation on all tasks
results = {}
for task in mb.tasks:
print(f"\n{'='*60}")
print(f"Evaluating: {task.dataset_name}")
print('='*60)
scores = evaluate_composition_model(task)
results[task.dataset_name] = scores
# 5. Display results summary
print("\n" + "="*60)
print("All Tasks Evaluation Results Summary")
print("="*60)
for dataset_name, scores in results.items():
print(f"\n{dataset_name}:")
for metric, values in scores.items():
print(f" {metric}: {values['mean']:.4f} ± {values['std']:.4f}")
# 6. Save results in Matbench official format
mb.to_file('matbench_composition_results.json')
print("\nSaved results to matbench_composition_results.json")
print("Can be submitted to official leaderboard: https://matbench.materialsproject.org/")
SHAP (SHapley Additive exPlanations) is an interpretation method that applies Shapley values from game theory to machine learning. It quantifies the contribution of each feature to predicted values, making black-box model decisions explainable.
The SHAP value $\phi_i$ for feature $i$ is defined as the following Shapley value:
$$ \phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|!(|F|-|S|-1)!}{|F|!} [f(S \cup \{i\}) - f(S)] $$
Where $F$ is the set of all features, $S$ is a subset of features, and $f(S)$ is the predicted value using feature set $S$.
Intuitive Interpretation: The value averaged over all possible feature combinations of how much the predicted value changes when feature $i$ is added.
It is important to understand the differences between Random Forest feature importance and SHAP:
| Comparison Item | Feature Importance | SHAP Value |
|---|---|---|
| Definition | Average impurity (Gini) decrease | Contribution to predicted value (Shapley value) |
| Interpretation | Global (overall importance) | Local (individual sample contribution) |
| Directionality | None (importance only) | Yes (positive/negative contribution) |
| Computational Cost | Low (calculated during training) | High (calculated per prediction) |
| Theoretical Guarantee | None (heuristic) | Yes (axiomatic foundation) |
We analyze Random Forest models using the SHAP library.
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - shap>=0.42.0
"""
Example: We analyze Random Forest models using the SHAP library.
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
# Code Example7: SHAP interpretation (summary_plot, dependence_plot)
# SHAP installation
# !pip install shap
import shap
import matplotlib.pyplot as plt
import numpy as np
# Use trained model (best_rf trained in previous section)
# X_test, y_test also continue from previous section
# 1. Create SHAP Explainer (Tree Explainer for Random Forest)
explainer = shap.TreeExplainer(best_rf)
# 2. Calculate SHAP values (all test data)
print("Calculating SHAP values...")
shap_values = explainer.shap_values(X_test)
print(f"SHAP values shape: {shap_values.shape}") # (n_samples, n_features)
# 3. Summary Plot (overall feature importance and distribution)
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test, feature_names=feature_cols,
show=False, max_display=20)
plt.tight_layout()
plt.savefig('shap_summary_plot.png', dpi=150, bbox_inches='tight')
print("\nSaved Summary Plot to shap_summary_plot.png")
plt.close()
# 4. Summary Plot (bar chart version: mean absolute SHAP values)
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test, feature_names=feature_cols,
plot_type='bar', show=False, max_display=20)
plt.tight_layout()
plt.savefig('shap_summary_bar.png', dpi=150, bbox_inches='tight')
print("Saved Summary Plot (bar chart) to shap_summary_bar.png")
plt.close()
# 5. Dependence Plot (detailed analysis of specific feature)
# Get most important feature
shap_importance = np.abs(shap_values).mean(axis=0)
top_feature_idx = np.argmax(shap_importance)
top_feature_name = feature_cols[top_feature_idx]
plt.figure(figsize=(10, 6))
shap.dependence_plot(
top_feature_idx,
shap_values,
X_test,
feature_names=feature_cols,
show=False
)
plt.tight_layout()
plt.savefig(f'shap_dependence_{top_feature_name}.png',
dpi=150, bbox_inches='tight')
print(f"\nSaved Dependence Plot ({top_feature_name})")
plt.close()
# 6. Force Plot (individual sample prediction explanation)
# JavaScript-based, recommended for Jupyter environment
shap.initjs()
# Explain prediction for first test sample
sample_idx = 0
shap_values_sample = shap_values[sample_idx, :]
base_value = explainer.expected_value
prediction = best_rf.predict(X_test.iloc[[sample_idx]])[0]
print(f"\nPrediction explanation for Sample #{sample_idx}:")
print(f" Base value (mean prediction): {base_value:.4f}")
print(f" Predicted Value: {prediction:.4f}")
print(f" Actual Value: {y_test.iloc[sample_idx]:.4f}")
# Display Force Plot (Jupyter environment)
# shap.force_plot(base_value, shap_values_sample, X_test.iloc[sample_idx, :],
# feature_names=feature_cols)
# 7. Mean absolute SHAP value ranking by feature
shap_df = pd.DataFrame({
'feature': feature_cols,
'mean_abs_shap': np.abs(shap_values).mean(axis=0)
}).sort_values('mean_abs_shap', ascending=False)
print("\nTop 20 Features (mean absolute SHAP value):")
print(shap_df.head(20).to_string(index=False))
# 8. Comparison with Partial Dependence Plot
from sklearn.inspection import partial_dependence, PartialDependenceDisplay
# Create partial dependence plots for top 3 features
top_3_features = shap_df.head(3)['feature'].tolist()
top_3_indices = [feature_cols.index(f) for f in top_3_features]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, (feature_idx, feature_name) in enumerate(zip(top_3_indices, top_3_features)):
pd_result = partial_dependence(
best_rf, X_test, [feature_idx], grid_resolution=50
)
axes[i].plot(pd_result['grid_values'][0], pd_result['average'][0])
axes[i].set_xlabel(feature_name)
axes[i].set_ylabel('Partial Dependence')
axes[i].grid(alpha=0.3)
plt.tight_layout()
plt.savefig('partial_dependence_plots.png', dpi=150)
print("\nSaved partial dependence plots to partial_dependence_plots.png")
# 9. Statistical summary of SHAP values
print("\nSHAP Value Statistics:")
print(f" Mean absolute SHAP value across all features: {np.abs(shap_values).mean():.4f}")
print(f" Maximum SHAP value: {shap_values.max():.4f}")
print(f" Minimum SHAP value: {shap_values.min():.4f}")
# Separate positive and negative contributions
positive_shap = shap_values[shap_values > 0].sum()
negative_shap = shap_values[shap_values < 0].sum()
print(f" Total positive contribution: {positive_shap:.4f}")
print(f" Total negative contribution: {negative_shap:.4f}")
We explain interpretation methods for each SHAP visualization technique:
Application Example in Materials Science: If "electronegativity difference" shows high SHAP values in formation energy prediction, it suggests that ionic bonding contributes to stability.
Build a scikit-learn pipeline including Magpie features, StandardScaler, and RandomForestRegressor, and calculate the cross-validation score (5-fold, R²).
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from matminer.featurizers.composition import ElementProperty
# Generate features (outside pipeline)
featurizer = ElementProperty.from_preset('magpie')
data = featurizer.featurize_dataframe(data, 'composition_obj')
X = data[featurizer.feature_labels()]
y = data['target']
# Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(n_estimators=100, random_state=42))
])
# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
print(f"5-Fold CV R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
Train XGBoostRegressor with default parameters and calculate MAE, RMSE, and R² for training/test data.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - xgboost>=2.0.0
"""
Example: Train XGBoostRegressor with default parameters and calculate
Purpose: Demonstrate machine learning model training and evaluation
Target: Beginner to Intermediate
Execution time: 30-60 seconds
Dependencies: None
"""
import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Train model
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)
# Predict
y_pred_train = xgb_model.predict(X_train)
y_pred_test = xgb_model.predict(X_test)
# Evaluation metrics
print("Training Data:")
print(f" MAE: {mean_absolute_error(y_train, y_pred_train):.4f}")
print(f" RMSE: {np.sqrt(mean_squared_error(y_train, y_pred_train)):.4f}")
print(f" R²: {r2_score(y_train, y_pred_train):.4f}")
print("\nTest Data:")
print(f" MAE: {mean_absolute_error(y_test, y_pred_test):.4f}")
print(f" RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test)):.4f}")
print(f" R²: {r2_score(y_test, y_pred_test):.4f}")
Execute 3-fold cross-validation on MLPRegressor and display R² scores for each fold and the average.
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler
# Data standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# MLP model
mlp = MLPRegressor(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
# 3-fold CV
cv_results = cross_validate(
mlp, X_scaled, y,
cv=3,
scoring='r2',
return_train_score=True
)
print("Scores for each fold:")
for i, (train_score, test_score) in enumerate(zip(
cv_results['train_score'], cv_results['test_score']
)):
print(f" Fold {i+1}: Train R²={train_score:.4f}, Test R²={test_score:.4f}")
print(f"\nAverage Test R²: {cv_results['test_score'].mean():.4f} "
f"± {cv_results['test_score'].std():.4f}")
Optimize Random Forest hyperparameters (n_estimators, max_depth, min_samples_split) with GridSearchCV and display optimal parameters and best CV score.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'min_samples_split': [2, 5, 10]
}
rf_model = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(
rf_model, param_grid, cv=5, scoring='r2', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Optimal Parameters: {grid_search.best_params_}")
print(f"Best CV R² Score: {grid_search.best_score_:.4f}")
# Evaluate on test data
test_score = grid_search.best_estimator_.score(X_test, y_test)
print(f"Test R² Score: {test_score:.4f}")
Draw learning curves and diagnose overfitting/underfitting. Vary training data size at 10%, 30%, 50%, 70%, and 100%.
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Draw learning curves and diagnose overfitting/underfitting.
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, valid_scores = learning_curve(
RandomForestRegressor(n_estimators=100, random_state=42),
X, y,
train_sizes=[0.1, 0.3, 0.5, 0.7, 1.0],
cv=5,
scoring='r2',
n_jobs=-1
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training Score')
plt.plot(train_sizes, valid_scores.mean(axis=1), 's-', label='Validation Score')
plt.xlabel('Number of Training Samples')
plt.ylabel('R² Score')
plt.title('Learning Curve')
plt.legend()
plt.grid(alpha=0.3)
# Diagnosis
final_train_score = train_scores.mean(axis=1)[-1]
final_valid_score = valid_scores.mean(axis=1)[-1]
gap = final_train_score - final_valid_score
if gap > 0.1:
diagnosis = "Possibility of overfittingļ¼Training Score >> Validation Scoreļ¼"
elif final_valid_score < 0.7:
diagnosis = "Possibility of underfittingļ¼Both scores are lowļ¼"
else:
diagnosis = "Good generalization performance"
plt.text(0.5, 0.1, f"Diagnosis: {diagnosis}",
transform=plt.gca().transAxes, fontsize=12,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.tight_layout()
plt.show()
Calculate SHAP values for a trained Random Forest model and display ranking of top 10 features by mean absolute SHAP value.
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - shap>=0.42.0
"""
Example: Calculate SHAP values for a trained Random Forest model and
Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
import shap
import numpy as np
# Create TreeExplainer
explainer = shap.TreeExplainer(best_rf)
# Calculate SHAP values
shap_values = explainer.shap_values(X_test)
# Mean Absolute SHAP Valuefor ranking
shap_importance = np.abs(shap_values).mean(axis=0)
feature_names = X_test.columns.tolist()
shap_df = pd.DataFrame({
'feature': feature_names,
'mean_abs_shap': shap_importance
}).sort_values('mean_abs_shap', ascending=False)
print("Top 10 Features (mean absolute SHAP value):")
print(shap_df.head(10).to_string(index=False))
# Visualization
plt.figure(figsize=(10, 6))
plt.barh(shap_df.head(10)['feature'], shap_df.head(10)['mean_abs_shap'])
plt.xlabel('Mean Absolute SHAP Value')
plt.title('Feature Importance (SHAP)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Train three models (Random Forest, XGBoost, MLP) on the same dataset and output MAE, RMSE, and R² as a comparison table.
# Requirements:
# - Python 3.9+
# - xgboost>=2.0.0
"""
Example: Train three models (Random Forest, XGBoost, MLP) on the same
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 30-60 seconds
Dependencies: None
"""
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
# Define models
models = {
'Random Forest': RandomForestRegressor(n_estimators=200, max_depth=20,
random_state=42),
'XGBoost': xgb.XGBRegressor(n_estimators=200, max_depth=6,
learning_rate=0.1, random_state=42),
'MLP': MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=500,
random_state=42)
}
# Store evaluation results
results = []
for model_name, model in models.items():
# Standardize for MLP
if model_name == 'MLP':
scaler = StandardScaler()
X_train_current = scaler.fit_transform(X_train)
X_test_current = scaler.transform(X_test)
else:
X_train_current = X_train
X_test_current = X_test
# Train and predict
model.fit(X_train_current, y_train)
y_pred = model.predict(X_test_current)
# Evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
results.append({
'Model': model_name,
'MAE': mae,
'RMSE': rmse,
'R²': r2
})
# Display results
results_df = pd.DataFrame(results)
print("\nModel Performance Comparison:")
print(results_df.to_string(index=False))
# Highlight best model
best_model_idx = results_df['R²'].idxmax()
print(f"\nBest Model: {results_df.loc[best_model_idx, 'Model']}")
Implement StackingRegressor with Random Forest, XGBoost, and MLP as base models and LinearRegression as meta-model. Verify if performance improves over single models.
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
# Define base models
base_models = [
('rf', RandomForestRegressor(n_estimators=200, max_depth=20,
random_state=42)),
('xgb', xgb.XGBRegressor(n_estimators=200, max_depth=6,
learning_rate=0.1, random_state=42)),
('mlp', Pipeline([
('scaler', StandardScaler()),
('model', MLPRegressor(hidden_layer_sizes=(100, 50),
max_iter=500, random_state=42))
]))
]
# Meta-model
meta_model = LinearRegression()
# Build StackingRegressor
stacking_model = StackingRegressor(
estimators=base_models,
final_estimator=meta_model,
cv=5
)
# Train
print("Training Stacking model...")
stacking_model.fit(X_train, y_train)
# Evaluation
y_pred_stacking = stacking_model.predict(X_test)
stacking_mae = mean_absolute_error(y_test, y_pred_stacking)
stacking_rmse = np.sqrt(mean_squared_error(y_test, y_pred_stacking))
stacking_r2 = r2_score(y_test, y_pred_stacking)
print("\nStacking Model Performance:")
print(f" MAE: {stacking_mae:.4f}")
print(f" RMSE: {stacking_rmse:.4f}")
print(f" R²: {stacking_r2:.4f}")
# Comparison with single models
print("\nPerformance Improvement Rate (vs Random Forest):")
rf_r2 = 0.85 # Assume previous exercise results
improvement = (stacking_r2 - rf_r2) / rf_r2 * 100
print(f" R² Improvement: {improvement:.2f}%")
Train a composition-based model on Matbench's matbench_mp_e_form (formation energy) task and save results in official leaderboard format.
from matbench.bench import MatbenchBenchmark
# Initialize Matbench benchmark
mb = MatbenchBenchmark(autoload=False, subset=['matbench_mp_e_form'])
for task in mb.tasks:
task.load()
# Convert chemical composition and generate Magpie features
str_to_comp = StrToComposition(target_col_id='composition_obj')
task.df = str_to_comp.featurize_dataframe(task.df, 'composition')
featurizer = ElementProperty.from_preset('magpie')
task.df = featurizer.featurize_dataframe(task.df, 'composition_obj')
feature_cols = featurizer.feature_labels()
# Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(n_estimators=300, max_depth=30,
min_samples_split=5, random_state=42))
])
# Evaluation with 5-fold CV
for fold_idx, fold in enumerate(task.folds):
train_inputs, train_outputs = task.get_train_and_val_data(fold)
X_train_fold = train_inputs[feature_cols]
y_train_fold = train_outputs
pipeline.fit(X_train_fold, y_train_fold)
test_inputs, _ = task.get_test_data(fold, include_target=False)
X_test_fold = test_inputs[feature_cols]
predictions = pipeline.predict(X_test_fold)
task.record(fold, predictions)
print(f"Fold {fold_idx + 1} completed")
# Display scores
scores = task.scores
for metric, values in scores.items():
print(f"{metric}: {values['mean']:.4f} ± {values['std']:.4f}")
# Save results
mb.to_file('matbench_eform_results.json')
print("\nSaved results to matbench_eform_results.json ")
Implement Mean Absolute Percentage Error (MAPE) as a custom evaluation metric and use it with GridSearchCV. MAPE is defined by the following formula:
$$\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|$$
from sklearn.metrics import make_scorer
def mean_absolute_percentage_error(y_true, y_pred):
"""
MAPEļ¼Mean Absolute Percentage Errorļ¼ calculation
Note: Add small value to avoid errors when y_true contains 0
"""
y_true = np.array(y_true)
y_pred = np.array(y_pred)
# Avoid division by zero
epsilon = 1e-10
return np.mean(np.abs((y_true - y_pred) / (y_true + epsilon))) * 100
# Convert to scikit-learn scorer
mape_scorer = make_scorer(mean_absolute_percentage_error,
greater_is_better=False) # Lower MAPE is better
# Use custom evaluation metric with GridSearchCV
param_grid = {
'n_estimators': [100, 200],
'max_depth': [10, 20]
}
grid_search = GridSearchCV(
RandomForestRegressor(random_state=42),
param_grid,
cv=5,
scoring=mape_scorer, # Custom evaluation metric
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Optimal Parameters: {grid_search.best_params_}")
print(f"Best CV MAPE Score: {-grid_search.best_score_:.4f}%") # Invert negative
# Evaluate on test data
y_pred_test = grid_search.best_estimator_.predict(X_test)
test_mape = mean_absolute_percentage_error(y_test, y_pred_test)
print(f"Test MAPE: {test_mape:.4f}%")
In this chapter, we learned practical methods for integrating composition-based features with machine learning models to predict material properties with high accuracy. Let's review the key points:
This hybrid approach enables high-accuracy material discovery while reducing computational cost by 90%.
In Chapter 5, we will learn end-to-end workflows from discovery of new materials to experimental validation by applying composition-based features to actual material discovery projects. We will proceed to more practical content including integration with the Materials Project database, efficiency through active learning, and uncertainty quantification.