Chapter 2: Materials Informatics Methods for Catalyst Design

This chapter covers Materials Informatics Methods for Catalyst Design. You will learn Catalysis-Hub: https://www.catalysis-hub.org/, Materials Project: https://materialsproject.org/, and ASE Documentation: https://wiki.fysik.dtu.dk/ase/.

Learning Objectives

Descriptor Design: Understand the 4 types of catalyst descriptors and how to use them appropriately
Prediction Models: Explain the construction procedure for activity and selectivity prediction models
Bayesian Optimization: Understand principles and application methods
DFT Integration: Comprehend integration techniques for first-principles calculations and machine learning
Databases: Understand characteristics and appropriate use of major catalyst databases

2.1 Catalyst Descriptors

2.1.1 Role of Descriptors

Descriptors are numerical representations of catalyst properties. They are used as inputs for machine learning models.

Requirements for Good Descriptors: - ✅ Clear physical meaning - ✅ Easy to compute - ✅ Correlated with activity - ✅ General applicability (applicable to different reaction systems)

2.1.2 Classification of Descriptors

1. Electronic Descriptors

Descriptor	Definition	Relationship to Catalytic Activity
d-Orbital Occupancy	Number of electrons in d-orbitals of transition metals	Determines adsorption energy
d-band Center	Center of gravity of d-orbital energy levels	Higher values lead to stronger adsorption
Work Function	Energy required to remove electrons from surface	Affects electron transfer reactions
Bader Charge	Localized atomic charge	Correlates with redox activity

d-band Theory (Nørskov):

Adsorption Energy ∝ d-band Center Position

d-band close to Fermi level
  → Increased occupation of antibonding orbitals
  → Strong adsorption
  → High activity (but slow desorption)

Optimal d-band Center: Intermediate value (Sabatier principle)

2. Geometrical Descriptors

Descriptor	Description	Example
Coordination Number (CN)	Number of neighboring atoms	Lower CN (edge, corner) is more active
Atomic Radius	Size of metal atom	Correlates with lattice strain
Surface Area (BET)	Specific surface area of catalyst	Larger area increases active sites
Pore Diameter	Pore size of zeolites	Determines shape selectivity
Crystal Facet	(111), (100), (110), etc.	Different active site densities

3. Compositional Descriptors

Descriptor	Definition	Application Example
Elemental Composition	Molar fraction of each element	Composition optimization
Electronegativity	Strength of electron attraction	Redox activity
Ionic Radius	Atomic size in ionic state	Interaction with support
Melting Point	Melting point of metal	Indicator of thermal stability

4. Reaction Descriptors

Descriptor	Definition	Usage
Adsorption Energy	Energy when molecules adsorb on surface	Direct indicator of activity
Activation Energy	Reaction barrier	Prediction of reaction rate
Transition State Energy	Stability of transition state	Identification of rate-determining step

2.2 Sabatier Principle and d-band Theory

2.2.1 Sabatier Principle

Definition: The optimal catalyst interacts with reaction intermediates with "just the right strength."

Adsorption too weak:
  → Reactants don't stay on surface
  → Low activity

Adsorption too strong:
  → Products can't desorb from surface
  → Low activity

Optimal adsorption strength:
  → Peak of Volcano Plot

Volcano Plot:

Activity (TOF)
    |
    |        *
    |      /   \
    |     /     \
    |    /       \
    |   /         \
    |  /           \
    |_________________ Adsorption Energy
   Weak  Optimal  Strong

2.2.2 Scaling Relations

Many adsorption energies have linear relationships:

E(OH*) = 0.5 * E(O*) + 0.25 eV

E(CHO*) = E(CO*) + 0.8 eV

⇒ One descriptor (e.g., E(O*)) can predict multiple adsorbates
⇒ Dimensionality reduction of descriptors

2.3 Activity and Selectivity Prediction Models

2.3.1 Regression Models (Activity Prediction)

Objective: Predict catalyst activity (TOF, conversion)

Workflow:

1. Data Collection
   - Experimental data: Activity measurements
   - DFT data: Adsorption energies

2. Descriptor Calculation
   - Electronic: d-band center
   - Geometrical: Coordination number
   - Compositional: Elemental composition

3. Model Training
   - Random Forest
   - Gradient Boosting
   - Neural Network

4. Performance Evaluation
   - R², RMSE, MAE
   - Cross-validation

5. Prediction
   - Activity prediction for unknown catalysts

Recommended Models:

Model	Advantages	Disadvantages	Recommended Data Size
Random Forest	Interpretable, stable	Weak extrapolation	100+
XGBoost	High accuracy, fast	Many hyperparameters	200+
Neural Network	Learns complex relationships	Prone to overfitting	500+
Gaussian Process	Uncertainty quantification	Doesn't scale	<500

2.3.2 Classification Models (Active Catalyst Screening)

Objective: Classify active vs. inactive catalysts

Class Definition:

Active catalyst: TOF > Threshold (e.g., 1 s⁻¹)
Inactive catalyst: TOF ≤ Threshold

Evaluation Metrics: - Precision: Fraction of predicted active catalysts that are actually active - Recall: Fraction of actual active catalysts correctly predicted - F1 Score: Harmonic mean of Precision and Recall - ROC-AUC: Overall evaluation of classification performance

2.4 Catalyst Discovery via Bayesian Optimization

2.4.1 Principles of Bayesian Optimization

Objective: Discover optimal catalyst with minimum number of experiments

Components: 1. Surrogate Model: Gaussian Process 2. Acquisition Function: Select next candidate to test

Algorithm:

1. Initial experiments (10-20 samples)
   → Obtain composition and activity data

2. Train surrogate model with Gaussian Process
   → Predict activity for unknown compositions (mean + uncertainty)

3. Select next experiment using acquisition function
   - EI (Expected Improvement)
   - UCB (Upper Confidence Bound)
   - PI (Probability of Improvement)

4. Conduct experiment with selected composition

5. Update data and return to step 2

6. Iterate until convergence criteria met

2.4.2 Comparison of Acquisition Functions

Acquisition Function	Formula	Characteristics	Recommended Scenario
EI	E[max(f(x) - f(x⁺), 0)]	Balanced	General purpose
UCB	μ(x) + β·σ(x)	Exploration-focused	Broad exploration
PI	P(f(x) > f(x⁺))	Exploitation-focused	Local optimization

Parameter Tuning: - β (UCB exploration degree): Large initially (3.0), smaller later (1.0) - ξ (EI trade-off): Typically 0.01-0.1

2.4.3 Multi-Objective Bayesian Optimization

Objective: Simultaneously optimize activity and selectivity

Pareto Front:

Selectivity
    |
100%|      * (ideal)
    |    *   *
    |  *       *
    | *         *
    |*___________*___ Activity (TOF)
   0%           High

Pareto Front: Boundary where improving one objective worsens the other

Methods: - ParEGO (Pareto Efficient Global Optimization) - NSGA-II (Non-dominated Sorting Genetic Algorithm II) - EHVI (Expected Hypervolume Improvement)

2.5 Integration with DFT Calculations

2.5.1 What is DFT (Density Functional Theory)?

Objective: Calculate electronic structure at atomic level based on quantum mechanics

Calculable Properties: - Adsorption energy - Activation energy (transition state) - Electron density distribution - Band structure

Computational Cost: - 1 structure: Several hours to days (depends on CPU cores) - Transition state search: Days to weeks

2.5.2 Multi-Fidelity Optimization

Strategy: Combine inexpensive low-fidelity and high-fidelity calculations

Low-Fidelity:
- Empirical models (bond-order force fields)
- Small k-point mesh
- Low cutoff energy
- Cost: 1 minute/structure

High-Fidelity:
- Converged DFT calculation
- Dense k-point mesh
- High cutoff energy
- Cost: 10 hours/structure

Multi-Fidelity:
1. Screen 10,000 structures with Low-Fidelity (~7 days)
2. Calculate top 100 structures with High-Fidelity (~42 days)
3. Train ML with both datasets
4. Prediction accuracy: Equivalent to High-Fidelity alone
5. Total cost: ~1/10

2.5.3 Transfer Learning

Idea: Transfer knowledge from existing reaction systems to new reaction systems

Example:

Source Task: CO oxidation (large dataset available)
Target Task: NO reduction (limited data)

Procedure:
1. Train DNN on Source Task
2. Transfer learning on Target Task
   - Lower layers (general features): Fixed
   - Upper layers (task-specific): Retrain
3. Required data: 1/5 to 1/10

2.6 Major Databases and Tools

2.6.1 Catalyst Databases

1. Catalysis-Hub.org - Content: 20,000+ catalyst reaction energies - Data: DFT calculation results (adsorption energies, transition states) - Format: JSON API, Python API - URL: https://www.catalysis-hub.org/

2. Materials Project - Content: 140,000+ inorganic materials - Data: Crystal structures, band gaps, formation energies - API: Python (pymatgen) - URL: https://materialsproject.org/

3. NIST Kinetics Database - Content: Chemical reaction rate constants - Data: Arrhenius parameters (A, Ea) - Format: Web search - URL: https://kinetics.nist.gov/

2.6.2 Computational Tools

1. ASE (Atomic Simulation Environment) - Language: Python - Functions: - Structure optimization - Vibrational analysis - NEB (transition state search) - Integration with various calculation engines (VASP, Quantum ESPRESSO) - Installation: conda install -c conda-forge ase

2. Pymatgen - Functions: - Read/write crystal structures - Symmetry analysis - Phase diagram calculation - Integration with Materials Project - Installation: pip install pymatgen

3. matminer - Functions: - Automatic calculation of descriptors (200+ types) - Data retrieval from databases - Feature engineering - Installation: pip install matminer

2.7 Catalyst MI Workflow

Integrated Workflow

flowchart TD A[Set Target Reaction] --> B[Initial Data Collection] B --> C[Descriptor Calculation] C --> D[ML Model Training] D --> E[Bayesian Optimization] E --> F[Candidate Catalyst Selection] F --> G{DFT Validation} G -->|Low Activity| E G -->|High Activity| H[Experimental Validation] H --> I{Goal Achieved?} I -->|No| C I -->|Yes| J[Optimal Catalyst]

Implementation Example (Pseudocode)

# Step 1: Data Collection
data = load_catalysis_hub_data(reaction='CO_oxidation')

# Step 2: Descriptor Calculation
descriptors = calculate_descriptors(data['structures'])

# Step 3: ML Model Training
X_train, X_test, y_train, y_test = train_test_split(descriptors, data['activity'])
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Step 4: Bayesian Optimization
optimizer = BayesianOptimization(model, acquisition='EI')
for i in range(50):
    next_candidate = optimizer.suggest()
    dft_energy = run_dft(next_candidate)  # DFT calculation
    optimizer.update(next_candidate, dft_energy)

# Step 5: Optimal Catalyst
best_catalyst = optimizer.get_best()

Summary

In this chapter, we learned Materials Informatics methods specialized for catalyst design:

What We Learned

Descriptors: 4 types - electronic, geometrical, compositional, and reaction descriptors
Sabatier Principle: Optimal adsorption strength, volcano plots
Prediction Models: When to use Random Forest, XGBoost, Neural Networks
Bayesian Optimization: Efficient exploration, acquisition functions (EI, UCB, PI)
DFT Integration: Multi-Fidelity, Transfer Learning
Databases: Catalysis-Hub, Materials Project, ASE

Next Steps

In Chapter 3, we'll implement catalyst MI in Python: - Structure manipulation with ASE - Building activity prediction models - Composition exploration via Bayesian optimization - Integration with DFT calculations - 30 executable code examples

Proceed to Chapter 3 →

Exercises

Basic Level

Problem 1: Explain why adsorption becomes stronger when the d-band center is close to the Fermi level in d-band theory.

Problem 2: Classify the following descriptors into electronic, geometrical, compositional, and reaction descriptors: - Coordination number - Adsorption energy - Electronegativity - Work function

Problem 3: Explain the Sabatier principle in three cases: "adsorption too weak," "too strong," and "optimal."

Intermediate Level

Problem 4: Compare the three acquisition functions (EI, UCB, PI) in Bayesian optimization and explain in which situations each should be used.

Problem 5: Explain why Multi-Fidelity Optimization can reduce computational costs, including the characteristics of Low-Fidelity and High-Fidelity approaches.

Advanced Level

Problem 6: In designing CO2 reduction catalysts, you need to simultaneously optimize activity and selectivity. Propose an exploration strategy using multi-objective Bayesian optimization. Include: - Definition of objective functions - Concept of Pareto front - Specific acquisition functions

Problem 7: Design a strategy for using Transfer Learning to design catalysts for a new reaction system with limited data (e.g., ammonia decomposition). Explain what to choose as the Source Task and why.

References

Important Papers

Nørskov, J. K., et al. (2011). "Towards the computational design of solid catalysts." Nature Chemistry, 3, 273-278.
Ulissi, Z. W., et al. (2017). "To address surface reaction network complexity using scaling relations machine learning and DFT calculations." Nature Communications, 8, 14621.
Wertheim, M. K., et al. (2020). "Bayesian optimization for catalysis." ACS Catalysis, 10(20), 12186-12200.

Databases & Tools

Catalysis-Hub: https://www.catalysis-hub.org/
Materials Project: https://materialsproject.org/
ASE Documentation: https://wiki.fysik.dtu.dk/ase/
matminer: https://hackingmaterials.lbl.gov/matminer/

Last Updated: October 19, 2025 Version: 1.0