🎯 What You Will Learn in This Series
Composition-based features are classical yet powerful methods for predicting material properties from chemical composition (types and ratios of elements). Centered on Magpie descriptors, this series systematically covers everything from utilizing elemental property databases to Python implementation with matminer.
Series Overview
In materials discovery, chemical composition is the most fundamental and important information. However, a composition formula like "Fe2O3" alone cannot be input into machine learning models. This is where composition-based features play a crucial role by combining periodic table information (ionization energy, electronegativity, atomic radius, etc.) to convert composition into numerical vectors.
This series comprehensively covers the following, centered on the widely-used Magpie descriptors:
- ✅ Theoretical Foundation: Mathematical definitions and materials science significance of composition-based features
- ✅ Practical Skills: Feature generation workflows using the matminer library
- ✅ Comparative Analysis: When to use composition-based vs structure-based features (CGCNN/MPNN and other GNNs)
- ✅ Latest Trends: Limitations of Magpie and evolution to GNN methods
Why Composition-Based Features Are Important
💡 Composition-Based vs Structure-Based
Material features have two main approaches:
- Composition-Based (this series): Generate features from chemical composition only (no structure information required)
- Structure-Based (GNN Introduction Series): Learn from 3D structures including atomic coordinates and bonding information
Strengths of Composition-Based: Effective for exploring new materials with unknown structures, high-speed screening, and cases with limited data
Typical Applications of Composition-Based Features
- High-Speed Materials Screening: Formation energy prediction for 1 million compounds (10-100× faster than GNNs)
- Experimental Data-Driven Exploration: Property prediction from limited experimental data (combined with transfer learning)
- Hybrid Models: Improved accuracy by combining composition features + GNN features
How to Study
Recommended Learning Flow
For Beginners (Learning composition features for the first time):
- Chapter 1 → Chapter 2 → Chapter 3 → Chapter 4 → Chapter 5 (all chapters recommended)
- Time required: 150-180 minutes
For Intermediate Learners (Machine learning experience, want to use matminer):
- Chapter 2 → Chapter 3 → Chapter 5
- Time required: 90-120 minutes
For GNN Learners (Want to compare composition vs structure):
- Chapter 1 → Chapter 2 → Chapter 5 →
- Time required: 120-150 minutes
Chapter Details
Chapter 1: Fundamentals of Composition-Based Features
Difficulty: Introductory
Reading Time: 25-30 minutes
Code Examples: 5
Learning Content
- What Are Composition-Based Features - Converting chemical composition to numerical vectors
- Historical Background - Before and after Ward (2016) Magpie paper
- Limitations of Conventional Descriptors - Density, symmetry, and lattice parameters are insufficient
- Utilizing Elemental Properties - The power of periodic table databases
- Success Stories - Applications in OQMD and Materials Project
Learning Objectives
- ✅ Explain the definition and role of composition-based features
- ✅ Demonstrate differences from conventional material descriptors with examples
- ✅ Understand why Magpie is widely used
Chapter 2: Magpie and Statistical Descriptors
Difficulty: Beginner to Intermediate
Reading Time: 30-35 minutes
Code Examples: 8
Learning Content
- Mathematical Definition of Magpie Descriptors - 145-dimensional statistics
- Types of Statistical Descriptors - Mean, variance, maximum, minimum, range, mode
- Weighted vs Unweighted - Effect of composition ratio weighting
- 22 Types of Elemental Properties - Ionization energy, electronegativity, atomic radius, etc.
- Implementation Example - Manual calculation using NumPy
Learning Objectives
- ✅ Understand Magpie descriptor calculation methods with formulas
- ✅ List the 22 types of elemental properties
- ✅ Explain the significance of weighted statistics
Chapter 3: Elemental Property Databases and Featurizers
Difficulty: Intermediate
Reading Time: 30-35 minutes
Code Examples: 10
Learning Content
- Types of Elemental Property Databases - Magpie, Deml, Jarvis, Matscholar
- matminer Featurizer API - ElementProperty, Stoichiometry, OxidationStates
- Choosing Featurizers - Selection criteria based on application
- Custom Featurizer Creation - How to add original elemental properties
- Feature Preprocessing - Standardization, missing value handling
Learning Objectives
- ✅ Choose among 3+ elemental property databases appropriately
- ✅ Select matminer Featurizers based on application
- ✅ Implement custom Featurizers
Chapter 4: Integration with Machine Learning Models
Difficulty: Intermediate to Advanced
Reading Time: 30-35 minutes
Code Examples: 12
Learning Content
- Model Selection Criteria - Random Forest, XGBoost, LightGBM, Neural Networks
- Hyperparameter Optimization - Optuna, GridSearchCV, BayesSearchCV
- Feature Importance Analysis - SHAP, LIME, Permutation Importance
- Ensemble Methods - Bagging, Boosting, Stacking
- Performance Evaluation Metrics - MAE, RMSE, R², Cross-validation
Learning Objectives
- ✅ Select appropriate machine learning models based on tasks
- ✅ Execute hyperparameter optimization with Optuna
- ✅ Interpret feature importance using SHAP values
Chapter 5: Python Practice - matminer Workflow
Difficulty: Intermediate to Advanced
Reading Time: 35-45 minutes
Code Examples: 15 (all executable)
Learning Content
- Environment Setup - Anaconda, pip, Google Colab
- Data Preparation - Materials Project API, OQMD dataset
- Feature Generation Pipeline - Composition formula → Magpie descriptors → Standardization
- Model Training and Evaluation - Formation energy prediction, band gap prediction
- Performance Comparison with GNNs - Accuracy, speed, interpretability
- Hybrid Approach - Composition features + Structure features
Learning Objectives
- ✅ Build end-to-end prediction workflows with matminer
- ✅ Perform property prediction with Materials Project data (R² > 0.85)
- ✅ Quantitatively compare composition-based and GNN performance
- ✅ Achieve improved accuracy with hybrid models
Overall Learning Outcomes
Upon completing this series, you will acquire the following skills and knowledge:
Knowledge Level (Understanding)
- ✅ Explain the theoretical foundation and history of composition-based features
- ✅ Understand the mathematical definition of Magpie descriptors
- ✅ Know the types and characteristics of elemental property databases
- ✅ Understand criteria for choosing between composition-based and GNN approaches
Practical Skills (Doing)
- ✅ Generate feature vectors from composition using matminer
- ✅ Extend features by combining multiple Featurizers
- ✅ Perform property prediction with machine learning models (RF/XGBoost/NN)
- ✅ Interpret prediction rationale using SHAP values
- ✅ Quantitatively compare performance with GNNs
Application Ability (Applying)
- ✅ Design appropriate features for new materials discovery tasks
- ✅ Build hybrid models (composition + structure)
- ✅ Design prediction workflows combining experimental data
- ✅ Apply feature engineering to industrial applications (battery materials, catalysts)
Frequently Asked Questions (FAQ)
Q1: Should I use composition-based features or GNNs (structure-based)?
A: It depends on the task and data characteristics:
- Composition-based advantages: (1) Materials discovery with unknown structures, (2) High-speed screening (1 million compound scale), (3) Limited data cases (<1000 samples)
- GNN advantages: (1) Properties with strong structure dependence (elastic modulus, thermal conductivity), (2) Accuracy priority, (3) Sufficient data available (>10000 samples)
- Hybrid is strongest: Using both composition and GNN features improves accuracy (implemented in Chapter 5)
Q2: Aren't 145 dimensions of Magpie descriptors too many? Concerns about overfitting?
A: This is rarely a problem in practice:
- 145 dimensions is considered low in modern machine learning (GNNs learn thousands of dimensions in embeddings)
- Elemental properties have physical meaning, unlike random high dimensions
- Dimensionality reduction possible with regularization (L1/L2) or feature selection
- Good performance reported experimentally even with 100-1000 samples
Q3: Are there libraries other than matminer?
A: Yes, these are available:
- DScribe: SOAP, MBTR, ACSF descriptors (supports molecules and crystals)
- CFID: Composition-based Feature Identifier
- Pymatgen: matminer's foundation library (low-level API)
- XenonPy: Integrates deep learning and feature generation
This series focuses on the most widely-used matminer.
Q4: Can Magpie descriptors be calculated automatically from chemical formulas (e.g., Fe2O3)?
A: Yes, it's easy with matminer:
from matminer.featurizers.composition import ElementProperty
featurizer = ElementProperty.from_preset("magpie")
features = featurizer.featurize_dataframe(df, col_id="composition")
# Automatically generates 145-dimensional vectors from df["composition"] column (Fe2O3, etc.)
Chapter 5 provides detailed code examples.
Q5: What is the prediction accuracy difference between GNNs (CGCNN, MPNN) and composition-based features?
A: Depends on the dataset and task, but representative benchmark results:
| Task | Magpie + RF | CGCNN | Hybrid |
|---|---|---|---|
| Formation Energy (OQMD) | MAE 0.12 eV | MAE 0.039 eV | MAE 0.035 eV |
| Band Gap (Materials Project) | MAE 0.45 eV | MAE 0.39 eV | MAE 0.36 eV |
| Inference Speed (1M compounds) | 10 min | 100 min | 110 min |
Detailed comparison experiments are conducted in Chapter 5.
Q6: What are the differences between elemental property databases (Magpie/Deml/Jarvis)?
A: Characteristics of each database:
- Magpie (Ward+ 2016): 22 elemental properties, most widely used, proven track record in Materials Project
- Deml (Deml+ 2016): Considers oxidation states, particularly strong for oxides
- Jarvis (Choudhary+ 2020): Includes DFT calculation values, latest elemental properties
- Matscholar (Tshitoyan+ 2019): Element embeddings extracted from 2 million papers using NLP
Chapter 3 implements usage of each database.
Q7: Can composition-based features be used for transfer learning?
A: Yes, the following approaches are effective:
- Pre-training: Train on Materials Project (60k compounds) → Fine-tune on experimental data (100 samples)
- Domain Adaptation: Train on inorganic materials → Apply to organic-inorganic hybrids
- Meta-learning: Learn common feature representations across multiple tasks (formation energy, band gap, elastic modulus)
Chapter 4 implements transfer learning using XGBoost and neural networks.
Q8: Can custom elemental properties (e.g., rarity, cost) be added?
A: Yes, you can create custom Featurizers by inheriting matminer's BaseFeaturizer:
from matminer.featurizers.base import BaseFeaturizer
class CustomElementProperty(BaseFeaturizer):
def featurize(self, comp):
# Reference custom elemental property database
rarity = get_element_rarity(comp)
cost = get_element_cost(comp)
return [rarity, cost]
Chapter 3 provides detailed custom Featurizer implementation examples.
Q9: How is interpretability (Explainability) with composition-based features?
A: There are aspects that are easier to interpret than GNNs:
- SHAP Values: Clear which elemental properties (e.g., average electronegativity) contributed to predictions
- Physical Meaning: Chemical interpretations possible, such as "higher ionization energy leads to lower formation energy"
- Integration with Domain Knowledge: Direct utilization of periodic table knowledge
Chapter 4 implements interpretability analysis using SHAP/LIME.
Q10: After completing this series, what learning resources should I pursue next?
A: The following learning paths are recommended:
- Comparison with GNNs: → Quantitatively compare both methods
- Extension to Deep Learning: GNN Introduction Series → Learn CGCNN, MPNN
- Practical Application: → Apply to actual projects
- Paper Implementation: Reproduce Ward+ (2016) "A general-purpose machine learning framework for predicting properties of inorganic materials"
Prerequisites
To effectively learn this series, the following prerequisites are recommended:
Required (Must Have)
- ✅ Python Basics: Basic operations with NumPy, Pandas, Matplotlib
- ✅ Materials Science Basics: Concepts of chemical composition, periodic table, elemental properties
- ✅ Machine Learning Basics: Concepts of supervised learning, regression/classification, cross-validation
Recommended (Nice to Have)
- 📚 scikit-learn: Experience with Random Forest, model evaluation
- 📚 Statistics: Understanding of mean, variance, standard deviation
- 📚 Crystallography: Basics of lattices, symmetry (reviewed in Chapter 1)
Not Required
- ❌ Deep Learning: GNN knowledge not required (composition-based focuses on classical machine learning)
- ❌ Quantum Chemistry: DFT calculation experience not required
Related Series
🔗 Integrated Learning with GNN Series
By learning this series together with the GNN Introduction Series, you can grasp the complete picture of material features:
- Introduction to Composition-Based Features (this series): Property prediction from chemical composition
- GNN Introduction Series: Property prediction from 3D structures (CGCNN, MPNN, SchNet)
- Composition vs GNN Comparison Series (coming soon): Quantitative benchmarking of both methods
Recommended Learning Order
- Introduction to Composition-Based Features (this series) → Build fundamentals
- GNN Introduction Series → Learn structure-based methods
- Composition vs GNN Comparison Series (coming soon) → Master when to use which
- Materials Screening Workflow (coming soon) → Practical application
Let's Get Started!
Are you ready? Start with Chapter 1 and begin your journey into the world of composition-based features!
Chapter 1: Fundamentals of Composition-Based Features →
Update History
- 2025-11-02: v1.0 Initial release
Your materials discovery journey starts here!