🌐 EN | 🇯🇵 JP | Last sync: 2025-11-16

Introduction to Composition-Based Features Series v1.0

Accelerate Materials Discovery with Magpie and Machine Learning

📖 Total Study Time: 150-180 minutes 📊 Level: Beginner to Intermediate 👥 Target Audience: Python basics acquired, materials science fundamentals

🎯 What You Will Learn in This Series

Composition-based features are classical yet powerful methods for predicting material properties from chemical composition (types and ratios of elements). Centered on Magpie descriptors, this series systematically covers everything from utilizing elemental property databases to Python implementation with matminer.

Series Overview

In materials discovery, chemical composition is the most fundamental and important information. However, a composition formula like "Fe2O3" alone cannot be input into machine learning models. This is where composition-based features play a crucial role by combining periodic table information (ionization energy, electronegativity, atomic radius, etc.) to convert composition into numerical vectors.

This series comprehensively covers the following, centered on the widely-used Magpie descriptors:

Why Composition-Based Features Are Important

💡 Composition-Based vs Structure-Based

Material features have two main approaches:

Strengths of Composition-Based: Effective for exploring new materials with unknown structures, high-speed screening, and cases with limited data

Typical Applications of Composition-Based Features

  1. High-Speed Materials Screening: Formation energy prediction for 1 million compounds (10-100× faster than GNNs)
  2. Experimental Data-Driven Exploration: Property prediction from limited experimental data (combined with transfer learning)
  3. Hybrid Models: Improved accuracy by combining composition features + GNN features

How to Study

Recommended Learning Flow

timeline title Introduction to Composition-Based Features Learning Flow section Chapter 1 : Fundamentals What are Composition-Based Features : Definitions and history Limitations of Conventional Descriptors : Density and symmetry are insufficient Background of Magpie : Utilizing elemental properties section Chapter 2 : Magpie Details Types of Statistical Descriptors : Mean, variance, max, min 145 Elemental Properties : Periodic table database Mathematical Implementation : Weighted statistics section Chapter 3 : Databases Elemental Property Databases : Magpie/Deml/Jarvis Choosing Featurizers : matminer API Custom Featurizer Creation : Adding original descriptors section Chapter 4 : Machine Learning Integration Model Selection : Random Forest/XGBoost/NN Hyperparameter Optimization : Optuna/GridSearch Feature Importance Analysis : SHAP/LIME section Chapter 5 : Python Practice matminer Workflow : Data prep → Feature generation → Model training Materials Project Data : Property prediction with real data Performance Evaluation and Benchmarking : Comparison with GNNs

For Beginners (Learning composition features for the first time):
- Chapter 1 → Chapter 2 → Chapter 3 → Chapter 4 → Chapter 5 (all chapters recommended)
- Time required: 150-180 minutes

For Intermediate Learners (Machine learning experience, want to use matminer):
- Chapter 2 → Chapter 3 → Chapter 5
- Time required: 90-120 minutes

For GNN Learners (Want to compare composition vs structure):
- Chapter 1 → Chapter 2 → Chapter 5 →
- Time required: 120-150 minutes

Chapter Details

Chapter 1: Fundamentals of Composition-Based Features

Difficulty: Introductory
Reading Time: 25-30 minutes
Code Examples: 5

Learning Content

  1. What Are Composition-Based Features - Converting chemical composition to numerical vectors
  2. Historical Background - Before and after Ward (2016) Magpie paper
  3. Limitations of Conventional Descriptors - Density, symmetry, and lattice parameters are insufficient
  4. Utilizing Elemental Properties - The power of periodic table databases
  5. Success Stories - Applications in OQMD and Materials Project

Learning Objectives

Read Chapter 1 →


Chapter 2: Magpie and Statistical Descriptors

Difficulty: Beginner to Intermediate
Reading Time: 30-35 minutes
Code Examples: 8

Learning Content

  1. Mathematical Definition of Magpie Descriptors - 145-dimensional statistics
  2. Types of Statistical Descriptors - Mean, variance, maximum, minimum, range, mode
  3. Weighted vs Unweighted - Effect of composition ratio weighting
  4. 22 Types of Elemental Properties - Ionization energy, electronegativity, atomic radius, etc.
  5. Implementation Example - Manual calculation using NumPy

Learning Objectives

Read Chapter 2 →


Chapter 3: Elemental Property Databases and Featurizers

Difficulty: Intermediate
Reading Time: 30-35 minutes
Code Examples: 10

Learning Content

  1. Types of Elemental Property Databases - Magpie, Deml, Jarvis, Matscholar
  2. matminer Featurizer API - ElementProperty, Stoichiometry, OxidationStates
  3. Choosing Featurizers - Selection criteria based on application
  4. Custom Featurizer Creation - How to add original elemental properties
  5. Feature Preprocessing - Standardization, missing value handling

Learning Objectives

Read Chapter 3 →


Chapter 4: Integration with Machine Learning Models

Difficulty: Intermediate to Advanced
Reading Time: 30-35 minutes
Code Examples: 12

Learning Content

  1. Model Selection Criteria - Random Forest, XGBoost, LightGBM, Neural Networks
  2. Hyperparameter Optimization - Optuna, GridSearchCV, BayesSearchCV
  3. Feature Importance Analysis - SHAP, LIME, Permutation Importance
  4. Ensemble Methods - Bagging, Boosting, Stacking
  5. Performance Evaluation Metrics - MAE, RMSE, R², Cross-validation

Learning Objectives

Read Chapter 4 →


Chapter 5: Python Practice - matminer Workflow

Difficulty: Intermediate to Advanced
Reading Time: 35-45 minutes
Code Examples: 15 (all executable)

Learning Content

  1. Environment Setup - Anaconda, pip, Google Colab
  2. Data Preparation - Materials Project API, OQMD dataset
  3. Feature Generation Pipeline - Composition formula → Magpie descriptors → Standardization
  4. Model Training and Evaluation - Formation energy prediction, band gap prediction
  5. Performance Comparison with GNNs - Accuracy, speed, interpretability
  6. Hybrid Approach - Composition features + Structure features

Learning Objectives

Read Chapter 5 →


Overall Learning Outcomes

Upon completing this series, you will acquire the following skills and knowledge:

Knowledge Level (Understanding)

Practical Skills (Doing)

Application Ability (Applying)


Frequently Asked Questions (FAQ)

Q1: Should I use composition-based features or GNNs (structure-based)?

A: It depends on the task and data characteristics:

Q2: Aren't 145 dimensions of Magpie descriptors too many? Concerns about overfitting?

A: This is rarely a problem in practice:

Q3: Are there libraries other than matminer?

A: Yes, these are available:

This series focuses on the most widely-used matminer.

Q4: Can Magpie descriptors be calculated automatically from chemical formulas (e.g., Fe2O3)?

A: Yes, it's easy with matminer:

from matminer.featurizers.composition import ElementProperty

featurizer = ElementProperty.from_preset("magpie")
features = featurizer.featurize_dataframe(df, col_id="composition")
# Automatically generates 145-dimensional vectors from df["composition"] column (Fe2O3, etc.)

Chapter 5 provides detailed code examples.

Q5: What is the prediction accuracy difference between GNNs (CGCNN, MPNN) and composition-based features?

A: Depends on the dataset and task, but representative benchmark results:

Task Magpie + RF CGCNN Hybrid
Formation Energy (OQMD) MAE 0.12 eV MAE 0.039 eV MAE 0.035 eV
Band Gap (Materials Project) MAE 0.45 eV MAE 0.39 eV MAE 0.36 eV
Inference Speed (1M compounds) 10 min 100 min 110 min

Detailed comparison experiments are conducted in Chapter 5.

Q6: What are the differences between elemental property databases (Magpie/Deml/Jarvis)?

A: Characteristics of each database:

Chapter 3 implements usage of each database.

Q7: Can composition-based features be used for transfer learning?

A: Yes, the following approaches are effective:

Chapter 4 implements transfer learning using XGBoost and neural networks.

Q8: Can custom elemental properties (e.g., rarity, cost) be added?

A: Yes, you can create custom Featurizers by inheriting matminer's BaseFeaturizer:

from matminer.featurizers.base import BaseFeaturizer

class CustomElementProperty(BaseFeaturizer):
    def featurize(self, comp):
        # Reference custom elemental property database
        rarity = get_element_rarity(comp)
        cost = get_element_cost(comp)
        return [rarity, cost]

Chapter 3 provides detailed custom Featurizer implementation examples.

Q9: How is interpretability (Explainability) with composition-based features?

A: There are aspects that are easier to interpret than GNNs:

Chapter 4 implements interpretability analysis using SHAP/LIME.

Q10: After completing this series, what learning resources should I pursue next?

A: The following learning paths are recommended:


Prerequisites

To effectively learn this series, the following prerequisites are recommended:

Required (Must Have)

Recommended (Nice to Have)

Not Required


🔗 Integrated Learning with GNN Series

By learning this series together with the GNN Introduction Series, you can grasp the complete picture of material features:

Recommended Learning Order

  1. Introduction to Composition-Based Features (this series) → Build fundamentals
  2. GNN Introduction Series → Learn structure-based methods
  3. Composition vs GNN Comparison Series (coming soon) → Master when to use which
  4. Materials Screening Workflow (coming soon) → Practical application

Let's Get Started!

Are you ready? Start with Chapter 1 and begin your journey into the world of composition-based features!

Chapter 1: Fundamentals of Composition-Based Features →


Update History


Your materials discovery journey starts here!

Disclaimer