Introduction to Data-Driven Materials Science Series v1.0

About This Series

This series provides a systematic educational resource for learning practical data analysis methods and workflow design in Data-Driven Materials Science. You'll acquire essential data analysis skills for materials research, covering everything from data collection strategies, feature engineering, and model selection to explainable AI.

Target Audience

Graduate students and researchers (materials science, chemistry, physics, engineering)
Materials researchers looking to enhance practical data analysis skills
Engineers seeking to apply machine learning to materials development
Those pursuing careers in Materials Informatics

Learning Objectives

Upon completing this series, you will have acquired the following skills:

✅ Understanding data collection strategies and design of experiments for materials data
✅ Mastery of appropriate missing value and outlier handling techniques
✅ Design of materials descriptors and feature engineering
✅ Practical application of model selection and hyperparameter optimization
✅ Interpretation of predictions and physical meaning extraction using SHAP/LIME
✅ Workflow construction capabilities using real datasets

Prerequisites

Required: Python basics, fundamental machine learning concepts, university-level mathematics (statistics, linear algebra)
Recommended: Experience with scikit-learn, fundamental materials science knowledge

Series Structure

📘 Chapter 1: Data Collection Strategy and Cleaning

📖 Reading Time: 25-30 min 📊 Level: Beginner to Intermediate 💻 Code Examples: 9-11

Understand the characteristics of materials data (small-scale, imbalanced, noisy) and learn effective data collection strategies and preprocessing techniques. Practice Design of Experiments (DOE), Latin Hypercube Sampling, missing value imputation (MICE), and outlier detection (Isolation Forest).

Learning Content:

Characteristics and challenges of materials data
Data collection strategies (DOE, Active Learning)
Missing value handling (Simple/KNN/MICE Imputation)
Outlier detection (Z-score, IQR, Isolation Forest, LOF)
Case Study: Thermoelectric materials dataset

👉 Read Chapter 1

📗 Chapter 2: Feature Engineering

📖 Reading Time: 25-30 min 📊 Level: Intermediate 💻 Code Examples: 10-12

Learn materials descriptor selection and design, feature transformations, dimensionality reduction, and feature selection. Master materials science-specific feature engineering from matminer-based composition/structure descriptor generation to SHAP-based selection.

Learning Content:

Materials descriptors (composition, structure, electronic structure)
Feature transformations (normalization, logarithmic transformation, polynomial features)
Dimensionality reduction (PCA, t-SNE, UMAP)
Feature selection (Filter, Wrapper, Embedded, SHAP-based)
Case Study: Band gap prediction (200 dimensions → 20 dimensions)

👉 Read Chapter 2

💻 Chapter 3: Model Selection and Hyperparameter Optimization

📖 Reading Time: 25-30 min 📊 Level: Intermediate 💻 Code Examples: 8-10

Practice model selection based on data size, cross-validation, hyperparameter optimization (Optuna), and ensemble learning. Master the appropriate use of linear models, tree-based models, neural networks, and GNNs, along with automated optimization using Bayesian Optimization.

Learning Content:

Model selection strategies (interpretability vs. accuracy)
Cross-validation (K-Fold, Stratified, Time Series Split)
Hyperparameter optimization (Grid/Random/Bayesian)
Ensemble learning (Bagging, Boosting, Stacking)
Case Study: Li-ion battery capacity prediction

👉 Read Chapter 3

🔍 Chapter 4: Explainable AI (XAI)

📖 Reading Time: 20-25 min 📊 Level: Intermediate 💻 Code Examples: 8-10

Learn prediction interpretation methods using SHAP, LIME, and Attention visualization. Understand XAI career paths through the importance of physical interpretation in materials science and real-world application examples from Toyota, IBM, Citrine, and others.

Learning Content:

Importance of interpretability (black box problem)
SHAP (Shapley values, Tree SHAP, Global/Local interpretation)
LIME (local linear approximation, Tabular LIME)
Attention visualization (for NN/GNN)
Real-world applications and career paths (including salary information)

👉 Read Chapter 4

How to Study

Recommended Learning Path

flowchart LR A[Chapter 1 Data Collection and Cleaning] --> B[Chapter 2 Feature Engineering] B --> C[Chapter 3 Model Selection and Optimization] C --> D[Chapter 4 Explainable AI] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#e8f5e9 style D fill:#fce4ec

Study Methods

Chapters 1-2 (Fundamentals): Master techniques to improve data quality
- Design of experiments and data collection strategies
- Feature design and dimensionality reduction
Chapter 3 (Optimization): Maximize model performance
- Cross-validation and hyperparameter optimization
- Accuracy improvement through ensemble learning
Chapter 4 (Interpretation): Understand physical meaning of predictions
- Interpretation using SHAP/LIME
- Real-world application examples and career information

Environment Setup

The following environment is required for practice:

Recommended Environment:

Python 3.8 or higher
Jupyter Notebook or Google Colab
Main libraries:

pip install pandas numpy matplotlib seaborn scikit-learn
pip install lightgbm xgboost optuna shap lime
pip install matminer pymatgen scipy scikit-optimize

For Google Colab users:

!pip install matminer optuna shap lime
# Other libraries are pre-installed

Series Features

🎯 Practice-Oriented

Acquire practical skills applicable to real materials research through 35-40 executable Python code examples. All code is Google Colab compatible.

📊 Materials Science Focused

Learn materials science-specific data analysis methods, including materials descriptor generation with matminer and Materials Project integration.

🔬 Real Datasets

Acquire practical skills through exercises using actual materials datasets including thermoelectric materials, band gap prediction, and Li-ion batteries.

🌐 Latest Technologies

Learn the latest tools and methods as of 2024-2025, including Optuna, SHAP, and LIME.

Overall Workflow

The data-driven materials science workflow you'll learn in this series is as follows:

flowchart TD A[Problem Definition] --> B[Data Collection Strategy Chapter 1] B --> C[Data Cleaning Chapter 1] C --> D[Feature Engineering Chapter 2] D --> E[Model Selection Chapter 3] E --> F[Hyperparameter Optimization Chapter 3] F --> G[Model Evaluation] G --> H{Performance OK?} H -->|No| D H -->|Yes| I[Prediction Interpretation Chapter 4] I --> J[Materials Design/Experimental Validation] J --> K[New Data Acquisition] K --> C style A fill:#f9f9f9 style B fill:#e1f5ff style C fill:#e1f5ff style D fill:#fff4e1 style E fill:#e8f5e9 style F fill:#e8f5e9 style I fill:#fce4ec

Related Series

We also publish the following series on this site:

- From fundamentals to practice across MI
Introduction to Bayesian Optimization - Efficient materials search
Introduction to Graph Neural Networks - Crystal structure prediction
Introduction to Active Learning - Efficient data collection

References and Resources

Key Textbooks

Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A., & Kim, C. (2017). Machine learning in materials informatics: recent applications and prospects. npj Computational Materials, 3(1), 54. DOI: 10.1038/s41524-017-0056-5
Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O., & Walsh, A. (2018). Machine learning for molecular and materials science. Nature, 559(7715), 547-555. DOI: 10.1038/s41586-018-0337-2
Ward, L., Agrawal, A., Choudhary, A., & Wolverton, C. (2016). A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials, 2(1), 16028. DOI: 10.1038/npjcompumats.2016.28

Online Resources

matminer: Materials descriptor generation library (https://hackingmaterials.lbl.gov/matminer/)
Materials Project: Materials database (https://materialsproject.org)
SHAP Documentation: Explainable AI (https://shap.readthedocs.io/)
Optuna: Hyperparameter optimization (https://optuna.org/)

Total Code Examples and Reading Time

Chapter	Reading Time	Code Examples	Practice Problems
Chapter 1	25-30 min	9-11	5-7
Chapter 2	25-30 min	10-12	6-8
Chapter 3	25-30 min	8-10	5-7
Chapter 4	20-25 min	8-10	4-6
Total	100-120 min	35-43	20-28

Feedback and Questions

For questions and feedback regarding this series, please contact:

Dr. Yusuke Hashimoto
Institute of Multidisciplinary Research for Advanced Materials (IMRAM)
Tohoku University
Email: yusuke.hashimoto.b8@tohoku.ac.jp

License

This content is published under the Creative Commons Attribution 4.0 International License.

Free use for educational and research purposes is welcome. When citing, please use the following format:

Hashimoto, Yusuke (2025) 'Introduction to Data-Driven Materials Science Series v1.0' Tohoku University
https://yusukehashimotolab.github.io/wp/knowledge/en/data-driven-materials-introduction/

Last Updated: October 18, 2025 | Version: 1.0

Start with Chapter 1 →

About This Series

Target Audience

Learning Objectives

Prerequisites

Series Structure

📘 Chapter 1: Data Collection Strategy and Cleaning

📗 Chapter 2: Feature Engineering

💻 Chapter 3: Model Selection and Hyperparameter Optimization

🔍 Chapter 4: Explainable AI (XAI)

How to Study

Recommended Learning Path

Study Methods

Environment Setup

Series Features

🎯 Practice-Oriented

📊 Materials Science Focused

🔬 Real Datasets

🌐 Latest Technologies

Overall Workflow

Related Series

References and Resources

Key Textbooks

Online Resources

Total Code Examples and Reading Time

Feedback and Questions

License

Disclaimer