About This Series
This series provides a systematic educational resource for learning practical data analysis methods and workflow design in Data-Driven Materials Science. You'll acquire essential data analysis skills for materials research, covering everything from data collection strategies, feature engineering, and model selection to explainable AI.
Target Audience
- Graduate students and researchers (materials science, chemistry, physics, engineering)
- Materials researchers looking to enhance practical data analysis skills
- Engineers seeking to apply machine learning to materials development
- Those pursuing careers in Materials Informatics
Learning Objectives
Upon completing this series, you will have acquired the following skills:
- ✅ Understanding data collection strategies and design of experiments for materials data
- ✅ Mastery of appropriate missing value and outlier handling techniques
- ✅ Design of materials descriptors and feature engineering
- ✅ Practical application of model selection and hyperparameter optimization
- ✅ Interpretation of predictions and physical meaning extraction using SHAP/LIME
- ✅ Workflow construction capabilities using real datasets
Prerequisites
- Required: Python basics, fundamental machine learning concepts, university-level mathematics (statistics, linear algebra)
- Recommended: Experience with scikit-learn, fundamental materials science knowledge
Series Structure
📘 Chapter 1: Data Collection Strategy and Cleaning
Understand the characteristics of materials data (small-scale, imbalanced, noisy) and learn effective data collection strategies and preprocessing techniques. Practice Design of Experiments (DOE), Latin Hypercube Sampling, missing value imputation (MICE), and outlier detection (Isolation Forest).
Learning Content:
- Characteristics and challenges of materials data
- Data collection strategies (DOE, Active Learning)
- Missing value handling (Simple/KNN/MICE Imputation)
- Outlier detection (Z-score, IQR, Isolation Forest, LOF)
- Case Study: Thermoelectric materials dataset
📗 Chapter 2: Feature Engineering
Learn materials descriptor selection and design, feature transformations, dimensionality reduction, and feature selection. Master materials science-specific feature engineering from matminer-based composition/structure descriptor generation to SHAP-based selection.
Learning Content:
- Materials descriptors (composition, structure, electronic structure)
- Feature transformations (normalization, logarithmic transformation, polynomial features)
- Dimensionality reduction (PCA, t-SNE, UMAP)
- Feature selection (Filter, Wrapper, Embedded, SHAP-based)
- Case Study: Band gap prediction (200 dimensions → 20 dimensions)
💻 Chapter 3: Model Selection and Hyperparameter Optimization
Practice model selection based on data size, cross-validation, hyperparameter optimization (Optuna), and ensemble learning. Master the appropriate use of linear models, tree-based models, neural networks, and GNNs, along with automated optimization using Bayesian Optimization.
Learning Content:
- Model selection strategies (interpretability vs. accuracy)
- Cross-validation (K-Fold, Stratified, Time Series Split)
- Hyperparameter optimization (Grid/Random/Bayesian)
- Ensemble learning (Bagging, Boosting, Stacking)
- Case Study: Li-ion battery capacity prediction
🔍 Chapter 4: Explainable AI (XAI)
Learn prediction interpretation methods using SHAP, LIME, and Attention visualization. Understand XAI career paths through the importance of physical interpretation in materials science and real-world application examples from Toyota, IBM, Citrine, and others.
Learning Content:
- Importance of interpretability (black box problem)
- SHAP (Shapley values, Tree SHAP, Global/Local interpretation)
- LIME (local linear approximation, Tabular LIME)
- Attention visualization (for NN/GNN)
- Real-world applications and career paths (including salary information)
How to Study
Recommended Learning Path
Study Methods
- Chapters 1-2 (Fundamentals): Master techniques to improve data quality
- Design of experiments and data collection strategies
- Feature design and dimensionality reduction
- Chapter 3 (Optimization): Maximize model performance
- Cross-validation and hyperparameter optimization
- Accuracy improvement through ensemble learning
- Chapter 4 (Interpretation): Understand physical meaning of predictions
- Interpretation using SHAP/LIME
- Real-world application examples and career information
Environment Setup
The following environment is required for practice:
Recommended Environment:
- Python 3.8 or higher
- Jupyter Notebook or Google Colab
- Main libraries:
pip install pandas numpy matplotlib seaborn scikit-learn
pip install lightgbm xgboost optuna shap lime
pip install matminer pymatgen scipy scikit-optimize
For Google Colab users:
!pip install matminer optuna shap lime
# Other libraries are pre-installed
Series Features
🎯 Practice-Oriented
Acquire practical skills applicable to real materials research through 35-40 executable Python code examples. All code is Google Colab compatible.
📊 Materials Science Focused
Learn materials science-specific data analysis methods, including materials descriptor generation with matminer and Materials Project integration.
🔬 Real Datasets
Acquire practical skills through exercises using actual materials datasets including thermoelectric materials, band gap prediction, and Li-ion batteries.
🌐 Latest Technologies
Learn the latest tools and methods as of 2024-2025, including Optuna, SHAP, and LIME.
Overall Workflow
The data-driven materials science workflow you'll learn in this series is as follows:
Related Series
We also publish the following series on this site:
- - From fundamentals to practice across MI
- Introduction to Bayesian Optimization - Efficient materials search
- Introduction to Graph Neural Networks - Crystal structure prediction
- Introduction to Active Learning - Efficient data collection
References and Resources
Key Textbooks
- Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A., & Kim, C. (2017). Machine learning in materials informatics: recent applications and prospects. npj Computational Materials, 3(1), 54. DOI: 10.1038/s41524-017-0056-5
- Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O., & Walsh, A. (2018). Machine learning for molecular and materials science. Nature, 559(7715), 547-555. DOI: 10.1038/s41586-018-0337-2
- Ward, L., Agrawal, A., Choudhary, A., & Wolverton, C. (2016). A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials, 2(1), 16028. DOI: 10.1038/npjcompumats.2016.28
Online Resources
- matminer: Materials descriptor generation library (https://hackingmaterials.lbl.gov/matminer/)
- Materials Project: Materials database (https://materialsproject.org)
- SHAP Documentation: Explainable AI (https://shap.readthedocs.io/)
- Optuna: Hyperparameter optimization (https://optuna.org/)
Total Code Examples and Reading Time
| Chapter | Reading Time | Code Examples | Practice Problems |
|---|---|---|---|
| Chapter 1 | 25-30 min | 9-11 | 5-7 |
| Chapter 2 | 25-30 min | 10-12 | 6-8 |
| Chapter 3 | 25-30 min | 8-10 | 5-7 |
| Chapter 4 | 20-25 min | 8-10 | 4-6 |
| Total | 100-120 min | 35-43 | 20-28 |
Feedback and Questions
For questions and feedback regarding this series, please contact:
Dr. Yusuke Hashimoto
Institute of Multidisciplinary Research for Advanced Materials (IMRAM)
Tohoku University
Email: yusuke.hashimoto.b8@tohoku.ac.jp
License
This content is published under the Creative Commons Attribution 4.0 International License.
Free use for educational and research purposes is welcome. When citing, please use the following format:
Hashimoto, Yusuke (2025) 'Introduction to Data-Driven Materials Science Series v1.0' Tohoku University
https://yusukehashimotolab.github.io/wp/knowledge/en/data-driven-materials-introduction/
Last Updated: October 18, 2025 | Version: 1.0