🌐 EN | 🇯🇵 JP | Last sync: 2025-11-16

Introduction to Data-Driven Materials Science Series

Practical Data Analysis Methods and Workflow Design

📖 Total Study Time: 100-120 minutes 📊 Difficulty: Beginner to Intermediate 💻 Code Examples: 35-43 📝 Practice Problems: 20-28

About This Series

This series provides a systematic educational resource for learning practical data analysis methods and workflow design in Data-Driven Materials Science. You'll acquire essential data analysis skills for materials research, covering everything from data collection strategies, feature engineering, and model selection to explainable AI.

Target Audience

Learning Objectives

Upon completing this series, you will have acquired the following skills:

Prerequisites


Series Structure

📘 Chapter 1: Data Collection Strategy and Cleaning

📖 Reading Time: 25-30 min 📊 Level: Beginner to Intermediate 💻 Code Examples: 9-11

Understand the characteristics of materials data (small-scale, imbalanced, noisy) and learn effective data collection strategies and preprocessing techniques. Practice Design of Experiments (DOE), Latin Hypercube Sampling, missing value imputation (MICE), and outlier detection (Isolation Forest).

Learning Content:

👉 Read Chapter 1

📗 Chapter 2: Feature Engineering

📖 Reading Time: 25-30 min 📊 Level: Intermediate 💻 Code Examples: 10-12

Learn materials descriptor selection and design, feature transformations, dimensionality reduction, and feature selection. Master materials science-specific feature engineering from matminer-based composition/structure descriptor generation to SHAP-based selection.

Learning Content:

👉 Read Chapter 2

💻 Chapter 3: Model Selection and Hyperparameter Optimization

📖 Reading Time: 25-30 min 📊 Level: Intermediate 💻 Code Examples: 8-10

Practice model selection based on data size, cross-validation, hyperparameter optimization (Optuna), and ensemble learning. Master the appropriate use of linear models, tree-based models, neural networks, and GNNs, along with automated optimization using Bayesian Optimization.

Learning Content:

👉 Read Chapter 3

🔍 Chapter 4: Explainable AI (XAI)

📖 Reading Time: 20-25 min 📊 Level: Intermediate 💻 Code Examples: 8-10

Learn prediction interpretation methods using SHAP, LIME, and Attention visualization. Understand XAI career paths through the importance of physical interpretation in materials science and real-world application examples from Toyota, IBM, Citrine, and others.

Learning Content:

👉 Read Chapter 4


How to Study

Recommended Learning Path

flowchart LR A[Chapter 1< br>Data Collection< br>and Cleaning] --> B[Chapter 2< br>Feature< br>Engineering] B --> C[Chapter 3< br>Model Selection< br>and Optimization] C --> D[Chapter 4< br>Explainable AI] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#e8f5e9 style D fill:#fce4ec

Study Methods

  1. Chapters 1-2 (Fundamentals): Master techniques to improve data quality
    • Design of experiments and data collection strategies
    • Feature design and dimensionality reduction
  2. Chapter 3 (Optimization): Maximize model performance
    • Cross-validation and hyperparameter optimization
    • Accuracy improvement through ensemble learning
  3. Chapter 4 (Interpretation): Understand physical meaning of predictions
    • Interpretation using SHAP/LIME
    • Real-world application examples and career information

Environment Setup

The following environment is required for practice:

Recommended Environment:

pip install pandas numpy matplotlib seaborn scikit-learn
pip install lightgbm xgboost optuna shap lime
pip install matminer pymatgen scipy scikit-optimize

For Google Colab users:

!pip install matminer optuna shap lime
# Other libraries are pre-installed

Series Features

🎯 Practice-Oriented

Acquire practical skills applicable to real materials research through 35-40 executable Python code examples. All code is Google Colab compatible.

📊 Materials Science Focused

Learn materials science-specific data analysis methods, including materials descriptor generation with matminer and Materials Project integration.

🔬 Real Datasets

Acquire practical skills through exercises using actual materials datasets including thermoelectric materials, band gap prediction, and Li-ion batteries.

🌐 Latest Technologies

Learn the latest tools and methods as of 2024-2025, including Optuna, SHAP, and LIME.


Overall Workflow

The data-driven materials science workflow you'll learn in this series is as follows:

flowchart TD A[Problem Definition] --> B[Data Collection Strategy< br>Chapter 1] B --> C[Data Cleaning< br>Chapter 1] C --> D[Feature Engineering< br>Chapter 2] D --> E[Model Selection< br>Chapter 3] E --> F[Hyperparameter Optimization< br>Chapter 3] F --> G[Model Evaluation] G --> H{Performance OK?} H -->|No| D H -->|Yes| I[Prediction Interpretation< br>Chapter 4] I --> J[Materials Design/Experimental Validation] J --> K[New Data Acquisition] K --> C style A fill:#f9f9f9 style B fill:#e1f5ff style C fill:#e1f5ff style D fill:#fff4e1 style E fill:#e8f5e9 style F fill:#e8f5e9 style I fill:#fce4ec

Related Series

We also publish the following series on this site:


References and Resources

Key Textbooks

  1. Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A., & Kim, C. (2017). Machine learning in materials informatics: recent applications and prospects. npj Computational Materials, 3(1), 54. DOI: 10.1038/s41524-017-0056-5
  2. Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O., & Walsh, A. (2018). Machine learning for molecular and materials science. Nature, 559(7715), 547-555. DOI: 10.1038/s41586-018-0337-2
  3. Ward, L., Agrawal, A., Choudhary, A., & Wolverton, C. (2016). A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Materials, 2(1), 16028. DOI: 10.1038/npjcompumats.2016.28

Online Resources


Total Code Examples and Reading Time

Chapter Reading Time Code Examples Practice Problems
Chapter 1 25-30 min 9-11 5-7
Chapter 2 25-30 min 10-12 6-8
Chapter 3 25-30 min 8-10 5-7
Chapter 4 20-25 min 8-10 4-6
Total 100-120 min 35-43 20-28

Feedback and Questions

For questions and feedback regarding this series, please contact:

Dr. Yusuke Hashimoto
Institute of Multidisciplinary Research for Advanced Materials (IMRAM)
Tohoku University
Email: yusuke.hashimoto.b8@tohoku.ac.jp


License

This content is published under the Creative Commons Attribution 4.0 International License.

Free use for educational and research purposes is welcome. When citing, please use the following format:

Hashimoto, Yusuke (2025) 'Introduction to Data-Driven Materials Science Series v1.0' Tohoku University
https://yusukehashimotolab.github.io/wp/knowledge/en/data-driven-materials-introduction/

Last Updated: October 18, 2025 | Version: 1.0

Start with Chapter 1 →

Disclaimer