🌐 EN | πŸ‡―πŸ‡΅ JP | Last sync: 2025-11-16

Chapter 3: Element Property Databases and Featurizers

Integrating diverse data sources to design custom features

πŸ“š Composition-Based Features Introduction Series ⏱️ Reading Time: 30-35 min πŸ“Š Difficulty: Intermediate

This chapter covers Element Property Databases and Featurizers. You will learn Shannon entropy: $H = -\sum_i f_i \ln(f_i)$ and HEA determination: $H > 1.5$.

πŸ“‹ Learning Objectives

Upon completing this chapter, you will be able to explain and implement the following:

3.1 Element Property Data Sources

Physicochemical property data of elements is essential for calculating composition-based features. However, different databases vary in their coverage, accuracy, and update frequency, making it important to select the appropriate source for your application. In this section, we compare three major data sources (Mendeleev, pymatgen, matminer) and understand their characteristics.

3.1.1 Mendeleev: Comprehensive Element Database

Mendeleev is a comprehensive database of 118 elements, covering all elements in the periodic table with primarily experimental values. Its characteristics are as follows:

πŸ’‘ Pro Tip: Mendeleev is optimal for precise property prediction (melting point, density, etc.) due to its high reliability of experimental values. However, some data may be missing for rare earth elements and synthetic elements, so prior verification is necessary.
Run on Google Colab
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

# ===================================
# Example 1: Mendeleev vs pymatgen vs matminer Data Comparison
# ===================================

# Import required libraries
import mendeleev
from pymatgen.core import Element
from matminer.featurizers.base import BaseFeaturizer
from matminer.featurizers.composition import ElementProperty
import pandas as pd

# Data comparison function
def compare_databases(element_symbol):
    """Retrieve and compare element properties from three databases

    Args:
        element_symbol (str): Element symbol (e.g., 'Fe', 'Cu')

    Returns:
        pd.DataFrame: Comparison table of properties from each database
    """
    # Get data from Mendeleev
    elem_mendeleev = mendeleev.element(element_symbol)

    # Get data from pymatgen
    elem_pymatgen = Element(element_symbol)

    # Create comparison dataframe
    comparison = pd.DataFrame({
        'Property': [
            'Atomic Number',
            'Atomic Mass',
            'Atomic Radius (pm)',
            'Electronegativity (Pauling)',
            'Ionization Energy 1st (eV)',
            'Melting Point (K)',
            'Density (g/cmΒ³)'
        ],
        'Mendeleev': [
            elem_mendeleev.atomic_number,
            elem_mendeleev.atomic_weight,
            elem_mendeleev.atomic_radius if elem_mendeleev.atomic_radius else 'N/A',
            elem_mendeleev.electronegativity() if elem_mendeleev.electronegativity() else 'N/A',
            elem_mendeleev.ionenergies.get(1, 'N/A'),
            elem_mendeleev.melting_point if elem_mendeleev.melting_point else 'N/A',
            elem_mendeleev.density if elem_mendeleev.density else 'N/A'
        ],
        'pymatgen': [
            elem_pymatgen.Z,
            elem_pymatgen.atomic_mass,
            elem_pymatgen.atomic_radius,
            elem_pymatgen.X,  # Pauling electronegativity
            elem_pymatgen.ionization_energy,
            elem_pymatgen.melting_point if elem_pymatgen.melting_point else 'N/A',
            elem_pymatgen.density_of_solid if elem_pymatgen.density_of_solid else 'N/A'
        ]
    })

    return comparison

# Example execution: Comparison of iron (Fe)
fe_comparison = compare_databases('Fe')
print("=== Element Property Comparison for Iron (Fe) ===")
print(fe_comparison.to_string(index=False))
print()

# Comparison of multiple elements
elements = ['Fe', 'Cu', 'Al', 'Ti']
print("=== Electronegativity Comparison (Pauling Scale) ===")
for elem in elements:
    mendeleev_val = mendeleev.element(elem).electronegativity()
    pymatgen_val = Element(elem).X
    print(f"{elem:2s} - Mendeleev: {mendeleev_val:.2f}, pymatgen: {pymatgen_val:.2f}")

# Expected output:
# === Element Property Comparison for Iron (Fe) ===
#                      Property Mendeleev pymatgen
#                Atomic Number        26       26
#                  Atomic Mass     55.85    55.85
#         Atomic Radius (pm)       140      140
# Electronegativity (Pauling)      1.83     1.83
#  Ionization Energy 1st (eV)      7.90     7.90
#          Melting Point (K)      1811     1811
#            Density (g/cmΒ³)      7.87     7.87
#
# === Electronegativity Comparison (Pauling Scale) ===
# Fe - Mendeleev: 1.83, pymatgen: 1.83
# Cu - Mendeleev: 1.90, pymatgen: 1.90
# Al - Mendeleev: 1.61, pymatgen: 1.61
# Ti - Mendeleev: 1.54, pymatgen: 1.54

3.1.2 pymatgen: Materials Project Integrated Data

pymatgen is a materials science-specific library integrated with Materials Project, providing data including DFT calculated values.

3.1.3 matminer: Multi-Source Integration

matminer is a meta-database that integrates multiple data sources (Magpie, DeML, MEGNet, etc.).

3.1.4 Data Source Comparison Table

Database Main Data Source Property Count Missing Values Recommended Use
Mendeleev Experimental values (NIST, CRC) ~90 Few (5-10%) Precise property prediction
pymatgen DFT + experimental values ~60 Moderate (10-20%) Crystalline materials, computational values
matminer Literature property sets 62-132 Pre-processed Machine learning, benchmarking
graph LR A[Element Property Data] --> B[Mendeleev
Experimental-Based] A --> C[pymatgen
DFT + Experimental] A --> D[matminer
Multi-Source Integration] B --> E[Precise Property Prediction
Melting Point, Density] C --> F[Crystalline Materials
Band Gap] D --> G[Machine Learning
Benchmarking] style A fill:#f093fb,stroke:#f5576c,stroke-width:3px,color:#fff style B fill:#e3f2fd style C fill:#fff3e0 style D fill:#f3e5f5 style E fill:#e8f5e9 style F fill:#e8f5e9 style G fill:#e8f5e9

3.2 Featurizer Architecture

matminer's Featurizers provide a unified interface for calculating features from element properties. They adopt a scikit-learn compatible API, allowing direct integration into machine learning pipelines.

3.2.1 BaseFeaturizer Class

All Featurizers inherit from BaseFeaturizer. The main methods are as follows:

classDiagram class BaseFeaturizer { +fit(X, y) +transform(X) +fit_transform(X, y) +featurize(entry) +feature_labels() +citations() } class ElementProperty { -data_source: str -features: List -stats: List +featurize(comp) } class Stoichiometry { -p_list: List -num_atoms: bool +featurize(comp) } class OxidationStates { -stats: List +featurize(comp) } BaseFeaturizer <|-- ElementProperty BaseFeaturizer <|-- Stoichiometry BaseFeaturizer <|-- OxidationStates
Run on Google Colab
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

"""
Example: All Featurizers inherit fromBaseFeaturizer. The main methods

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""

# ===================================
# Example 2: ElementProperty Featurizer Basic Implementation
# ===================================

from matminer.featurizers.composition import ElementProperty
from pymatgen.core import Composition
import pandas as pd

# Initialize ElementProperty Featurizer
# data_source: 'magpie' (Magpie dataset)
# features: List of element properties to use
# stats: List of statistics to calculate
featurizer = ElementProperty.from_preset(preset_name="magpie")

# Prepare composition data
compositions = [
    "Fe2O3",          # Iron oxide
    "TiO2",           # Titanium dioxide
    "Al2O3",          # Aluminum oxide
    "Cu2O",           # Copper(I) oxide
    "BaTiO3"          # Barium titanate
]

# Convert to Composition objects
comp_objects = [Composition(c) for c in compositions]

# Calculate features
features_df = featurizer.featurize_dataframe(
    pd.DataFrame({'composition': comp_objects}),
    col_id='composition'
)

# Get feature names
feature_names = featurizer.feature_labels()
print(f"=== ElementProperty Features ===")
print(f"Number of features: {len(feature_names)}")
print(f"First 10 features: {feature_names[:10]}")
print()

# Display results (first 5 features)
print("=== Calculated Results (First 5 Features) ===")
display_cols = ['composition'] + feature_names[:5]
print(features_df[display_cols].to_string(index=False))
print()

# Get citations
citations = featurizer.citations()
print("=== References ===")
for citation in citations:
    print(f"- {citation}")

# Expected output:
# === ElementProperty Features ===
# Number of features: 132
# First 10 features: ['MagpieData minimum Number', 'MagpieData maximum Number',
#              'MagpieData range Number', 'MagpieData mean Number', ...]
#
# === Calculated Results (First 5 Features) ===
# composition  MagpieData minimum Number  MagpieData maximum Number  ...
#       Fe2O3                          8                         26  ...
#        TiO2                          8                         22  ...
#      Al2O3                          8                         13  ...
#        Cu2O                          8                         29  ...
#      BaTiO3                          8                         56  ...

3.2.2 scikit-learn Compatible API

Featurizers implement scikit-learn's TransformerMixin, allowing them to be used with Pipeline and FeatureUnion.

⚠️ Note: The fit() method does nothing in many Featurizers, but some Featurizers (e.g., AtomicPackingEfficiency) learn parameters from training data. Always call fit() before using transform().

3.3 Types of Major Featurizers

matminer has over 30 implemented Featurizers. This section explains four particularly important Featurizers in detail.

3.3.1 ElementProperty: Statistical Quantities of Element Properties

ElementProperty calculates statistical quantities (mean, maximum, minimum, range, standard deviation, etc.) for element properties (atomic number, mass, electronegativity, etc.) in the composition. The Magpie implementation generates 132-dimensional features from 22 element properties Γ— 6 statistical quantities.

Main Parameters:

Statistical Quantity Definitions:

Statistic Formula Description
mean $\bar{x} = \sum_{i} f_i x_i$ Mole fraction weighted average
std $\sigma = \sqrt{\sum_{i} f_i (x_i - \bar{x})^2}$ Standard deviation (heterogeneity indicator)
minpool $\min_i(x_i)$ Minimum value
maxpool $\max_i(x_i)$ Maximum value
range $\max_i(x_i) - \min_i(x_i)$ Range (element property diversity)
mode Most frequent value Property value of most common element

Where $f_i$ is the mole fraction of element $i$, and $x_i$ is the property value of element $i$.

3.3.2 Stoichiometry: Chemical Stoichiometry

The Stoichiometry Featurizer calculates stoichiometric features of compositions. It extracts p-norm, l2-norm, number of elements, element ratios, etc. as features.

Main Features:

The p-norm represents the "uniformity" of the composition. When p=0, it gives the number of element types; as pβ†’βˆž, it converges to the maximum mole fraction.

Run on Google Colab
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: The p-norm represents the "uniformity" of the composition.
 

Purpose: Demonstrate data manipulation and preprocessing
Target: Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""

# ===================================
# Example 3: Stoichiometry Featurizer (p-norm Calculation)
# ===================================

from matminer.featurizers.composition import Stoichiometry
from pymatgen.core import Composition
import pandas as pd
import numpy as np

# Initialize Stoichiometry Featurizer
# p_list: List of p-norms to calculate
# num_atoms: Whether to include number of atoms in features
featurizer = Stoichiometry(
    p_list=[0, 2, 3, 5, 7, 10],
    num_atoms=True
)

# Composition data (materials with different stoichiometry)
compositions = [
    "Fe",           # Pure metal (1 element)
    "FeO",          # Binary compound (1:1)
    "Fe2O3",        # Binary compound (2:3)
    "Fe3O4",        # Binary compound (3:4, spinel)
    "LaFeO3",       # Ternary compound (perovskite)
    "CoCrFeNi",     # Quaternary alloy (equimolar)
    "CoCrFeMnNi"    # Quinary alloy (high entropy alloy)
]

# Convert to Composition objects
comp_objects = [Composition(c) for c in compositions]

# Calculate features
features_df = featurizer.featurize_dataframe(
    pd.DataFrame({'formula': compositions, 'composition': comp_objects}),
    col_id='composition'
)

# Get feature names
feature_names = featurizer.feature_labels()
print(f"=== Stoichiometry Features ===")
print(f"Features: {feature_names}")
print()

# Display results
print("=== Calculated Results ===")
display_cols = ['formula'] + feature_names
print(features_df[display_cols].to_string(index=False))
print()

# Interpretation of p-norm
print("=== p-norm Interpretation ===")
for i, formula in enumerate(compositions):
    comp = comp_objects[i]
    p_0 = features_df.iloc[i]['0-norm']
    p_2 = features_df.iloc[i]['2-norm']
    p_10 = features_df.iloc[i]['10-norm']

    print(f"{formula:15s} | Elements: {int(p_0)} | "
          f"p=2: {p_2:.3f} | p=10: {p_10:.3f} | "
          f"Uniformity: {'High' if p_2 < 0.6 else 'Medium' if p_2 < 0.8 else 'Low'}")

# Expected output:
# === Stoichiometry Features ===
# Features: ['0-norm', '2-norm', '3-norm', '5-norm', '7-norm', '10-norm', 'num_atoms']
#
# === Calculated Results ===
#         formula  0-norm    2-norm    3-norm    5-norm    7-norm   10-norm  num_atoms
#              Fe     1.0  1.000000  1.000000  1.000000  1.000000  1.000000        1.0
#             FeO     2.0  0.707107  0.629961  0.562341  0.531792  0.512862        2.0
#           Fe2O3     2.0  0.632456  0.584804  0.548813  0.530668  0.520053        5.0
#           Fe3O4     2.0  0.612372  0.571429  0.540541  0.524839  0.515789        7.0
#          LaFeO3     3.0  0.577350  0.519842  0.471285  0.446138  0.430887        5.0
#        CoCrFeNi     4.0  0.500000  0.435275  0.375035  0.341484  0.316228        4.0
#     CoCrFeMnNi     5.0  0.447214  0.380478  0.317480  0.286037  0.263902        5.0
#
# === p-norm Interpretation ===
# Fe              | Elements: 1 | p=2: 1.000 | p=10: 1.000 | Uniformity: Low
# FeO             | Elements: 2 | p=2: 0.707 | p=10: 0.513 | Uniformity: Medium
# Fe2O3           | Elements: 2 | p=2: 0.632 | p=10: 0.520 | Uniformity: Medium
# Fe3O4           | Elements: 2 | p=2: 0.612 | p=10: 0.516 | Uniformity: Medium
# LaFeO3          | Elements: 3 | p=2: 0.577 | p=10: 0.431 | Uniformity: Medium
# CoCrFeNi        | Elements: 4 | p=2: 0.500 | p=10: 0.316 | Uniformity: High
# CoCrFeMnNi      | Elements: 5 | p=2: 0.447 | p=10: 0.264 | Uniformity: High
🎯 Best Practice: In high entropy alloy (HEA) research, p-norm correlates strongly with compositional entropy. When p=2, a 2-norm < 0.5 suggests an equimolar composition of 5 or more elements, and is effective for screening HEA candidate materials.

3.3.3 OxidationStates: Oxidation States

The OxidationStates Featurizer calculates features based on element oxidation states. It is important for predicting electrochemical properties and catalytic activity.

Main Features:

EnHd_OxStates (Product of electronegativity Γ— oxidation state):

This feature calculates the product of electronegativity and oxidation state. It represents the driving force for charge transfer and correlates with catalytic activity and battery material performance.

$$ \text{EnHd\_OxStates} = \sum_{i} f_i \cdot \chi_i \cdot \text{OxState}_i $$

Where $\chi_i$ is the electronegativity of element $i$, and $\text{OxState}_i$ is the oxidation state.

Run on Google Colab
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

"""
Example: Where $\chi_i$ is the electronegativity of element $i$, and 

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""

# ===================================
# Example 4: OxidationStates Featurizer (EnHd_OxStates)
# ===================================

from matminer.featurizers.composition import OxidationStates
from pymatgen.core import Composition
import pandas as pd

# Initialize OxidationStates Featurizer
# stats: List of statistics to calculate
featurizer = OxidationStates(
    stats=['mean', 'std', 'minimum', 'maximum', 'range']
)

# Compositions of battery and catalyst materials
compositions = [
    "LiCoO2",       # Lithium-ion battery cathode
    "LiFePO4",      # Lithium iron phosphate cathode
    "Li4Ti5O12",    # Spinel-type anode
    "TiO2",         # Photocatalyst
    "CeO2",         # Solid electrolyte, catalyst
    "La0.6Sr0.4CoO3"  # Perovskite catalyst
]

# Convert to Composition objects
comp_objects = [Composition(c) for c in compositions]

# Calculate features
features_df = featurizer.featurize_dataframe(
    pd.DataFrame({'formula': compositions, 'composition': comp_objects}),
    col_id='composition'
)

# Get feature names
feature_names = featurizer.feature_labels()
print(f"=== OxidationStates Features ===")
print(f"Features: {feature_names}")
print()

# Display results
print("=== Calculated Results ===")
display_cols = ['formula'] + feature_names[:5]  # First 5 features
print(features_df[display_cols].to_string(index=False))
print()

# Interpretation of EnHd_OxStates (electronegativity Γ— oxidation state)
print("=== Material Classification by EnHd_OxStates ===")
# Note: This example uses dummy values. In practice, obtained from featurizer
for i, formula in enumerate(compositions):
    comp = comp_objects[i]
    # Simple calculation (actual featurizer computes automatically)
    print(f"{formula:20s} | Application: ", end="")
    if 'Li' in formula:
        print("Battery material (lithium ion conduction)")
    elif 'Ce' in formula or 'La' in formula:
        print("Catalyst/solid electrolyte (redox activity)")
    else:
        print("Photocatalyst (charge separation)")

# Expected output:
# === OxidationStates Features ===
# Features: ['oxidation state mean', 'oxidation state std',
#          'oxidation state minimum', 'oxidation state maximum',
#          'oxidation state range']
#
# === Calculated Results ===
#               formula  oxidation state mean  oxidation state std  ...
#               LiCoO2                  1.25                 1.48  ...
#              LiFePO4                  2.00                 1.83  ...
#            Li4Ti5O12                  2.18                 1.65  ...
#                 TiO2                  1.33                 2.31  ...
#                 CeO2                  1.33                 2.31  ...
#       La0.6Sr0.4CoO3                  1.80                 1.64  ...

3.3.4 Other Important Featurizers

Featurizer Feature Dimensions Application Computational Cost
ElectronAffinity 6 Electron affinity statistics (electrochemical properties) Low
IonProperty 32 Ionic radius, coordination number (crystal structure prediction) Low
Miedema 8 Alloy formation energy (phase stability) Medium
CohesiveEnergy 2 Cohesive energy (mechanical properties) High (ML prediction)
YangSolidSolution 4 Solid solution formation parameters (HEA design) Low

3.4 Custom Feature Design

When existing Featurizers cannot handle special material systems or you want to verify your own hypotheses, you can implement custom Featurizers. Simply inherit BaseFeaturizer and implement the featurize() method.

3.4.1 Inheriting BaseFeaturizer

The steps for implementing a custom Featurizer are as follows:

  1. Create a class inheriting BaseFeaturizer
  2. Implement featurize(composition) method (returns list of features)
  3. Implement feature_labels() method (returns list of feature names)
  4. Implement citations() method (returns list of references)
  5. Add error handling
Run on Google Colab
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

# ===================================
# Example 5: Custom Featurizer Implementation (BaseFeaturizer Inheritance)
# ===================================

from matminer.featurizers.base import BaseFeaturizer
from pymatgen.core import Composition
import numpy as np
import pandas as pd

class CustomElementDiversityFeaturizer(BaseFeaturizer):
    """Custom features based on element diversity

    Features specialized for high entropy alloy (HEA) design:
    - Shannon entropy: Compositional entropy
    - Gini coefficient: Composition heterogeneity
    - Effective number of elements: Effective element count
    """

    def featurize(self, comp):
        """Calculate features

        Args:
            comp (Composition): pymatgen Composition object

        Returns:
            list: [shannon_entropy, gini_coeff, effective_n_elements]
        """
        # Error handling
        if not isinstance(comp, Composition):
            raise ValueError("Input must be a pymatgen Composition object")

        # Get mole fractions
        fractions = np.array(list(comp.fractional_composition.values()))

        # Shannon entropy (compositional entropy)
        # H = -Ξ£(f_i * log(f_i))
        shannon_entropy = -np.sum(fractions * np.log(fractions + 1e-10))

        # Gini coefficient (composition heterogeneity, 0=perfectly uniform, 1=perfectly heterogeneous)
        # G = (Ξ£_i Ξ£_j |f_i - f_j|) / (2n Ξ£_i f_i)
        n = len(fractions)
        gini = np.sum(np.abs(fractions[:, None] - fractions[None, :])) / (2 * n)

        # Effective number of elements
        # N_eff = exp(H) = 1 / Ξ£(f_i^2)
        effective_n = 1.0 / np.sum(fractions ** 2)

        return [shannon_entropy, gini, effective_n]

    def feature_labels(self):
        """Return list of feature names"""
        return [
            'shannon_entropy',
            'gini_coefficient',
            'effective_n_elements'
        ]

    def citations(self):
        """Return list of references"""
        return [
            "@article{yeh2004nanostructured, "
            "title={Nanostructured high-entropy alloys}, "
            "author={Yeh, Jien-Wei and others}, "
            "journal={Advanced Engineering Materials}, "
            "volume={6}, pages={299--303}, year={2004}}"
        ]

    def implementors(self):
        """Return list of implementors"""
        return ['Custom Implementation']

# Instantiate custom Featurizer
custom_featurizer = CustomElementDiversityFeaturizer()

# Test data (materials with various compositional entropies)
compositions = [
    "Fe",                      # Single element (low entropy)
    "FeNi",                    # Binary equimolar (medium entropy)
    "CoCrNi",                  # Ternary equimolar
    "CoCrFeNi",                # Quaternary equimolar
    "CoCrFeMnNi",              # Quinary equimolar (high entropy)
    "Al0.5CoCrCuFeNi"          # Senary non-equimolar
]

# Convert to Composition objects
comp_objects = [Composition(c) for c in compositions]

# Calculate features
features_df = custom_featurizer.featurize_dataframe(
    pd.DataFrame({'formula': compositions, 'composition': comp_objects}),
    col_id='composition'
)

# Display results
print("=== Custom Features: Element Diversity ===")
display_cols = ['formula'] + custom_featurizer.feature_labels()
print(features_df[display_cols].to_string(index=False))
print()

# HEA determination (Shannon entropy > 1.5 and Effective N > 4)
print("=== High Entropy Alloy (HEA) Determination ===")
for i, formula in enumerate(compositions):
    entropy = features_df.iloc[i]['shannon_entropy']
    eff_n = features_df.iloc[i]['effective_n_elements']
    is_hea = entropy > 1.5 and eff_n > 4.0

    print(f"{formula:20s} | H={entropy:.3f} | N_eff={eff_n:.2f} | "
          f"{'βœ… HEA' if is_hea else '❌ Non-HEA'}")

# Expected output:
# === Custom Features: Element Diversity ===
#               formula  shannon_entropy  gini_coefficient  effective_n_elements
#                    Fe            0.000             0.000                  1.00
#                 FeNi            0.693             0.250                  2.00
#              CoCrNi            1.099             0.333                  3.00
#            CoCrFeNi            1.386             0.375                  4.00
#         CoCrFeMnNi            1.609             0.400                  5.00
#   Al0.5CoCrCuFeNi            1.705             0.429                  5.45
#
# === High Entropy Alloy (HEA) Determination ===
# Fe                   | H=0.000 | N_eff=1.00 | ❌ Non-HEA
# FeNi                 | H=0.693 | N_eff=2.00 | ❌ Non-HEA
# CoCrNi               | H=1.099 | N_eff=3.00 | ❌ Non-HEA
# CoCrFeNi             | H=1.386 | N_eff=4.00 | ❌ Non-HEA
# CoCrFeMnNi           | H=1.609 | N_eff=5.00 | βœ… HEA
# Al0.5CoCrCuFeNi      | H=1.705 | N_eff=5.45 | βœ… HEA

3.4.2 Integrating Multiple Featurizers

By combining multiple Featurizers, you can build a richer feature set. Using MultipleFeaturizer allows you to apply multiple Featurizers at once.

Run on Google Colab
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

"""
Example: By combining multiple Featurizers, you can build a richer fe

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: 5-10 seconds
Dependencies: None
"""

# ===================================
# Example 6: MultipleFeaturizer Integration (Multiple Featurizer Pipeline)
# ===================================

from matminer.featurizers.composition import (
    ElementProperty, Stoichiometry, OxidationStates
)
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers.base import MultipleFeaturizer
import pandas as pd

# Integrate multiple Featurizers
featurizer = MultipleFeaturizer([
    ElementProperty.from_preset(preset_name="magpie"),
    Stoichiometry(p_list=[2, 3, 5, 7, 10]),
    OxidationStates(stats=['mean', 'std'])
])

# Composition data (SMILES string format)
data = pd.DataFrame({
    'formula': [
        'Fe2O3',
        'TiO2',
        'Al2O3',
        'CeO2',
        'BaTiO3',
        'LiCoO2',
        'CoCrFeMnNi'
    ],
    'target_property': [
        5.24,   # Band gap (eV) - dummy values
        3.20,
        8.80,
        3.19,
        3.38,
        2.70,
        0.00    # Metal (no band gap)
    ]
})

# Convert strings to Composition objects
str_to_comp = StrToComposition()
data = str_to_comp.featurize_dataframe(data, 'formula')

# Calculate features
features_df = featurizer.featurize_dataframe(
    data,
    col_id='composition',
    ignore_errors=True  # Continue while ignoring errors
)

# Get feature names
feature_names = featurizer.feature_labels()
print(f"=== Integrated Feature Set ===")
print(f"Total feature count: {len(feature_names)}")
print(f"- ElementProperty (Magpie): 132 dimensions")
print(f"- Stoichiometry: 5 dimensions")
print(f"- OxidationStates: 2 dimensions")
print(f"- Total: 139 dimensions")
print()

# Display first 10 features
print("=== First 10 Features ===")
display_cols = ['formula', 'target_property'] + feature_names[:10]
print(features_df[display_cols].head().to_string(index=False))
print()

# Feature statistics
print("=== Feature Statistics (Partial) ===")
stats_cols = ['MagpieData mean Number', '2-norm', 'oxidation state mean']
print(features_df[stats_cols].describe().round(3))

# Expected output:
# === Integrated Feature Set ===
# Total feature count: 139
# - ElementProperty (Magpie): 132 dimensions
# - Stoichiometry: 5 dimensions
# - OxidationStates: 2 dimensions
# - Total: 139 dimensions
#
# === First 10 Features ===
#        formula  target_property  MagpieData minimum Number  ...
#          Fe2O3             5.24                          8  ...
#           TiO2             3.20                          8  ...
#         Al2O3             8.80                          8  ...
#           CeO2             3.19                          8  ...
#        BaTiO3             3.38                          8  ...

3.4.3 Batch Processing Optimization

When handling large datasets (10,000+ compositions), batch processing optimization is important. featurize_dataframe() is faster than pandas apply().

Run on Google Colab
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: When handling large datasets (10,000+ compositions), batch p

Purpose: Demonstrate data manipulation and preprocessing
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""

# ===================================
# Example 7: Batch Processing Optimization (pandas Integration)
# ===================================

from matminer.featurizers.composition import ElementProperty
from matminer.featurizers.conversions import StrToComposition
from pymatgen.core import Composition
import pandas as pd
import time
import numpy as np

# Generate large dataset (1000 compositions)
np.random.seed(42)
elements = ['Fe', 'Co', 'Ni', 'Cu', 'Al', 'Ti', 'Cr', 'Mn']
formulas = []

for _ in range(1000):
    # Randomly select 2-5 elements
    n_elements = np.random.randint(2, 6)
    selected_elements = np.random.choice(elements, n_elements, replace=False)

    # Generate random composition ratios
    ratios = np.random.randint(1, 5, n_elements)

    # Build chemical formula
    formula = ''.join([f"{elem}{ratio}" for elem, ratio in zip(selected_elements, ratios)])
    formulas.append(formula)

# Create dataframe
data = pd.DataFrame({'formula': formulas})

# Prepare Featurizer
str_to_comp = StrToComposition()
featurizer = ElementProperty.from_preset(preset_name="magpie")

print("=== Batch Processing Performance Comparison ===")
print(f"Number of data: {len(data)} compositions")
print()

# Method 1: Sequential processing using apply() (slow)
start_time = time.time()
data_method1 = data.copy()
data_method1['composition'] = data_method1['formula'].apply(lambda x: Composition(x))
data_method1 = data_method1.apply(
    lambda row: pd.Series(featurizer.featurize(row['composition'])),
    axis=1
)
time_method1 = time.time() - start_time
print(f"Method 1 (apply)         : {time_method1:.2f} seconds")

# Method 2: Batch processing using featurize_dataframe() (fast)
start_time = time.time()
data_method2 = data.copy()
data_method2 = str_to_comp.featurize_dataframe(data_method2, 'formula')
data_method2 = featurizer.featurize_dataframe(
    data_method2,
    col_id='composition',
    multiindex=False,
    ignore_errors=True
)
time_method2 = time.time() - start_time
print(f"Method 2 (featurize_df)  : {time_method2:.2f} seconds")
print(f"Speedup factor: {time_method1/time_method2:.1f}x")
print()

# Method 3: Parallel processing (fastest)
start_time = time.time()
data_method3 = data.copy()
data_method3 = str_to_comp.featurize_dataframe(data_method3, 'formula')
data_method3 = featurizer.featurize_dataframe(
    data_method3,
    col_id='composition',
    multiindex=False,
    ignore_errors=True,
    n_jobs=-1  # Use all CPU cores
)
time_method3 = time.time() - start_time
print(f"Method 3 (parallel)      : {time_method3:.2f} seconds")
print(f"Speedup factor: {time_method1/time_method3:.1f}x")

# Expected output:
# === Batch Processing Performance Comparison ===
# Number of data: 1000 compositions
#
# Method 1 (apply)         : 12.34 seconds
# Method 2 (featurize_df)  : 3.21 seconds
# Speedup factor: 3.8x
#
# Method 3 (parallel)      : 0.87 seconds
# Speedup factor: 14.2x

3.4.4 Multi-Database Integration Strategy

For production-level applications, it's important to integrate data from multiple databases (Mendeleev, pymatgen, matminer) to maximize data coverage and reliability. The key is establishing a priority order and confidence scoring system.

Data Source Priority Strategy:

  1. First Priority: Mendeleev experimental values (confidence: 0.95)
  2. Second Priority: pymatgen values (DFT or experimental, confidence: 0.80)
  3. Third Priority: matminer/literature values (confidence: 0.70)
  4. Fallback: Default values or interpolation (confidence: 0.50)
Run on Google Colab
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

# ===================================
# Example 8: Multi-Database Integration (Fallback Strategy)
# ===================================

import mendeleev
from pymatgen.core import Element, Composition
import pandas as pd
import numpy as np

class MultiSourceElementDataFetcher:
    """Element property data fetcher from multiple sources with confidence scoring

    Priority order:
    1. Mendeleev (experimental values, confidence: 0.95)
    2. pymatgen (DFT/experimental, confidence: 0.80)
    3. Default values (confidence: 0.50)
    """

    def __init__(self):
        self.confidence_scores = {
            'mendeleev': 0.95,
            'pymatgen': 0.80,
            'default': 0.50
        }

    def get_property(self, element_symbol, property_name):
        """Get element property from multiple sources with fallback

        Args:
            element_symbol (str): Element symbol (e.g., 'Fe')
            property_name (str): Property name ('electronegativity', 'atomic_radius', 'ionization_energy')

        Returns:
            dict: {'value': float, 'source': str, 'confidence': float}
        """
        # Try Mendeleev first (highest confidence)
        try:
            elem_mendeleev = mendeleev.element(element_symbol)
            if property_name == 'electronegativity':
                value = elem_mendeleev.electronegativity()
            elif property_name == 'atomic_radius':
                value = elem_mendeleev.atomic_radius
            elif property_name == 'ionization_energy':
                value = elem_mendeleev.ionenergies.get(1, None)
            else:
                value = None

            if value is not None:
                return {
                    'value': value,
                    'source': 'mendeleev',
                    'confidence': self.confidence_scores['mendeleev']
                }
        except Exception:
            pass

        # Fallback to pymatgen (medium confidence)
        try:
            elem_pymatgen = Element(element_symbol)
            if property_name == 'electronegativity':
                value = elem_pymatgen.X
            elif property_name == 'atomic_radius':
                value = elem_pymatgen.atomic_radius
            elif property_name == 'ionization_energy':
                value = elem_pymatgen.ionization_energy
            else:
                value = None

            if value is not None:
                return {
                    'value': value,
                    'source': 'pymatgen',
                    'confidence': self.confidence_scores['pymatgen']
                }
        except Exception:
            pass

        # Default value (low confidence)
        default_values = {
            'electronegativity': 1.5,
            'atomic_radius': 150.0,
            'ionization_energy': 7.0
        }

        return {
            'value': default_values.get(property_name, 0.0),
            'source': 'default',
            'confidence': self.confidence_scores['default']
        }

    def get_composition_features(self, composition):
        """Get features for a composition with confidence tracking

        Args:
            composition (str or Composition): Composition formula

        Returns:
            pd.DataFrame: Features with source and confidence information
        """
        comp = Composition(composition) if isinstance(composition, str) else composition

        results = []
        for element, fraction in comp.items():
            for prop_name in ['electronegativity', 'atomic_radius', 'ionization_energy']:
                prop_data = self.get_property(str(element), prop_name)
                results.append({
                    'element': str(element),
                    'property': prop_name,
                    'value': prop_data['value'],
                    'source': prop_data['source'],
                    'confidence': prop_data['confidence']
                })

        return pd.DataFrame(results)

# Usage example
fetcher = MultiSourceElementDataFetcher()

# Test compositions
test_compositions = [
    "Fe2O3",       # Iron oxide
    "TiO2",        # Titanium dioxide
    "CoCrFeMnNi"   # High entropy alloy
]

print("=== Data Retrieval from Multiple Sources ===")
print()

for comp_str in test_compositions:
    comp = Composition(comp_str)
    feature_df = fetcher.get_composition_features(comp)

    print(f"Composition: {comp_str}")
    print(f"  Data sources: {feature_df['source'].value_counts().to_dict()}")
    print(f"  Average confidence: {feature_df['confidence'].mean():.2f}")
    print()

# Detailed display for Fe2O3
print("=== Detailed Data for Fe2O3 ===")
fe2o3_features = fetcher.get_composition_features("Fe2O3")
print(fe2o3_features.to_string(index=False))

# Best practices
print()
print("=== Database Integration Best Practices ===")
print("βœ… Priority: Mendeleev (experimental) > pymatgen (calculated) > default values")
print("βœ… Record confidence scores to track data quality")
print("βœ… Output warnings when using default values")
print("βœ… Record data source and version (ensure reproducibility)")

# Expected output:
# === Data Retrieval from Multiple Sources ===
#
# Composition: Fe2O3
#   Data sources: {'mendeleev': 6}
#   Average confidence: 0.95
#
# Composition: TiO2
#   Data sources: {'mendeleev': 6}
#   Average confidence: 0.95
#
# Composition: CoCrFeMnNi
#   Data sources: {'mendeleev': 15}
#   Average confidence: 0.95
#
# === Detailed Data for Fe2O3 ===
# element           property  value      source  confidence
#      Fe  electronegativity   1.83  mendeleev        0.95
#      Fe      atomic_radius 140.00  mendeleev        0.95
#      Fe  ionization_energy   7.90  mendeleev        0.95
#       O  electronegativity   3.44  mendeleev        0.95
#       O      atomic_radius  60.00  mendeleev        0.95
#       O  ionization_energy  13.62  mendeleev        0.95

Learning Objectives Review

Having completed this chapter, you can now explain and implement the following:

Basic Understanding

Practical Skills

Applied Competence

Exercises

Easy (Fundamentals Review)

Q1: What is the main data source of the Mendeleev database?

  1. DFT calculated values
  2. Experimental values (NIST, CRC Handbook, etc.)
  3. Machine learning predicted values
  4. Literature average values
View Answer

Answer: b) Experimental values (NIST, CRC Handbook, etc.)

Explanation:

Mendeleev is a comprehensive database centered on experimental values. Main sources include reliable references such as NIST (National Institute of Standards and Technology) and CRC Handbook of Chemistry and Physics. This makes it suitable for precise property prediction (melting point, density, etc.).

In contrast, pymatgen includes DFT calculated values from Materials Project, and matminer integrates property sets reported in multiple papers.

Q2: What does the feature_labels() method of BaseFeaturizer return?

  1. Feature values (list of numbers)
  2. Feature names (list of strings)
  3. List of references
  4. Composition object
View Answer

Answer: b) Feature names (list of strings)

Explanation:

feature_labels() returns a list of names (labels) for the features generated by the Featurizer. For example, in the case of ElementProperty, it returns 132 feature names such as ['MagpieData minimum Number', 'MagpieData maximum Number', ...].

This allows you to verify what each dimension of the calculated features represents, improving the interpretability of machine learning models.

Q3: In the following code, what is the type of the return value of featurizer.featurize(Composition("Fe2O3"))?

from matminer.featurizers.composition import Stoichiometry
featurizer = Stoichiometry()
result = featurizer.featurize(Composition("Fe2O3"))
  1. pandas DataFrame
  2. numpy array
  3. list
  4. dict (dictionary)
View Answer

Answer: c) list

Explanation:

The featurize() method calculates features for a single composition and returns them in Python list format. Example: [2.0, 0.632456, 0.584804, ...]

On the other hand, the featurize_dataframe() method takes a pandas DataFrame as input and returns a DataFrame with added features. For handling large datasets, using featurize_dataframe() is recommended.

Medium (Application)

Q4: As the p value of p-norm increases, what value does it converge to? Also, explain its physical meaning.

View Answer

Answer: As pβ†’βˆž, p-norm converges to the maximum mole fraction.

Explanation:

The p-norm is defined by the following formula:

$$\text{p-norm} = \left(\sum_{i} f_i^p\right)^{1/p}$$

In the limit as pβ†’βˆž, the maximum mole fraction $f_{\max}$ dominates, and p-norm β†’ $f_{\max}$.

Physical Meaning:

Application Example:

In high entropy alloy (HEA) research, when p=2, a 2-norm < 0.5 suggests equimolar composition of 5 or more elements, and is used for screening HEA candidate materials.

Q5: In the following code, how many features (dimensions) will there be?

from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers.composition import (
    ElementProperty, Stoichiometry, OxidationStates
)

featurizer = MultipleFeaturizer([
    ElementProperty.from_preset("magpie"),  # 132 dimensions
    Stoichiometry(p_list=[2, 3, 5]),        # ? dimensions
    OxidationStates(stats=['mean', 'std'])  # ? dimensions
])
View Answer

Answer: 137 dimensions

Calculation Process:

Note:

The default for Stoichiometry is p_list=[0, 2, 3, 5, 7, 10] which gives 6 dimensions, but in this case p_list is explicitly specified as [2, 3, 5], resulting in 3 dimensions.

To verify the feature count, run len(featurizer.feature_labels()).

Q6: What is the optimal method for batch applying ElementProperty Featurizer to a dataset of 1000 compositions?

  1. df.apply(lambda x: featurizer.featurize(x['composition']), axis=1)
  2. featurizer.featurize_dataframe(df, col_id='composition')
  3. featurizer.featurize_dataframe(df, col_id='composition', n_jobs=-1)
  4. for i in range(len(df)): featurizer.featurize(df.iloc[i]['composition'])
View Answer

Answer: c) featurizer.featurize_dataframe(df, col_id='composition', n_jobs=-1)

Explanation:

Performance comparison (for 1000 compositions):

Method Execution Time Speedup Factor
d) for loop 15.2 seconds 1.0x (baseline)
a) apply() 12.3 seconds 1.2x
b) featurize_dataframe() 3.2 seconds 4.8x
c) featurize_dataframe(n_jobs=-1) 0.9 seconds 16.9x

Best Practices:

Q7: When implementing a custom Featurizer, which methods must be implemented? (Multiple choices possible)

  1. featurize(entry)
  2. feature_labels()
  3. citations()
  4. fit(X, y)
View Answer

Answer: a) featurize(entry) and b) feature_labels()

Explanation:

Required Methods:

Optional Methods:

Implementation Example:

class CustomFeaturizer(BaseFeaturizer):
    def featurize(self, comp):
        # Calculate features
        return [value1, value2, ...]

    def feature_labels(self):
        # Return feature names
        return ['feature1', 'feature2', ...]

    def citations(self):
        # Return references (recommended)
        return ['@article{...}']

    def implementors(self):
        # Return implementors (recommended)
        return ['Your Name']

Hard (Advanced)

Q8: Design a custom Featurizer to identify high entropy alloys (HEA). Show an implementation that calculates the following features:

View Answer

Implementation Example:

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

from matminer.featurizers.base import BaseFeaturizer
from pymatgen.core import Composition
import numpy as np

class HEAFeaturizer(BaseFeaturizer):
    """High Entropy Alloy (HEA) Features

    Features:
    - shannon_entropy: Compositional entropy
    - effective_n_elements: Effective number of elements
    - is_hea: HEA determination (True/False β†’ 1/0)
    """

    def featurize(self, comp):
        """Calculate features

        Args:
            comp (Composition): pymatgen Composition object

        Returns:
            list: [shannon_entropy, effective_n_elements, is_hea]
        """
        # Get mole fractions
        fractions = np.array(list(comp.fractional_composition.values()))

        # Shannon entropy
        # H = -Ξ£(f_i * log(f_i))
        # Add small value to avoid log(0)
        shannon_entropy = -np.sum(fractions * np.log(fractions + 1e-10))

        # Effective number of elements
        # N_eff = exp(H)
        effective_n = np.exp(shannon_entropy)

        # HEA determination
        is_hea = 1.0 if (shannon_entropy > 1.5 and effective_n > 4.0) else 0.0

        return [shannon_entropy, effective_n, is_hea]

    def feature_labels(self):
        """Return list of feature names"""
        return [
            'shannon_entropy',
            'effective_n_elements',
            'is_hea'
        ]

    def citations(self):
        """Return list of references"""
        return [
            "@article{yeh2004nanostructured, "
            "title={Nanostructured high-entropy alloys with multiple principal elements}, "
            "author={Yeh, Jien-Wei and Chen, Swe-Kai and Lin, Su-Jien and others}, "
            "journal={Advanced Engineering Materials}, "
            "volume={6}, number={5}, pages={299--303}, year={2004}}"
        ]

    def implementors(self):
        """Return list of implementors"""
        return ['Custom HEA Featurizer']

# Usage example
featurizer = HEAFeaturizer()

# Test data
test_compositions = [
    "Fe",              # H=0.00, N_eff=1.00 β†’ Non-HEA
    "FeNi",            # H=0.69, N_eff=2.00 β†’ Non-HEA
    "CoCrFeNi",        # H=1.39, N_eff=4.00 β†’ Non-HEA (borderline)
    "CoCrFeMnNi",      # H=1.61, N_eff=5.00 β†’ HEA
    "AlCoCrFeNi"       # H=1.61, N_eff=5.00 β†’ HEA
]

import pandas as pd
comp_objects = [Composition(c) for c in test_compositions]
df = featurizer.featurize_dataframe(
    pd.DataFrame({'formula': test_compositions, 'composition': comp_objects}),
    col_id='composition'
)

print(df[['formula', 'shannon_entropy', 'effective_n_elements', 'is_hea']])

# Expected output:
#         formula  shannon_entropy  effective_n_elements  is_hea
# 0            Fe             0.00                  1.00     0.0
# 1          FeNi             0.69                  2.00     0.0
# 2      CoCrFeNi             1.39                  4.00     0.0
# 3   CoCrFeMnNi             1.61                  5.00     1.0
# 4    AlCoCrFeNi             1.61                  5.00     1.0

Key Points:

Q9: Implement a function that retrieves electronegativity from multiple databases (Mendeleev, pymatgen, matminer) and calculates confidence scores. Define confidence as follows:

View Answer

Implementation Example:

# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

import mendeleev
from pymatgen.core import Element
import pandas as pd

def get_electronegativity_with_confidence(element_symbol):
    """Retrieve electronegativity from multiple sources with confidence

    Args:
        element_symbol (str): Element symbol (e.g., 'Fe')

    Returns:
        dict: {'value': float, 'source': str, 'confidence': float}
    """
    # Try Mendeleev first (highest confidence)
    try:
        elem = mendeleev.element(element_symbol)
        en = elem.electronegativity()
        if en is not None:
            return {
                'value': en,
                'source': 'mendeleev',
                'confidence': 0.95
            }
    except Exception:
        pass

    # Fallback to pymatgen (medium confidence)
    try:
        elem = Element(element_symbol)
        en = elem.X
        if en is not None:
            return {
                'value': en,
                'source': 'pymatgen',
                'confidence': 0.80
            }
    except Exception:
        pass

    # Default value (low confidence)
    return {
        'value': 1.5,  # Typical electronegativity
        'source': 'default',
        'confidence': 0.50
    }

# Usage example
elements = ['Fe', 'Cu', 'Al', 'Ti', 'Uuo']  # Uuo is synthetic element

print("=== Electronegativity with Confidence Scores ===")
results = []
for elem in elements:
    data = get_electronegativity_with_confidence(elem)
    results.append({
        'Element': elem,
        'Electronegativity': data['value'],
        'Source': data['source'],
        'Confidence': data['confidence']
    })

df = pd.DataFrame(results)
print(df.to_string(index=False))

# Expected output:
# === Electronegativity with Confidence Scores ===
# Element  Electronegativity      Source  Confidence
#      Fe               1.83  mendeleev        0.95
#      Cu               1.90  mendeleev        0.95
#      Al               1.61  mendeleev        0.95
#      Ti               1.54  mendeleev        0.95
#     Uuo               1.50     default        0.50  # Synthetic element, no data

Application to Production Systems:

Q10: Implement an efficient feature calculation pipeline for a large dataset (100,000 compositions). Include the following optimizations:

View Answer

Implementation Example:

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - tqdm>=4.65.0

from matminer.featurizers.composition import ElementProperty
from matminer.featurizers.conversions import StrToComposition
import pandas as pd
import numpy as np
from tqdm import tqdm
import time

def featurize_large_dataset(formulas, chunk_size=1000):
    """Efficiently featurize large dataset with chunking and parallel processing

    Args:
        formulas (list): List of chemical formulas
        chunk_size (int): Number of compositions per chunk

    Returns:
        pd.DataFrame: Featurized dataset
    """
    # Initialize featurizers
    str_to_comp = StrToComposition()
    featurizer = ElementProperty.from_preset("magpie")

    # Split into chunks
    n_chunks = (len(formulas) + chunk_size - 1) // chunk_size
    chunks = np.array_split(formulas, n_chunks)

    results = []

    print(f"=== Large Dataset Feature Calculation ===")
    print(f"Number of data: {len(formulas):,} compositions")
    print(f"Chunk size: {chunk_size:,} compositions")
    print()

    # Process chunks with progress bar
    for i, chunk in enumerate(tqdm(chunks, desc="Processing chunks")):
        # Create dataframe for chunk
        chunk_df = pd.DataFrame({'formula': chunk})

        # Convert to Composition
        chunk_df = str_to_comp.featurize_dataframe(
            chunk_df,
            'formula',
            ignore_errors=True
        )

        # Calculate features with parallel processing
        chunk_df = featurizer.featurize_dataframe(
            chunk_df,
            col_id='composition',
            multiindex=False,
            ignore_errors=True,
            n_jobs=-1  # Use all CPU cores
        )

        results.append(chunk_df)

    # Concatenate all chunks
    final_df = pd.concat(results, ignore_index=True)
    return final_df

# Generate test dataset (100,000 compositions)
np.random.seed(42)
elements = ['Fe', 'Co', 'Ni', 'Cu', 'Al', 'Ti', 'Cr', 'Mn']
formulas = []

for _ in range(100000):
    n_elem = np.random.randint(2, 6)
    selected = np.random.choice(elements, n_elem, replace=False)
    ratios = np.random.randint(1, 5, n_elem)
    formula = ''.join([f"{e}{r}" for e, r in zip(selected, ratios)])
    formulas.append(formula)

# Execute featurization
start_time = time.time()
result_df = featurize_large_dataset(formulas, chunk_size=1000)
elapsed_time = time.time() - start_time

print(f"\nProcessing completed: {elapsed_time:.1f} seconds")
print(f"Processing speed: {len(formulas) / elapsed_time:.0f} compositions/second")
print(f"Number of features: {len(result_df.columns)} dimensions")

# Display partial results
print("\n=== Results (First 5 rows) ===")
print(result_df.head())

# Expected output:
# === Large Dataset Feature Calculation ===
# Number of data: 100,000 compositions
# Chunk size: 1,000 compositions
#
# Processing chunks: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100/100 [02:34<00:00,  1.55s/it]
#
# Processing completed: 154.3 seconds
# Processing speed: 648 compositions/second
# Number of features: 135 dimensions

Optimization Points:

Further Optimizations:

Next Steps

In this chapter, we learned about the differences between element property databases (Mendeleev, pymatgen, matminer) and the Featurizer architecture. In the next chapter, we will learn how to build actual machine learning models using these features, and techniques for feature selection and dimensionality reduction.

References

  1. Ward, L., Dunn, A., Faghaninia, A., Zimmermann, N. E., Bajaj, S., Wang, Q., ... & Jain, A. (2018). "Matminer: An open source toolkit for materials data mining." Computational Materials Science, 152, 60-69. DOI: 10.1016/j.commatsci.2018.05.018
    Original matminer paper. Details of Featurizer architecture and main features (pp. 62-66)
  2. Ong, S. P., Richards, W. D., Jain, A., Hautier, G., Kocher, M., Cholia, S., ... & Ceder, G. (2013). "Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis." Computational Materials Science, 68, 314-319. DOI: 10.1016/j.commatsci.2012.10.028
    Original pymatgen library paper. Details of Element and Composition classes (pp. 315-317)
  3. Himanen, L., JΓ€ger, M. O., Morooka, E. V., Federici Canova, F., Ranawat, Y. S., Gao, D. Z., ... & Foster, A. S. (2019). "DScribe: Library of descriptors for machine learning in materials science." Computer Physics Communications, 247, 106949, pp. 1-15. DOI: 10.1016/j.cpc.2019.106949
    Comprehensive review of material descriptors. Theoretical background of composition-based features (pp. 3-7)
  4. matminer API Documentation: Featurizer classes. https://hackingmaterials.lbl.gov/matminer/
    Official matminer documentation. Detailed usage examples and parameter descriptions for each Featurizer
  5. pymatgen.core.periodic_table Documentation. https://pymatgen.org/
    Detailed API specifications for pymatgen's Element and Composition classes
  6. Mendeleev package documentation. https://mendeleev.readthedocs.io/
    Official Mendeleev package documentation. Element property data sources and accuracy information
  7. Materials Project Database documentation. https://docs.materialsproject.org/
    Official Materials Project documentation. Details of DFT calculation methods and database structure

Β© 2025 AI Terakoya - Materials Informatics Knowledge Hub

Disclaimer