Chapter 2: Feature Engineering
This chapter covers Feature Engineering. You will learn Feature transformations (normalization and Dimensionality reduction (PCA.
Learning Objectives
After reading this chapter, you will master:
- ✅ Selection and design of material descriptors (composition, structure, electronic structure)
- ✅ Automated generation of material features using matminer
- ✅ Feature transformations (normalization, log transformation, polynomial features)
- ✅ Dimensionality reduction (PCA, t-SNE, UMAP) for visualization and interpretation
- ✅ Feature selection techniques (Filter/Wrapper/Embedded/SHAP-based)
- ✅ Effective reduction from 200 to 20 dimensions in bandgap prediction
2.1 Selection and Design of Material Descriptors
To predict material properties using machine learning, appropriate material descriptors are essential.
Composition Descriptors
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def calculate_composition_descriptors(formula_dict):
"""
Calculate composition descriptors
Parameters:
-----------
formula_dict : dict
{'element_symbol': fraction} e.g., {'Fe': 0.7, 'Ni': 0.3}
Returns:
--------
dict : composition descriptors
"""
# Element properties (simplified)
element_properties = {
'Fe': {'atomic_mass': 55.845, 'electronegativity': 1.83,
'atomic_radius': 1.26},
'Ni': {'atomic_mass': 58.693, 'electronegativity': 1.91,
'atomic_radius': 1.24},
'Cu': {'atomic_mass': 63.546, 'electronegativity': 1.90,
'atomic_radius': 1.28},
'Zn': {'atomic_mass': 65.38, 'electronegativity': 1.65,
'atomic_radius': 1.34}
}
descriptors = {}
# Average atomic mass
avg_mass = sum(
element_properties[el]['atomic_mass'] * frac
for el, frac in formula_dict.items()
)
descriptors['average_atomic_mass'] = avg_mass
# Average electronegativity
avg_electronegativity = sum(
element_properties[el]['electronegativity'] * frac
for el, frac in formula_dict.items()
)
descriptors['average_electronegativity'] = avg_electronegativity
# Electronegativity difference (max - min)
electronegativities = [
element_properties[el]['electronegativity']
for el in formula_dict.keys()
]
descriptors['electronegativity_difference'] = max(electronegativities) - min(electronegativities)
# Average atomic radius
avg_radius = sum(
element_properties[el]['atomic_radius'] * frac
for el, frac in formula_dict.items()
)
descriptors['average_atomic_radius'] = avg_radius
return descriptors
# Example: Fe-Ni alloy
formula = {'Fe': 0.7, 'Ni': 0.3}
descriptors = calculate_composition_descriptors(formula)
print("Composition descriptors (Fe₀.₇Ni₀.₃):")
for key, value in descriptors.items():
print(f" {key}: {value:.4f}")
Output:
Composition descriptors (Fe₀.₇Ni₀.₃):
average_atomic_mass: 56.6984
average_electronegativity: 1.8540
electronegativity_difference: 0.0800
average_atomic_radius: 1.2540
Utilizing matminer
# Automated material descriptor generation using matminer
# !pip install matminer pymatgen
from matminer.featurizers.composition import (
ElementProperty,
Stoichiometry,
ValenceOrbital,
IonProperty
)
from pymatgen.core import Composition
def generate_matminer_features(formula_str):
"""
Generate material descriptors using matminer
Parameters:
-----------
formula_str : str
Chemical formula (e.g., "Fe2O3")
Returns:
--------
pd.DataFrame : features
"""
comp = Composition(formula_str)
# Element property descriptors
ep_feat = ElementProperty.from_preset("magpie")
features_ep = ep_feat.featurize(comp)
# Stoichiometry descriptors
stoich_feat = Stoichiometry()
features_stoich = stoich_feat.featurize(comp)
# Valence orbital descriptors
valence_feat = ValenceOrbital()
features_valence = valence_feat.featurize(comp)
# Obtain feature names
feature_names = (
ep_feat.feature_labels() +
stoich_feat.feature_labels() +
valence_feat.feature_labels()
)
# Convert to DataFrame
all_features = features_ep + features_stoich + features_valence
df = pd.DataFrame([all_features], columns=feature_names)
return df
# Example: Iron oxide
formula = "Fe2O3"
features = generate_matminer_features(formula)
print(f"Feature generation using matminer ({formula}):")
print(f"Number of features: {features.shape[1]}")
print(f"\nFirst 10 features:")
print(features.iloc[:, :10].T)
Main matminer descriptors:
# Comparison of descriptor types
descriptor_types = pd.DataFrame({
'Descriptor Type': [
'ElementProperty',
'Stoichiometry',
'ValenceOrbital',
'IonProperty',
'OxidationStates',
'ElectronAffinity'
],
'Number of Features': [132, 7, 10, 32, 3, 1],
'Purpose': [
'Physical-chemical properties of elements',
'Stoichiometric ratio',
'Valence orbitals',
'Ionic properties',
'Oxidation states',
'Electron affinity'
]
})
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(descriptor_types['Descriptor Type'],
descriptor_types['Number of Features'],
color='steelblue', alpha=0.7)
ax.set_xlabel('Number of Features', fontsize=12)
ax.set_title('matminer Descriptor Types', fontsize=13, fontweight='bold')
ax.grid(axis='x', alpha=0.3)
for idx, row in descriptor_types.iterrows():
ax.text(row['Number of Features'] + 5, idx, row['Purpose'],
va='center', fontsize=9, style='italic')
plt.tight_layout()
plt.show()
print(descriptor_types.to_string(index=False))
Structure Descriptors
def calculate_structure_descriptors(lattice_params):
"""
Calculate crystal structure descriptors
Parameters:
-----------
lattice_params : dict
{'a': float, 'b': float, 'c': float,
'alpha': float, 'beta': float, 'gamma': float}
Returns:
--------
dict : structure descriptors
"""
a = lattice_params['a']
b = lattice_params['b']
c = lattice_params['c']
alpha = np.radians(lattice_params['alpha'])
beta = np.radians(lattice_params['beta'])
gamma = np.radians(lattice_params['gamma'])
# Volume
volume = a * b * c * np.sqrt(
1 - np.cos(alpha)**2 - np.cos(beta)**2 - np.cos(gamma)**2 +
2 * np.cos(alpha) * np.cos(beta) * np.cos(gamma)
)
# Packing density (simplified)
packing_density = 0.74 # Example: FCC case
descriptors = {
'lattice_constant_a': a,
'lattice_constant_b': b,
'lattice_constant_c': c,
'volume': volume,
'packing_density': packing_density
}
return descriptors
# Example: Cubic crystal
lattice = {'a': 5.43, 'b': 5.43, 'c': 5.43,
'alpha': 90, 'beta': 90, 'gamma': 90}
struct_desc = calculate_structure_descriptors(lattice)
print("Structure descriptors (cubic crystal):")
for key, value in struct_desc.items():
print(f" {key}: {value:.4f}")
Electronic Structure Descriptors
# Simulate electronic structure descriptors
def simulate_electronic_descriptors(n_samples=100):
"""
Generate sample electronic structure descriptor data
"""
np.random.seed(42)
data = pd.DataFrame({
'bandgap': np.random.uniform(0, 5, n_samples),
'fermi_energy': np.random.uniform(-5, 5, n_samples),
'dos_valence': np.random.uniform(10, 100, n_samples),
'dos_conduction': np.random.uniform(5, 50, n_samples),
'effective_mass_electron': np.random.uniform(0.1, 2, n_samples),
'effective_mass_hole': np.random.uniform(0.1, 2, n_samples)
})
return data
# Generate
electronic_data = simulate_electronic_descriptors(100)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
for idx, col in enumerate(electronic_data.columns):
axes[idx].hist(electronic_data[col], bins=20,
color='steelblue', alpha=0.7, edgecolor='black')
axes[idx].set_xlabel(col, fontsize=11)
axes[idx].set_ylabel('Frequency', fontsize=11)
axes[idx].set_title(f'{col} distribution', fontsize=12, fontweight='bold')
axes[idx].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("Electronic structure descriptor statistics:")
print(electronic_data.describe())
2.2 Feature Transformations
Transform raw features into forms suitable for machine learning models.
Normalization (Min-Max, Z-score)
from sklearn.preprocessing import MinMaxScaler, StandardScaler
def compare_normalization(data):
"""
Compare normalization techniques
"""
# Min-Max normalization (0-1)
minmax_scaler = MinMaxScaler()
data_minmax = pd.DataFrame(
minmax_scaler.fit_transform(data),
columns=data.columns
)
# Z-score normalization (mean 0, std 1)
standard_scaler = StandardScaler()
data_standard = pd.DataFrame(
standard_scaler.fit_transform(data),
columns=data.columns
)
return data_minmax, data_standard
# Sample data
np.random.seed(42)
sample_data = pd.DataFrame({
'lattice_constant': np.random.uniform(3, 7, 100),
'electrical_conductivity': np.random.lognormal(10, 2, 100),
'bandgap': np.random.uniform(0, 3, 100)
})
# Normalize
data_minmax, data_standard = compare_normalization(sample_data)
# Visualization
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
for idx, col in enumerate(sample_data.columns):
# Original data
axes[idx, 0].hist(sample_data[col], bins=20,
color='gray', alpha=0.7, edgecolor='black')
axes[idx, 0].set_title(f'Original: {col}', fontsize=11, fontweight='bold')
axes[idx, 0].set_ylabel('Frequency', fontsize=10)
# Min-Max
axes[idx, 1].hist(data_minmax[col], bins=20,
color='steelblue', alpha=0.7, edgecolor='black')
axes[idx, 1].set_title(f'Min-Max: {col}', fontsize=11, fontweight='bold')
# Z-score
axes[idx, 2].hist(data_standard[col], bins=20,
color='coral', alpha=0.7, edgecolor='black')
axes[idx, 2].set_title(f'Z-score: {col}', fontsize=11, fontweight='bold')
plt.tight_layout()
plt.show()
print("Statistics after normalization:")
print("\nMin-Max normalization:")
print(data_minmax.describe())
print("\nZ-score normalization:")
print(data_standard.describe())
Log and Box-Cox Transformations
from scipy.stats import boxcox
def apply_transformations(data):
"""
Apply various transformations
"""
# Log transformation
data_log = np.log1p(data) # log(1+x) handles zero values
# Box-Cox transformation (positive values only)
data_boxcox, lambda_param = boxcox(data + 1) # +1 to avoid zero
return data_log, data_boxcox, lambda_param
# Skewed data (electrical conductivity)
np.random.seed(42)
conductivity = np.random.lognormal(10, 2, 100)
# Transform
cond_log, cond_boxcox, lambda_val = apply_transformations(conductivity)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Original data
axes[0].hist(conductivity, bins=30, color='gray',
alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Electrical conductivity (S/m)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Original data (skewed)', fontsize=12, fontweight='bold')
# Log transformation
axes[1].hist(cond_log, bins=30, color='steelblue',
alpha=0.7, edgecolor='black')
axes[1].set_xlabel('log(conductivity+1)', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].set_title('Log transformation', fontsize=12, fontweight='bold')
# Box-Cox transformation
axes[2].hist(cond_boxcox, bins=30, color='coral',
alpha=0.7, edgecolor='black')
axes[2].set_xlabel(f'Box-Cox (λ={lambda_val:.3f})', fontsize=11)
axes[2].set_ylabel('Frequency', fontsize=11)
axes[2].set_title('Box-Cox transformation', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
# Compare skewness
from scipy.stats import skew
print(f"Original data skewness: {skew(conductivity):.3f}")
print(f"After log transformation: {skew(cond_log):.3f}")
print(f"After Box-Cox transformation: {skew(cond_boxcox):.3f}")
Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
def create_polynomial_features(X, degree=2):
"""
Generate polynomial features
Parameters:
-----------
X : array-like, shape (n_samples, n_features)
degree : int
Polynomial degree
Returns:
--------
X_poly : polynomial features
feature_names : feature names
"""
poly = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly.fit_transform(X)
feature_names = poly.get_feature_names_out()
return X_poly, feature_names
# Sample data
np.random.seed(42)
X_original = pd.DataFrame({
'x1': np.random.uniform(0, 1, 50),
'x2': np.random.uniform(0, 1, 50)
})
# Generate 2nd degree polynomial features
X_poly, feature_names = create_polynomial_features(
X_original.values, degree=2
)
print(f"Original number of features: {X_original.shape[1]}")
print(f"Number of polynomial features: {X_poly.shape[1]}")
print(f"\nGenerated features:")
for name in feature_names:
print(f" {name}")
# Nonlinear relationship learning example
# y = 2*x1^2 + 3*x1*x2 - x2^2 + noise
y_true = (
2 * X_original['x1']**2 +
3 * X_original['x1'] * X_original['x2'] -
X_original['x2']**2 +
np.random.normal(0, 0.1, 50)
)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Linear model with original features
model_linear = LinearRegression()
model_linear.fit(X_original, y_true)
y_pred_linear = model_linear.predict(X_original)
r2_linear = r2_score(y_true, y_pred_linear)
# Linear model with polynomial features
model_poly = LinearRegression()
model_poly.fit(X_poly, y_true)
y_pred_poly = model_poly.predict(X_poly)
r2_poly = r2_score(y_true, y_pred_poly)
print(f"\nLinear model (original features) R²: {r2_linear:.4f}")
print(f"Linear model (polynomial features) R²: {r2_poly:.4f}")
print(f"Improvement: {(r2_poly - r2_linear) / r2_linear * 100:.1f}%")
Generating Interaction Terms
from itertools import combinations
def create_interaction_features(df):
"""
Generate interaction terms
Parameters:
-----------
df : pd.DataFrame
Original features
Returns:
--------
df_with_interactions : DataFrame with interaction terms added
"""
df_new = df.copy()
# Two-variable interactions (products)
for col1, col2 in combinations(df.columns, 2):
interaction_name = f"{col1}×{col2}"
df_new[interaction_name] = df[col1] * df[col2]
return df_new
# Example: Thermoelectric material features
thermoelectric_features = pd.DataFrame({
'electrical_conductivity': np.random.lognormal(8, 1, 50),
'seebeck_coefficient': np.random.normal(200, 50, 50),
'thermal_conductivity': np.random.uniform(1, 10, 50)
})
# Add interaction terms
features_with_interactions = create_interaction_features(
thermoelectric_features
)
print(f"Original features: {thermoelectric_features.columns.tolist()}")
print(f"\nAdded interaction terms:")
new_cols = [col for col in features_with_interactions.columns
if col not in thermoelectric_features.columns]
for col in new_cols:
print(f" {col}")
# ZT figure of merit prediction (similar to ZT = σ*S²/κ)
thermoelectric_features['ZT'] = (
thermoelectric_features['electrical_conductivity'] *
thermoelectric_features['seebeck_coefficient']**2 /
thermoelectric_features['thermal_conductivity'] / 1e6 +
np.random.normal(0, 0.1, 50)
)
print(f"\nCorrelation analysis (correlation with interaction terms):")
correlations = features_with_interactions.corrwith(
thermoelectric_features['ZT']
).sort_values(ascending=False)
print(correlations)
2.3 Dimensionality Reduction
Compress high-dimensional data to lower dimensions for easier visualization and interpretation.
PCA (Principal Component Analysis)
from sklearn.decomposition import PCA
def apply_pca(X, n_components=2):
"""
Apply PCA for dimensionality reduction
Parameters:
-----------
X : array-like
Original features
n_components : int
Reduced dimensionality
Returns:
--------
X_pca : principal component scores
pca : PCA object
"""
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
return X_pca, pca
# Generate high-dimensional data (100 dimensions)
np.random.seed(42)
n_samples = 200
n_features = 100
# Data with latent 2D structure
latent = np.random.randn(n_samples, 2)
X_high_dim = latent @ np.random.randn(2, n_features) + np.random.randn(n_samples, n_features) * 0.5
# Apply PCA
X_pca, pca_model = apply_pca(X_high_dim, n_components=10)
# Explained variance
explained_var = pca_model.explained_variance_ratio_
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Explained variance ratio
axes[0].bar(range(1, 11), explained_var * 100,
color='steelblue', alpha=0.7)
axes[0].plot(range(1, 11), np.cumsum(explained_var) * 100,
'ro-', linewidth=2, label='Cumulative explained variance')
axes[0].set_xlabel('Principal component', fontsize=12)
axes[0].set_ylabel('Explained variance (%)', fontsize=12)
axes[0].set_title('PCA Explained Variance Ratio', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# 2D plot
axes[1].scatter(X_pca[:, 0], X_pca[:, 1],
c='steelblue', s=50, alpha=0.6, edgecolors='k')
axes[1].set_xlabel('PC1', fontsize=12)
axes[1].set_ylabel('PC2', fontsize=12)
axes[1].set_title('PCA Visualization (2D)', fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Original dimensionality: {n_features}")
print(f"Reduced dimensionality: {X_pca.shape[1]}")
print(f"Cumulative explained variance (PC1-PC2): {np.sum(explained_var[:2]) * 100:.2f}%")
print(f"Cumulative explained variance (PC1-PC10): {np.sum(explained_var) * 100:.2f}%")
t-SNE and UMAP
from sklearn.manifold import TSNE
# !pip install umap-learn
from umap import UMAP
def compare_dimensionality_reduction(X, labels=None):
"""
Compare PCA, t-SNE, and UMAP
"""
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# UMAP
umap_model = UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# PCA
axes[0].scatter(X_pca[:, 0], X_pca[:, 1],
c=labels, cmap='viridis', s=50, alpha=0.6)
axes[0].set_xlabel('PC1', fontsize=11)
axes[0].set_ylabel('PC2', fontsize=11)
axes[0].set_title('PCA', fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)
# t-SNE
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1],
c=labels, cmap='viridis', s=50, alpha=0.6)
axes[1].set_xlabel('t-SNE1', fontsize=11)
axes[1].set_ylabel('t-SNE2', fontsize=11)
axes[1].set_title('t-SNE', fontsize=12, fontweight='bold')
axes[1].grid(alpha=0.3)
# UMAP
im = axes[2].scatter(X_umap[:, 0], X_umap[:, 1],
c=labels, cmap='viridis', s=50, alpha=0.6)
axes[2].set_xlabel('UMAP1', fontsize=11)
axes[2].set_ylabel('UMAP2', fontsize=11)
axes[2].set_title('UMAP', fontsize=12, fontweight='bold')
axes[2].grid(alpha=0.3)
if labels is not None:
plt.colorbar(im, ax=axes[2], label='Label')
plt.tight_layout()
plt.show()
# Sample data (3 classes)
np.random.seed(42)
class1 = np.random.randn(100, 50) + [2, 2] + np.zeros(48)
class2 = np.random.randn(100, 50) + [-2, 2] + np.zeros(48)
class3 = np.random.randn(100, 50) + [0, -2] + np.zeros(48)
X_multi_class = np.vstack([class1, class2, class3])
labels = np.array([0]*100 + [1]*100 + [2]*100)
# Compare
compare_dimensionality_reduction(X_multi_class, labels)
print("Dimensionality reduction method characteristics:")
print("PCA: linear transformation, global structure preservation, fast")
print("t-SNE: nonlinear transformation, local structure preservation, slow")
print("UMAP: nonlinear transformation, global+local structure preservation, medium speed")
LDA (Linear Discriminant Analysis)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
def apply_lda(X, y, n_components=2):
"""
Apply LDA for dimensionality reduction (supervised)
Parameters:
-----------
X : array-like
Features
y : array-like
Labels
Returns:
--------
X_lda : LDA-transformed features
lda : LDA model
"""
lda = LinearDiscriminantAnalysis(n_components=n_components)
X_lda = lda.fit_transform(X, y)
return X_lda, lda
# Apply LDA
X_lda, lda_model = apply_lda(X_multi_class, labels, n_components=2)
# Compare PCA vs LDA
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# PCA (unsupervised)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_multi_class)
axes[0].scatter(X_pca[:, 0], X_pca[:, 1],
c=labels, cmap='viridis', s=50, alpha=0.6)
axes[0].set_xlabel('PC1', fontsize=11)
axes[0].set_ylabel('PC2', fontsize=11)
axes[0].set_title('PCA (unsupervised)', fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)
# LDA (supervised)
im = axes[1].scatter(X_lda[:, 0], X_lda[:, 1],
c=labels, cmap='viridis', s=50, alpha=0.6)
axes[1].set_xlabel('LD1', fontsize=11)
axes[1].set_ylabel('LD2', fontsize=11)
axes[1].set_title('LDA (supervised)', fontsize=12, fontweight='bold')
axes[1].grid(alpha=0.3)
plt.colorbar(im, ax=axes[1], label='Class')
plt.tight_layout()
plt.show()
print("LDA advantages:")
print("- Finds projection axes that maximize class separation")
print("- Suitable for classification problems")
print(f"- Maximum dimensionality: min(n_features, n_classes-1) = {lda_model.n_components}")
2.4 Feature Selection
Select only important features to improve model accuracy and interpretability.
Filter Method: Correlation and ANOVA
from sklearn.feature_selection import (
VarianceThreshold,
SelectKBest,
f_regression,
mutual_info_regression
)
def filter_method_selection(X, y, k=10):
"""
Feature selection using Filter method
Parameters:
-----------
X : pd.DataFrame
Features
y : array-like
Target variable
Returns:
--------
selected_features : selected feature names
scores : scores for each feature
"""
# Remove low-variance features
var_threshold = VarianceThreshold(threshold=0.01)
X_var = var_threshold.fit_transform(X)
selected_by_var = X.columns[var_threshold.get_support()]
# F-statistic
selector_f = SelectKBest(f_regression, k=k)
selector_f.fit(X, y)
scores_f = selector_f.scores_
selected_by_f = X.columns[selector_f.get_support()]
# Mutual information
selector_mi = SelectKBest(mutual_info_regression, k=k)
selector_mi.fit(X, y)
scores_mi = selector_mi.scores_
selected_by_mi = X.columns[selector_mi.get_support()]
return {
'variance': selected_by_var,
'f_stat': selected_by_f,
'mutual_info': selected_by_mi,
'scores_f': scores_f,
'scores_mi': scores_mi
}
# Sample data
np.random.seed(42)
n_samples = 200
X_data = pd.DataFrame(
np.random.randn(n_samples, 30),
columns=[f'feature_{i}' for i in range(30)]
)
# Target variable (related to only some features)
y_data = (
2 * X_data['feature_0'] +
3 * X_data['feature_5'] -
1.5 * X_data['feature_10'] +
np.random.normal(0, 0.5, n_samples)
)
# Execute Filter method
selection_results = filter_method_selection(X_data, y_data, k=10)
# Visualize scores
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# F-statistic scores
axes[0].bar(range(len(selection_results['scores_f'])),
selection_results['scores_f'],
color='steelblue', alpha=0.7)
axes[0].set_xlabel('Feature index', fontsize=11)
axes[0].set_ylabel('F-statistic score', fontsize=11)
axes[0].set_title('Feature Evaluation by F-statistic', fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)
# Mutual information scores
axes[1].bar(range(len(selection_results['scores_mi'])),
selection_results['scores_mi'],
color='coral', alpha=0.7)
axes[1].set_xlabel('Feature index', fontsize=11)
axes[1].set_ylabel('Mutual information', fontsize=11)
axes[1].set_title('Feature Evaluation by Mutual Information', fontsize=12, fontweight='bold')
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("Features selected by F-statistic:")
print(selection_results['f_stat'].tolist())
print("\nFeatures selected by mutual information:")
print(selection_results['mutual_info'].tolist())
Wrapper Method: RFE
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
def rfe_selection(X, y, n_features_to_select=10):
"""
RFE (Recursive Feature Elimination)
"""
estimator = RandomForestRegressor(n_estimators=50, random_state=42)
selector = RFE(estimator, n_features_to_select=n_features_to_select)
selector.fit(X, y)
selected_features = X.columns[selector.support_]
feature_ranking = selector.ranking_
return selected_features, feature_ranking
# Execute RFE
selected_rfe, ranking_rfe = rfe_selection(X_data, y_data, n_features_to_select=10)
# Visualize rankings
plt.figure(figsize=(12, 6))
plt.bar(range(len(ranking_rfe)), ranking_rfe,
color='steelblue', alpha=0.7)
plt.axhline(y=1, color='red', linestyle='--',
label='Selected features (rank=1)', linewidth=2)
plt.xlabel('Feature index', fontsize=12)
plt.ylabel('Ranking (lower is more important)', fontsize=12)
plt.title('Feature Ranking by RFE', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("Features selected by RFE:")
print(selected_rfe.tolist())
Embedded Method: Lasso and Random Forest Importances
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
def embedded_selection(X, y):
"""
Embedded method (Lasso + Random Forest)
"""
# Lasso (L1 regularization)
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X, y)
lasso_coefs = np.abs(lasso.coef_)
selected_lasso = X.columns[lasso_coefs > 0]
# Random Forest importances
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)
rf_importances = rf.feature_importances_
# Select top 10
top_10_idx = np.argsort(rf_importances)[-10:]
selected_rf = X.columns[top_10_idx]
return {
'lasso': selected_lasso,
'lasso_coefs': lasso_coefs,
'rf': selected_rf,
'rf_importances': rf_importances
}
# Execute Embedded method
embedded_results = embedded_selection(X_data, y_data)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Lasso coefficients
axes[0].bar(range(len(embedded_results['lasso_coefs'])),
embedded_results['lasso_coefs'],
color='steelblue', alpha=0.7)
axes[0].set_xlabel('Feature index', fontsize=11)
axes[0].set_ylabel('|Lasso coefficient|', fontsize=11)
axes[0].set_title('Feature Selection by Lasso', fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)
# Random Forest importances
axes[1].bar(range(len(embedded_results['rf_importances'])),
embedded_results['rf_importances'],
color='coral', alpha=0.7)
axes[1].set_xlabel('Feature index', fontsize=11)
axes[1].set_ylabel('Feature Importance', fontsize=11)
axes[1].set_title('Feature Importances (Random Forest)', fontsize=12, fontweight='bold')
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("Features selected by Lasso:")
print(embedded_results['lasso'].tolist())
print("\nTop 10 features selected by Random Forest:")
print(embedded_results['rf'].tolist())
SHAP-based Selection
# Requirements:
# - Python 3.9+
# - shap>=0.42.0
import shap
def shap_based_selection(X, y, top_k=10):
"""
Feature selection using SHAP values
"""
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)
# Calculate SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
# Evaluate importance by mean absolute SHAP values
mean_abs_shap = np.abs(shap_values).mean(axis=0)
# Select top k features
top_k_idx = np.argsort(mean_abs_shap)[-top_k:]
selected_features = X.columns[top_k_idx]
return selected_features, mean_abs_shap, shap_values
# SHAP selection
selected_shap, mean_shap, shap_vals = shap_based_selection(X_data, y_data, top_k=10)
# SHAP Summary Plot
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_vals, X_data, plot_type="bar", show=False)
plt.title('SHAP Feature Importance', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
print("Top 10 features selected by SHAP:")
print(selected_shap.tolist())
# Compare methods
print("\nComparison of feature selection methods:")
print(f"Filter method (F-statistic): {len(selection_results['f_stat'])} features")
print(f"Wrapper method (RFE): {len(selected_rfe)} features")
print(f"Embedded method (Lasso): {len(embedded_results['lasso'])} features")
print(f"SHAP-based: {len(selected_shap)} features")
2.5 Case Study: Bandgap Prediction
Practice effective reduction from 200 to 20 dimensions in a real bandgap prediction task.
# Bandgap dataset (simulated)
np.random.seed(42)
n_materials = 500
# 200-dimensional material descriptors (composition, structure, electronic structure)
descriptor_names = (
[f'composition_{i}' for i in range(80)] +
[f'structure_{i}' for i in range(60)] +
[f'electronic_{i}' for i in range(60)]
)
X_bandgap = pd.DataFrame(
np.random.randn(n_materials, 200),
columns=descriptor_names
)
# Bandgap (depends on only some descriptors)
important_features = [
'composition_5', 'composition_12', 'composition_25',
'structure_10', 'structure_23',
'electronic_8', 'electronic_15', 'electronic_30'
]
y_bandgap = np.zeros(n_materials)
for feat in important_features:
idx = descriptor_names.index(feat)
y_bandgap += np.random.uniform(0.5, 1.5) * X_bandgap[feat]
y_bandgap = np.abs(y_bandgap) + np.random.normal(0, 0.3, n_materials)
y_bandgap = np.clip(y_bandgap, 0, 6) # Range: 0-6 eV
print("=== Bandgap Prediction Dataset ===")
print(f"Number of materials: {n_materials}")
print(f"Number of features: {X_bandgap.shape[1]}")
print(f"Objective: Reduce from 200 to 20 dimensions")
Step 1: Filter Method to Reduce to 100 Dimensions
# Select top 100 using mutual information
selector_mi = SelectKBest(mutual_info_regression, k=100)
X_filtered = selector_mi.fit_transform(X_bandgap, y_bandgap)
selected_features_100 = X_bandgap.columns[selector_mi.get_support()]
print(f"\nStep 1: Filter method (mutual information)")
print(f"200 dimensions → {X_filtered.shape[1]} dimensions")
Step 2: PCA to Reduce to 50 Dimensions
# Apply PCA
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_filtered)
# Cumulative explained variance
cumsum_var = np.cumsum(pca.explained_variance_ratio_)
# Find number of components for 90% cumulative explained variance
n_components_90 = np.argmax(cumsum_var >= 0.90) + 1
plt.figure(figsize=(10, 6))
plt.plot(range(1, 51), cumsum_var * 100, 'b-', linewidth=2)
plt.axhline(y=90, color='red', linestyle='--',
label='90% cumulative explained variance', linewidth=2)
plt.axvline(x=n_components_90, color='green', linestyle='--',
label=f'{n_components_90} components for 90%', linewidth=2)
plt.xlabel('Number of principal components', fontsize=12)
plt.ylabel('Cumulative explained variance (%)', fontsize=12)
plt.title('Dimensionality Reduction by PCA', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nStep 2: PCA")
print(f"100 dimensions → {X_pca.shape[1]} dimensions")
print(f"90% cumulative explained variance achieved at: {n_components_90} dimensions")
Step 3: Random Forest Importance to Reduce to 20 Dimensions
# Train Random Forest on 50-dimensional PCA features
X_pca_df = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(50)])
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_pca_df, y_bandgap)
# Select top 20
importances = rf.feature_importances_
top_20_idx = np.argsort(importances)[-20:]
X_final = X_pca_df.iloc[:, top_20_idx]
# Visualize importances
plt.figure(figsize=(12, 6))
plt.bar(range(50), importances, color='steelblue', alpha=0.7)
plt.bar(top_20_idx, importances[top_20_idx],
color='coral', alpha=0.9, label='Selected 20 dimensions')
plt.xlabel('Principal component index', fontsize=12)
plt.ylabel('Feature Importance', fontsize=12)
plt.title('Final Selection by Random Forest', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nStep 3: Random Forest Importance")
print(f"50 dimensions → {X_final.shape[1]} dimensions")
print(f"\nFinal selected principal components:")
print(X_final.columns.tolist())
Step 4: Validate Prediction Performance
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, r2_score
def evaluate_dimension_reduction(X, y, name):
"""
Evaluate prediction performance after dimensionality reduction
"""
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Cross-validation
cv_scores = cross_val_score(
model, X, y, cv=5,
scoring='neg_mean_absolute_error'
)
cv_mae = -cv_scores.mean()
return {
'name': name,
'dimensions': X.shape[1],
'MAE': mae,
'R2': r2,
'CV_MAE': cv_mae
}
# Evaluate performance at each stage
results = []
# Original data (200 dimensions)
results.append(evaluate_dimension_reduction(X_bandgap, y_bandgap, 'Original'))
# After Filter (100 dimensions)
X_filtered_df = pd.DataFrame(X_filtered)
results.append(evaluate_dimension_reduction(X_filtered_df, y_bandgap, 'After Filter'))
# After PCA (50 dimensions)
results.append(evaluate_dimension_reduction(X_pca_df, y_bandgap, 'After PCA'))
# Final (20 dimensions)
results.append(evaluate_dimension_reduction(X_final, y_bandgap, 'Final'))
# Display results
results_df = pd.DataFrame(results)
print("\n=== Impact of Dimensionality Reduction ===")
print(results_df.to_string(index=False))
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# MAE comparison
axes[0].bar(results_df['name'], results_df['MAE'],
color=['gray', 'steelblue', 'coral', 'green'], alpha=0.7)
axes[0].set_ylabel('MAE (eV)', fontsize=12)
axes[0].set_title('Prediction Error (MAE)', fontsize=13, fontweight='bold')
axes[0].tick_params(axis='x', rotation=15)
axes[0].grid(axis='y', alpha=0.3)
# R² comparison
axes[1].bar(results_df['name'], results_df['R2'],
color=['gray', 'steelblue', 'coral', 'green'], alpha=0.7)
axes[1].set_ylabel('R²', fontsize=12)
axes[1].set_ylim(0, 1)
axes[1].set_title('Coefficient of Determination (R²)', fontsize=13, fontweight='bold')
axes[1].tick_params(axis='x', rotation=15)
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nPerformance retention rate (20 dims vs 200 dims):")
print(f"R² retention: {results_df.iloc[3]['R2'] / results_df.iloc[0]['R2'] * 100:.1f}%")
print(f"Dimensionality reduction: {(1 - 20/200) * 100:.0f}%")
Physical Interpretation
# Analyze correspondence between selected 20 dims and original descriptors
def interpret_selected_components(pca_model, original_features, selected_pcs):
"""
Provide physical interpretation of selected principal components
"""
# PCA loadings
loadings = pca_model.components_.T
interpretations = []
for pc_name in selected_pcs:
pc_idx = int(pc_name.replace('PC', '')) - 1
# Original features with large contributions to this component
loading_vector = np.abs(loadings[:, pc_idx])
top_5_idx = np.argsort(loading_vector)[-5:]
top_features = [original_features[i] for i in top_5_idx]
top_loadings = loading_vector[top_5_idx]
interpretations.append({
'PC': pc_name,
'Top_Features': top_features,
'Loadings': top_loadings
})
return interpretations
# Execute interpretation
selected_pc_names = X_final.columns.tolist()
interpretations = interpret_selected_components(
pca,
selected_features_100.tolist(),
selected_pc_names[:5] # Display first 5 only
)
print("\n=== Physical Interpretation of Selected Components ===")
for interp in interpretations:
print(f"\n{interp['PC']}:")
for feat, loading in zip(interp['Top_Features'], interp['Loadings']):
print(f" {feat}: {loading:.3f}")
2.6 Practical Guide for Feature Engineering
Version Control for matminer Library
# Check matminer and pymatgen versions
import matminer
import pymatgen
print("=== Feature Engineering Environment ===")
print(f"matminer: {matminer.__version__}")
print(f"pymatgen: {pymatgen.__version__}")
# Recommended versions
print("\n【Recommended Environment】")
recommended = {
'matminer': '0.9.0',
'pymatgen': '2023.9.10',
'scikit-learn': '1.3.0',
'numpy': '1.24.3'
}
for package, version in recommended.items():
print(f"{package}>={version}")
print("\n【Installation Command】")
print("```bash")
print("pip install matminer==0.9.0 pymatgen==2023.9.10")
print("```")
print("\n【Important Notes】")
print("⚠️ matminer presets (magpie, deml, etc.) change feature count by version")
print("⚠️ Always specify version when reproducing papers")
Using Benchmark Datasets
# Common benchmark datasets in materials science
benchmark_datasets = pd.DataFrame({
'Dataset': [
'Matbench',
'JARVIS-DFT',
'Materials Project',
'OQMD',
'Expt Gap (Zhuo et al.)'
],
'Task': [
'13 types of regression/classification',
'Property prediction for 55,000 materials',
'Bandgap, formation energy',
'Stability prediction, phase diagrams',
'Experimental bandgap (6,354 materials)'
],
'Sample Count': [
'Hundreds to tens of thousands',
'55,000+',
'150,000+',
'1,000,000+',
'6,354'
],
'Purpose': [
'Model performance comparison',
'GNN, deep learning',
'General materials exploration',
'Stability assessment',
'Experimental data validation'
],
'URL': [
'https://matbench.materialsproject.org/',
'https://jarvis.nist.gov/',
'https://materialsproject.org/',
'http://oqmd.org/',
'DOI: 10.1021/acs.jpclett.8b00124'
]
})
print("=== Benchmark Datasets ===")
print(benchmark_datasets.to_string(index=False))
print("\n【Usage Example: Matbench】")
print("```python")
print("from matbench.bench import MatbenchBenchmark")
print("mb = MatbenchBenchmark(autoload=False)")
print("for task in mb.tasks:")
print(" task.load()")
print(" print(f'{task.dataset_name}: {len(task.df)} samples')")
print("```")
Practical Pitfalls in Feature Engineering
print("=== Pitfalls in Feature Engineering ===\n")
print("【Pitfall 1: Target Leakage】")
print("❌ Bad practice: Generate features from target variable")
print("```python")
print("# In bandgap prediction, using values directly related to bandgap as features")
print("X['bandgap_proxy'] = y_bandgap * 0.9 + noise # NG!")
print("```")
print("\n✅ Correct practice: Use only independent physical properties")
print("```python")
print("# Use composition, structure, electronic structure descriptors independent of target")
print("X_features = generate_composition_features(formulas)")
print("```")
print("\n【Pitfall 2: Inconsistent Feature Scales】")
print("⚠️ Mixing lattice constants (3-7 Å) and electrical conductivity (10³-10⁶ S/m)")
print("→ Performance degradation in distance-based algorithms (KNN, SVM)")
print("\n✅ Solution: Normalize with StandardScaler")
print("```python")
print("from sklearn.preprocessing import StandardScaler")
print("scaler = StandardScaler()")
print("X_scaled = scaler.fit_transform(X)")
print("```")
print("\n【Pitfall 3: Information Loss in Dimensionality Reduction】")
print("⚠️ PCA with 95% explained variance → remaining 5% may contain important information")
print("\n✅ Solution: Compare performance at multiple variance levels")
print("```python")
print("for var_ratio in [0.90, 0.95, 0.99]:")
print(" pca = PCA(n_components=var_ratio)")
print(" X_pca = pca.fit_transform(X)")
print(" # Evaluate model performance")
print("```")
print("\n【Pitfall 4: Uncritical Use of matminer Presets】")
print("⚠️ Using magpie preset (132 features) directly")
print("→ Many redundant and irrelevant features")
print("\n✅ Solution: Carefully select features based on domain knowledge")
print("```python")
print("# Select only relevant features for bandgap prediction")
print("relevant_features = [")
print(" 'MagpieData mean Electronegativity',")
print(" 'MagpieData range Electronegativity',")
print(" 'MagpieData mean NValence'")
print("]")
print("X_selected = X_all[relevant_features]")
print("```")
print("\n【Pitfall 5: Non-normalized Composition Descriptors】")
print("⚠️ Li₀.₉CoO₂ and LiCoO₂ have different feature values")
print("→ Composition normalization is necessary")
print("\n✅ Solution: Normalize composition to sum = 1")
print("```python")
print("from pymatgen.core import Composition")
print("comp = Composition('Li0.9CoO2')")
print("comp_normalized = comp.fractional_composition # Li₀.₃₁Co₀.₃₄O₀.₆₉")
print("```")
Exercises
Problem 1 (Difficulty: Easy)
Using matminer, generate material descriptors for chemical formulas "Fe2O3" and "TiO2" and compare which descriptors differ the most.
Hints
1. Use `ElementProperty.from_preset("magpie")` 2. Generate features for each formula 3. Calculate absolute differences and display top 10Sample Solution
from matminer.featurizers.composition import ElementProperty
from pymatgen.core import Composition
# Generate descriptors
ep_feat = ElementProperty.from_preset("magpie")
comp1 = Composition("Fe2O3")
comp2 = Composition("TiO2")
features1 = ep_feat.featurize(comp1)
features2 = ep_feat.featurize(comp2)
feature_names = ep_feat.feature_labels()
# Calculate differences
df_comparison = pd.DataFrame({
'Feature': feature_names,
'Fe2O3': features1,
'TiO2': features2,
'Difference': np.abs(np.array(features1) - np.array(features2))
})
df_comparison_sorted = df_comparison.sort_values(
'Difference', ascending=False
)
print("Most different descriptors (top 10):")
print(df_comparison_sorted.head(10).to_string(index=False))
Problem 2 (Difficulty: Medium)
Compare PCA and UMAP for reducing high-dimensional material data to 2D, and analyze which method better visualizes cluster structure.
Sample Solution
from sklearn.decomposition import PCA
from umap import UMAP
from sklearn.datasets import make_blobs
# Generate high-dimensional data with cluster structure
X, y = make_blobs(n_samples=300, n_features=50,
centers=3, random_state=42)
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# UMAP
umap_model = UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1],
c=y, cmap='viridis', s=50, alpha=0.6)
axes[0].set_title('PCA', fontsize=12, fontweight='bold')
axes[1].scatter(X_umap[:, 0], X_umap[:, 1],
c=y, cmap='viridis', s=50, alpha=0.6)
axes[1].set_title('UMAP', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
Problem 3 (Difficulty: Hard)
Apply four feature selection methods (Filter, Wrapper, Embedded, SHAP-based) to the same dataset, select top 10 features with each, and analyze the overlap between selections.
Sample Solution
# Requirements:
# - Python 3.9+
# - shap>=0.42.0
"""
Example: Apply four feature selection methods (Filter, Wrapper, Embed
Purpose: Demonstrate machine learning model training and evaluation
Target: Advanced
Execution time: 5-10 seconds
Dependencies: None
"""
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
import shap
# Sample data
np.random.seed(42)
X = pd.DataFrame(np.random.randn(200, 30),
columns=[f'feat_{i}' for i in range(30)])
y = (2*X['feat_0'] + 3*X['feat_5'] - X['feat_10'] +
np.random.normal(0, 0.5, 200))
# 1. Filter method
selector_filter = SelectKBest(f_regression, k=10)
selector_filter.fit(X, y)
selected_filter = set(X.columns[selector_filter.get_support()])
# 2. Wrapper method (RFE)
model_rfe = RandomForestRegressor(n_estimators=50, random_state=42)
selector_rfe = RFE(model_rfe, n_features_to_select=10)
selector_rfe.fit(X, y)
selected_rfe = set(X.columns[selector_rfe.support_])
# 3. Embedded method (Lasso)
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X, y)
lasso_coefs = np.abs(lasso.coef_)
top_10_lasso_idx = np.argsort(lasso_coefs)[-10:]
selected_lasso = set(X.columns[top_10_lasso_idx])
# 4. SHAP-based
rf_shap = RandomForestRegressor(n_estimators=100, random_state=42)
rf_shap.fit(X, y)
explainer = shap.TreeExplainer(rf_shap)
shap_values = explainer.shap_values(X)
mean_abs_shap = np.abs(shap_values).mean(axis=0)
top_10_shap_idx = np.argsort(mean_abs_shap)[-10:]
selected_shap = set(X.columns[top_10_shap_idx])
# Overlap analysis
all_methods = {
'Filter': selected_filter,
'Wrapper': selected_rfe,
'Embedded': selected_lasso,
'SHAP': selected_shap
}
# Compute overlap
common_all = selected_filter & selected_rfe & selected_lasso & selected_shap
print("Features selected by each method:")
for method, features in all_methods.items():
print(f"{method}: {sorted(features)}")
print(f"\nCommon features across all methods: {sorted(common_all)}")
print(f"Overlap rate: {len(common_all)} / 10")
Summary
In this chapter, we learned practical feature engineering techniques.
Key Points:
- Material Descriptors: Efficiently generate composition, structure, and electronic structure descriptors using matminer
- Feature Transformations: Capture nonlinear relationships through normalization, log transformation, and polynomial features
- Dimensionality Reduction: Achieve visualization and computational efficiency with PCA, t-SNE, and UMAP
- Feature Selection: Improve accuracy with Filter < Wrapper < Embedded < SHAP progression
- Case Study: Effectively reduce 200 dimensions to 20 while maintaining performance and improving interpretability
- Library Management: Ensure reproducibility through matminer and pymatgen version control
- Benchmark Data: Leverage standard datasets like Matbench and JARVIS-DFT
- Practical Pitfalls: Avoid target leakage, scale inconsistency, information loss, and composition normalization issues
Next Chapter Preview: In Chapter 3, we will learn optimal model selection and hyperparameter optimization. We will maximize prediction accuracy through automated optimization using Optuna and ensemble learning.
Chapter 2 Checklist
Feature Generation (Composition Descriptors)
- [ ] matminer Usage
- [ ] ElementProperty.from_preset("magpie") generates 132 features
- [ ] Stoichiometry() for stoichiometric descriptors (7 features)
- [ ] ValenceOrbital() for valence orbital descriptors (10 features)
-
[ ] Record versions (matminer 0.9.0, pymatgen 2023.9.10)
-
[ ] Composition Normalization
- [ ] Normalize to sum = 1 using Composition.fractional_composition
- [ ] Standardize treatment of Li₀.₉CoO₂ and LiCoO₂
-
[ ] Proper representation of dopant concentrations
-
[ ] Domain Knowledge Integration
- [ ] Understand chemical significance of electronegativity, atomic radius, valence electrons
- [ ] Prioritize electronic structure descriptors for bandgap prediction
- [ ] Prioritize crystal structure descriptors for mechanical property prediction
Feature Generation (Structure and Electronic Structure Descriptors)
- [ ] Crystal Structure Descriptors
- [ ] Lattice constants (a, b, c), lattice angles (α, β, γ)
- [ ] Volume, packing density
-
[ ] Space group, symmetry
-
[ ] Electronic Structure Descriptors
- [ ] Bandgap, Fermi energy
- [ ] Density of states (valence band, conduction band)
-
[ ] Effective mass (electron, hole)
-
[ ] DFT Calculation Data Integration
- [ ] Fetch formation energy from Materials Project API
- [ ] Fetch stability data from OQMD
- [ ] Record computational conditions (functional, cutoff energy)
Feature Transformations
- [ ] Normalization
- [ ] Use StandardScaler (mean 0, std 1) by default
- [ ] MinMaxScaler (0-1 range) when interpretability is priority
-
[ ] Fit **after** train/test split (prevent data leakage)
-
[ ] Log Transformation
- [ ] Apply to features with large order of magnitude (electrical conductivity)
- [ ] Use log1p(log(1+x)) to handle zero values
-
[ ] Verify skewness is reduced to -0.5~0.5 after transformation
-
[ ] Polynomial Features
- [ ] 2nd degree captures nonlinear relationships
- [ ] Interaction terms (x₁×x₂) make interactions explicit
- [ ] 3rd degree and higher risk overfitting → use carefully
Dimensionality Reduction
- [ ] PCA (Principal Component Analysis)
- [ ] Target cumulative explained variance 90-95%
- [ ] Interpret through principal component loadings
-
[ ] Visualize scree plot (explained variance decay)
-
[ ] t-SNE / UMAP
- [ ] Effective for cluster visualization (2D plots)
- [ ] Do not use as features for prediction models (nonlinear, irreversible)
-
[ ] Tune perplexity (t-SNE), n_neighbors (UMAP)
-
[ ] LDA (Linear Discriminant Analysis)
- [ ] Maximizes class separation for classification
- [ ] Maximum dimensions: min(n_features, n_classes-1)
- [ ] Pay attention to generalization in supervised setting
Feature Selection
- [ ] Filter Method
- [ ] VarianceThreshold to remove low-variance features
- [ ] SelectKBest + f_regression (F-statistic)
- [ ] SelectKBest + mutual_info_regression (Mutual information)
-
[ ] Fast computation but does not consider feature interactions
-
[ ] Wrapper Method
- [ ] RFE (Recursive Feature Elimination)
- [ ] High computational cost but high accuracy
-
[ ] Effective for reduction to 10-20 features
-
[ ] Embedded Method
- [ ] Lasso (L1 regularization) automatically removes zero coefficients
- [ ] Random Forest feature_importances_
- [ ] LightGBM feature_importances_
-
[ ] Feature selection integrated into training process
-
[ ] SHAP-based Selection
- [ ] Rank by mean(|SHAP values|)
- [ ] Most reliable for global interpretation
- [ ] Moderate computational cost
Avoiding Practical Pitfalls (Feature Engineering)
- [ ] Prevent Target Leakage
- [ ] Do not generate features from target variable
- [ ] Do not directly use DFT values (bandgap) as features
-
[ ] Use only composition, structure, and other independent descriptors
-
[ ] Unify Scales
- [ ] Normalize lattice constants (Å) and electrical conductivity (S/m)
- [ ] Especially critical for distance-based models (KNN, SVM)
-
[ ] Recommended for tree models (RF, XGBoost) even if not required
-
[ ] Monitor Information Loss
- [ ] Compare PCA at multiple variance levels (90%, 95%, 99%)
- [ ] Evaluate model performance before and after reduction
-
[ ] Increase dimensions if performance drops from over-reduction
-
[ ] Consider matminer Presets Carefully
- [ ] magpie (132 features) may be excessive in some cases
- [ ] Select relevant features using domain knowledge
-
[ ] Remove redundant features using correlation matrix
-
[ ] Composition Normalization
- [ ] Normalize so composition sums to 1
- [ ] Standardize chemical formula notation (Li₀.₉CoO₂ vs LiCoO₂)
- [ ] Use Composition.fractional_composition
Using Benchmark Datasets
- [ ] Matbench
- [ ] 13 tasks for standard performance evaluation
- [ ] Common baseline for paper comparisons
-
[ ] Validate model selection appropriateness
-
[ ] JARVIS-DFT
- [ ] 55,000 materials with DFT-computed properties
- [ ] Optimal for GNN and Transformer training
-
[ ] Rich structure descriptors
-
[ ] Experimental Datasets
- [ ] Expt Gap (experimental bandgap, 6,354 materials)
- [ ] Validate DFT vs experimental gap
- [ ] Assess practical model applicability
Feature Engineering Quality Metrics
- [ ] Feature Quality
- [ ] Missing rate < 5% (per feature)
- [ ] Variance > 0.01 (remove low-variance features)
-
[ ] Correlation |r| < 0.9 (reduce highly correlated pairs)
-
[ ] Dimensionality Reduction Effectiveness
- [ ] PCA explained variance ≥ 90%
- [ ] Dimensionality reduction ratio 50-90%
-
[ ] Performance retention ratio ≥ 95% (MAE, R², etc.)
-
[ ] Feature Selection Effectiveness
- [ ] Selected features ≤ 20% of original count
- [ ] Maintain or improve performance (prevent overfitting)
- [ ] Training time reduction ≥ 30%
Ensuring Reproducibility
- [ ] Version Management
- [ ] Document matminer and pymatgen versions
- [ ] Record preset names (magpie, deml, etc.)
-
[ ] Timestamp feature generation
-
[ ] Feature List Preservation
- [ ] Save all generated feature names to CSV
- [ ] Document feature meanings and descriptions
-
[ ] Record selected feature subsets
-
[ ] Save Transformation Parameters
- [ ] Pickle StandardScaler mean, std for reuse
- [ ] Save PCA components_
- [ ] Apply identical transformations to new data
References
-
Ward, L., Dunn, A., Faghaninia, A., et al. (2018). Matminer: An open source toolkit for materials data mining. Computational Materials Science, 152, 60-69. DOI: 10.1016/j.commatsci.2018.05.018
-
Jolliffe, I. T. & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202. DOI: 10.1098/rsta.2015.0202
-
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
-
Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.