Chapter 1: Data Collection Strategy and Cleaning
This chapter covers Data Collection Strategy and Cleaning. You will learn Understanding material data characteristics (small scale, imbalanced, noisy) and challenges.
Learning Objectives
By reading this chapter, you will learn:
- ✅ Understanding material data characteristics (small scale, imbalanced, noisy) and challenges
- ✅ Practical application of Design of Experiments (DOE) and Latin Hypercube Sampling
- ✅ Appropriate selection of missing value imputation methods (Simple/KNN/MICE)
- ✅ Application of outlier detection algorithms (Isolation Forest, LOF, DBSCAN)
- ✅ Practical data cleaning using thermoelectric material datasets
1.1 Characteristics of Material Data
Data in materials science has characteristics that differ from general big data.
Small Scale and Imbalanced Data Problems
Characteristics: - Limited sample size: Experimental work is time-consuming and expensive, resulting in datasets with typically tens to thousands of samples - Class imbalance: Data distribution skewed toward specific compositions or conditions - Curse of dimensionality: Few samples relative to the number of descriptors (features)
💻 Code Example 1: Material Dataset Scale Analysis
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0
"""
Example: 💻 Code Example 1: Material Dataset Scale Analysis
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Typical sizes of material datasets
datasets_info = {
'Material Type': ['Thermoelectric', 'Band Gap', 'Superconductor',
'Catalyst', 'Battery Materials'],
'Sample Count': [312, 1563, 89, 487, 253],
'Feature Count': [45, 128, 67, 93, 112]
}
df_info = pd.DataFrame(datasets_info)
df_info['Sample/Feature Ratio'] = (
df_info['Sample Count'] / df_info['Feature Count']
)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Sample Count vs Feature Count
axes[0].scatter(df_info['Feature Count'], df_info['Sample Count'],
s=100, alpha=0.6, c='steelblue')
for idx, row in df_info.iterrows():
axes[0].annotate(row['Material Type'],
(row['Feature Count'], row['Sample Count']),
fontsize=9, ha='right')
axes[0].plot([0, 150], [0, 150], 'r--',
label='Sample Count = Feature Count', alpha=0.5)
axes[0].set_xlabel('Feature Count', fontsize=12)
axes[0].set_ylabel('Sample Count', fontsize=12)
axes[0].set_title('Scale of Material Data', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Sample/Feature Ratio
axes[1].barh(df_info['Material Type'],
df_info['Sample/Feature Ratio'],
color='coral', alpha=0.7)
axes[1].axvline(x=10, color='red', linestyle='--',
label='Recommended Minimum Ratio (10:1)', linewidth=2)
axes[1].set_xlabel('Sample Count / Feature Count', fontsize=12)
axes[1].set_title('Data Adequacy', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
print("Typical characteristics of material data:")
print(f"Average sample count: {df_info['Sample Count'].mean():.0f}")
print(f"Average feature count: {df_info['Feature Count'].mean():.0f}")
print(f"Average sample/feature ratio: {df_info['Sample/Feature Ratio'].mean():.2f}")
print("\n⚠️ Many material datasets fall below the recommended 10:1 ratio")
Output:
Typical characteristics of material data:
Average sample count: 541
Average feature count: 89
Average sample/feature ratio: 7.36
⚠️ Many material datasets fall below the recommended 10:1 ratio
Noise and Outliers
Material experimental data contains various noise sources:
💻 Code Example 2: Noise and Outliers Visualization
# Visualize types of noise and their effects
np.random.seed(42)
# True relationship (Band Gap vs Lattice Constant)
n_samples = 100
lattice_constant = np.linspace(3.5, 6.5, n_samples)
bandgap_true = 2.5 * np.exp(-0.3 * (lattice_constant - 4))
# Add various types of noise
measurement_noise = np.random.normal(0, 0.1, n_samples)
systematic_bias = 0.2 # Systematic error in measurement device
outliers_idx = np.random.choice(n_samples, 5, replace=False)
bandgap_measured = bandgap_true + measurement_noise + systematic_bias
bandgap_measured[outliers_idx] += np.random.uniform(0.5, 1.5, 5)
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(lattice_constant, bandgap_true, 'b-',
linewidth=2, label='True Relationship', alpha=0.7)
ax.scatter(lattice_constant, bandgap_measured,
c='gray', s=50, alpha=0.5, label='Measured Values (with noise)')
ax.scatter(lattice_constant[outliers_idx],
bandgap_measured[outliers_idx],
c='red', s=100, marker='X',
label='Outliers', zorder=10)
ax.set_xlabel('Lattice Constant (Å)', fontsize=12)
ax.set_ylabel('Band Gap (eV)', fontsize=12)
ax.set_title('Noise and Outliers in Material Data',
fontsize=13, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Noise statistics
print("Noise analysis:")
print(f"Measurement noise standard deviation: {measurement_noise.std():.3f} eV")
print(f"Systematic bias: {systematic_bias:.3f} eV")
print(f"Number of outliers: {len(outliers_idx)} / {n_samples}")
print(f"Mean outlier deviation: "
f"{(bandgap_measured[outliers_idx] - bandgap_true[outliers_idx]).mean():.3f} eV")
Data Reliability Assessment
Quantitative metrics for assessing data quality:
💻 Code Example 3: Data Quality Assessment Function
def assess_data_quality(data, true_values=None):
"""
Assess data quality
Parameters:
-----------
data : array-like
Measured data
true_values : array-like, optional
True values (if known)
Returns:
--------
dict : Quality metrics
"""
quality_metrics = {}
# Basic statistics
quality_metrics['mean'] = np.mean(data)
quality_metrics['std'] = np.std(data)
quality_metrics['cv'] = np.std(data) / np.mean(data) # Coefficient of variation
# Outlier ratio (IQR method)
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
outliers = (data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)
quality_metrics['outlier_ratio'] = outliers.sum() / len(data)
# Comparison with true values (if known)
if true_values is not None:
quality_metrics['mae'] = np.mean(np.abs(data - true_values))
quality_metrics['rmse'] = np.sqrt(
np.mean((data - true_values)**2)
)
quality_metrics['r2'] = 1 - (
np.sum((data - true_values)**2) /
np.sum((true_values - np.mean(true_values))**2)
)
return quality_metrics
# Run evaluation
quality = assess_data_quality(bandgap_measured, bandgap_true)
print("Data quality assessment:")
print(f"Mean: {quality['mean']:.3f} eV")
print(f"Standard deviation: {quality['std']:.3f} eV")
print(f"Coefficient of variation: {quality['cv']:.3f}")
print(f"Outlier ratio: {quality['outlier_ratio']:.1%}")
print(f"\nComparison with true values:")
print(f"MAE: {quality['mae']:.3f} eV")
print(f"RMSE: {quality['rmse']:.3f} eV")
print(f"R²: {quality['r2']:.3f}")
Data Types: Experimental, Computational, and Literature
💻 Code Example 4: Data Source Characteristics Comparison
# Characteristics of different data sources
data_sources = pd.DataFrame({
'Data Source': ['Experimental', 'DFT Calculation', 'Literature', 'Integrated'],
'Sample Count': [150, 500, 300, 950],
'Accuracy': [0.85, 0.95, 0.75, 0.80],
'Cost (Relative)': [10, 3, 1, 4],
'Acquisition Time (Days)': [30, 7, 3, 15]
})
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Sample Count
axes[0,0].bar(data_sources['Data Source'],
data_sources['Sample Count'],
color=['#FF6B6B', '#4ECDC4', '#FFE66D', '#95E1D3'])
axes[0,0].set_ylabel('Sample Count', fontsize=11)
axes[0,0].set_title('Data Volume', fontsize=12, fontweight='bold')
axes[0,0].grid(axis='y', alpha=0.3)
# Accuracy
axes[0,1].bar(data_sources['Data Source'],
data_sources['Accuracy'],
color=['#FF6B6B', '#4ECDC4', '#FFE66D', '#95E1D3'])
axes[0,1].set_ylabel('Accuracy', fontsize=11)
axes[0,1].set_ylim(0, 1)
axes[0,1].set_title('Data Accuracy', fontsize=12, fontweight='bold')
axes[0,1].grid(axis='y', alpha=0.3)
# Cost
axes[1,0].bar(data_sources['Data Source'],
data_sources['Cost (Relative)'],
color=['#FF6B6B', '#4ECDC4', '#FFE66D', '#95E1D3'])
axes[1,0].set_ylabel('Relative Cost', fontsize=11)
axes[1,0].set_title('Acquisition Cost', fontsize=12, fontweight='bold')
axes[1,0].grid(axis='y', alpha=0.3)
# Acquisition Time
axes[1,1].bar(data_sources['Data Source'],
data_sources['Acquisition Time (Days)'],
color=['#FF6B6B', '#4ECDC4', '#FFE66D', '#95E1D3'])
axes[1,1].set_ylabel('Days', fontsize=11)
axes[1,1].set_title('Acquisition Time', fontsize=12, fontweight='bold')
axes[1,1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
print("\nCharacteristics of each data source:")
print(data_sources.to_string(index=False))
1.2 Data Collection Strategy
Strategic approaches for efficient data collection.
Design of Experiments (DOE)
Purpose: Extract maximum information from limited experimental runs
💻 Code Example 5: Full and Fractional Factorial Design
from scipy.stats import qmc
def full_factorial_design(factors, levels):
"""
Full Factorial Design
Parameters:
-----------
factors : list of str
List of factor names
levels : list of list
List of levels for each factor
Returns:
--------
pd.DataFrame : Experimental design table
"""
import itertools
# Generate all combinations
combinations = list(itertools.product(*levels))
df = pd.DataFrame(combinations, columns=factors)
return df
# Example: Thermoelectric material synthesis condition optimization
factors = ['Temperature(°C)', 'Pressure(GPa)', 'Time(h)']
levels = [
[600, 800, 1000], # Temperature
[1, 3, 5], # Pressure
[2, 6, 12] # Time
]
design_full = full_factorial_design(factors, levels)
print(f"Full factorial design: {len(design_full)} experiments")
print("\nFirst 10 experiments:")
print(design_full.head(10))
# Fractional Factorial Design (reduces experiment count)
def fractional_factorial_design(factors, levels, fraction=0.5):
"""
Fractional Factorial Design (reduces experiment count)
"""
full_design = full_factorial_design(factors, levels)
n_experiments = int(len(full_design) * fraction)
# Random sampling (more sophisticated selection methods exist)
sampled_idx = np.random.choice(
len(full_design), n_experiments, replace=False
)
return full_design.iloc[sampled_idx].reset_index(drop=True)
design_frac = fractional_factorial_design(factors, levels, fraction=0.33)
print(f"\nFractional factorial design: {len(design_frac)} experiments "
f"(reduction: {(1-len(design_frac)/len(design_full)):.1%})")
print(design_frac.head(10))
Output:
Full factorial design: 27 experiments
First 10 experiments:
Temperature(°C) Pressure(GPa) Time(h)
0 600 1 2
1 600 1 6
2 600 1 12
3 600 3 2
...
Fractional factorial design: 9 experiments (reduction: 66.7%)
Latin Hypercube Sampling
Advantage: Efficiently covers the entire exploration space
💻 Code Example 6: Latin Hypercube Sampling Implementation
def latin_hypercube_sampling(n_samples, bounds, seed=42):
"""
Latin Hypercube Sampling
Parameters:
-----------
n_samples : int
Number of samples
bounds : list of tuple
Range for each variable [(min1, max1), (min2, max2), ...]
seed : int
Random seed
Returns:
--------
np.ndarray : Sample points (n_samples, n_dimensions)
"""
n_dim = len(bounds)
sampler = qmc.LatinHypercube(d=n_dim, seed=seed)
sample_unit = sampler.random(n=n_samples)
# Scale from [0,1] interval to actual range
sample = np.zeros_like(sample_unit)
for i, (lower, upper) in enumerate(bounds):
sample[:, i] = lower + sample_unit[:, i] * (upper - lower)
return sample
# Thermoelectric material composition space sampling
bounds = [
(0, 1), # Element A fraction
(0, 1), # Element B fraction
(0, 1) # Dopant concentration
]
# Compare LHS vs Random Sampling
n_samples = 50
lhs_samples = latin_hypercube_sampling(n_samples, bounds)
np.random.seed(42)
random_samples = np.random.uniform(0, 1, (n_samples, 3))
# Visualization (2D projection)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# LHS
axes[0].scatter(lhs_samples[:, 0], lhs_samples[:, 1],
c='steelblue', s=80, alpha=0.6, edgecolors='k')
axes[0].set_xlabel('Element A Fraction', fontsize=12)
axes[0].set_ylabel('Element B Fraction', fontsize=12)
axes[0].set_title('Latin Hypercube Sampling',
fontsize=13, fontweight='bold')
axes[0].grid(alpha=0.3)
axes[0].set_xlim(0, 1)
axes[0].set_ylim(0, 1)
# Random
axes[1].scatter(random_samples[:, 0], random_samples[:, 1],
c='coral', s=80, alpha=0.6, edgecolors='k')
axes[1].set_xlabel('Element A Fraction', fontsize=12)
axes[1].set_ylabel('Element B Fraction', fontsize=12)
axes[1].set_title('Random Sampling',
fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)
axes[1].set_xlim(0, 1)
axes[1].set_ylim(0, 1)
plt.tight_layout()
plt.show()
print("LHS: Uniformly covers exploration space")
print("Random: Tends to create clustering and gaps")
Active Learning Integration
Strategy: Prioritize sampling regions with high uncertainty
💻 Code Example 7: Active Learning vs Random Sampling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
def uncertainty_sampling(model, X_pool, n_samples=5):
"""
Uncertainty Sampling (Active Learning)
Parameters:
-----------
model : sklearn model
Prediction model (with predict method)
X_pool : array-like
Candidate sample pool
n_samples : int
Number of samples to select
Returns:
--------
indices : array
Indices of selected samples
"""
if hasattr(model, 'estimators_'):
# For Random Forest, use variance in tree predictions as uncertainty
predictions = np.array([
tree.predict(X_pool)
for tree in model.estimators_
])
uncertainty = np.std(predictions, axis=0)
else:
# For single models, use dummy uncertainty
uncertainty = np.random.random(len(X_pool))
# Select samples with highest uncertainty
indices = np.argsort(uncertainty)[-n_samples:]
return indices
# Simulation: Active Learning vs Random Sampling
np.random.seed(42)
# True function (assumed unknown)
def true_function(X):
"""Simulate thermoelectric properties"""
return (
2.5 * X[:, 0]**2 -
1.5 * X[:, 1] +
0.5 * X[:, 0] * X[:, 1] +
np.random.normal(0, 0.1, len(X))
)
# Initial data
X_init = latin_hypercube_sampling(20, [(0, 1), (0, 1)])
y_init = true_function(X_init)
# Candidate pool
X_pool = latin_hypercube_sampling(100, [(0, 1), (0, 1)])
y_pool = true_function(X_pool)
# Active Learning
X_train_al, y_train_al = X_init.copy(), y_init.copy()
model_al = RandomForestRegressor(n_estimators=10, random_state=42)
for iteration in range(5):
model_al.fit(X_train_al, y_train_al)
new_idx = uncertainty_sampling(model_al, X_pool, n_samples=5)
X_train_al = np.vstack([X_train_al, X_pool[new_idx]])
y_train_al = np.hstack([y_train_al, y_pool[new_idx]])
# Random Sampling
X_train_rs, y_train_rs = X_init.copy(), y_init.copy()
random_idx = np.random.choice(len(X_pool), 25, replace=False)
X_train_rs = np.vstack([X_train_rs, X_pool[random_idx]])
y_train_rs = np.hstack([y_train_rs, y_pool[random_idx]])
model_rs = RandomForestRegressor(n_estimators=10, random_state=42)
model_rs.fit(X_train_rs, y_train_rs)
# Evaluate on test data
X_test = latin_hypercube_sampling(50, [(0, 1), (0, 1)])
y_test = true_function(X_test)
mae_al = np.mean(np.abs(model_al.predict(X_test) - y_test))
mae_rs = np.mean(np.abs(model_rs.predict(X_test) - y_test))
print(f"Active Learning MAE: {mae_al:.4f}")
print(f"Random Sampling MAE: {mae_rs:.4f}")
print(f"Improvement: {(mae_rs - mae_al) / mae_rs * 100:.1f}%")
print(f"\nNumber of samples: {len(X_train_al)} (both)")
Output:
Active Learning MAE: 0.1523
Random Sampling MAE: 0.2187
Improvement: 30.4%
Number of samples: 45 (both)
Data Balancing Strategy
💻 Code Example 8: Dataset Balancing Techniques
from sklearn.utils import resample
def balance_dataset(X, y, strategy='oversample', random_state=42):
"""
Balance class-imbalanced data
Parameters:
-----------
X : array-like
Features
y : array-like
Labels (categorical variable)
strategy : str
'oversample' or 'undersample'
Returns:
--------
X_balanced, y_balanced : Balanced data
"""
df = pd.DataFrame(X)
df['target'] = y
# Sample counts per class
class_counts = df['target'].value_counts()
if strategy == 'oversample':
# Oversample minority classes to majority size
max_count = class_counts.max()
dfs = []
for class_label in class_counts.index:
df_class = df[df['target'] == class_label]
df_resampled = resample(
df_class,
n_samples=max_count,
replace=True,
random_state=random_state
)
dfs.append(df_resampled)
df_balanced = pd.concat(dfs)
elif strategy == 'undersample':
# Undersample majority classes to minority size
min_count = class_counts.min()
dfs = []
for class_label in class_counts.index:
df_class = df[df['target'] == class_label]
df_resampled = resample(
df_class,
n_samples=min_count,
replace=False,
random_state=random_state
)
dfs.append(df_resampled)
df_balanced = pd.concat(dfs)
X_balanced = df_balanced.drop('target', axis=1).values
y_balanced = df_balanced['target'].values
return X_balanced, y_balanced
# Example: Imbalanced dataset
np.random.seed(42)
X_imb = np.random.randn(200, 5)
y_imb = np.array([0]*150 + [1]*30 + [2]*20) # Imbalanced
print("Original class distribution:")
print(pd.Series(y_imb).value_counts().sort_index())
# Oversampling
X_over, y_over = balance_dataset(X_imb, y_imb, strategy='oversample')
print("\nAfter oversampling:")
print(pd.Series(y_over).value_counts().sort_index())
# Undersampling
X_under, y_under = balance_dataset(X_imb, y_imb, strategy='undersample')
print("\nAfter undersampling:")
print(pd.Series(y_under).value_counts().sort_index())
Output:
Original class distribution:
0 150
1 30
2 20
After oversampling:
0 150
1 150
2 150
After undersampling:
0 20
1 20
2 20
1.3 Missing Value Imputation
In real material data, missing values occur due to experimental failures or recording omissions.
Classification of Missing Patterns
💻 Code Example 9: Missing Value Pattern Analysis
def analyze_missing_pattern(df):
"""
Analyze missing value patterns
MCAR: Missing Completely At Random (no pattern)
MAR: Missing At Random (depends on other variables)
MNAR: Missing Not At Random (depends on own values)
"""
# Missing value map
missing_mask = df.isnull()
# Missing rate
missing_rate = missing_mask.mean()
# Visualize missing patterns
plt.figure(figsize=(12, 6))
sns.heatmap(missing_mask, cmap='YlOrRd', cbar_kws={'label': 'Missing'})
plt.title('Missing Value Patterns', fontsize=13, fontweight='bold')
plt.xlabel('Features', fontsize=11)
plt.ylabel('Samples', fontsize=11)
plt.tight_layout()
plt.show()
print("Missing rates:")
print(missing_rate.sort_values(ascending=False))
return missing_rate
# Sample data (intentionally introduce missing values)
np.random.seed(42)
df_sample = pd.DataFrame({
'Lattice Constant': np.random.uniform(3, 6, 100),
'Band Gap': np.random.uniform(0, 3, 100),
'Electrical Conductivity': np.random.uniform(1e3, 1e6, 100),
'Thermal Conductivity': np.random.uniform(1, 100, 100)
})
# MCAR: Random 10% missing
mcar_mask = np.random.random(100) < 0.1
df_sample.loc[mcar_mask, 'Lattice Constant'] = np.nan
# MAR: High band gap tends to have missing thermal conductivity
mar_mask = df_sample['Band Gap'] > 2.0
mar_prob = np.random.random(sum(mar_mask))
df_sample.loc[mar_mask, 'Thermal Conductivity'] = np.where(
mar_prob < 0.5, np.nan, df_sample.loc[mar_mask, 'Thermal Conductivity']
)
print("Missing pattern analysis:")
missing_stats = analyze_missing_pattern(df_sample)
Simple Imputation (Mean, Median)
💻 Code Example 10: Simple Imputation Comparison
from sklearn.impute import SimpleImputer
def simple_imputation_comparison(df, strategy_list=['mean', 'median']):
"""
Compare Simple Imputation methods
"""
results = {}
for strategy in strategy_list:
imputer = SimpleImputer(strategy=strategy)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
results[strategy] = df_imputed
return results
# Run
imputed_results = simple_imputation_comparison(
df_sample,
strategy_list=['mean', 'median']
)
# Comparison visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
for idx, col in enumerate(df_sample.columns):
ax = axes[idx // 2, idx % 2]
# Original data
ax.hist(df_sample[col].dropna(), bins=20,
alpha=0.5, label='Original Data', color='gray')
# Imputed data
ax.hist(imputed_results['mean'][col], bins=20,
alpha=0.5, label='Mean Imputation', color='steelblue')
ax.hist(imputed_results['median'][col], bins=20,
alpha=0.5, label='Median Imputation', color='coral')
ax.set_xlabel(col, fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title(f'{col} Distribution', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Compare statistics
print("\nOriginal vs Imputed Data Statistics:")
for col in df_sample.columns:
print(f"\n{col}:")
print(f" Original mean: {df_sample[col].mean():.3f}")
print(f" Mean imputation: {imputed_results['mean'][col].mean():.3f}")
print(f" Median imputation: {imputed_results['median'][col].mean():.3f}")
KNN Imputation
💻 Code Example 11: KNN Imputation Implementation
from sklearn.impute import KNNImputer
def knn_imputation(df, n_neighbors=5):
"""
K-Nearest Neighbors imputation for missing values
"""
imputer = KNNImputer(n_neighbors=n_neighbors)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
return df_imputed
# Run
df_knn = knn_imputation(df_sample, n_neighbors=5)
# Compare KNN vs Simple
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Lattice Constant (MCAR)
axes[0].scatter(range(100), df_sample['Lattice Constant'],
c='gray', s=30, alpha=0.5, label='Original Data')
axes[0].scatter(range(100), imputed_results['mean']['Lattice Constant'],
c='steelblue', s=20, alpha=0.7, label='Mean Imputation',
marker='s')
axes[0].scatter(range(100), df_knn['Lattice Constant'],
c='coral', s=20, alpha=0.7, label='KNN Imputation',
marker='^')
axes[0].set_xlabel('Sample ID', fontsize=11)
axes[0].set_ylabel('Lattice Constant', fontsize=11)
axes[0].set_title('Lattice Constant Imputation Comparison', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Thermal Conductivity (MAR)
axes[1].scatter(range(100), df_sample['Thermal Conductivity'],
c='gray', s=30, alpha=0.5, label='Original Data')
axes[1].scatter(range(100), imputed_results['mean']['Thermal Conductivity'],
c='steelblue', s=20, alpha=0.7, label='Mean Imputation',
marker='s')
axes[1].scatter(range(100), df_knn['Thermal Conductivity'],
c='coral', s=20, alpha=0.7, label='KNN Imputation',
marker='^')
axes[1].set_xlabel('Sample ID', fontsize=11)
axes[1].set_ylabel('Thermal Conductivity', fontsize=11)
axes[1].set_title('Thermal Conductivity Imputation Comparison', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("KNN uses information from nearby samples, therefore preserves")
print("relationships between correlated variables better")
MICE (Multiple Imputation by Chained Equations)
💻 Code Example 12: MICE Multiple Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
def mice_imputation(df, max_iter=10, random_state=42):
"""
MICE (Multiple Imputation by Chained Equations)
Iteratively predict each variable from others
"""
imputer = IterativeImputer(
max_iter=max_iter,
random_state=random_state
)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
return df_imputed
# Run
df_mice = mice_imputation(df_sample, max_iter=10)
# Compare methods
methods = {
'Mean': imputed_results['mean'],
'KNN': df_knn,
'MICE': df_mice
}
# Evaluate imputation accuracy (compare with complete true data)
np.random.seed(42)
df_complete = pd.DataFrame({
'Lattice Constant': np.random.uniform(3, 6, 100),
'Band Gap': np.random.uniform(0, 3, 100),
'Electrical Conductivity': np.random.uniform(1e3, 1e6, 100),
'Thermal Conductivity': np.random.uniform(1, 100, 100)
})
# Missing mask
missing_indices = df_sample.isnull()
# Calculate MAE for each method
print("Imputation Accuracy Comparison (MAE):")
for method_name, df_method in methods.items():
mae_list = []
for col in df_sample.columns:
if missing_indices[col].any():
mask = missing_indices[col]
mae = np.mean(
np.abs(
df_method.loc[mask, col] -
df_complete.loc[mask, col]
)
)
mae_list.append(mae)
print(f"{method_name}: {np.mean(mae_list):.4f}")
Output:
Imputation Accuracy Comparison (MAE):
Mean: 0.8523
KNN: 0.5127
MICE: 0.4856
1.4 Outlier Detection and Handling
Outliers may indicate measurement errors or potentially lead to discovery of novel materials.
Statistical Methods (Z-score, IQR)
💻 Code Example 13: Statistical Outlier Detection
def detect_outliers_zscore(data, threshold=3):
"""
Outlier detection using Z-score
"""
z_scores = np.abs((data - np.mean(data)) / np.std(data))
return z_scores > threshold
def detect_outliers_iqr(data, multiplier=1.5):
"""
Outlier detection using IQR (Interquartile Range)
"""
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
return (data < lower_bound) | (data > upper_bound)
# Test data
np.random.seed(42)
data_normal = np.random.normal(50, 10, 100)
data_with_outliers = np.concatenate([
data_normal,
[10, 15, 95, 100] # Outliers
])
# Detect
outliers_z = detect_outliers_zscore(data_with_outliers, threshold=3)
outliers_iqr = detect_outliers_iqr(data_with_outliers, multiplier=1.5)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Z-score
axes[0].scatter(range(len(data_with_outliers)), data_with_outliers,
c=outliers_z, cmap='RdYlGn_r', s=60, alpha=0.7,
edgecolors='k')
axes[0].axhline(y=np.mean(data_with_outliers) + 3*np.std(data_with_outliers),
color='r', linestyle='--', label='±3σ')
axes[0].axhline(y=np.mean(data_with_outliers) - 3*np.std(data_with_outliers),
color='r', linestyle='--')
axes[0].set_xlabel('Sample ID', fontsize=11)
axes[0].set_ylabel('Value', fontsize=11)
axes[0].set_title('Z-score Method', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# IQR
axes[1].scatter(range(len(data_with_outliers)), data_with_outliers,
c=outliers_iqr, cmap='RdYlGn_r', s=60, alpha=0.7,
edgecolors='k')
Q1 = np.percentile(data_with_outliers, 25)
Q3 = np.percentile(data_with_outliers, 75)
IQR = Q3 - Q1
axes[1].axhline(y=Q3 + 1.5*IQR, color='r', linestyle='--', label='Q3+1.5×IQR')
axes[1].axhline(y=Q1 - 1.5*IQR, color='r', linestyle='--', label='Q1-1.5×IQR')
axes[1].set_xlabel('Sample ID', fontsize=11)
axes[1].set_ylabel('Value', fontsize=11)
axes[1].set_title('IQR Method', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Z-score method: {outliers_z.sum()} outliers detected")
print(f"IQR method: {outliers_iqr.sum()} outliers detected")
Isolation Forest
💻 Code Example 14: Isolation Forest Detection
from sklearn.ensemble import IsolationForest
def detect_outliers_iforest(X, contamination=0.1, random_state=42):
"""
Outlier detection using Isolation Forest
Effective for high-dimensional data
"""
clf = IsolationForest(
contamination=contamination,
random_state=random_state
)
predictions = clf.fit_predict(X)
# -1: outlier, 1: normal
return predictions == -1
# 2D data for visualization
np.random.seed(42)
X_normal = np.random.randn(200, 2) * [2, 3] + [50, 60]
X_outliers = np.random.uniform(40, 70, (20, 2))
X = np.vstack([X_normal, X_outliers])
outliers_if = detect_outliers_iforest(X, contamination=0.1)
# Visualization
plt.figure(figsize=(10, 8))
plt.scatter(X[~outliers_if, 0], X[~outliers_if, 1],
c='steelblue', s=50, alpha=0.6, label='Normal')
plt.scatter(X[outliers_if, 0], X[outliers_if, 1],
c='red', s=100, alpha=0.8, marker='X', label='Outliers')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Isolation Forest Outlier Detection',
fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Detected outliers: {outliers_if.sum()} / {len(X)}")
Local Outlier Factor (LOF)
💻 Code Example 15: Local Outlier Factor
from sklearn.neighbors import LocalOutlierFactor
def detect_outliers_lof(X, n_neighbors=20, contamination=0.1):
"""
Outlier detection using Local Outlier Factor
Detects based on local density
"""
clf = LocalOutlierFactor(
n_neighbors=n_neighbors,
contamination=contamination
)
predictions = clf.fit_predict(X)
return predictions == -1
# Compare LOF vs Isolation Forest
outliers_lof = detect_outliers_lof(X, n_neighbors=20, contamination=0.1)
fig, axes = plt.subplots(1, 2, figsize=(16, 7))
# Isolation Forest
axes[0].scatter(X[~outliers_if, 0], X[~outliers_if, 1],
c='steelblue', s=50, alpha=0.6, label='Normal')
axes[0].scatter(X[outliers_if, 0], X[outliers_if, 1],
c='red', s=100, alpha=0.8, marker='X', label='Outliers')
axes[0].set_xlabel('Feature 1', fontsize=11)
axes[0].set_ylabel('Feature 2', fontsize=11)
axes[0].set_title('Isolation Forest', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# LOF
axes[1].scatter(X[~outliers_lof, 0], X[~outliers_lof, 1],
c='steelblue', s=50, alpha=0.6, label='Normal')
axes[1].scatter(X[outliers_lof, 0], X[outliers_lof, 1],
c='red', s=100, alpha=0.8, marker='X', label='Outliers')
axes[1].set_xlabel('Feature 1', fontsize=11)
axes[1].set_ylabel('Feature 2', fontsize=11)
axes[1].set_title('Local Outlier Factor', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Isolation Forest: {outliers_if.sum()} outliers")
print(f"LOF: {outliers_lof.sum()} outliers")
DBSCAN Clustering
💻 Code Example 16: DBSCAN Clustering Detection
from sklearn.cluster import DBSCAN
def detect_outliers_dbscan(X, eps=3, min_samples=5):
"""
Outlier detection using DBSCAN
Samples labeled -1 in clustering are outliers
"""
clustering = DBSCAN(eps=eps, min_samples=min_samples)
labels = clustering.fit_predict(X)
return labels == -1
# Run
outliers_dbscan = detect_outliers_dbscan(X, eps=5, min_samples=10)
# Visualization
plt.figure(figsize=(10, 8))
clustering = DBSCAN(eps=5, min_samples=10)
labels = clustering.fit_predict(X)
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Outliers
class_member_mask = (labels == k)
xy = X[class_member_mask]
plt.scatter(xy[:, 0], xy[:, 1], c='red', s=100,
marker='X', label='Outliers', alpha=0.8)
else:
# Clusters
class_member_mask = (labels == k)
xy = X[class_member_mask]
plt.scatter(xy[:, 0], xy[:, 1], c=[col], s=50,
alpha=0.6, label=f'Cluster {k}')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('DBSCAN Outlier Detection', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Detected outliers: {outliers_dbscan.sum()} / {len(X)}")
1.5 Case Study: Thermoelectric Material Dataset
We perform a complete data cleaning workflow using an actual thermoelectric material dataset.
💻 Code Example 17: Thermoelectric Dataset Creation
# Thermoelectric material dataset (simulated)
np.random.seed(42)
n_samples = 200
thermoelectric_data = pd.DataFrame({
'Composition_A': np.random.uniform(0.1, 0.9, n_samples),
'Composition_B': np.random.uniform(0.05, 0.3, n_samples),
'Dopant Concentration': np.random.uniform(0.001, 0.05, n_samples),
'Synthesis Temperature': np.random.uniform(600, 1200, n_samples),
'Lattice Constant': np.random.uniform(5.5, 6.5, n_samples),
'Band Gap': np.random.uniform(0.1, 0.8, n_samples),
'Electrical Conductivity': np.random.lognormal(10, 2, n_samples),
'Seebeck Coefficient': np.random.normal(200, 50, n_samples),
'Thermal Conductivity': np.random.uniform(1, 10, n_samples),
'ZT Value': np.random.uniform(0.1, 2.0, n_samples)
})
# Integrated experimental + DFT calculation data
thermoelectric_data['Data Source'] = np.random.choice(
['Experimental', 'DFT'], n_samples, p=[0.6, 0.4]
)
# Introduce 20% missing values
missing_mask_lattice = np.random.random(n_samples) < 0.15
thermoelectric_data.loc[missing_mask_lattice, 'Lattice Constant'] = np.nan
missing_mask_bandgap = np.random.random(n_samples) < 0.12
thermoelectric_data.loc[missing_mask_bandgap, 'Band Gap'] = np.nan
missing_mask_thermal = np.random.random(n_samples) < 0.18
thermoelectric_data.loc[missing_mask_thermal, 'Thermal Conductivity'] = np.nan
# Introduce outliers
outlier_idx = np.random.choice(n_samples, 10, replace=False)
thermoelectric_data.loc[outlier_idx, 'ZT Value'] += np.random.uniform(2, 5, 10)
print("=== Thermoelectric Material Dataset ===")
print(f"Number of samples: {len(thermoelectric_data)}")
print(f"Number of features: {thermoelectric_data.shape[1]}")
print(f"\nMissing values:")
print(thermoelectric_data.isnull().sum())
Step 1: Missing Value Imputation
💻 Code Example 18: Missing Value Imputation Workflow
# Visualize missing patterns
plt.figure(figsize=(12, 6))
sns.heatmap(thermoelectric_data.isnull(),
cmap='YlOrRd', cbar_kws={'label': 'Missing'})
plt.title('Missing Value Patterns in Thermoelectric Data', fontsize=13, fontweight='bold')
plt.xlabel('Features', fontsize=11)
plt.ylabel('Samples', fontsize=11)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# MICE imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Extract numeric columns only
numeric_cols = thermoelectric_data.select_dtypes(
include=[np.number]
).columns
imputer = IterativeImputer(max_iter=10, random_state=42)
thermoelectric_imputed = thermoelectric_data.copy()
thermoelectric_imputed[numeric_cols] = imputer.fit_transform(
thermoelectric_data[numeric_cols]
)
print("\nMissing value imputation complete")
print(thermoelectric_imputed.isnull().sum())
Step 2: Outlier Detection
💻 Code Example 19: Outlier Detection Application
# Detect outliers using Isolation Forest
X_features = thermoelectric_imputed[numeric_cols].values
clf = IsolationForest(contamination=0.05, random_state=42)
outlier_labels = clf.fit_predict(X_features)
outliers_mask = outlier_labels == -1
print(f"\nDetected outliers: {outliers_mask.sum()} / {len(thermoelectric_imputed)}")
# Distribution of ZT values and outliers
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Box plot
axes[0].boxplot([
thermoelectric_imputed.loc[~outliers_mask, 'ZT Value'],
thermoelectric_imputed.loc[outliers_mask, 'ZT Value']
], labels=['Normal', 'Outliers'])
axes[0].set_ylabel('ZT Value', fontsize=12)
axes[0].set_title('ZT Value Distribution', fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)
# Scatter plot (Electrical Conductivity vs ZT Value)
axes[1].scatter(
thermoelectric_imputed.loc[~outliers_mask, 'Electrical Conductivity'],
thermoelectric_imputed.loc[~outliers_mask, 'ZT Value'],
c='steelblue', s=50, alpha=0.6, label='Normal'
)
axes[1].scatter(
thermoelectric_imputed.loc[outliers_mask, 'Electrical Conductivity'],
thermoelectric_imputed.loc[outliers_mask, 'ZT Value'],
c='red', s=100, alpha=0.8, marker='X', label='Outliers'
)
axes[1].set_xlabel('Electrical Conductivity (S/m)', fontsize=11)
axes[1].set_ylabel('ZT Value', fontsize=11)
axes[1].set_xscale('log')
axes[1].set_title('Outlier Visualization', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
Step 3: Physical Validity Validation
💻 Code Example 20: Physical Validity Validation
def validate_physical_constraints(df):
"""
Verify physical constraint conditions
"""
violations = []
# Composition sum should be around 1
composition_sum = df['Composition_A'] + df['Composition_B']
composition_violation = (composition_sum < 0.9) | (composition_sum > 1.1)
if composition_violation.any():
violations.append(
f"Composition sum anomaly: {composition_violation.sum()} samples"
)
# Band gap must be positive
bandgap_violation = df['Band Gap'] < 0
if bandgap_violation.any():
violations.append(
f"Negative band gap: {bandgap_violation.sum()} samples"
)
# ZT value theoretical limit (ZT > 4 is unrealistic)
zt_violation = df['ZT Value'] > 4
if zt_violation.any():
violations.append(
f"ZT value anomaly (>4): {zt_violation.sum()} samples"
)
return violations
# Validate
violations = validate_physical_constraints(thermoelectric_imputed)
print("\nPhysical validity validation:")
if violations:
for v in violations:
print(f"⚠️ {v}")
else:
print("✅ All samples satisfy physical constraints")
# Remove outliers
thermoelectric_cleaned = thermoelectric_imputed[~outliers_mask].copy()
print(f"\nCleaned sample count: {len(thermoelectric_cleaned)}")
print(f"Removed samples: {outliers_mask.sum()}")
Step 4: Comparison Before and After Cleaning
💻 Code Example 21: Before/After Cleaning Comparison
# Compare data quality
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
features_to_compare = ['ZT Value', 'Electrical Conductivity', 'Seebeck Coefficient', 'Thermal Conductivity']
for idx, feature in enumerate(features_to_compare):
ax = axes[idx // 2, idx % 2]
# Original data
ax.hist(thermoelectric_data[feature].dropna(), bins=30,
alpha=0.5, label='Original Data', color='gray')
# After cleaning
ax.hist(thermoelectric_cleaned[feature], bins=30,
alpha=0.7, label='After Cleaning', color='steelblue')
ax.set_xlabel(feature, fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title(f'{feature} Distribution', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Summary statistics
print("\n=== Data Cleaning Effects ===")
print(f"Original data: {len(thermoelectric_data)} samples, "
f"{thermoelectric_data.isnull().sum().sum()} missing values")
print(f"After cleaning: {len(thermoelectric_cleaned)} samples, "
f"{thermoelectric_cleaned.isnull().sum().sum()} missing values")
print(f"\nZT Value statistics:")
print(f" Original: mean {thermoelectric_data['ZT Value'].mean():.3f}, "
f"std {thermoelectric_data['ZT Value'].std():.3f}")
print(f" After cleaning: mean {thermoelectric_cleaned['ZT Value'].mean():.3f}, "
f"std {thermoelectric_cleaned['ZT Value'].std():.3f}")
1.6 Data Licensing and Reproducibility
Licenses of Major Material Databases
When using material data, it is important to understand the licensing of each database.
💻 Code Example 22: Material Database Information
# Information on major material databases
database_info = pd.DataFrame({
'Database': [
'Materials Project',
'OQMD',
'NOMAD',
'AFLOW',
'Citrination'
],
'License': [
'CC BY 4.0',
'Academic Use',
'CC BY 4.0',
'AFLOWLIB Consortium',
'Commercial/Academic'
],
'Data Count': [
'150,000+',
'1,000,000+',
'10,000,000+',
'3,500,000+',
'250,000+'
],
'Key Data': [
'DFT calculations, band gap, formation energy',
'DFT calculations, stability, phase diagrams',
'Computational and experimental data, ML models',
'Prototype structures, property predictions',
'Experimental data, process conditions'
],
'API': [
'pymatgen',
'qmpy',
'NOMAD API',
'AFLOW API',
'citrination-client'
]
})
print("=== Material Database Comparison ===")
print(database_info.to_string(index=False))
# Usage example
print("\n【Materials Project API Usage Example】")
print("```python")
print("from pymatgen.ext.matproj import MPRester")
print("# Get API key at: https://materialsproject.org/api")
print("with MPRester('YOUR_API_KEY') as mpr:")
print(" structure = mpr.get_structure_by_material_id('mp-149')")
print(" bandgap = mpr.get_bandstructure_by_material_id('mp-149')")
print("```")
print("\n【Important Notes】")
print("✅ Verify licensing for publications and commercial use")
print("✅ Cite databases appropriately (follow citation format)")
print("✅ Manage API keys with environment variables (.env file)")
print("✅ Record data version and acquisition date")
Output:
=== Material Database Comparison ===
Database License Data Count Key Data API
Materials Project CC BY 4.0 150,000+ DFT calculations, band gap, formation energy pymatgen
OQMD Academic Use 1,000,000+ DFT calculations, stability, phase diagrams qmpy
NOMAD CC BY 4.0 10,000,000+ Computational/experimental data, ML models NOMAD API
AFLOW AFLOWLIB Consortium 3,500,000+ Prototype structures, property predictions AFLOW API
Citrination Commercial/Academic 250,000+ Experimental data, process conditions citrination-client
Ensuring Code Reproducibility
💻 Code Example 23: Reproducibility Environment Setup
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - scikit-learn>=1.3.0, <1.5.0
"""
Example: 💻 Code Example 23: Reproducibility Environment Setup
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
# Record environment specifications
import sys
import sklearn
import pandas as pd
import numpy as np
reproducibility_info = {
'Python': sys.version,
'NumPy': np.__version__,
'Pandas': pd.__version__,
'scikit-learn': sklearn.__version__,
'Date': '2025-10-19'
}
print("=== Reproducibility Information ===")
for key, value in reproducibility_info.items():
print(f"{key}: {value}")
# Generate requirements.txt
print("\n【Recommended Environment】")
requirements = """
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
matplotlib==3.7.2
seaborn==0.12.2
scipy==1.11.1
"""
print(requirements)
print("【Environment Setup Commands】")
print("```bash")
print("# Create virtual environment")
print("python -m venv venv")
print("source venv/bin/activate # Linux/Mac")
print("# venv\\Scripts\\activate # Windows")
print("")
print("# Install packages")
print("pip install -r requirements.txt")
print("```")
Practical Pitfalls
💻 Code Example 24: Common Data Cleaning Pitfalls
# Pitfall 1: Data Leakage
print("=== Practical Pitfalls ===\n")
print("【Pitfall 1: Data Leakage】")
print("❌ Bad: Preprocess all data → Train/Test split")
print("```python")
print("X_scaled = StandardScaler().fit_transform(X) # Fit on all data")
print("X_train, X_test = train_test_split(X_scaled)")
print("# → Test data information leaks into training!")
print("```")
print("\n✅ Correct: Train/Test split → Preprocess training data only")
print("```python")
print("X_train, X_test = train_test_split(X)")
print("scaler = StandardScaler().fit(X_train) # Fit on training data only")
print("X_train_scaled = scaler.transform(X_train)")
print("X_test_scaled = scaler.transform(X_test)")
print("```")
print("\n【Pitfall 2: Need for Composition-Based Splitting】")
print("❌ Bad: Random splitting")
print("- Li₀.₉CoO₂ (training) and Li₁.₀CoO₂ (test) are similar")
print("- Overly optimistic performance estimates")
print("\n✅ Correct: Composition group splitting")
print("```python")
print("from sklearn.model_selection import GroupKFold")
print("groups = [get_composition_family(formula) for formula in formulas]")
print("gkf = GroupKFold(n_splits=5)")
print("for train_idx, test_idx in gkf.split(X, y, groups):")
print(" # Same composition family in same fold")
print("```")
print("\n【Pitfall 3: Limitations of Extrapolation】")
print("⚠️ ML models perform poorly outside training range")
print("Example: Trained on band gap 0-3 eV → Prediction at 5 eV is inaccurate")
print("\nMitigation strategies:")
print("- Explicitly state training data range")
print("- Quantify uncertainty in extrapolation regions (Bayesian methods)")
print("- Expand range gradually using Active Learning")
print("\n【Pitfall 4: Feature Correlation】")
print("⚠️ Highly correlated features are redundant and promote overfitting")
print("```python")
print("# Check correlation matrix")
print("correlation_matrix = X.corr()")
print("high_corr = (correlation_matrix.abs() > 0.9) & (correlation_matrix != 1.0)")
print("print(high_corr.sum()) # Count high correlation pairs")
print("")
print("# Detect multicollinearity using VIF (Variance Inflation Factor)")
print("from statsmodels.stats.outliers_influence import variance_inflation_factor")
print("vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]")
print("```")
Exercises
Problem 1 (Difficulty: Easy)
Apply Simple Imputation (mean) and KNN Imputation to the following dataset and compare imputation accuracy.
💻 Code Example 25: Exercise 1 Data Generation
# Exercise data
np.random.seed(123)
exercise_data = pd.DataFrame({
'feature1': np.random.normal(50, 10, 100),
'feature2': np.random.normal(30, 5, 100),
'feature3': np.random.normal(100, 20, 100)
})
# Introduce 10% random missing values
for col in exercise_data.columns:
missing_idx = np.random.choice(100, 10, replace=False)
exercise_data.loc[missing_idx, col] = np.nan
Hint
1. Use `SimpleImputer(strategy='mean')` 2. Use `KNNImputer(n_neighbors=5)` 3. Create original complete data and calculate MAE difference from imputed valuesSolution
💻 Code Example 26: Exercise 1 Solution
from sklearn.impute import SimpleImputer, KNNImputer
# Create complete original data (for comparison)
np.random.seed(123)
true_data = pd.DataFrame({
'feature1': np.random.normal(50, 10, 100),
'feature2': np.random.normal(30, 5, 100),
'feature3': np.random.normal(100, 20, 100)
})
# Simple Imputation
simple_imputer = SimpleImputer(strategy='mean')
data_simple = pd.DataFrame(
simple_imputer.fit_transform(exercise_data),
columns=exercise_data.columns
)
# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=5)
data_knn = pd.DataFrame(
knn_imputer.fit_transform(exercise_data),
columns=exercise_data.columns
)
# Evaluate accuracy
missing_mask = exercise_data.isnull()
mae_simple = []
mae_knn = []
for col in exercise_data.columns:
mask = missing_mask[col]
if mask.any():
mae_s = np.mean(np.abs(data_simple.loc[mask, col] - true_data.loc[mask, col]))
mae_k = np.mean(np.abs(data_knn.loc[mask, col] - true_data.loc[mask, col]))
mae_simple.append(mae_s)
mae_knn.append(mae_k)
print(f"Simple Imputation MAE: {np.mean(mae_simple):.4f}")
print(f"KNN Imputation MAE: {np.mean(mae_knn):.4f}")
Problem 2 (Difficulty: Medium)
Using Latin Hypercube Sampling, sample a 3-dimensional composition space (fractions of elements A, B, C). Ensure the constraint A + B + C = 1 is satisfied.
Hint
1. Perform LHS in 2D (A and B only) 2. Calculate C = 1 - A - B 3. Visualize in 3D spaceSolution
💻 Code Example 27: Exercise 2 Solution
from scipy.stats import qmc
from mpl_toolkits.mplot3d import Axes3D
# 2D LHS (A, B)
sampler = qmc.LatinHypercube(d=2, seed=42)
samples_2d = sampler.random(n=50)
# Scale so A + B <= 1
A = samples_2d[:, 0] * 0.9 # 0 to 0.9
B = (1 - A) * samples_2d[:, 1] # Within remaining range
C = 1 - A - B
# 3D visualization
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(A, B, C, c='steelblue', s=100, alpha=0.6, edgecolors='k')
ax.set_xlabel('Element A', fontsize=12)
ax.set_ylabel('Element B', fontsize=12)
ax.set_zlabel('Element C', fontsize=12)
ax.set_title('Composition Space Latin Hypercube Sampling', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
# Verify constraint
print(f"All samples satisfy A+B+C=1: {np.allclose(A+B+C, 1)}")
Problem 3 (Difficulty: Hard)
Using Isolation Forest and LOF, detect outliers in high-dimensional data and evaluate which is more appropriate using known outlier labels and metrics (Precision, Recall, F1-score).
Hint
1. Generate normal data + intentional outliers 2. Detect using Isolation Forest and LOF 3. Evaluate using `sklearn.metrics.classification_report`Solution
💻 Code Example 28: Exercise 3 Solution
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report, confusion_matrix
# Data generation
np.random.seed(42)
X_normal = np.random.randn(200, 5) * 2 + 10
X_outliers = np.random.uniform(0, 20, (20, 5))
X = np.vstack([X_normal, X_outliers])
# True labels (0: normal, 1: outlier)
y_true = np.array([0]*200 + [1]*20)
# Isolation Forest
clf_if = IsolationForest(contamination=0.1, random_state=42)
y_pred_if = (clf_if.fit_predict(X) == -1).astype(int)
# LOF
clf_lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred_lof = (clf_lof.fit_predict(X) == -1).astype(int)
# Evaluation
print("=== Isolation Forest ===")
print(classification_report(y_true, y_pred_if,
target_names=['Normal', 'Outlier']))
print("\n=== Local Outlier Factor ===")
print(classification_report(y_true, y_pred_lof,
target_names=['Normal', 'Outlier']))
# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
cm_if = confusion_matrix(y_true, y_pred_if)
sns.heatmap(cm_if, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted Label', fontsize=11)
axes[0].set_ylabel('True Label', fontsize=11)
axes[0].set_title('Isolation Forest', fontsize=12, fontweight='bold')
cm_lof = confusion_matrix(y_true, y_pred_lof)
sns.heatmap(cm_lof, annot=True, fmt='d', cmap='Oranges', ax=axes[1])
axes[1].set_xlabel('Predicted Label', fontsize=11)
axes[1].set_ylabel('True Label', fontsize=11)
axes[1].set_title('LOF', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
Summary
In this chapter, we learned about data collection strategy and cleaning in data-driven materials science.
Key Points:
- Material Data Characteristics: Small-scale, imbalanced, noisy → Appropriate preprocessing is essential
- Design of Experiments: DOE, LHS, Active Learning for efficient data collection
- Missing Value Imputation: Accuracy improves in order: Simple < KNN < MICE
- Outlier Detection: Choose appropriately among statistical methods, Isolation Forest, LOF, DBSCAN
- Physical Validity: Validate not just mechanically but also by physical meaning
- Data Licensing: Verify terms of use for Materials Project, OQMD, NOMAD and other major databases
- Ensuring Reproducibility: Record environment versions and manage requirements.txt
- Practical Pitfalls: Data leakage, composition-based splitting, extrapolation limits, feature correlation
Next Chapter Preview: In Chapter 2, we will learn methods for designing effective features from cleaned data (feature engineering). We will practice generating material descriptors using matminer, dimensionality reduction, and feature selection.
Chapter 1 Checklist
Data Collection
- [ ] Selection of experimental design methods
- [ ] Decide between full factorial vs fractional factorial
- [ ] Use Latin Hypercube Sampling to uniformly cover exploration space
-
[ ] Use Active Learning to prioritize high-uncertainty regions
-
[ ] Verification of data sources
- [ ] Understand mixture of experimental, DFT, and literature data
- [ ] Assess accuracy and reliability of each source
-
[ ] Record data acquisition date and version
-
[ ] Licensing and citations
- [ ] Verify licenses for Materials Project, OQMD, NOMAD, AFLOW
- [ ] Understand restrictions for publication and commercial use
- [ ] Cite databases with appropriate citation format
Data Cleaning
- [ ] Missing value handling
- [ ] Classify missing patterns (MCAR, MAR, MNAR)
- [ ] Confirm baseline performance using Simple Imputation (mean/median)
- [ ] Use KNN Imputation considering correlations
- [ ] Use MICE for handling complex dependencies
-
[ ] Verify statistical changes before and after imputation
-
[ ] Outlier detection
- [ ] Single-variable outlier detection using Z-score and IQR
- [ ] Multi-variable outlier detection using Isolation Forest
- [ ] Detect local density-based outliers using LOF
- [ ] Detect clustering-based outliers using DBSCAN
-
[ ] Validate whether outliers are physically plausible (measurement error vs discovery)
-
[ ] Physical validity verification
- [ ] Composition sum around 1 (tolerance ±0.1)
- [ ] Band gap, formation energy etc. are positive values
- [ ] Property values do not exceed theoretical limits
- [ ] Consistency with known physical laws (Arrhenius equation etc.)
Avoiding Practical Pitfalls
- [ ] Prevent data leakage
- [ ] Apply preprocessing (StandardScaler, Imputer) after Train/Test split
- [ ] Independently preprocess each fold in cross-validation
-
[ ] Avoid feature engineering using target variable
-
[ ] Composition-based splitting
- [ ] Use GroupKFold to keep similar compositions in same fold
- [ ] Avoid overly optimistic evaluations with random splitting
-
[ ] Ensure diversity of material systems in test set
-
[ ] Recognize extrapolation limits
- [ ] Explicitly state training data range (min/max)
- [ ] Attach uncertainty estimates to predictions in extrapolation regions
-
[ ] Gradually expand range using Active Learning
-
[ ] Manage feature correlations
- [ ] Identify high-correlation pairs (|r| > 0.9) from correlation matrix
- [ ] Detect multicollinearity using VIF (Variance Inflation Factor)
- [ ] Remove redundant features or aggregate using PCA
Ensuring Reproducibility
- [ ] Environment recording
- [ ] Document versions of Python, NumPy, Pandas, scikit-learn
- [ ] Create requirements.txt or environment.yml
-
[ ] Fix random seeds (e.g., `random_state=42`)
-
[ ] Data management
- [ ] Keep original and cleaned data separately
- [ ] Version control cleaning scripts
-
[ ] Record data acquisition date, source, and processing history
-
[ ] Code quality
- [ ] Functionalize and modularize for reusability
- [ ] Document with docstrings
- [ ] Unit test major functions
Data Quality Metrics
- [ ] Completeness
- [ ] Missing rate < 20% (recommended)
-
[ ] Missing rate for critical features < 10%
-
[ ] Accuracy
- [ ] Outlier rate < 5%
-
[ ] Physical constraint violation rate = 0%
-
[ ] Representativeness
- [ ] Sample/feature ratio > 10:1 (recommended)
-
[ ] Class imbalance ratio < 10:1 (for classification)
-
[ ] Reliability
- [ ] Agreement between multiple sources > 80%
- [ ] Record measurement error standard deviation
References
-
Little, R. J. & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley. DOI: 10.1002/9781119482260
-
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining (pp. 413-422). IEEE. DOI: 10.1109/ICDM.2008.17
-
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: identifying density-based local outliers. In ACM SIGMOD Record (Vol. 29, No. 2, pp. 93-104). DOI: 10.1145/335191.335388
-
McKay, M. D., Beckman, R. J., & Conover, W. J. (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2), 239-245. DOI: 10.1080/00401706.1979.10489755
-
Settles, B. (2009). Active Learning Literature Survey (Computer Sciences Technical Report 1648). University of Wisconsin-Madison.
Disclaimer
- This content is provided solely for educational, research, and informational purposes and does not constitute professional advice (legal, accounting, technical warranty, etc.).
- This content and accompanying code examples are provided "AS IS" without any warranty, express or implied, including but not limited to merchantability, fitness for a particular purpose, non-infringement, accuracy, completeness, operation, or safety.
- The author and Tohoku University assume no responsibility for the content, availability, or safety of external links, third-party data, tools, libraries, etc.
- To the maximum extent permitted by applicable law, the author and Tohoku University shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from the use, execution, or interpretation of this content.
- The content may be changed, updated, or discontinued without notice.
- The copyright and license of this content are subject to the stated conditions (e.g., CC BY 4.0). Such licenses typically include no-warranty clauses.