Chapter 4: Practical Guide - Materials Mapping with GNN + Dimensionality Reduction
This chapter focuses on practical applications of Practical Guide. You will learn essential concepts and techniques.
Overview
In this chapter, we will build a practical materials space mapping system that combines GNN-based representation learning (Chapter 3) with dimensionality reduction techniques (Chapter 2). We will collect real data from the Materials Project API and implement an end-to-end pipeline.
Learning Objectives
- Collect real data using the Materials Project API
- Build a GNN training pipeline
- Visualize trained GNN embeddings using dimensionality reduction
- Implement an interactive materials exploration system
- Apply to real materials design tasks
4.1 Environment Setup and Data Collection
Code Example 1: Installing Required Libraries
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0
# - torch>=2.0.0, <2.3.0
# - tqdm>=4.65.0
# Install required libraries (run once at first time)
"""
!pip install pymatgen
!pip install mp-api
!pip install torch torchvision torchaudio
!pip install torch-geometric
!pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.0.0+cpu.html
!pip install umap-learn
!pip install plotly
!pip install bokeh
!pip install scikit-learn
!pip install pandas matplotlib seaborn
!pip install tqdm
"""
# Import
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data, Dataset, DataLoader
import umap
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from tqdm import tqdm
import json
import pickle
from pathlib import Path
print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
Code Example 2: Materials Project API Setup and Data Collection
from mp_api.client import MPRester
from pymatgen.core import Structure
import warnings
warnings.filterwarnings('ignore')
# API key setup (environment variable or direct specification)
# MP_API_KEY = "your_api_key_here" # Get from https://next-gen.materialsproject.org/api
# Note: It is recommended to read from environment variables in actual projects
def fetch_materials_data(api_key, criteria, properties, max_materials=1000):
"""
Fetch materials data from Materials Project API
Parameters:
-----------
api_key : str
Materials Project API key
criteria : dict
Search criteria
properties : list
List of properties to fetch
max_materials : int
Maximum number of materials to fetch
Returns:
--------
materials_df : pd.DataFrame
Materials data
structures : list
List of crystal structures
"""
with MPRester(api_key) as mpr:
# Fetch data
docs = mpr.materials.summary.search(
**criteria,
fields=properties,
num_chunks=10,
chunk_size=100
)
# Limit to maximum number
docs = docs[:max_materials]
print(f"Number of materials fetched: {len(docs)}")
# Convert to DataFrame
data_dict = {prop: [] for prop in properties}
data_dict['material_id'] = []
data_dict['formula'] = []
structures = []
for doc in tqdm(docs, desc="Processing data"):
try:
data_dict['material_id'].append(doc.material_id)
data_dict['formula'].append(str(doc.formula_pretty))
for prop in properties:
if prop == 'structure':
structures.append(doc.structure)
else:
value = getattr(doc, prop, None)
data_dict[prop].append(value)
except Exception as e:
print(f"Error processing {doc.material_id}: {e}")
continue
materials_df = pd.DataFrame(data_dict)
materials_df = materials_df.dropna() # Remove rows with missing values
print(f"Number of valid materials: {len(materials_df)}")
return materials_df, structures
# Usage example (substitute with dummy data - uncomment if you have an actual API key)
# criteria = {
# "band_gap": (0.5, 5.0), # Band gap 0.5-5.0 eV
# "is_stable": True, # Only stable materials
# "nelements": (2, 3) # 2-3 element systems
# }
#
# properties = [
# "material_id", "formula_pretty", "structure",
# "band_gap", "formation_energy_per_atom", "density",
# "energy_above_hull", "volume"
# ]
#
# materials_df, structures = fetch_materials_data(
# api_key=MP_API_KEY,
# criteria=criteria,
# properties=properties,
# max_materials=1000
# )
# Generate dummy data (if API key is not available)
print("\nGenerating dummy data...")
from pymatgen.core import Lattice, Structure
np.random.seed(42)
n_materials = 1000
materials_df = pd.DataFrame({
'material_id': [f'mp-{1000+i}' for i in range(n_materials)],
'formula': [f'Material_{i}' for i in range(n_materials)],
'band_gap': np.random.exponential(2.0, n_materials),
'formation_energy_per_atom': np.random.normal(-1.5, 0.8, n_materials),
'density': np.random.normal(5.0, 1.5, n_materials).clip(0.1),
'energy_above_hull': np.random.exponential(0.05, n_materials),
'volume': np.random.normal(50, 15, n_materials).clip(10)
})
# Generate dummy crystal structures
structures = []
for i in range(n_materials):
lattice = Lattice.cubic(4.0 + np.random.rand())
n_atoms = np.random.randint(2, 8)
species = np.random.choice(['Li', 'Na', 'K', 'O', 'S', 'Cl', 'F'], n_atoms)
coords = np.random.rand(n_atoms, 3)
structure = Structure(lattice, species, coords)
structures.append(structure)
print(f"Generated {n_materials} dummy materials data")
print("\nFirst 5 rows of materials data:")
print(materials_df.head())
Output Example:
Generated 1000 dummy materials data
First 5 rows of materials data:
material_id formula band_gap formation_energy_per_atom density energy_above_hull volume
0 mp-1000 Material_0 1.234567 -1.678901 4.567 0.012345 45.678
1 mp-1001 Material_1 2.345678 -1.234567 5.678 0.023456 50.789
2 mp-1002 Material_2 0.987654 -2.012345 3.456 0.001234 48.901
3 mp-1003 Material_3 3.456789 -0.987654 6.789 0.034567 52.345
4 mp-1004 Material_4 1.876543 -1.456789 4.890 0.009876 47.234
Code Example 3: Exploratory Data Analysis
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - seaborn>=0.12.0
"""
Example: Code Example 3: Exploratory Data Analysis
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import seaborn as sns
# Basic statistics
print("Basic statistics of materials data:")
print(materials_df.describe())
# Distribution of properties
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()
properties_to_plot = [
'band_gap', 'formation_energy_per_atom', 'density',
'energy_above_hull', 'volume'
]
for idx, prop in enumerate(properties_to_plot):
axes[idx].hist(materials_df[prop], bins=50, alpha=0.7,
edgecolor='black', color='steelblue')
axes[idx].set_xlabel(prop.replace('_', ' ').title(), fontsize=12, fontweight='bold')
axes[idx].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[idx].set_title(f'Distribution of {prop.replace("_", " ").title()}',
fontsize=14, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
# Hide the last subplot
axes[-1].axis('off')
plt.tight_layout()
plt.savefig('materials_data_distributions.png', dpi=300, bbox_inches='tight')
print("\nSaved property distributions to materials_data_distributions.png")
plt.show()
# Correlation matrix
correlation_matrix = materials_df[properties_to_plot].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.3f', cmap='RdBu_r',
center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Material Properties',
fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('materials_correlation_matrix.png', dpi=300, bbox_inches='tight')
print("Saved correlation matrix to materials_correlation_matrix.png")
plt.show()
4.2 Graph Dataset Construction
Code Example 4: Crystal Structure to Graph Conversion (Optimized Version)
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
import torch
from torch_geometric.data import Data
from pymatgen.core import Structure
from pymatgen.core.periodic_table import Element
import numpy as np
class MaterialGraphConverter:
"""Class to convert crystal structures to graphs"""
def __init__(self, cutoff=5.0, max_neighbors=12):
self.cutoff = cutoff
self.max_neighbors = max_neighbors
# Define atomic properties (element symbol → feature vector)
self._build_atom_features()
def _build_atom_features(self):
"""Build atom properties dictionary"""
# Pre-calculate properties for common elements
common_elements = [
'H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne',
'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca',
'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn',
'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr'
]
self.atom_features = {}
for symbol in common_elements:
try:
element = Element(symbol)
features = [
element.Z / 100.0, # Normalized atomic number
element.atomic_mass / 200.0, # Normalized atomic mass
element.X if element.X is not None else 0.0, # Electronegativity
element.atomic_radius if element.atomic_radius is not None else 0.0, # Atomic radius
element.group if element.group is not None else 0.0, # Group
element.row if element.row is not None else 0.0, # Period
]
self.atom_features[symbol] = features
except Exception as e:
print(f"Warning: Could not process element {symbol}: {e}")
self.atom_features[symbol] = [0.0] * 6
def get_atom_features(self, species):
"""Get feature vector from element symbol"""
symbol = str(species)
if symbol in self.atom_features:
return self.atom_features[symbol]
else:
# Unknown elements get zero vector
return [0.0] * 6
def structure_to_graph(self, structure, target=None):
"""
Convert pymatgen Structure to PyTorch Geometric Data
Parameters:
-----------
structure : pymatgen.Structure
Crystal structure
target : float, optional
Target value (for property prediction)
Returns:
--------
data : torch_geometric.data.Data
Graph data
"""
# Node features
node_features = []
for site in structure:
features = self.get_atom_features(site.specie)
node_features.append(features)
node_features = torch.tensor(node_features, dtype=torch.float)
# Edge construction
all_neighbors = structure.get_all_neighbors(self.cutoff)
edge_index = []
edge_attr = []
for i, neighbors in enumerate(all_neighbors):
# Sort by distance and take up to max_neighbors
neighbors = sorted(neighbors, key=lambda x: x.nn_distance)[:self.max_neighbors]
for neighbor in neighbors:
j = neighbor.index
distance = neighbor.nn_distance
edge_index.append([i, j])
edge_attr.append([distance])
# Handle case with no edges
if len(edge_index) == 0:
edge_index = torch.zeros((2, 0), dtype=torch.long)
edge_attr = torch.zeros((0, 1), dtype=torch.float)
else:
edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
edge_attr = torch.tensor(edge_attr, dtype=torch.float)
# Create Data object
data = Data(x=node_features, edge_index=edge_index, edge_attr=edge_attr)
if target is not None:
data.y = torch.tensor([target], dtype=torch.float)
return data
# Instantiate and test
converter = MaterialGraphConverter(cutoff=5.0, max_neighbors=12)
# Test with first material
test_structure = structures[0]
test_target = materials_df.iloc[0]['band_gap']
test_graph = converter.structure_to_graph(test_structure, test_target)
print("Graph conversion test:")
print(f"Number of nodes: {test_graph.num_nodes}")
print(f"Number of edges: {test_graph.num_edges}")
print(f"Node feature dimension: {test_graph.x.shape[1]}")
print(f"Target value: {test_graph.y.item():.3f}")
Code Example 5: Custom Dataset Class
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - tqdm>=4.65.0
from torch_geometric.data import Dataset, Data
import torch
from tqdm import tqdm
import pickle
class MaterialsDataset(Dataset):
"""
Materials dataset
Parameters:
-----------
structures : list
List of pymatgen.Structure
targets : array-like
Target values
converter : MaterialGraphConverter
Graph converter
root : str
Data save directory
transform : callable, optional
Data transformation function
pre_transform : callable, optional
Pre-processing function
"""
def __init__(self, structures, targets, converter,
root='data/materials', transform=None, pre_transform=None):
self.structures = structures
self.targets = targets
self.converter = converter
super().__init__(root, transform, pre_transform)
@property
def raw_file_names(self):
return []
@property
def processed_file_names(self):
return [f'data_{i}.pt' for i in range(len(self.structures))]
def download(self):
pass
def process(self):
"""Convert structures to graphs and save"""
for idx in tqdm(range(len(self.structures)), desc="Converting to graphs"):
structure = self.structures[idx]
target = self.targets[idx]
# Convert to graph
data = self.converter.structure_to_graph(structure, target)
if self.pre_transform is not None:
data = self.pre_transform(data)
# Save
torch.save(data, self.processed_paths[idx])
def len(self):
return len(self.structures)
def get(self, idx):
data = torch.load(self.processed_paths[idx])
return data
# Create dataset
print("Creating dataset...")
# Target property (band gap)
targets = materials_df['band_gap'].values
# Instantiate dataset
dataset = MaterialsDataset(
structures=structures,
targets=targets,
converter=converter,
root='data/materials_dataset'
)
print(f"\nDataset creation complete!")
print(f"Dataset size: {len(dataset)}")
print(f"Sample data:")
print(dataset[0])
Output Example:
Converting to graphs: 100%|██████████| 1000/1000 [00:15<00:00, 63.42it/s]
Dataset creation complete!
Dataset size: 1000
Sample data:
Data(x=[5, 6], edge_index=[2, 48], edge_attr=[48, 1], y=[1])
Code Example 6: Dataset Splitting and DataLoader
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
"""
Example: Code Example 6: Dataset Splitting and DataLoader
Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from torch_geometric.loader import DataLoader
from sklearn.model_selection import train_test_split
import torch
# Split dataset indices
train_idx, temp_idx = train_test_split(
range(len(dataset)), test_size=0.3, random_state=42
)
val_idx, test_idx = train_test_split(
temp_idx, test_size=0.5, random_state=42
)
# Create subsets
train_dataset = dataset[train_idx]
val_dataset = dataset[val_idx]
test_dataset = dataset[test_idx]
print(f"Data split:")
print(f" Training data: {len(train_dataset)}")
print(f" Validation data: {len(val_dataset)}")
print(f" Test data: {len(test_dataset)}")
# Create DataLoaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
# Test DataLoader
batch = next(iter(train_loader))
print(f"\nBatch data:")
print(f" Batch size: {batch.num_graphs}")
print(f" Total nodes: {batch.num_nodes}")
print(f" Total edges: {batch.num_edges}")
print(f" Node feature shape: {batch.x.shape}")
print(f" Target shape: {batch.y.shape}")
Output Example:
Data split:
Training data: 700
Validation data: 150
Test data: 150
Batch data:
Batch size: 32
Total nodes: 168
Total edges: 1568
Node feature shape: torch.Size([168, 6])
Target shape: torch.Size([32, 1])
4.3 GNN Model Training
Code Example 7: Improved CGCNN Model
# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing, global_mean_pool, global_add_pool, global_max_pool
class ImprovedCGConv(MessagePassing):
"""Improved CGCNN convolution layer"""
def __init__(self, node_dim, edge_dim, hidden_dim=128):
super().__init__(aggr='add')
self.node_dim = node_dim
self.edge_dim = edge_dim
self.hidden_dim = hidden_dim
# Edge network
self.edge_network = nn.Sequential(
nn.Linear(2 * node_dim + edge_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.Softplus(),
nn.Linear(hidden_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.Softplus()
)
# Gate network
self.gate_network = nn.Sequential(
nn.Linear(2 * node_dim + edge_dim, hidden_dim),
nn.Sigmoid()
)
# Node update network
self.node_update = nn.Sequential(
nn.Linear(node_dim + hidden_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.Softplus(),
nn.Linear(hidden_dim, hidden_dim)
)
def forward(self, x, edge_index, edge_attr):
return self.propagate(edge_index, x=x, edge_attr=edge_attr)
def message(self, x_i, x_j, edge_attr):
"""Message function"""
# Concatenate source, target, and edge features
z = torch.cat([x_i, x_j, edge_attr], dim=1)
# Gating mechanism
gate = self.gate_network(z)
message = self.edge_network(z)
return gate * message
def update(self, aggr_out, x):
"""Node update"""
combined = torch.cat([x, aggr_out], dim=1)
return self.node_update(combined)
class ImprovedCGCNN(nn.Module):
"""Improved CGCNN model"""
def __init__(self, node_dim, edge_dim, hidden_dim=128,
num_conv=4, dropout=0.2, pooling='mean'):
super().__init__()
self.node_dim = node_dim
self.edge_dim = edge_dim
self.hidden_dim = hidden_dim
self.num_conv = num_conv
self.pooling = pooling
# Input embedding
self.node_embedding = nn.Sequential(
nn.Linear(node_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.Softplus()
)
# CGConv layers
self.conv_layers = nn.ModuleList([
ImprovedCGConv(hidden_dim, edge_dim, hidden_dim)
for _ in range(num_conv)
])
# Batch Normalization
self.bn_layers = nn.ModuleList([
nn.BatchNorm1d(hidden_dim)
for _ in range(num_conv)
])
# Dropout
self.dropout = nn.Dropout(dropout)
# Pooling
if pooling == 'mean':
self.pool = global_mean_pool
elif pooling == 'add':
self.pool = global_add_pool
elif pooling == 'max':
self.pool = global_max_pool
else:
raise ValueError(f"Unknown pooling: {pooling}")
# Output network
self.fc = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim // 2),
nn.BatchNorm1d(hidden_dim // 2),
nn.Softplus(),
nn.Dropout(dropout),
nn.Linear(hidden_dim // 2, hidden_dim // 4),
nn.BatchNorm1d(hidden_dim // 4),
nn.Softplus(),
nn.Dropout(dropout),
nn.Linear(hidden_dim // 4, 1)
)
def forward(self, data, return_embedding=False):
"""
Forward pass
Parameters:
-----------
data : torch_geometric.data.Batch
Batch data
return_embedding : bool
Whether to return embedding
Returns:
--------
out : Tensor
Predictions
embedding : Tensor (optional)
Graph embedding
"""
x, edge_index, edge_attr, batch = data.x, data.edge_index, data.edge_attr, data.batch
# Input embedding
x = self.node_embedding(x)
# Apply CGConv layers
for conv, bn in zip(self.conv_layers, self.bn_layers):
x_new = conv(x, edge_index, edge_attr)
x_new = bn(x_new)
x_new = F.softplus(x_new)
x_new = self.dropout(x_new)
# Residual connection
x = x + x_new
# Graph-level pooling
graph_embedding = self.pool(x, batch)
# Output
out = self.fc(graph_embedding)
if return_embedding:
return out, graph_embedding
else:
return out
# Model instantiation
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ImprovedCGCNN(
node_dim=6,
edge_dim=1,
hidden_dim=128,
num_conv=4,
dropout=0.2,
pooling='mean'
).to(device)
print(f"Total model parameters: {sum(p.numel() for p in model.parameters()):,}")
# Test
batch = next(iter(train_loader)).to(device)
predictions, embeddings = model(batch, return_embedding=True)
print(f"\nPrediction shape: {predictions.shape}")
print(f"Embedding shape: {embeddings.shape}")
Code Example 8: Training Loop (with Early Stopping)
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
# - tqdm>=4.65.0
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau
import numpy as np
from tqdm import tqdm
import time
class EarlyStopping:
"""Early Stopping class"""
def __init__(self, patience=20, delta=0.001, path='best_model.pt'):
self.patience = patience
self.delta = delta
self.path = path
self.counter = 0
self.best_score = None
self.early_stop = False
self.val_loss_min = np.Inf
def __call__(self, val_loss, model):
score = -val_loss
if self.best_score is None:
self.best_score = score
self.save_checkpoint(val_loss, model)
elif score < self.best_score + self.delta:
self.counter += 1
if self.counter >= self.patience:
self.early_stop = True
else:
self.best_score = score
self.save_checkpoint(val_loss, model)
self.counter = 0
def save_checkpoint(self, val_loss, model):
torch.save(model.state_dict(), self.path)
self.val_loss_min = val_loss
def train_epoch(model, loader, criterion, optimizer, device):
"""Train one epoch"""
model.train()
total_loss = 0
total_samples = 0
for batch in loader:
batch = batch.to(device)
optimizer.zero_grad()
predictions = model(batch)
loss = criterion(predictions, batch.y)
loss.backward()
optimizer.step()
total_loss += loss.item() * batch.num_graphs
total_samples += batch.num_graphs
return total_loss / total_samples
def validate_epoch(model, loader, criterion, device):
"""Validate one epoch"""
model.eval()
total_loss = 0
total_samples = 0
with torch.no_grad():
for batch in loader:
batch = batch.to(device)
predictions = model(batch)
loss = criterion(predictions, batch.y)
total_loss += loss.item() * batch.num_graphs
total_samples += batch.num_graphs
return total_loss / total_samples
# Training setup
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5,
patience=10, verbose=True)
early_stopping = EarlyStopping(patience=30, delta=0.001,
path='best_cgcnn_model.pt')
# Training loop
num_epochs = 200
train_losses = []
val_losses = []
print("Starting training...")
start_time = time.time()
for epoch in range(num_epochs):
# Train
train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
train_losses.append(train_loss)
# Validate
val_loss = validate_epoch(model, val_loader, criterion, device)
val_losses.append(val_loss)
# Learning rate scheduler
scheduler.step(val_loss)
# Early Stopping
early_stopping(val_loss, model)
if (epoch + 1) % 10 == 0:
elapsed_time = time.time() - start_time
print(f"Epoch [{epoch+1}/{num_epochs}] - "
f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f} - "
f"Time: {elapsed_time:.1f}s")
if early_stopping.early_stop:
print(f"Early stopping at epoch {epoch+1}")
break
total_time = time.time() - start_time
print(f"\nTraining complete! Total time: {total_time/60:.1f} minutes")
# Load best model
model.load_state_dict(torch.load('best_cgcnn_model.pt'))
print(f"Loaded best model (Val Loss: {early_stopping.val_loss_min:.4f})")
Output Example:
Starting training...
Epoch [10/200] - Train Loss: 1.2345, Val Loss: 1.3456 - Time: 15.3s
Epoch [20/200] - Train Loss: 0.9876, Val Loss: 1.1234 - Time: 30.7s
Epoch [30/200] - Train Loss: 0.7654, Val Loss: 1.0123 - Time: 46.1s
Epoch [40/200] - Train Loss: 0.6543, Val Loss: 0.9876 - Time: 61.5s
Epoch [50/200] - Train Loss: 0.5678, Val Loss: 0.9654 - Time: 76.9s
Early stopping at epoch 85
Training complete! Total time: 2.3 minutes
Loaded best model (Val Loss: 0.9234)
Code Example 9: Training Curve Visualization
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 9: Training Curve Visualization
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""
import matplotlib.pyplot as plt
import numpy as np
# Training curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Loss curve
epochs = range(1, len(train_losses) + 1)
ax1.plot(epochs, train_losses, label='Train Loss', linewidth=2, color='blue')
ax1.plot(epochs, val_losses, label='Validation Loss', linewidth=2, color='red')
ax1.set_xlabel('Epoch', fontsize=14, fontweight='bold')
ax1.set_ylabel('MSE Loss', fontsize=14, fontweight='bold')
ax1.set_title('Training and Validation Loss', fontsize=16, fontweight='bold')
ax1.legend(fontsize=12)
ax1.grid(True, alpha=0.3)
# Log scale
ax2.semilogy(epochs, train_losses, label='Train Loss', linewidth=2, color='blue')
ax2.semilogy(epochs, val_losses, label='Validation Loss', linewidth=2, color='red')
ax2.set_xlabel('Epoch', fontsize=14, fontweight='bold')
ax2.set_ylabel('MSE Loss (log scale)', fontsize=14, fontweight='bold')
ax2.set_title('Training Curve (Log Scale)', fontsize=16, fontweight='bold')
ax2.legend(fontsize=12)
ax2.grid(True, alpha=0.3, which='both')
plt.tight_layout()
plt.savefig('training_curve.png', dpi=300, bbox_inches='tight')
print("Saved training curve to training_curve.png")
plt.show()
Code Example 10: Evaluation on Test Data
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
def evaluate_model(model, loader, device):
"""Evaluate model"""
model.eval()
all_predictions = []
all_targets = []
with torch.no_grad():
for batch in loader:
batch = batch.to(device)
predictions = model(batch)
all_predictions.append(predictions.cpu().numpy())
all_targets.append(batch.y.cpu().numpy())
predictions = np.concatenate(all_predictions, axis=0).flatten()
targets = np.concatenate(all_targets, axis=0).flatten()
return predictions, targets
# Evaluate on test data
test_predictions, test_targets = evaluate_model(model, test_loader, device)
# Calculate evaluation metrics
mae = mean_absolute_error(test_targets, test_predictions)
rmse = np.sqrt(mean_squared_error(test_targets, test_predictions))
r2 = r2_score(test_targets, test_predictions)
print("Evaluation on test data:")
print(f" MAE: {mae:.4f}")
print(f" RMSE: {rmse:.4f}")
print(f" R²: {r2:.4f}")
# Plot predictions vs actual values
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Scatter plot
ax1.scatter(test_targets, test_predictions, alpha=0.6, s=50,
edgecolors='black', linewidth=0.5, color='steelblue')
# Ideal line
min_val = min(test_targets.min(), test_predictions.min())
max_val = max(test_targets.max(), test_predictions.max())
ax1.plot([min_val, max_val], [min_val, max_val],
'r--', linewidth=2, label='Ideal')
ax1.set_xlabel('True Band Gap (eV)', fontsize=14, fontweight='bold')
ax1.set_ylabel('Predicted Band Gap (eV)', fontsize=14, fontweight='bold')
ax1.set_title(f'Predictions vs True Values\nR²={r2:.3f}, MAE={mae:.3f}',
fontsize=16, fontweight='bold')
ax1.legend(fontsize=12)
ax1.grid(True, alpha=0.3)
# Residual plot
residuals = test_predictions - test_targets
ax2.scatter(test_targets, residuals, alpha=0.6, s=50,
edgecolors='black', linewidth=0.5, color='coral')
ax2.axhline(y=0, color='r', linestyle='--', linewidth=2)
ax2.set_xlabel('True Band Gap (eV)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Residuals (eV)', fontsize=14, fontweight='bold')
ax2.set_title('Residual Plot', fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('test_evaluation.png', dpi=300, bbox_inches='tight')
print("\nSaved evaluation results to test_evaluation.png")
plt.show()
4.4 Embedding Extraction and Dimensionality Reduction
Code Example 11: Extracting Embeddings from All Data
def extract_all_embeddings(model, dataset, device, batch_size=64):
"""
Extract embeddings from entire dataset
Parameters:
-----------
model : nn.Module
Trained model
dataset : Dataset
Dataset
device : torch.device
Device
batch_size : int
Batch size
Returns:
--------
embeddings : np.ndarray
Embedding vectors
targets : np.ndarray
Target values
"""
model.eval()
loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
all_embeddings = []
all_targets = []
with torch.no_grad():
for batch in tqdm(loader, desc="Extracting embeddings"):
batch = batch.to(device)
_, embeddings = model(batch, return_embedding=True)
all_embeddings.append(embeddings.cpu().numpy())
all_targets.append(batch.y.cpu().numpy())
embeddings = np.concatenate(all_embeddings, axis=0)
targets = np.concatenate(all_targets, axis=0).flatten()
return embeddings, targets
# Extract embeddings from all data
embeddings, targets = extract_all_embeddings(model, dataset, device)
print(f"Embedding shape: {embeddings.shape}")
print(f"Target shape: {targets.shape}")
# Save
np.save('cgcnn_embeddings.npy', embeddings)
np.save('cgcnn_targets.npy', targets)
print("\nSaved embeddings")
Code Example 12: Dimensionality Reduction with PCA
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 12: Dimensionality Reduction with PCA
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
# Run PCA
pca = PCA(n_components=min(50, embeddings.shape[1]))
embeddings_pca = pca.fit_transform(embeddings)
# Analyze explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
cumsum_variance = np.cumsum(explained_variance_ratio)
# Scree plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Individual explained variance
ax1.bar(range(1, 21), explained_variance_ratio[:20],
alpha=0.7, edgecolor='black', color='steelblue')
ax1.set_xlabel('Principal Component', fontsize=14, fontweight='bold')
ax1.set_ylabel('Explained Variance Ratio', fontsize=14, fontweight='bold')
ax1.set_title('PCA Scree Plot (Top 20 Components)',
fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3, axis='y')
# Cumulative explained variance
ax2.plot(range(1, len(cumsum_variance) + 1), cumsum_variance,
marker='o', linewidth=2, markersize=4, color='darkred')
ax2.axhline(y=0.95, color='green', linestyle='--', linewidth=2,
label='95% threshold', alpha=0.7)
ax2.set_xlabel('Number of Components', fontsize=14, fontweight='bold')
ax2.set_ylabel('Cumulative Explained Variance', fontsize=14, fontweight='bold')
ax2.set_title('Cumulative Variance Explained',
fontsize=16, fontweight='bold')
ax2.legend(fontsize=12)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pca_analysis.png', dpi=300, bbox_inches='tight')
print(f"Saved PCA analysis to pca_analysis.png")
print(f"\nNumber of principal components explaining 95% variance: {np.argmax(cumsum_variance >= 0.95) + 1}")
plt.show()
# 2D PCA visualization
pca_2d = PCA(n_components=2)
embeddings_pca_2d = pca_2d.fit_transform(embeddings)
plt.figure(figsize=(12, 9))
scatter = plt.scatter(embeddings_pca_2d[:, 0], embeddings_pca_2d[:, 1],
c=targets, cmap='viridis', s=50, alpha=0.6,
edgecolors='black', linewidth=0.5)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)',
fontsize=14, fontweight='bold')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)',
fontsize=14, fontweight='bold')
plt.title('PCA: 2D Projection of CGCNN Embeddings',
fontsize=16, fontweight='bold')
plt.colorbar(scatter, label='Band Gap (eV)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pca_2d_projection.png', dpi=300, bbox_inches='tight')
print("Saved PCA 2D projection to pca_2d_projection.png")
plt.show()
Code Example 13: Dimensionality Reduction with UMAP
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 13: Dimensionality Reduction with UMAP
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import umap
import matplotlib.pyplot as plt
import numpy as np
# Run UMAP (experiment with multiple parameters)
n_neighbors_list = [5, 15, 30, 50]
min_dist_list = [0.0, 0.1, 0.3]
# Run with optimal parameters
print("Running UMAP...")
reducer = umap.UMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
metric='euclidean',
random_state=42,
verbose=True
)
embeddings_umap = reducer.fit_transform(embeddings)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(18, 7))
# Color by target value
scatter1 = axes[0].scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
c=targets, cmap='viridis', s=50, alpha=0.6,
edgecolors='black', linewidth=0.5)
axes[0].set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
axes[0].set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
axes[0].set_title('UMAP: Colored by Band Gap',
fontsize=16, fontweight='bold')
axes[0].grid(True, alpha=0.3)
cbar1 = plt.colorbar(scatter1, ax=axes[0])
cbar1.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')
# Color by category
band_gap_categories = pd.cut(targets, bins=[0, 1, 2, 3, 10],
labels=['Small', 'Medium', 'Large', 'Very Large'])
colors_map = {'Small': 'red', 'Medium': 'orange', 'Large': 'yellow', 'Very Large': 'green'}
colors = [colors_map[cat] for cat in band_gap_categories]
for category in ['Small', 'Medium', 'Large', 'Very Large']:
mask = band_gap_categories == category
axes[1].scatter(embeddings_umap[mask, 0], embeddings_umap[mask, 1],
c=colors_map[category], label=category, s=50, alpha=0.7,
edgecolors='black', linewidth=0.5)
axes[1].set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
axes[1].set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
axes[1].set_title('UMAP: Colored by Band Gap Category',
fontsize=16, fontweight='bold')
axes[1].legend(title='Band Gap', fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('umap_2d_projection.png', dpi=300, bbox_inches='tight')
print("\nSaved UMAP 2D projection to umap_2d_projection.png")
plt.show()
# Save UMAP results
np.save('umap_embeddings_2d.npy', embeddings_umap)
Code Example 14: Dimensionality Reduction with t-SNE
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Code Example 14: Dimensionality Reduction with t-SNE
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import time
# Run t-SNE (speed up with subset)
subset_size = min(1000, len(embeddings))
subset_indices = np.random.choice(len(embeddings), subset_size, replace=False)
embeddings_subset = embeddings[subset_indices]
targets_subset = targets[subset_indices]
print(f"Running t-SNE (subset size: {subset_size})...")
start_time = time.time()
tsne = TSNE(
n_components=2,
perplexity=30,
n_iter=1000,
random_state=42,
verbose=1
)
embeddings_tsne = tsne.fit_transform(embeddings_subset)
elapsed_time = time.time() - start_time
print(f"t-SNE complete (elapsed time: {elapsed_time:.1f} seconds)")
# Visualization
plt.figure(figsize=(12, 9))
scatter = plt.scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1],
c=targets_subset, cmap='plasma', s=50, alpha=0.6,
edgecolors='black', linewidth=0.5)
plt.xlabel('t-SNE 1', fontsize=14, fontweight='bold')
plt.ylabel('t-SNE 2', fontsize=14, fontweight='bold')
plt.title('t-SNE: 2D Projection of CGCNN Embeddings',
fontsize=16, fontweight='bold')
plt.colorbar(scatter, label='Band Gap (eV)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('tsne_2d_projection.png', dpi=300, bbox_inches='tight')
print("Saved t-SNE 2D projection to tsne_2d_projection.png")
plt.show()
Code Example 15: Comparison of Dimensionality Reduction Methods
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 15: Comparison of Dimensionality Reduction Meth
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import numpy as np
# Display results from three methods side by side
fig, axes = plt.subplots(1, 3, figsize=(21, 6))
# PCA
scatter = axes[0].scatter(embeddings_pca_2d[:, 0], embeddings_pca_2d[:, 1],
c=targets, cmap='viridis', s=30, alpha=0.6,
edgecolors='none')
axes[0].set_xlabel('PC1', fontsize=12, fontweight='bold')
axes[0].set_ylabel('PC2', fontsize=12, fontweight='bold')
axes[0].set_title('PCA', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# UMAP
scatter = axes[1].scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
c=targets, cmap='viridis', s=30, alpha=0.6,
edgecolors='none')
axes[1].set_xlabel('UMAP 1', fontsize=12, fontweight='bold')
axes[1].set_ylabel('UMAP 2', fontsize=12, fontweight='bold')
axes[1].set_title('UMAP', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
# t-SNE
scatter = axes[2].scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1],
c=targets_subset, cmap='viridis', s=30, alpha=0.6,
edgecolors='none')
axes[2].set_xlabel('t-SNE 1', fontsize=12, fontweight='bold')
axes[2].set_ylabel('t-SNE 2', fontsize=12, fontweight='bold')
axes[2].set_title('t-SNE', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3)
# Common colorbar
fig.subplots_adjust(right=0.9)
cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
cbar = fig.colorbar(scatter, cax=cbar_ax)
cbar.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')
plt.savefig('dimensionality_reduction_comparison.png', dpi=300, bbox_inches='tight')
print("Saved dimensionality reduction comparison to dimensionality_reduction_comparison.png")
plt.show()
4.5 Materials Space Analysis
Code Example 16: Clustering and Property Analysis
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0
"""
Example: Code Example 16: Clustering and Property Analysis
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
from sklearn.cluster import KMeans
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# K-Means clustering (on UMAP space)
n_clusters = 6
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=20)
cluster_labels = kmeans.fit_predict(embeddings_umap)
# Visualization by cluster
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
# Color by cluster label
colors = plt.cm.Set3(np.linspace(0, 1, n_clusters))
for cluster_id in range(n_clusters):
mask = cluster_labels == cluster_id
ax1.scatter(embeddings_umap[mask, 0], embeddings_umap[mask, 1],
c=[colors[cluster_id]], label=f'Cluster {cluster_id}',
s=60, alpha=0.7, edgecolors='black', linewidth=0.5)
# Cluster centers
umap_reducer_for_centers = umap.UMAP(n_components=2, n_neighbors=15,
min_dist=0.1, random_state=42)
umap_reducer_for_centers.fit(embeddings)
centers_umap = umap_reducer_for_centers.transform(kmeans.cluster_centers_)
ax1.scatter(centers_umap[:, 0], centers_umap[:, 1],
c='red', marker='X', s=300, edgecolors='black',
linewidth=2, label='Centroids', zorder=10)
ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title(f'K-Means Clustering (k={n_clusters})',
fontsize=16, fontweight='bold')
ax1.legend(fontsize=10, loc='best', ncol=2)
ax1.grid(True, alpha=0.3)
# Property distribution by cluster
cluster_df = pd.DataFrame({
'cluster': cluster_labels,
'band_gap': targets
})
sns.boxplot(data=cluster_df, x='cluster', y='band_gap',
ax=ax2, palette='Set3')
sns.swarmplot(data=cluster_df, x='cluster', y='band_gap',
ax=ax2, color='black', alpha=0.3, size=2)
ax2.set_xlabel('Cluster ID', fontsize=14, fontweight='bold')
ax2.set_ylabel('Band Gap (eV)', fontsize=14, fontweight='bold')
ax2.set_title('Band Gap Distribution by Cluster',
fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('clustering_analysis.png', dpi=300, bbox_inches='tight')
print("Saved clustering analysis to clustering_analysis.png")
plt.show()
# Cluster statistics
print("\nBand gap statistics by cluster:")
cluster_stats = cluster_df.groupby('cluster')['band_gap'].agg([
'count', 'mean', 'std', 'min', 'max'
])
print(cluster_stats.round(3))
Code Example 17: Material Property and Cluster Relationship Analysis
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0
"""
Example: Code Example 17: Material Property and Cluster Relationship
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Integrate additional material properties into DataFrame
materials_analysis_df = materials_df.copy()
materials_analysis_df['cluster'] = cluster_labels
materials_analysis_df['umap1'] = embeddings_umap[:, 0]
materials_analysis_df['umap2'] = embeddings_umap[:, 1]
# Statistics of multiple properties by cluster
properties_to_analyze = [
'band_gap', 'formation_energy_per_atom',
'density', 'energy_above_hull', 'volume'
]
cluster_property_stats = materials_analysis_df.groupby('cluster')[
properties_to_analyze
].mean()
# Visualize with heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(cluster_property_stats.T, annot=True, fmt='.2f',
cmap='RdYlGn_r', center=0, linewidths=1,
cbar_kws={"label": "Normalized Value"})
plt.title('Average Material Properties by Cluster',
fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Cluster ID', fontsize=14, fontweight='bold')
plt.ylabel('Property', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('cluster_properties_heatmap.png', dpi=300, bbox_inches='tight')
print("Saved cluster properties heatmap to cluster_properties_heatmap.png")
plt.show()
Code Example 18: Creating Density Maps
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 18: Creating Density Maps
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import numpy as np
# Density estimation in UMAP space
xy = np.vstack([embeddings_umap[:, 0], embeddings_umap[:, 1]])
density = gaussian_kde(xy)(xy)
# Sort by density (draw high-density points on top)
idx = density.argsort()
x, y, z = embeddings_umap[idx, 0], embeddings_umap[idx, 1], density[idx]
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
# Density map
scatter1 = ax1.scatter(x, y, c=z, cmap='hot', s=50, alpha=0.7,
edgecolors='black', linewidth=0.3)
ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title('Materials Space Density Map',
fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3)
cbar1 = plt.colorbar(scatter1, ax=ax1)
cbar1.set_label('Point Density', fontsize=12, fontweight='bold')
# Density map + Band gap
scatter2 = ax2.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
c=targets, s=50 + density*1000, alpha=0.6,
cmap='viridis', edgecolors='black', linewidth=0.5)
ax2.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax2.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax2.set_title('Band Gap with Density (size = density)',
fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)
cbar2 = plt.colorbar(scatter2, ax=ax2)
cbar2.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('density_map.png', dpi=300, bbox_inches='tight')
print("Saved density map to density_map.png")
plt.show()
Code Example 19: Searching for Similar Materials
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
from sklearn.neighbors import NearestNeighbors
import pandas as pd
def find_similar_materials_in_embedding_space(
query_idx, embeddings, materials_df, k=10
):
"""
Search for similar materials in embedding space
Parameters:
-----------
query_idx : int
Query material index
embeddings : np.ndarray
Embedding vectors
materials_df : pd.DataFrame
Materials data
k : int
Number of neighbors to search
Returns:
--------
results_df : pd.DataFrame
Information on neighboring materials
"""
nbrs = NearestNeighbors(n_neighbors=k+1, metric='cosine').fit(embeddings)
distances, indices = nbrs.kneighbors(embeddings[query_idx:query_idx+1])
# Create results DataFrame
results = []
for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
if i == 0: # Skip query itself
continue
result = {
'rank': i,
'material_id': materials_df.iloc[idx]['material_id'],
'formula': materials_df.iloc[idx]['formula'],
'band_gap': materials_df.iloc[idx]['band_gap'],
'formation_energy': materials_df.iloc[idx]['formation_energy_per_atom'],
'density': materials_df.iloc[idx]['density'],
'distance': dist
}
results.append(result)
results_df = pd.DataFrame(results)
return results_df
# Usage example: Randomly select materials and search for similar materials
np.random.seed(42)
query_indices = np.random.choice(len(materials_df), 3, replace=False)
print("Similar materials search examples:\n")
for query_idx in query_indices:
query_material = materials_df.iloc[query_idx]
print(f"Query material:")
print(f" Material ID: {query_material['material_id']}")
print(f" Formula: {query_material['formula']}")
print(f" Band Gap: {query_material['band_gap']:.3f} eV")
print(f"\nSimilar materials (Top 5):")
similar_materials = find_similar_materials_in_embedding_space(
query_idx, embeddings, materials_df, k=5
)
print(similar_materials[['rank', 'material_id', 'formula', 'band_gap', 'distance']].to_string(index=False))
print("\n" + "="*80 + "\n")
Code Example 20: Material Recommendation System
def recommend_materials_by_property(
target_property_value,
property_name,
embeddings,
materials_df,
n_recommendations=10,
property_weight=0.7,
embedding_weight=0.3
):
"""
Recommend materials based on target property value
Parameters:
-----------
target_property_value : float
Target property value
property_name : str
Property name
embeddings : np.ndarray
Embedding vectors
materials_df : pd.DataFrame
Materials data
n_recommendations : int
Number of recommendations
property_weight : float
Weight for property value
embedding_weight : float
Weight for embedding distance
Returns:
--------
recommendations_df : pd.DataFrame
List of recommended materials
"""
# Property value difference
property_diff = np.abs(materials_df[property_name].values - target_property_value)
property_diff_normalized = property_diff / property_diff.max()
# Index of material closest to target property value
closest_idx = property_diff.argmin()
# Distance in embedding space
from sklearn.metrics.pairwise import cosine_distances
embedding_distances = cosine_distances(
embeddings[closest_idx:closest_idx+1],
embeddings
).flatten()
embedding_distances_normalized = embedding_distances / embedding_distances.max()
# Combined score (smaller is better)
combined_score = (
property_weight * property_diff_normalized +
embedding_weight * embedding_distances_normalized
)
# Select top n_recommendations
top_indices = combined_score.argsort()[:n_recommendations]
# Create results DataFrame
recommendations = []
for rank, idx in enumerate(top_indices, 1):
rec = {
'rank': rank,
'material_id': materials_df.iloc[idx]['material_id'],
'formula': materials_df.iloc[idx]['formula'],
property_name: materials_df.iloc[idx][property_name],
'property_diff': property_diff[idx],
'embedding_distance': embedding_distances[idx],
'combined_score': combined_score[idx]
}
recommendations.append(rec)
recommendations_df = pd.DataFrame(recommendations)
return recommendations_df
# Usage example: Recommend materials close to band gap 2.0 eV
target_bandgap = 2.0
print(f"Target band gap: {target_bandgap} eV")
print("\nRecommended materials (Top 10):\n")
recommendations = recommend_materials_by_property(
target_property_value=target_bandgap,
property_name='band_gap',
embeddings=embeddings,
materials_df=materials_df,
n_recommendations=10,
property_weight=0.7,
embedding_weight=0.3
)
print(recommendations[['rank', 'material_id', 'formula', 'band_gap',
'property_diff', 'combined_score']].to_string(index=False))
4.6 Interactive Visualization
Code Example 21: 3D UMAP with Plotly
# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
# - plotly>=5.14.0
"""
Example: Code Example 21: 3D UMAP with Plotly
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import plotly.express as px
import plotly.graph_objects as go
import umap
import pandas as pd
# 3D UMAP
print("Running 3D UMAP...")
reducer_3d = umap.UMAP(
n_components=3,
n_neighbors=15,
min_dist=0.1,
random_state=42,
verbose=False
)
embeddings_umap_3d = reducer_3d.fit_transform(embeddings)
# Create DataFrame
df_3d = pd.DataFrame({
'UMAP1': embeddings_umap_3d[:, 0],
'UMAP2': embeddings_umap_3d[:, 1],
'UMAP3': embeddings_umap_3d[:, 2],
'Band_Gap': targets,
'Material_ID': materials_df['material_id'].values,
'Formula': materials_df['formula'].values,
'Formation_Energy': materials_df['formation_energy_per_atom'].values,
'Density': materials_df['density'].values,
'Cluster': cluster_labels
})
# Interactive 3D plot
fig = px.scatter_3d(
df_3d,
x='UMAP1', y='UMAP2', z='UMAP3',
color='Band_Gap',
size='Density',
hover_data=['Material_ID', 'Formula', 'Formation_Energy', 'Cluster'],
color_continuous_scale='Viridis',
title='Interactive 3D UMAP: Materials Space Explorer'
)
fig.update_traces(marker=dict(line=dict(width=0.5, color='DarkSlateGrey')))
fig.update_layout(
scene=dict(
xaxis_title='UMAP 1',
yaxis_title='UMAP 2',
zaxis_title='UMAP 3',
xaxis=dict(backgroundcolor="rgb(230, 230, 230)", gridcolor="white"),
yaxis=dict(backgroundcolor="rgb(230, 230, 230)", gridcolor="white"),
zaxis=dict(backgroundcolor="rgb(230, 230, 230)", gridcolor="white"),
),
width=1000,
height=800,
font=dict(size=12)
)
fig.write_html('materials_3d_interactive.html')
print("Saved interactive 3D visualization to materials_3d_interactive.html")
fig.show()
Code Example 22: Interactive Scatter Plot with Bokeh
from bokeh.plotting import figure, output_file, save
from bokeh.models import HoverTool, ColorBar, LinearColorMapper, ColumnDataSource
from bokeh.palettes import Viridis256
from bokeh.io import show
from bokeh.layouts import column, row
from bokeh.models.widgets import Select
# Color mapper
color_mapper = LinearColorMapper(
palette=Viridis256,
low=targets.min(),
high=targets.max()
)
# Data source
source = ColumnDataSource(data=dict(
x=embeddings_umap[:, 0],
y=embeddings_umap[:, 1],
material_id=materials_df['material_id'].values,
formula=materials_df['formula'].values,
band_gap=targets,
formation_energy=materials_df['formation_energy_per_atom'].values,
density=materials_df['density'].values,
volume=materials_df['volume'].values,
cluster=cluster_labels
))
# Create plot
output_file('materials_interactive.html')
p = figure(
width=1000,
height=800,
title='Interactive Materials Space Explorer (UMAP)',
tools='pan,wheel_zoom,box_zoom,box_select,lasso_select,reset,save'
)
# Scatter plot
circles = p.circle(
'x', 'y',
size=8,
source=source,
fill_color={'field': 'band_gap', 'transform': color_mapper},
fill_alpha=0.7,
line_color='black',
line_width=0.5
)
# Hover tool
hover = HoverTool(tooltips=[
('Material ID', '@material_id'),
('Formula', '@formula'),
('Band Gap', '@band_gap{0.00} eV'),
('Formation E', '@formation_energy{0.00} eV/atom'),
('Density', '@density{0.00} g/cm³'),
('Volume', '@volume{0.0} Ų'),
('Cluster', '@cluster')
])
p.add_tools(hover)
# Color bar
color_bar = ColorBar(
color_mapper=color_mapper,
label_standoff=12,
title='Band Gap (eV)',
location=(0, 0)
)
p.add_layout(color_bar, 'right')
# Axis labels
p.xaxis.axis_label = 'UMAP 1'
p.yaxis.axis_label = 'UMAP 2'
p.title.text_font_size = '16pt'
p.xaxis.axis_label_text_font_size = '14pt'
p.yaxis.axis_label_text_font_size = '14pt'
save(p)
print("Saved interactive UMAP to materials_interactive.html")
show(p)
Code Example 23: Dashboard with Dash (Optional)
# Requirements:
# - Python 3.9+
# - dash>=2.11.0
# - pandas>=2.0.0, <2.2.0
# - plotly>=5.14.0
# Install Dash (first time only)
# !pip install dash
"""
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
# Create Dash app
app = dash.Dash(__name__)
# Prepare DataFrame
df_dash = pd.DataFrame({
'umap1': embeddings_umap[:, 0],
'umap2': embeddings_umap[:, 1],
'material_id': materials_df['material_id'].values,
'formula': materials_df['formula'].values,
'band_gap': targets,
'formation_energy': materials_df['formation_energy_per_atom'].values,
'density': materials_df['density'].values,
'cluster': cluster_labels
})
# Layout
app.layout = html.Div([
html.H1('Materials Space Explorer Dashboard'),
html.Div([
html.Label('Color by:'),
dcc.Dropdown(
id='color-dropdown',
options=[
{'label': 'Band Gap', 'value': 'band_gap'},
{'label': 'Formation Energy', 'value': 'formation_energy'},
{'label': 'Density', 'value': 'density'},
{'label': 'Cluster', 'value': 'cluster'}
],
value='band_gap'
),
], style={'width': '30%', 'display': 'inline-block'}),
dcc.Graph(id='umap-scatter'),
html.Div(id='material-info')
])
# Callback
@app.callback(
Output('umap-scatter', 'figure'),
Input('color-dropdown', 'value')
)
def update_scatter(color_by):
fig = px.scatter(
df_dash,
x='umap1',
y='umap2',
color=color_by,
hover_data=['material_id', 'formula', 'band_gap', 'formation_energy'],
color_continuous_scale='Viridis',
title=f'UMAP colored by {color_by}'
)
fig.update_traces(marker=dict(size=8, line=dict(width=0.5, color='black')))
return fig
# Run app
if __name__ == '__main__':
app.run_server(debug=True)
"""
print("Displayed Dash dashboard code example")
print("Uncomment and run to execute")
4.7 Advanced Analysis and Applications
Code Example 24: Voronoi Tessellation of Materials Space
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 24: Voronoi Tessellation of Materials Space
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
from scipy.spatial import Voronoi, voronoi_plot_2d
import matplotlib.pyplot as plt
import numpy as np
# Create Voronoi diagram with subset (reduce computation)
subset_size = 100
subset_indices = np.random.choice(len(embeddings_umap), subset_size, replace=False)
points = embeddings_umap[subset_indices]
targets_subset = targets[subset_indices]
# Compute Voronoi diagram
vor = Voronoi(points)
# Plot
fig, ax = plt.subplots(figsize=(14, 12))
# Voronoi cells
voronoi_plot_2d(vor, ax=ax, show_vertices=False, line_colors='gray',
line_width=1, line_alpha=0.6, point_size=0)
# Material points
scatter = ax.scatter(points[:, 0], points[:, 1],
c=targets_subset, cmap='viridis',
s=100, alpha=0.8, edgecolors='black', linewidth=1.5,
zorder=5)
ax.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax.set_title('Voronoi Tessellation of Materials Space',
fontsize=16, fontweight='bold')
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')
# Limit axis range (exclude infinity points)
ax.set_xlim([points[:, 0].min() - 1, points[:, 0].max() + 1])
ax.set_ylim([points[:, 1].min() - 1, points[:, 1].max() + 1])
plt.tight_layout()
plt.savefig('voronoi_tessellation.png', dpi=300, bbox_inches='tight')
print("Saved Voronoi tessellation to voronoi_tessellation.png")
plt.show()
Code Example 25: Property Gradient Visualization
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 25: Property Gradient Visualization
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import griddata
# Create grid
grid_x, grid_y = np.mgrid[
embeddings_umap[:, 0].min():embeddings_umap[:, 0].max():100j,
embeddings_umap[:, 1].min():embeddings_umap[:, 1].max():100j
]
# Interpolation
grid_z = griddata(
embeddings_umap,
targets,
(grid_x, grid_y),
method='cubic'
)
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
# Contour plot
contour = ax1.contourf(grid_x, grid_y, grid_z, levels=20, cmap='viridis', alpha=0.8)
ax1.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
c='white', s=5, alpha=0.5, edgecolors='none')
ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title('Band Gap Contour Map', fontsize=16, fontweight='bold')
cbar1 = plt.colorbar(contour, ax=ax1)
cbar1.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')
# Gradient vectors
gradient_y, gradient_x = np.gradient(grid_z)
# Subsampling (adjust arrow density)
skip = 5
ax2.contourf(grid_x, grid_y, grid_z, levels=20, cmap='viridis', alpha=0.6)
ax2.quiver(grid_x[::skip, ::skip], grid_y[::skip, ::skip],
gradient_x[::skip, ::skip], gradient_y[::skip, ::skip],
color='white', alpha=0.8, scale=50)
ax2.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
c='black', s=5, alpha=0.3, edgecolors='none')
ax2.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax2.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax2.set_title('Band Gap Gradient Field', fontsize=16, fontweight='bold')
cbar2 = plt.colorbar(contour, ax=ax2)
cbar2.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('property_gradient.png', dpi=300, bbox_inches='tight')
print("Saved property gradient to property_gradient.png")
plt.show()
Code Example 26: Embedding Stability Evaluation
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
def evaluate_embedding_stability(model, dataset, device, n_splits=5):
"""
Evaluate embedding stability with cross-validation
Parameters:
-----------
model : nn.Module
Model (for re-training)
dataset : Dataset
Dataset
device : torch.device
Device
n_splits : int
Number of splits
Returns:
--------
stability_scores : list
Stability scores
"""
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)
embeddings_list = []
for fold, (train_idx, val_idx) in enumerate(kfold.split(range(len(dataset)))):
print(f"Fold {fold+1}/{n_splits}")
# Omitted: Re-train model for each fold
# In practice, create a new model instance and train with train_idx
# For demo: use existing embeddings
embeddings_list.append(embeddings)
# Calculate similarity between embeddings
from sklearn.metrics.pairwise import cosine_similarity
similarities = []
for i in range(len(embeddings_list)):
for j in range(i+1, len(embeddings_list)):
sim_matrix = cosine_similarity(embeddings_list[i], embeddings_list[j])
# Average of diagonal elements (similarity between corresponding embeddings for each point)
sim_score = np.mean(np.diag(sim_matrix))
similarities.append(sim_score)
return similarities
# Execute (demo)
print("Embedding stability evaluation (demo):")
stability_scores = evaluate_embedding_stability(model, dataset, device, n_splits=3)
print(f"Average similarity: {np.mean(stability_scores):.3f}")
print(f"Standard deviation: {np.std(stability_scores):.3f}")
Code Example 27: Ensemble Model Embedding Integration
def create_ensemble_embeddings(models_list, dataset, device):
"""
Integrate embeddings from multiple models
Parameters:
-----------
models_list : list
List of trained models
dataset : Dataset
Dataset
device : torch.device
Device
Returns:
--------
ensemble_embeddings : np.ndarray
Integrated embeddings
"""
all_embeddings = []
for model in models_list:
emb, _ = extract_all_embeddings(model, dataset, device)
all_embeddings.append(emb)
# Take average
ensemble_embeddings = np.mean(all_embeddings, axis=0)
return ensemble_embeddings
# Demo: Use single model three times (in practice, use different models)
print("Ensemble embedding demo:")
models_list = [model, model, model] # In practice, different models
ensemble_emb = create_ensemble_embeddings(models_list, dataset, device)
print(f"Ensemble embedding shape: {ensemble_emb.shape}")
Code Example 28: Time-series Materials Space Changes (Extensibility)
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Track changes in materials space with time-series data or multiple model versions
def visualize_materials_space_evolution(
embeddings_timeline,
targets_timeline,
timestamps
):
'''
Visualize time-series changes in materials space
Parameters:
-----------
embeddings_timeline : list of np.ndarray
Embeddings at each time point
targets_timeline : list of np.ndarray
Targets at each time point
timestamps : list
Timestamps
Returns:
--------
animation or multiple plots
'''
import matplotlib.animation as animation
fig, ax = plt.subplots(figsize=(10, 8))
def update(frame):
ax.clear()
embeddings = embeddings_timeline[frame]
targets = targets_timeline[frame]
scatter = ax.scatter(embeddings[:, 0], embeddings[:, 1],
c=targets, cmap='viridis', s=50, alpha=0.6)
ax.set_title(f'Materials Space at {timestamps[frame]}')
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')
return scatter,
anim = animation.FuncAnimation(fig, update, frames=len(timestamps),
interval=500, blit=False)
return anim
'''
print("Displayed time-series visualization extensibility code example")
"""
Code Example 29: Extrapolation Region Detection
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 29: Extrapolation Region Detection
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
import numpy as np
# Learn embedding range from training data
train_embeddings = embeddings[train_idx]
# Detect outliers (extrapolation regions) with Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(train_embeddings)
# Predict on all data
outlier_labels = iso_forest.predict(embeddings)
outlier_scores = iso_forest.score_samples(embeddings)
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
# Color by outlier label
colors_outlier = ['red' if label == -1 else 'blue' for label in outlier_labels]
ax1.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
c=colors_outlier, s=50, alpha=0.6, edgecolors='black', linewidth=0.5)
ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title('Extrapolation Region Detection (Red = Outlier)',
fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3)
# Outlier score
scatter = ax2.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
c=outlier_scores, cmap='RdYlGn', s=50, alpha=0.6,
edgecolors='black', linewidth=0.5)
ax2.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax2.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax2.set_title('Anomaly Score (Green = Normal, Red = Outlier)',
fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)
cbar = plt.colorbar(scatter, ax=ax2)
cbar.set_label('Anomaly Score', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('extrapolation_detection.png', dpi=300, bbox_inches='tight')
print("Saved extrapolation detection to extrapolation_detection.png")
plt.show()
print(f"\nNumber of outliers: {np.sum(outlier_labels == -1)} / {len(outlier_labels)}")
Code Example 30: Comprehensive Report Generation
import json
from datetime import datetime
def generate_comprehensive_report(
model, dataset, embeddings, targets,
materials_df, cluster_labels, test_predictions, test_targets
):
"""
Generate comprehensive analysis report
Parameters:
-----------
model : nn.Module
Trained model
dataset : Dataset
Dataset
embeddings : np.ndarray
Embeddings
targets : np.ndarray
Targets
materials_df : pd.DataFrame
Materials data
cluster_labels : np.ndarray
Cluster labels
test_predictions : np.ndarray
Test predictions
test_targets : np.ndarray
Test true values
Returns:
--------
report : dict
Report dictionary
"""
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Model performance
mae = mean_absolute_error(test_targets, test_predictions)
rmse = np.sqrt(mean_squared_error(test_targets, test_predictions))
r2 = r2_score(test_targets, test_predictions)
# Embedding statistics
embedding_stats = {
'dimension': embeddings.shape[1],
'mean_norm': float(np.mean(np.linalg.norm(embeddings, axis=1))),
'std_norm': float(np.std(np.linalg.norm(embeddings, axis=1))),
}
# Cluster statistics
cluster_stats = {}
for cluster_id in np.unique(cluster_labels):
mask = cluster_labels == cluster_id
cluster_stats[f'cluster_{cluster_id}'] = {
'size': int(np.sum(mask)),
'mean_band_gap': float(np.mean(targets[mask])),
'std_band_gap': float(np.std(targets[mask]))
}
# Report
report = {
'metadata': {
'timestamp': datetime.now().isoformat(),
'dataset_size': len(dataset),
'model_parameters': sum(p.numel() for p in model.parameters())
},
'model_performance': {
'MAE': float(mae),
'RMSE': float(rmse),
'R2': float(r2)
},
'embedding_statistics': embedding_stats,
'cluster_statistics': cluster_stats,
'target_statistics': {
'mean': float(np.mean(targets)),
'std': float(np.std(targets)),
'min': float(np.min(targets)),
'max': float(np.max(targets))
}
}
return report
# Generate report
report = generate_comprehensive_report(
model, dataset, embeddings, targets,
materials_df, cluster_labels,
test_predictions, test_targets
)
# Save in JSON format
with open('comprehensive_report.json', 'w') as f:
json.dump(report, f, indent=2)
print("Saved comprehensive report to comprehensive_report.json\n")
print("Report summary:")
print(json.dumps(report, indent=2)[:1000] + "...")
# Generate Markdown report
markdown_report = f"""
# Materials Space Mapping - Comprehensive Report
Generated: {report['metadata']['timestamp']}
## 1. Dataset Information
- Dataset size: {report['metadata']['dataset_size']}
- Model parameters: {report['metadata']['model_parameters']:,}
## 2. Model Performance
| Metric | Value |
|--------|-------|
| MAE | {report['model_performance']['MAE']:.4f} |
| RMSE | {report['model_performance']['RMSE']:.4f} |
| R² | {report['model_performance']['R2']:.4f} |
## 3. Embedding Statistics
- Dimensions: {report['embedding_statistics']['dimension']}
- Mean norm: {report['embedding_statistics']['mean_norm']:.3f}
- Norm standard deviation: {report['embedding_statistics']['std_norm']:.3f}
## 4. Target Property Statistics
| Statistic | Value (eV) |
|-----------|------------|
| Mean | {report['target_statistics']['mean']:.3f} |
| Std Dev | {report['target_statistics']['std']:.3f} |
| Min | {report['target_statistics']['min']:.3f} |
| Max | {report['target_statistics']['max']:.3f} |
## 5. Cluster Analysis
"""
for cluster_id, stats in report['cluster_statistics'].items():
markdown_report += f"- **{cluster_id}**: Size={stats['size']}, Mean band gap={stats['mean_band_gap']:.3f} eV\n"
markdown_report += """
## 6. Generated Visualizations
- `training_curve.png`: Training curve
- `test_evaluation.png`: Test evaluation
- `pca_2d_projection.png`: PCA 2D projection
- `umap_2d_projection.png`: UMAP 2D projection
- `tsne_2d_projection.png`: t-SNE 2D projection
- `clustering_analysis.png`: Clustering analysis
- `materials_3d_interactive.html`: Interactive 3D visualization
- `materials_interactive.html`: Interactive 2D visualization
## 7. Summary
In this analysis, by combining GNN-based materials representation learning with dimensionality reduction,
we effectively visualized high-dimensional materials space and revealed similarities between materials and property trends.
"""
# Save Markdown report
with open('comprehensive_report.md', 'w') as f:
f.write(markdown_report)
print("\nSaved Markdown report to comprehensive_report.md")
4.8 Summary
In this chapter, we built a practical materials mapping system combining GNN and dimensionality reduction.
Implemented Features
| Feature | Number of Code Examples | Main Content |
|---|---|---|
| Data Collection & Preparation | Examples 1-6 | MP API, graph conversion, dataset |
| Model Training | Examples 7-10 | CGCNN, training loop, evaluation |
| Embedding Extraction & Dimensionality Reduction | Examples 11-15 | PCA, UMAP, t-SNE, comparison |
| Materials Space Analysis | Examples 16-20 | Clustering, neighbor search, recommendation |
| Interactive Visualization | Examples 21-23 | Plotly 3D, Bokeh, Dash |
| Advanced Analysis | Examples 24-30 | Voronoi, gradient, extrapolation, report |
Best Practices
- Data Preprocessing: Standardization, missing value handling, outlier removal
- Model Training: Early stopping, learning rate scheduling, regularization
- Dimensionality Reduction: Comparison of multiple methods, parameter tuning
- Visualization: Interactivity, multiple perspectives, interpretability
- Reproducibility: Fixed random seeds, save configurations, version control
Applications
- Materials Discovery: Recommendation of materials with target properties
- Property Prediction: Prediction of properties for novel materials
- Structure-Property Relationships: Elucidating relationships between structure and properties in materials space
- Experimental Design: Suggestions for next materials to synthesize
Future Developments
- Multi-task Learning: Simultaneous prediction of multiple properties
- Transfer Learning: Application to small datasets
- Active Learning: Efficient data collection strategies
- Explainable AI: Improving interpretability of GNN predictions
Previous Chapter: Chapter 3: Materials Representation Learning with GNN
Series Top: Introduction to Materials Property Mapping
Further Learning
Recommended Literature
- GNN for Materials: "Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals" (Xie & Grossman, 2018)
- UMAP: "UMAP: Uniform Manifold Approximation and Projection" (McInnes et al., 2018)
- Materials Informatics: "Materials Informatics" (Ramprasad et al., 2017)
Related Resources
Congratulations! Through this series, you have learned the fundamentals and practical applications of materials property mapping.