Chapter 4: Practical Guide - Materials Mapping with GNN + Dimensionality Reduction

This chapter focuses on practical applications of Practical Guide. You will learn essential concepts and techniques.

Overview

In this chapter, we will build a practical materials space mapping system that combines GNN-based representation learning (Chapter 3) with dimensionality reduction techniques (Chapter 2). We will collect real data from the Materials Project API and implement an end-to-end pipeline.

Learning Objectives

Collect real data using the Materials Project API
Build a GNN training pipeline
Visualize trained GNN embeddings using dimensionality reduction
Implement an interactive materials exploration system
Apply to real materials design tasks

4.1 Environment Setup and Data Collection

Code Example 1: Installing Required Libraries

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0
# - torch>=2.0.0, <2.3.0
# - tqdm>=4.65.0

# Install required libraries (run once at first time)
"""
!pip install pymatgen
!pip install mp-api
!pip install torch torchvision torchaudio
!pip install torch-geometric
!pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.0.0+cpu.html
!pip install umap-learn
!pip install plotly
!pip install bokeh
!pip install scikit-learn
!pip install pandas matplotlib seaborn
!pip install tqdm
"""

# Import
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data, Dataset, DataLoader

import umap
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

from tqdm import tqdm
import json
import pickle
from pathlib import Path

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Code Example 2: Materials Project API Setup and Data Collection

from mp_api.client import MPRester
from pymatgen.core import Structure
import warnings
warnings.filterwarnings('ignore')

# API key setup (environment variable or direct specification)
# MP_API_KEY = "your_api_key_here"  # Get from https://next-gen.materialsproject.org/api
# Note: It is recommended to read from environment variables in actual projects

def fetch_materials_data(api_key, criteria, properties, max_materials=1000):
    """
    Fetch materials data from Materials Project API

    Parameters:
    -----------
    api_key : str
        Materials Project API key
    criteria : dict
        Search criteria
    properties : list
        List of properties to fetch
    max_materials : int
        Maximum number of materials to fetch

    Returns:
    --------
    materials_df : pd.DataFrame
        Materials data
    structures : list
        List of crystal structures
    """
    with MPRester(api_key) as mpr:
        # Fetch data
        docs = mpr.materials.summary.search(
            **criteria,
            fields=properties,
            num_chunks=10,
            chunk_size=100
        )

        # Limit to maximum number
        docs = docs[:max_materials]

        print(f"Number of materials fetched: {len(docs)}")

        # Convert to DataFrame
        data_dict = {prop: [] for prop in properties}
        data_dict['material_id'] = []
        data_dict['formula'] = []
        structures = []

        for doc in tqdm(docs, desc="Processing data"):
            try:
                data_dict['material_id'].append(doc.material_id)
                data_dict['formula'].append(str(doc.formula_pretty))

                for prop in properties:
                    if prop == 'structure':
                        structures.append(doc.structure)
                    else:
                        value = getattr(doc, prop, None)
                        data_dict[prop].append(value)

            except Exception as e:
                print(f"Error processing {doc.material_id}: {e}")
                continue

        materials_df = pd.DataFrame(data_dict)
        materials_df = materials_df.dropna()  # Remove rows with missing values

        print(f"Number of valid materials: {len(materials_df)}")

        return materials_df, structures


# Usage example (substitute with dummy data - uncomment if you have an actual API key)
# criteria = {
#     "band_gap": (0.5, 5.0),  # Band gap 0.5-5.0 eV
#     "is_stable": True,       # Only stable materials
#     "nelements": (2, 3)      # 2-3 element systems
# }
#
# properties = [
#     "material_id", "formula_pretty", "structure",
#     "band_gap", "formation_energy_per_atom", "density",
#     "energy_above_hull", "volume"
# ]
#
# materials_df, structures = fetch_materials_data(
#     api_key=MP_API_KEY,
#     criteria=criteria,
#     properties=properties,
#     max_materials=1000
# )

# Generate dummy data (if API key is not available)
print("\nGenerating dummy data...")
from pymatgen.core import Lattice, Structure

np.random.seed(42)
n_materials = 1000

materials_df = pd.DataFrame({
    'material_id': [f'mp-{1000+i}' for i in range(n_materials)],
    'formula': [f'Material_{i}' for i in range(n_materials)],
    'band_gap': np.random.exponential(2.0, n_materials),
    'formation_energy_per_atom': np.random.normal(-1.5, 0.8, n_materials),
    'density': np.random.normal(5.0, 1.5, n_materials).clip(0.1),
    'energy_above_hull': np.random.exponential(0.05, n_materials),
    'volume': np.random.normal(50, 15, n_materials).clip(10)
})

# Generate dummy crystal structures
structures = []
for i in range(n_materials):
    lattice = Lattice.cubic(4.0 + np.random.rand())
    n_atoms = np.random.randint(2, 8)
    species = np.random.choice(['Li', 'Na', 'K', 'O', 'S', 'Cl', 'F'], n_atoms)
    coords = np.random.rand(n_atoms, 3)
    structure = Structure(lattice, species, coords)
    structures.append(structure)

print(f"Generated {n_materials} dummy materials data")
print("\nFirst 5 rows of materials data:")
print(materials_df.head())

Output Example:

Generated 1000 dummy materials data

First 5 rows of materials data:
  material_id      formula  band_gap  formation_energy_per_atom  density  energy_above_hull  volume
0     mp-1000  Material_0  1.234567                  -1.678901    4.567             0.012345  45.678
1     mp-1001  Material_1  2.345678                  -1.234567    5.678             0.023456  50.789
2     mp-1002  Material_2  0.987654                  -2.012345    3.456             0.001234  48.901
3     mp-1003  Material_3  3.456789                  -0.987654    6.789             0.034567  52.345
4     mp-1004  Material_4  1.876543                  -1.456789    4.890             0.009876  47.234

Code Example 3: Exploratory Data Analysis

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - seaborn>=0.12.0

"""
Example: Code Example 3: Exploratory Data Analysis

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
import seaborn as sns

# Basic statistics
print("Basic statistics of materials data:")
print(materials_df.describe())

# Distribution of properties
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

properties_to_plot = [
    'band_gap', 'formation_energy_per_atom', 'density',
    'energy_above_hull', 'volume'
]

for idx, prop in enumerate(properties_to_plot):
    axes[idx].hist(materials_df[prop], bins=50, alpha=0.7,
                   edgecolor='black', color='steelblue')
    axes[idx].set_xlabel(prop.replace('_', ' ').title(), fontsize=12, fontweight='bold')
    axes[idx].set_ylabel('Frequency', fontsize=12, fontweight='bold')
    axes[idx].set_title(f'Distribution of {prop.replace("_", " ").title()}',
                        fontsize=14, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

# Hide the last subplot
axes[-1].axis('off')

plt.tight_layout()
plt.savefig('materials_data_distributions.png', dpi=300, bbox_inches='tight')
print("\nSaved property distributions to materials_data_distributions.png")
plt.show()

# Correlation matrix
correlation_matrix = materials_df[properties_to_plot].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.3f', cmap='RdBu_r',
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Material Properties',
          fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('materials_correlation_matrix.png', dpi=300, bbox_inches='tight')
print("Saved correlation matrix to materials_correlation_matrix.png")
plt.show()

4.2 Graph Dataset Construction

Code Example 4: Crystal Structure to Graph Conversion (Optimized Version)

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0

import torch
from torch_geometric.data import Data
from pymatgen.core import Structure
from pymatgen.core.periodic_table import Element
import numpy as np

class MaterialGraphConverter:
    """Class to convert crystal structures to graphs"""

    def __init__(self, cutoff=5.0, max_neighbors=12):
        self.cutoff = cutoff
        self.max_neighbors = max_neighbors

        # Define atomic properties (element symbol → feature vector)
        self._build_atom_features()

    def _build_atom_features(self):
        """Build atom properties dictionary"""
        # Pre-calculate properties for common elements
        common_elements = [
            'H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne',
            'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca',
            'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn',
            'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr'
        ]

        self.atom_features = {}

        for symbol in common_elements:
            try:
                element = Element(symbol)

                features = [
                    element.Z / 100.0,  # Normalized atomic number
                    element.atomic_mass / 200.0,  # Normalized atomic mass
                    element.X if element.X is not None else 0.0,  # Electronegativity
                    element.atomic_radius if element.atomic_radius is not None else 0.0,  # Atomic radius
                    element.group if element.group is not None else 0.0,  # Group
                    element.row if element.row is not None else 0.0,  # Period
                ]

                self.atom_features[symbol] = features

            except Exception as e:
                print(f"Warning: Could not process element {symbol}: {e}")
                self.atom_features[symbol] = [0.0] * 6

    def get_atom_features(self, species):
        """Get feature vector from element symbol"""
        symbol = str(species)
        if symbol in self.atom_features:
            return self.atom_features[symbol]
        else:
            # Unknown elements get zero vector
            return [0.0] * 6

    def structure_to_graph(self, structure, target=None):
        """
        Convert pymatgen Structure to PyTorch Geometric Data

        Parameters:
        -----------
        structure : pymatgen.Structure
            Crystal structure
        target : float, optional
            Target value (for property prediction)

        Returns:
        --------
        data : torch_geometric.data.Data
            Graph data
        """
        # Node features
        node_features = []
        for site in structure:
            features = self.get_atom_features(site.specie)
            node_features.append(features)

        node_features = torch.tensor(node_features, dtype=torch.float)

        # Edge construction
        all_neighbors = structure.get_all_neighbors(self.cutoff)
        edge_index = []
        edge_attr = []

        for i, neighbors in enumerate(all_neighbors):
            # Sort by distance and take up to max_neighbors
            neighbors = sorted(neighbors, key=lambda x: x.nn_distance)[:self.max_neighbors]

            for neighbor in neighbors:
                j = neighbor.index
                distance = neighbor.nn_distance

                edge_index.append([i, j])
                edge_attr.append([distance])

        # Handle case with no edges
        if len(edge_index) == 0:
            edge_index = torch.zeros((2, 0), dtype=torch.long)
            edge_attr = torch.zeros((0, 1), dtype=torch.float)
        else:
            edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
            edge_attr = torch.tensor(edge_attr, dtype=torch.float)

        # Create Data object
        data = Data(x=node_features, edge_index=edge_index, edge_attr=edge_attr)

        if target is not None:
            data.y = torch.tensor([target], dtype=torch.float)

        return data


# Instantiate and test
converter = MaterialGraphConverter(cutoff=5.0, max_neighbors=12)

# Test with first material
test_structure = structures[0]
test_target = materials_df.iloc[0]['band_gap']
test_graph = converter.structure_to_graph(test_structure, test_target)

print("Graph conversion test:")
print(f"Number of nodes: {test_graph.num_nodes}")
print(f"Number of edges: {test_graph.num_edges}")
print(f"Node feature dimension: {test_graph.x.shape[1]}")
print(f"Target value: {test_graph.y.item():.3f}")

Code Example 5: Custom Dataset Class

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0
# - tqdm>=4.65.0

from torch_geometric.data import Dataset, Data
import torch
from tqdm import tqdm
import pickle

class MaterialsDataset(Dataset):
    """
    Materials dataset

    Parameters:
    -----------
    structures : list
        List of pymatgen.Structure
    targets : array-like
        Target values
    converter : MaterialGraphConverter
        Graph converter
    root : str
        Data save directory
    transform : callable, optional
        Data transformation function
    pre_transform : callable, optional
        Pre-processing function
    """

    def __init__(self, structures, targets, converter,
                 root='data/materials', transform=None, pre_transform=None):
        self.structures = structures
        self.targets = targets
        self.converter = converter

        super().__init__(root, transform, pre_transform)

    @property
    def raw_file_names(self):
        return []

    @property
    def processed_file_names(self):
        return [f'data_{i}.pt' for i in range(len(self.structures))]

    def download(self):
        pass

    def process(self):
        """Convert structures to graphs and save"""
        for idx in tqdm(range(len(self.structures)), desc="Converting to graphs"):
            structure = self.structures[idx]
            target = self.targets[idx]

            # Convert to graph
            data = self.converter.structure_to_graph(structure, target)

            if self.pre_transform is not None:
                data = self.pre_transform(data)

            # Save
            torch.save(data, self.processed_paths[idx])

    def len(self):
        return len(self.structures)

    def get(self, idx):
        data = torch.load(self.processed_paths[idx])
        return data


# Create dataset
print("Creating dataset...")

# Target property (band gap)
targets = materials_df['band_gap'].values

# Instantiate dataset
dataset = MaterialsDataset(
    structures=structures,
    targets=targets,
    converter=converter,
    root='data/materials_dataset'
)

print(f"\nDataset creation complete!")
print(f"Dataset size: {len(dataset)}")
print(f"Sample data:")
print(dataset[0])

Output Example:

Converting to graphs: 100%|██████████| 1000/1000 [00:15<00:00, 63.42it/s]

Dataset creation complete!
Dataset size: 1000
Sample data:
Data(x=[5, 6], edge_index=[2, 48], edge_attr=[48, 1], y=[1])

Code Example 6: Dataset Splitting and DataLoader

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

"""
Example: Code Example 6: Dataset Splitting and DataLoader

Purpose: Demonstrate core concepts and implementation patterns
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

from torch_geometric.loader import DataLoader
from sklearn.model_selection import train_test_split
import torch

# Split dataset indices
train_idx, temp_idx = train_test_split(
    range(len(dataset)), test_size=0.3, random_state=42
)
val_idx, test_idx = train_test_split(
    temp_idx, test_size=0.5, random_state=42
)

# Create subsets
train_dataset = dataset[train_idx]
val_dataset = dataset[val_idx]
test_dataset = dataset[test_idx]

print(f"Data split:")
print(f"  Training data: {len(train_dataset)}")
print(f"  Validation data: {len(val_dataset)}")
print(f"  Test data: {len(test_dataset)}")

# Create DataLoaders
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Test DataLoader
batch = next(iter(train_loader))
print(f"\nBatch data:")
print(f"  Batch size: {batch.num_graphs}")
print(f"  Total nodes: {batch.num_nodes}")
print(f"  Total edges: {batch.num_edges}")
print(f"  Node feature shape: {batch.x.shape}")
print(f"  Target shape: {batch.y.shape}")

Output Example:

Data split:
  Training data: 700
  Validation data: 150
  Test data: 150

Batch data:
  Batch size: 32
  Total nodes: 168
  Total edges: 1568
  Node feature shape: torch.Size([168, 6])
  Target shape: torch.Size([32, 1])

4.3 GNN Model Training

Code Example 7: Improved CGCNN Model

# Requirements:
# - Python 3.9+
# - torch>=2.0.0, <2.3.0

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing, global_mean_pool, global_add_pool, global_max_pool

class ImprovedCGConv(MessagePassing):
    """Improved CGCNN convolution layer"""

    def __init__(self, node_dim, edge_dim, hidden_dim=128):
        super().__init__(aggr='add')

        self.node_dim = node_dim
        self.edge_dim = edge_dim
        self.hidden_dim = hidden_dim

        # Edge network
        self.edge_network = nn.Sequential(
            nn.Linear(2 * node_dim + edge_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.Softplus(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.Softplus()
        )

        # Gate network
        self.gate_network = nn.Sequential(
            nn.Linear(2 * node_dim + edge_dim, hidden_dim),
            nn.Sigmoid()
        )

        # Node update network
        self.node_update = nn.Sequential(
            nn.Linear(node_dim + hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.Softplus(),
            nn.Linear(hidden_dim, hidden_dim)
        )

    def forward(self, x, edge_index, edge_attr):
        return self.propagate(edge_index, x=x, edge_attr=edge_attr)

    def message(self, x_i, x_j, edge_attr):
        """Message function"""
        # Concatenate source, target, and edge features
        z = torch.cat([x_i, x_j, edge_attr], dim=1)

        # Gating mechanism
        gate = self.gate_network(z)
        message = self.edge_network(z)

        return gate * message

    def update(self, aggr_out, x):
        """Node update"""
        combined = torch.cat([x, aggr_out], dim=1)
        return self.node_update(combined)


class ImprovedCGCNN(nn.Module):
    """Improved CGCNN model"""

    def __init__(self, node_dim, edge_dim, hidden_dim=128,
                 num_conv=4, dropout=0.2, pooling='mean'):
        super().__init__()

        self.node_dim = node_dim
        self.edge_dim = edge_dim
        self.hidden_dim = hidden_dim
        self.num_conv = num_conv
        self.pooling = pooling

        # Input embedding
        self.node_embedding = nn.Sequential(
            nn.Linear(node_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.Softplus()
        )

        # CGConv layers
        self.conv_layers = nn.ModuleList([
            ImprovedCGConv(hidden_dim, edge_dim, hidden_dim)
            for _ in range(num_conv)
        ])

        # Batch Normalization
        self.bn_layers = nn.ModuleList([
            nn.BatchNorm1d(hidden_dim)
            for _ in range(num_conv)
        ])

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # Pooling
        if pooling == 'mean':
            self.pool = global_mean_pool
        elif pooling == 'add':
            self.pool = global_add_pool
        elif pooling == 'max':
            self.pool = global_max_pool
        else:
            raise ValueError(f"Unknown pooling: {pooling}")

        # Output network
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.BatchNorm1d(hidden_dim // 2),
            nn.Softplus(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, hidden_dim // 4),
            nn.BatchNorm1d(hidden_dim // 4),
            nn.Softplus(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 4, 1)
        )

    def forward(self, data, return_embedding=False):
        """
        Forward pass

        Parameters:
        -----------
        data : torch_geometric.data.Batch
            Batch data
        return_embedding : bool
            Whether to return embedding

        Returns:
        --------
        out : Tensor
            Predictions
        embedding : Tensor (optional)
            Graph embedding
        """
        x, edge_index, edge_attr, batch = data.x, data.edge_index, data.edge_attr, data.batch

        # Input embedding
        x = self.node_embedding(x)

        # Apply CGConv layers
        for conv, bn in zip(self.conv_layers, self.bn_layers):
            x_new = conv(x, edge_index, edge_attr)
            x_new = bn(x_new)
            x_new = F.softplus(x_new)
            x_new = self.dropout(x_new)

            # Residual connection
            x = x + x_new

        # Graph-level pooling
        graph_embedding = self.pool(x, batch)

        # Output
        out = self.fc(graph_embedding)

        if return_embedding:
            return out, graph_embedding
        else:
            return out


# Model instantiation
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = ImprovedCGCNN(
    node_dim=6,
    edge_dim=1,
    hidden_dim=128,
    num_conv=4,
    dropout=0.2,
    pooling='mean'
).to(device)

print(f"Total model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Test
batch = next(iter(train_loader)).to(device)
predictions, embeddings = model(batch, return_embedding=True)

print(f"\nPrediction shape: {predictions.shape}")
print(f"Embedding shape: {embeddings.shape}")

Code Example 8: Training Loop (with Early Stopping)

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - torch>=2.0.0, <2.3.0
# - tqdm>=4.65.0

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau
import numpy as np
from tqdm import tqdm
import time

class EarlyStopping:
    """Early Stopping class"""

    def __init__(self, patience=20, delta=0.001, path='best_model.pt'):
        self.patience = patience
        self.delta = delta
        self.path = path
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        self.val_loss_min = np.Inf

    def __call__(self, val_loss, model):
        score = -val_loss

        if self.best_score is None:
            self.best_score = score
            self.save_checkpoint(val_loss, model)
        elif score < self.best_score + self.delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = score
            self.save_checkpoint(val_loss, model)
            self.counter = 0

    def save_checkpoint(self, val_loss, model):
        torch.save(model.state_dict(), self.path)
        self.val_loss_min = val_loss


def train_epoch(model, loader, criterion, optimizer, device):
    """Train one epoch"""
    model.train()
    total_loss = 0
    total_samples = 0

    for batch in loader:
        batch = batch.to(device)

        optimizer.zero_grad()
        predictions = model(batch)
        loss = criterion(predictions, batch.y)

        loss.backward()
        optimizer.step()

        total_loss += loss.item() * batch.num_graphs
        total_samples += batch.num_graphs

    return total_loss / total_samples


def validate_epoch(model, loader, criterion, device):
    """Validate one epoch"""
    model.eval()
    total_loss = 0
    total_samples = 0

    with torch.no_grad():
        for batch in loader:
            batch = batch.to(device)

            predictions = model(batch)
            loss = criterion(predictions, batch.y)

            total_loss += loss.item() * batch.num_graphs
            total_samples += batch.num_graphs

    return total_loss / total_samples


# Training setup
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5,
                              patience=10, verbose=True)
early_stopping = EarlyStopping(patience=30, delta=0.001,
                               path='best_cgcnn_model.pt')

# Training loop
num_epochs = 200
train_losses = []
val_losses = []

print("Starting training...")
start_time = time.time()

for epoch in range(num_epochs):
    # Train
    train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    train_losses.append(train_loss)

    # Validate
    val_loss = validate_epoch(model, val_loader, criterion, device)
    val_losses.append(val_loss)

    # Learning rate scheduler
    scheduler.step(val_loss)

    # Early Stopping
    early_stopping(val_loss, model)

    if (epoch + 1) % 10 == 0:
        elapsed_time = time.time() - start_time
        print(f"Epoch [{epoch+1}/{num_epochs}] - "
              f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f} - "
              f"Time: {elapsed_time:.1f}s")

    if early_stopping.early_stop:
        print(f"Early stopping at epoch {epoch+1}")
        break

total_time = time.time() - start_time
print(f"\nTraining complete! Total time: {total_time/60:.1f} minutes")

# Load best model
model.load_state_dict(torch.load('best_cgcnn_model.pt'))
print(f"Loaded best model (Val Loss: {early_stopping.val_loss_min:.4f})")

Output Example:

Starting training...
Epoch [10/200] - Train Loss: 1.2345, Val Loss: 1.3456 - Time: 15.3s
Epoch [20/200] - Train Loss: 0.9876, Val Loss: 1.1234 - Time: 30.7s
Epoch [30/200] - Train Loss: 0.7654, Val Loss: 1.0123 - Time: 46.1s
Epoch [40/200] - Train Loss: 0.6543, Val Loss: 0.9876 - Time: 61.5s
Epoch [50/200] - Train Loss: 0.5678, Val Loss: 0.9654 - Time: 76.9s
Early stopping at epoch 85

Training complete! Total time: 2.3 minutes
Loaded best model (Val Loss: 0.9234)

Code Example 9: Training Curve Visualization

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 9: Training Curve Visualization

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 1-5 minutes
Dependencies: None
"""

import matplotlib.pyplot as plt
import numpy as np

# Training curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Loss curve
epochs = range(1, len(train_losses) + 1)
ax1.plot(epochs, train_losses, label='Train Loss', linewidth=2, color='blue')
ax1.plot(epochs, val_losses, label='Validation Loss', linewidth=2, color='red')
ax1.set_xlabel('Epoch', fontsize=14, fontweight='bold')
ax1.set_ylabel('MSE Loss', fontsize=14, fontweight='bold')
ax1.set_title('Training and Validation Loss', fontsize=16, fontweight='bold')
ax1.legend(fontsize=12)
ax1.grid(True, alpha=0.3)

# Log scale
ax2.semilogy(epochs, train_losses, label='Train Loss', linewidth=2, color='blue')
ax2.semilogy(epochs, val_losses, label='Validation Loss', linewidth=2, color='red')
ax2.set_xlabel('Epoch', fontsize=14, fontweight='bold')
ax2.set_ylabel('MSE Loss (log scale)', fontsize=14, fontweight='bold')
ax2.set_title('Training Curve (Log Scale)', fontsize=16, fontweight='bold')
ax2.legend(fontsize=12)
ax2.grid(True, alpha=0.3, which='both')

plt.tight_layout()
plt.savefig('training_curve.png', dpi=300, bbox_inches='tight')
print("Saved training curve to training_curve.png")
plt.show()

Code Example 10: Evaluation on Test Data

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np

def evaluate_model(model, loader, device):
    """Evaluate model"""
    model.eval()

    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for batch in loader:
            batch = batch.to(device)

            predictions = model(batch)

            all_predictions.append(predictions.cpu().numpy())
            all_targets.append(batch.y.cpu().numpy())

    predictions = np.concatenate(all_predictions, axis=0).flatten()
    targets = np.concatenate(all_targets, axis=0).flatten()

    return predictions, targets


# Evaluate on test data
test_predictions, test_targets = evaluate_model(model, test_loader, device)

# Calculate evaluation metrics
mae = mean_absolute_error(test_targets, test_predictions)
rmse = np.sqrt(mean_squared_error(test_targets, test_predictions))
r2 = r2_score(test_targets, test_predictions)

print("Evaluation on test data:")
print(f"  MAE:  {mae:.4f}")
print(f"  RMSE: {rmse:.4f}")
print(f"  R²:   {r2:.4f}")

# Plot predictions vs actual values
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Scatter plot
ax1.scatter(test_targets, test_predictions, alpha=0.6, s=50,
            edgecolors='black', linewidth=0.5, color='steelblue')

# Ideal line
min_val = min(test_targets.min(), test_predictions.min())
max_val = max(test_targets.max(), test_predictions.max())
ax1.plot([min_val, max_val], [min_val, max_val],
         'r--', linewidth=2, label='Ideal')

ax1.set_xlabel('True Band Gap (eV)', fontsize=14, fontweight='bold')
ax1.set_ylabel('Predicted Band Gap (eV)', fontsize=14, fontweight='bold')
ax1.set_title(f'Predictions vs True Values\nR²={r2:.3f}, MAE={mae:.3f}',
              fontsize=16, fontweight='bold')
ax1.legend(fontsize=12)
ax1.grid(True, alpha=0.3)

# Residual plot
residuals = test_predictions - test_targets
ax2.scatter(test_targets, residuals, alpha=0.6, s=50,
            edgecolors='black', linewidth=0.5, color='coral')
ax2.axhline(y=0, color='r', linestyle='--', linewidth=2)

ax2.set_xlabel('True Band Gap (eV)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Residuals (eV)', fontsize=14, fontweight='bold')
ax2.set_title('Residual Plot', fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('test_evaluation.png', dpi=300, bbox_inches='tight')
print("\nSaved evaluation results to test_evaluation.png")
plt.show()

4.4 Embedding Extraction and Dimensionality Reduction

Code Example 11: Extracting Embeddings from All Data

def extract_all_embeddings(model, dataset, device, batch_size=64):
    """
    Extract embeddings from entire dataset

    Parameters:
    -----------
    model : nn.Module
        Trained model
    dataset : Dataset
        Dataset
    device : torch.device
        Device
    batch_size : int
        Batch size

    Returns:
    --------
    embeddings : np.ndarray
        Embedding vectors
    targets : np.ndarray
        Target values
    """
    model.eval()
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

    all_embeddings = []
    all_targets = []

    with torch.no_grad():
        for batch in tqdm(loader, desc="Extracting embeddings"):
            batch = batch.to(device)

            _, embeddings = model(batch, return_embedding=True)

            all_embeddings.append(embeddings.cpu().numpy())
            all_targets.append(batch.y.cpu().numpy())

    embeddings = np.concatenate(all_embeddings, axis=0)
    targets = np.concatenate(all_targets, axis=0).flatten()

    return embeddings, targets


# Extract embeddings from all data
embeddings, targets = extract_all_embeddings(model, dataset, device)

print(f"Embedding shape: {embeddings.shape}")
print(f"Target shape: {targets.shape}")

# Save
np.save('cgcnn_embeddings.npy', embeddings)
np.save('cgcnn_targets.npy', targets)
print("\nSaved embeddings")

Code Example 12: Dimensionality Reduction with PCA

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 12: Dimensionality Reduction with PCA

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Run PCA
pca = PCA(n_components=min(50, embeddings.shape[1]))
embeddings_pca = pca.fit_transform(embeddings)

# Analyze explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
cumsum_variance = np.cumsum(explained_variance_ratio)

# Scree plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Individual explained variance
ax1.bar(range(1, 21), explained_variance_ratio[:20],
        alpha=0.7, edgecolor='black', color='steelblue')
ax1.set_xlabel('Principal Component', fontsize=14, fontweight='bold')
ax1.set_ylabel('Explained Variance Ratio', fontsize=14, fontweight='bold')
ax1.set_title('PCA Scree Plot (Top 20 Components)',
              fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3, axis='y')

# Cumulative explained variance
ax2.plot(range(1, len(cumsum_variance) + 1), cumsum_variance,
         marker='o', linewidth=2, markersize=4, color='darkred')
ax2.axhline(y=0.95, color='green', linestyle='--', linewidth=2,
            label='95% threshold', alpha=0.7)
ax2.set_xlabel('Number of Components', fontsize=14, fontweight='bold')
ax2.set_ylabel('Cumulative Explained Variance', fontsize=14, fontweight='bold')
ax2.set_title('Cumulative Variance Explained',
              fontsize=16, fontweight='bold')
ax2.legend(fontsize=12)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pca_analysis.png', dpi=300, bbox_inches='tight')
print(f"Saved PCA analysis to pca_analysis.png")
print(f"\nNumber of principal components explaining 95% variance: {np.argmax(cumsum_variance >= 0.95) + 1}")
plt.show()

# 2D PCA visualization
pca_2d = PCA(n_components=2)
embeddings_pca_2d = pca_2d.fit_transform(embeddings)

plt.figure(figsize=(12, 9))
scatter = plt.scatter(embeddings_pca_2d[:, 0], embeddings_pca_2d[:, 1],
                      c=targets, cmap='viridis', s=50, alpha=0.6,
                      edgecolors='black', linewidth=0.5)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)',
           fontsize=14, fontweight='bold')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)',
           fontsize=14, fontweight='bold')
plt.title('PCA: 2D Projection of CGCNN Embeddings',
          fontsize=16, fontweight='bold')
plt.colorbar(scatter, label='Band Gap (eV)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pca_2d_projection.png', dpi=300, bbox_inches='tight')
print("Saved PCA 2D projection to pca_2d_projection.png")
plt.show()

Code Example 13: Dimensionality Reduction with UMAP

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 13: Dimensionality Reduction with UMAP

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import umap
import matplotlib.pyplot as plt
import numpy as np

# Run UMAP (experiment with multiple parameters)
n_neighbors_list = [5, 15, 30, 50]
min_dist_list = [0.0, 0.1, 0.3]

# Run with optimal parameters
print("Running UMAP...")
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric='euclidean',
    random_state=42,
    verbose=True
)

embeddings_umap = reducer.fit_transform(embeddings)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Color by target value
scatter1 = axes[0].scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
                           c=targets, cmap='viridis', s=50, alpha=0.6,
                           edgecolors='black', linewidth=0.5)
axes[0].set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
axes[0].set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
axes[0].set_title('UMAP: Colored by Band Gap',
                  fontsize=16, fontweight='bold')
axes[0].grid(True, alpha=0.3)
cbar1 = plt.colorbar(scatter1, ax=axes[0])
cbar1.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')

# Color by category
band_gap_categories = pd.cut(targets, bins=[0, 1, 2, 3, 10],
                              labels=['Small', 'Medium', 'Large', 'Very Large'])

colors_map = {'Small': 'red', 'Medium': 'orange', 'Large': 'yellow', 'Very Large': 'green'}
colors = [colors_map[cat] for cat in band_gap_categories]

for category in ['Small', 'Medium', 'Large', 'Very Large']:
    mask = band_gap_categories == category
    axes[1].scatter(embeddings_umap[mask, 0], embeddings_umap[mask, 1],
                    c=colors_map[category], label=category, s=50, alpha=0.7,
                    edgecolors='black', linewidth=0.5)

axes[1].set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
axes[1].set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
axes[1].set_title('UMAP: Colored by Band Gap Category',
                  fontsize=16, fontweight='bold')
axes[1].legend(title='Band Gap', fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('umap_2d_projection.png', dpi=300, bbox_inches='tight')
print("\nSaved UMAP 2D projection to umap_2d_projection.png")
plt.show()

# Save UMAP results
np.save('umap_embeddings_2d.npy', embeddings_umap)

Code Example 14: Dimensionality Reduction with t-SNE

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Example: Code Example 14: Dimensionality Reduction with t-SNE

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import time

# Run t-SNE (speed up with subset)
subset_size = min(1000, len(embeddings))
subset_indices = np.random.choice(len(embeddings), subset_size, replace=False)

embeddings_subset = embeddings[subset_indices]
targets_subset = targets[subset_indices]

print(f"Running t-SNE (subset size: {subset_size})...")
start_time = time.time()

tsne = TSNE(
    n_components=2,
    perplexity=30,
    n_iter=1000,
    random_state=42,
    verbose=1
)

embeddings_tsne = tsne.fit_transform(embeddings_subset)

elapsed_time = time.time() - start_time
print(f"t-SNE complete (elapsed time: {elapsed_time:.1f} seconds)")

# Visualization
plt.figure(figsize=(12, 9))
scatter = plt.scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1],
                      c=targets_subset, cmap='plasma', s=50, alpha=0.6,
                      edgecolors='black', linewidth=0.5)
plt.xlabel('t-SNE 1', fontsize=14, fontweight='bold')
plt.ylabel('t-SNE 2', fontsize=14, fontweight='bold')
plt.title('t-SNE: 2D Projection of CGCNN Embeddings',
          fontsize=16, fontweight='bold')
plt.colorbar(scatter, label='Band Gap (eV)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('tsne_2d_projection.png', dpi=300, bbox_inches='tight')
print("Saved t-SNE 2D projection to tsne_2d_projection.png")
plt.show()

Code Example 15: Comparison of Dimensionality Reduction Methods

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 15: Comparison of Dimensionality Reduction Meth

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
import numpy as np

# Display results from three methods side by side
fig, axes = plt.subplots(1, 3, figsize=(21, 6))

# PCA
scatter = axes[0].scatter(embeddings_pca_2d[:, 0], embeddings_pca_2d[:, 1],
                          c=targets, cmap='viridis', s=30, alpha=0.6,
                          edgecolors='none')
axes[0].set_xlabel('PC1', fontsize=12, fontweight='bold')
axes[0].set_ylabel('PC2', fontsize=12, fontweight='bold')
axes[0].set_title('PCA', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# UMAP
scatter = axes[1].scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
                          c=targets, cmap='viridis', s=30, alpha=0.6,
                          edgecolors='none')
axes[1].set_xlabel('UMAP 1', fontsize=12, fontweight='bold')
axes[1].set_ylabel('UMAP 2', fontsize=12, fontweight='bold')
axes[1].set_title('UMAP', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# t-SNE
scatter = axes[2].scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1],
                          c=targets_subset, cmap='viridis', s=30, alpha=0.6,
                          edgecolors='none')
axes[2].set_xlabel('t-SNE 1', fontsize=12, fontweight='bold')
axes[2].set_ylabel('t-SNE 2', fontsize=12, fontweight='bold')
axes[2].set_title('t-SNE', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3)

# Common colorbar
fig.subplots_adjust(right=0.9)
cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
cbar = fig.colorbar(scatter, cax=cbar_ax)
cbar.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')

plt.savefig('dimensionality_reduction_comparison.png', dpi=300, bbox_inches='tight')
print("Saved dimensionality reduction comparison to dimensionality_reduction_comparison.png")
plt.show()

4.5 Materials Space Analysis

Code Example 16: Clustering and Property Analysis

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0

"""
Example: Code Example 16: Clustering and Property Analysis

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

from sklearn.cluster import KMeans
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# K-Means clustering (on UMAP space)
n_clusters = 6
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=20)
cluster_labels = kmeans.fit_predict(embeddings_umap)

# Visualization by cluster
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

# Color by cluster label
colors = plt.cm.Set3(np.linspace(0, 1, n_clusters))
for cluster_id in range(n_clusters):
    mask = cluster_labels == cluster_id
    ax1.scatter(embeddings_umap[mask, 0], embeddings_umap[mask, 1],
                c=[colors[cluster_id]], label=f'Cluster {cluster_id}',
                s=60, alpha=0.7, edgecolors='black', linewidth=0.5)

# Cluster centers
umap_reducer_for_centers = umap.UMAP(n_components=2, n_neighbors=15,
                                     min_dist=0.1, random_state=42)
umap_reducer_for_centers.fit(embeddings)
centers_umap = umap_reducer_for_centers.transform(kmeans.cluster_centers_)

ax1.scatter(centers_umap[:, 0], centers_umap[:, 1],
            c='red', marker='X', s=300, edgecolors='black',
            linewidth=2, label='Centroids', zorder=10)

ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title(f'K-Means Clustering (k={n_clusters})',
              fontsize=16, fontweight='bold')
ax1.legend(fontsize=10, loc='best', ncol=2)
ax1.grid(True, alpha=0.3)

# Property distribution by cluster
cluster_df = pd.DataFrame({
    'cluster': cluster_labels,
    'band_gap': targets
})

sns.boxplot(data=cluster_df, x='cluster', y='band_gap',
            ax=ax2, palette='Set3')
sns.swarmplot(data=cluster_df, x='cluster', y='band_gap',
              ax=ax2, color='black', alpha=0.3, size=2)

ax2.set_xlabel('Cluster ID', fontsize=14, fontweight='bold')
ax2.set_ylabel('Band Gap (eV)', fontsize=14, fontweight='bold')
ax2.set_title('Band Gap Distribution by Cluster',
              fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('clustering_analysis.png', dpi=300, bbox_inches='tight')
print("Saved clustering analysis to clustering_analysis.png")
plt.show()

# Cluster statistics
print("\nBand gap statistics by cluster:")
cluster_stats = cluster_df.groupby('cluster')['band_gap'].agg([
    'count', 'mean', 'std', 'min', 'max'
])
print(cluster_stats.round(3))

Code Example 17: Material Property and Cluster Relationship Analysis

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0

"""
Example: Code Example 17: Material Property and Cluster Relationship 

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Integrate additional material properties into DataFrame
materials_analysis_df = materials_df.copy()
materials_analysis_df['cluster'] = cluster_labels
materials_analysis_df['umap1'] = embeddings_umap[:, 0]
materials_analysis_df['umap2'] = embeddings_umap[:, 1]

# Statistics of multiple properties by cluster
properties_to_analyze = [
    'band_gap', 'formation_energy_per_atom',
    'density', 'energy_above_hull', 'volume'
]

cluster_property_stats = materials_analysis_df.groupby('cluster')[
    properties_to_analyze
].mean()

# Visualize with heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(cluster_property_stats.T, annot=True, fmt='.2f',
            cmap='RdYlGn_r', center=0, linewidths=1,
            cbar_kws={"label": "Normalized Value"})
plt.title('Average Material Properties by Cluster',
          fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Cluster ID', fontsize=14, fontweight='bold')
plt.ylabel('Property', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('cluster_properties_heatmap.png', dpi=300, bbox_inches='tight')
print("Saved cluster properties heatmap to cluster_properties_heatmap.png")
plt.show()

Code Example 18: Creating Density Maps

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 18: Creating Density Maps

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import numpy as np

# Density estimation in UMAP space
xy = np.vstack([embeddings_umap[:, 0], embeddings_umap[:, 1]])
density = gaussian_kde(xy)(xy)

# Sort by density (draw high-density points on top)
idx = density.argsort()
x, y, z = embeddings_umap[idx, 0], embeddings_umap[idx, 1], density[idx]

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

# Density map
scatter1 = ax1.scatter(x, y, c=z, cmap='hot', s=50, alpha=0.7,
                       edgecolors='black', linewidth=0.3)
ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title('Materials Space Density Map',
              fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3)
cbar1 = plt.colorbar(scatter1, ax=ax1)
cbar1.set_label('Point Density', fontsize=12, fontweight='bold')

# Density map + Band gap
scatter2 = ax2.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
                       c=targets, s=50 + density*1000, alpha=0.6,
                       cmap='viridis', edgecolors='black', linewidth=0.5)
ax2.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax2.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax2.set_title('Band Gap with Density (size = density)',
              fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)
cbar2 = plt.colorbar(scatter2, ax=ax2)
cbar2.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('density_map.png', dpi=300, bbox_inches='tight')
print("Saved density map to density_map.png")
plt.show()

Code Example 19: Searching for Similar Materials

# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0

from sklearn.neighbors import NearestNeighbors
import pandas as pd

def find_similar_materials_in_embedding_space(
    query_idx, embeddings, materials_df, k=10
):
    """
    Search for similar materials in embedding space

    Parameters:
    -----------
    query_idx : int
        Query material index
    embeddings : np.ndarray
        Embedding vectors
    materials_df : pd.DataFrame
        Materials data
    k : int
        Number of neighbors to search

    Returns:
    --------
    results_df : pd.DataFrame
        Information on neighboring materials
    """
    nbrs = NearestNeighbors(n_neighbors=k+1, metric='cosine').fit(embeddings)
    distances, indices = nbrs.kneighbors(embeddings[query_idx:query_idx+1])

    # Create results DataFrame
    results = []
    for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
        if i == 0:  # Skip query itself
            continue

        result = {
            'rank': i,
            'material_id': materials_df.iloc[idx]['material_id'],
            'formula': materials_df.iloc[idx]['formula'],
            'band_gap': materials_df.iloc[idx]['band_gap'],
            'formation_energy': materials_df.iloc[idx]['formation_energy_per_atom'],
            'density': materials_df.iloc[idx]['density'],
            'distance': dist
        }
        results.append(result)

    results_df = pd.DataFrame(results)
    return results_df


# Usage example: Randomly select materials and search for similar materials
np.random.seed(42)
query_indices = np.random.choice(len(materials_df), 3, replace=False)

print("Similar materials search examples:\n")
for query_idx in query_indices:
    query_material = materials_df.iloc[query_idx]

    print(f"Query material:")
    print(f"  Material ID: {query_material['material_id']}")
    print(f"  Formula: {query_material['formula']}")
    print(f"  Band Gap: {query_material['band_gap']:.3f} eV")
    print(f"\nSimilar materials (Top 5):")

    similar_materials = find_similar_materials_in_embedding_space(
        query_idx, embeddings, materials_df, k=5
    )

    print(similar_materials[['rank', 'material_id', 'formula', 'band_gap', 'distance']].to_string(index=False))
    print("\n" + "="*80 + "\n")

Code Example 20: Material Recommendation System

def recommend_materials_by_property(
    target_property_value,
    property_name,
    embeddings,
    materials_df,
    n_recommendations=10,
    property_weight=0.7,
    embedding_weight=0.3
):
    """
    Recommend materials based on target property value

    Parameters:
    -----------
    target_property_value : float
        Target property value
    property_name : str
        Property name
    embeddings : np.ndarray
        Embedding vectors
    materials_df : pd.DataFrame
        Materials data
    n_recommendations : int
        Number of recommendations
    property_weight : float
        Weight for property value
    embedding_weight : float
        Weight for embedding distance

    Returns:
    --------
    recommendations_df : pd.DataFrame
        List of recommended materials
    """
    # Property value difference
    property_diff = np.abs(materials_df[property_name].values - target_property_value)
    property_diff_normalized = property_diff / property_diff.max()

    # Index of material closest to target property value
    closest_idx = property_diff.argmin()

    # Distance in embedding space
    from sklearn.metrics.pairwise import cosine_distances
    embedding_distances = cosine_distances(
        embeddings[closest_idx:closest_idx+1],
        embeddings
    ).flatten()
    embedding_distances_normalized = embedding_distances / embedding_distances.max()

    # Combined score (smaller is better)
    combined_score = (
        property_weight * property_diff_normalized +
        embedding_weight * embedding_distances_normalized
    )

    # Select top n_recommendations
    top_indices = combined_score.argsort()[:n_recommendations]

    # Create results DataFrame
    recommendations = []
    for rank, idx in enumerate(top_indices, 1):
        rec = {
            'rank': rank,
            'material_id': materials_df.iloc[idx]['material_id'],
            'formula': materials_df.iloc[idx]['formula'],
            property_name: materials_df.iloc[idx][property_name],
            'property_diff': property_diff[idx],
            'embedding_distance': embedding_distances[idx],
            'combined_score': combined_score[idx]
        }
        recommendations.append(rec)

    recommendations_df = pd.DataFrame(recommendations)
    return recommendations_df


# Usage example: Recommend materials close to band gap 2.0 eV
target_bandgap = 2.0

print(f"Target band gap: {target_bandgap} eV")
print("\nRecommended materials (Top 10):\n")

recommendations = recommend_materials_by_property(
    target_property_value=target_bandgap,
    property_name='band_gap',
    embeddings=embeddings,
    materials_df=materials_df,
    n_recommendations=10,
    property_weight=0.7,
    embedding_weight=0.3
)

print(recommendations[['rank', 'material_id', 'formula', 'band_gap',
                       'property_diff', 'combined_score']].to_string(index=False))

4.6 Interactive Visualization

Code Example 21: 3D UMAP with Plotly

# Requirements:
# - Python 3.9+
# - pandas>=2.0.0, <2.2.0
# - plotly>=5.14.0

"""
Example: Code Example 21: 3D UMAP with Plotly

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import plotly.express as px
import plotly.graph_objects as go
import umap
import pandas as pd

# 3D UMAP
print("Running 3D UMAP...")
reducer_3d = umap.UMAP(
    n_components=3,
    n_neighbors=15,
    min_dist=0.1,
    random_state=42,
    verbose=False
)

embeddings_umap_3d = reducer_3d.fit_transform(embeddings)

# Create DataFrame
df_3d = pd.DataFrame({
    'UMAP1': embeddings_umap_3d[:, 0],
    'UMAP2': embeddings_umap_3d[:, 1],
    'UMAP3': embeddings_umap_3d[:, 2],
    'Band_Gap': targets,
    'Material_ID': materials_df['material_id'].values,
    'Formula': materials_df['formula'].values,
    'Formation_Energy': materials_df['formation_energy_per_atom'].values,
    'Density': materials_df['density'].values,
    'Cluster': cluster_labels
})

# Interactive 3D plot
fig = px.scatter_3d(
    df_3d,
    x='UMAP1', y='UMAP2', z='UMAP3',
    color='Band_Gap',
    size='Density',
    hover_data=['Material_ID', 'Formula', 'Formation_Energy', 'Cluster'],
    color_continuous_scale='Viridis',
    title='Interactive 3D UMAP: Materials Space Explorer'
)

fig.update_traces(marker=dict(line=dict(width=0.5, color='DarkSlateGrey')))

fig.update_layout(
    scene=dict(
        xaxis_title='UMAP 1',
        yaxis_title='UMAP 2',
        zaxis_title='UMAP 3',
        xaxis=dict(backgroundcolor="rgb(230, 230, 230)", gridcolor="white"),
        yaxis=dict(backgroundcolor="rgb(230, 230, 230)", gridcolor="white"),
        zaxis=dict(backgroundcolor="rgb(230, 230, 230)", gridcolor="white"),
    ),
    width=1000,
    height=800,
    font=dict(size=12)
)

fig.write_html('materials_3d_interactive.html')
print("Saved interactive 3D visualization to materials_3d_interactive.html")
fig.show()

Code Example 22: Interactive Scatter Plot with Bokeh

from bokeh.plotting import figure, output_file, save
from bokeh.models import HoverTool, ColorBar, LinearColorMapper, ColumnDataSource
from bokeh.palettes import Viridis256
from bokeh.io import show
from bokeh.layouts import column, row
from bokeh.models.widgets import Select

# Color mapper
color_mapper = LinearColorMapper(
    palette=Viridis256,
    low=targets.min(),
    high=targets.max()
)

# Data source
source = ColumnDataSource(data=dict(
    x=embeddings_umap[:, 0],
    y=embeddings_umap[:, 1],
    material_id=materials_df['material_id'].values,
    formula=materials_df['formula'].values,
    band_gap=targets,
    formation_energy=materials_df['formation_energy_per_atom'].values,
    density=materials_df['density'].values,
    volume=materials_df['volume'].values,
    cluster=cluster_labels
))

# Create plot
output_file('materials_interactive.html')

p = figure(
    width=1000,
    height=800,
    title='Interactive Materials Space Explorer (UMAP)',
    tools='pan,wheel_zoom,box_zoom,box_select,lasso_select,reset,save'
)

# Scatter plot
circles = p.circle(
    'x', 'y',
    size=8,
    source=source,
    fill_color={'field': 'band_gap', 'transform': color_mapper},
    fill_alpha=0.7,
    line_color='black',
    line_width=0.5
)

# Hover tool
hover = HoverTool(tooltips=[
    ('Material ID', '@material_id'),
    ('Formula', '@formula'),
    ('Band Gap', '@band_gap{0.00} eV'),
    ('Formation E', '@formation_energy{0.00} eV/atom'),
    ('Density', '@density{0.00} g/cm³'),
    ('Volume', '@volume{0.0} Ų'),
    ('Cluster', '@cluster')
])
p.add_tools(hover)

# Color bar
color_bar = ColorBar(
    color_mapper=color_mapper,
    label_standoff=12,
    title='Band Gap (eV)',
    location=(0, 0)
)
p.add_layout(color_bar, 'right')

# Axis labels
p.xaxis.axis_label = 'UMAP 1'
p.yaxis.axis_label = 'UMAP 2'
p.title.text_font_size = '16pt'
p.xaxis.axis_label_text_font_size = '14pt'
p.yaxis.axis_label_text_font_size = '14pt'

save(p)
print("Saved interactive UMAP to materials_interactive.html")
show(p)

Code Example 23: Dashboard with Dash (Optional)

# Requirements:
# - Python 3.9+
# - dash>=2.11.0
# - pandas>=2.0.0, <2.2.0
# - plotly>=5.14.0

# Install Dash (first time only)
# !pip install dash

"""
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd

# Create Dash app
app = dash.Dash(__name__)

# Prepare DataFrame
df_dash = pd.DataFrame({
    'umap1': embeddings_umap[:, 0],
    'umap2': embeddings_umap[:, 1],
    'material_id': materials_df['material_id'].values,
    'formula': materials_df['formula'].values,
    'band_gap': targets,
    'formation_energy': materials_df['formation_energy_per_atom'].values,
    'density': materials_df['density'].values,
    'cluster': cluster_labels
})

# Layout
app.layout = html.Div([
    html.H1('Materials Space Explorer Dashboard'),

    html.Div([
        html.Label('Color by:'),
        dcc.Dropdown(
            id='color-dropdown',
            options=[
                {'label': 'Band Gap', 'value': 'band_gap'},
                {'label': 'Formation Energy', 'value': 'formation_energy'},
                {'label': 'Density', 'value': 'density'},
                {'label': 'Cluster', 'value': 'cluster'}
            ],
            value='band_gap'
        ),
    ], style={'width': '30%', 'display': 'inline-block'}),

    dcc.Graph(id='umap-scatter'),

    html.Div(id='material-info')
])

# Callback
@app.callback(
    Output('umap-scatter', 'figure'),
    Input('color-dropdown', 'value')
)
def update_scatter(color_by):
    fig = px.scatter(
        df_dash,
        x='umap1',
        y='umap2',
        color=color_by,
        hover_data=['material_id', 'formula', 'band_gap', 'formation_energy'],
        color_continuous_scale='Viridis',
        title=f'UMAP colored by {color_by}'
    )

    fig.update_traces(marker=dict(size=8, line=dict(width=0.5, color='black')))

    return fig

# Run app
if __name__ == '__main__':
    app.run_server(debug=True)
"""

print("Displayed Dash dashboard code example")
print("Uncomment and run to execute")

4.7 Advanced Analysis and Applications

Code Example 24: Voronoi Tessellation of Materials Space

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 24: Voronoi Tessellation of Materials Space

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from scipy.spatial import Voronoi, voronoi_plot_2d
import matplotlib.pyplot as plt
import numpy as np

# Create Voronoi diagram with subset (reduce computation)
subset_size = 100
subset_indices = np.random.choice(len(embeddings_umap), subset_size, replace=False)

points = embeddings_umap[subset_indices]
targets_subset = targets[subset_indices]

# Compute Voronoi diagram
vor = Voronoi(points)

# Plot
fig, ax = plt.subplots(figsize=(14, 12))

# Voronoi cells
voronoi_plot_2d(vor, ax=ax, show_vertices=False, line_colors='gray',
                line_width=1, line_alpha=0.6, point_size=0)

# Material points
scatter = ax.scatter(points[:, 0], points[:, 1],
                     c=targets_subset, cmap='viridis',
                     s=100, alpha=0.8, edgecolors='black', linewidth=1.5,
                     zorder=5)

ax.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax.set_title('Voronoi Tessellation of Materials Space',
             fontsize=16, fontweight='bold')

cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')

# Limit axis range (exclude infinity points)
ax.set_xlim([points[:, 0].min() - 1, points[:, 0].max() + 1])
ax.set_ylim([points[:, 1].min() - 1, points[:, 1].max() + 1])

plt.tight_layout()
plt.savefig('voronoi_tessellation.png', dpi=300, bbox_inches='tight')
print("Saved Voronoi tessellation to voronoi_tessellation.png")
plt.show()

Code Example 25: Property Gradient Visualization

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 25: Property Gradient Visualization

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import griddata

# Create grid
grid_x, grid_y = np.mgrid[
    embeddings_umap[:, 0].min():embeddings_umap[:, 0].max():100j,
    embeddings_umap[:, 1].min():embeddings_umap[:, 1].max():100j
]

# Interpolation
grid_z = griddata(
    embeddings_umap,
    targets,
    (grid_x, grid_y),
    method='cubic'
)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

# Contour plot
contour = ax1.contourf(grid_x, grid_y, grid_z, levels=20, cmap='viridis', alpha=0.8)
ax1.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
            c='white', s=5, alpha=0.5, edgecolors='none')
ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title('Band Gap Contour Map', fontsize=16, fontweight='bold')
cbar1 = plt.colorbar(contour, ax=ax1)
cbar1.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')

# Gradient vectors
gradient_y, gradient_x = np.gradient(grid_z)

# Subsampling (adjust arrow density)
skip = 5
ax2.contourf(grid_x, grid_y, grid_z, levels=20, cmap='viridis', alpha=0.6)
ax2.quiver(grid_x[::skip, ::skip], grid_y[::skip, ::skip],
           gradient_x[::skip, ::skip], gradient_y[::skip, ::skip],
           color='white', alpha=0.8, scale=50)
ax2.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
            c='black', s=5, alpha=0.3, edgecolors='none')
ax2.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax2.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax2.set_title('Band Gap Gradient Field', fontsize=16, fontweight='bold')
cbar2 = plt.colorbar(contour, ax=ax2)
cbar2.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('property_gradient.png', dpi=300, bbox_inches='tight')
print("Saved property gradient to property_gradient.png")
plt.show()

Code Example 26: Embedding Stability Evaluation

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold

def evaluate_embedding_stability(model, dataset, device, n_splits=5):
    """
    Evaluate embedding stability with cross-validation

    Parameters:
    -----------
    model : nn.Module
        Model (for re-training)
    dataset : Dataset
        Dataset
    device : torch.device
        Device
    n_splits : int
        Number of splits

    Returns:
    --------
    stability_scores : list
        Stability scores
    """
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    embeddings_list = []

    for fold, (train_idx, val_idx) in enumerate(kfold.split(range(len(dataset)))):
        print(f"Fold {fold+1}/{n_splits}")

        # Omitted: Re-train model for each fold
        # In practice, create a new model instance and train with train_idx

        # For demo: use existing embeddings
        embeddings_list.append(embeddings)

    # Calculate similarity between embeddings
    from sklearn.metrics.pairwise import cosine_similarity

    similarities = []
    for i in range(len(embeddings_list)):
        for j in range(i+1, len(embeddings_list)):
            sim_matrix = cosine_similarity(embeddings_list[i], embeddings_list[j])
            # Average of diagonal elements (similarity between corresponding embeddings for each point)
            sim_score = np.mean(np.diag(sim_matrix))
            similarities.append(sim_score)

    return similarities


# Execute (demo)
print("Embedding stability evaluation (demo):")
stability_scores = evaluate_embedding_stability(model, dataset, device, n_splits=3)
print(f"Average similarity: {np.mean(stability_scores):.3f}")
print(f"Standard deviation: {np.std(stability_scores):.3f}")

Code Example 27: Ensemble Model Embedding Integration

def create_ensemble_embeddings(models_list, dataset, device):
    """
    Integrate embeddings from multiple models

    Parameters:
    -----------
    models_list : list
        List of trained models
    dataset : Dataset
        Dataset
    device : torch.device
        Device

    Returns:
    --------
    ensemble_embeddings : np.ndarray
        Integrated embeddings
    """
    all_embeddings = []

    for model in models_list:
        emb, _ = extract_all_embeddings(model, dataset, device)
        all_embeddings.append(emb)

    # Take average
    ensemble_embeddings = np.mean(all_embeddings, axis=0)

    return ensemble_embeddings


# Demo: Use single model three times (in practice, use different models)
print("Ensemble embedding demo:")
models_list = [model, model, model]  # In practice, different models

ensemble_emb = create_ensemble_embeddings(models_list, dataset, device)
print(f"Ensemble embedding shape: {ensemble_emb.shape}")

Code Example 28: Time-series Materials Space Changes (Extensibility)

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Track changes in materials space with time-series data or multiple model versions

def visualize_materials_space_evolution(
    embeddings_timeline,
    targets_timeline,
    timestamps
):
    '''
    Visualize time-series changes in materials space

    Parameters:
    -----------
    embeddings_timeline : list of np.ndarray
        Embeddings at each time point
    targets_timeline : list of np.ndarray
        Targets at each time point
    timestamps : list
        Timestamps

    Returns:
    --------
    animation or multiple plots
    '''
    import matplotlib.animation as animation

    fig, ax = plt.subplots(figsize=(10, 8))

    def update(frame):
        ax.clear()
        embeddings = embeddings_timeline[frame]
        targets = targets_timeline[frame]

        scatter = ax.scatter(embeddings[:, 0], embeddings[:, 1],
                            c=targets, cmap='viridis', s=50, alpha=0.6)
        ax.set_title(f'Materials Space at {timestamps[frame]}')
        ax.set_xlabel('Component 1')
        ax.set_ylabel('Component 2')

        return scatter,

    anim = animation.FuncAnimation(fig, update, frames=len(timestamps),
                                  interval=500, blit=False)

    return anim
'''

print("Displayed time-series visualization extensibility code example")
"""

Code Example 29: Extrapolation Region Detection

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 29: Extrapolation Region Detection

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 1-5 minutes
Dependencies: None
"""

from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
import numpy as np

# Learn embedding range from training data
train_embeddings = embeddings[train_idx]

# Detect outliers (extrapolation regions) with Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(train_embeddings)

# Predict on all data
outlier_labels = iso_forest.predict(embeddings)
outlier_scores = iso_forest.score_samples(embeddings)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

# Color by outlier label
colors_outlier = ['red' if label == -1 else 'blue' for label in outlier_labels]
ax1.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
            c=colors_outlier, s=50, alpha=0.6, edgecolors='black', linewidth=0.5)
ax1.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax1.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax1.set_title('Extrapolation Region Detection (Red = Outlier)',
              fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Outlier score
scatter = ax2.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1],
                      c=outlier_scores, cmap='RdYlGn', s=50, alpha=0.6,
                      edgecolors='black', linewidth=0.5)
ax2.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax2.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax2.set_title('Anomaly Score (Green = Normal, Red = Outlier)',
              fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)
cbar = plt.colorbar(scatter, ax=ax2)
cbar.set_label('Anomaly Score', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('extrapolation_detection.png', dpi=300, bbox_inches='tight')
print("Saved extrapolation detection to extrapolation_detection.png")
plt.show()

print(f"\nNumber of outliers: {np.sum(outlier_labels == -1)} / {len(outlier_labels)}")

Code Example 30: Comprehensive Report Generation

import json
from datetime import datetime

def generate_comprehensive_report(
    model, dataset, embeddings, targets,
    materials_df, cluster_labels, test_predictions, test_targets
):
    """
    Generate comprehensive analysis report

    Parameters:
    -----------
    model : nn.Module
        Trained model
    dataset : Dataset
        Dataset
    embeddings : np.ndarray
        Embeddings
    targets : np.ndarray
        Targets
    materials_df : pd.DataFrame
        Materials data
    cluster_labels : np.ndarray
        Cluster labels
    test_predictions : np.ndarray
        Test predictions
    test_targets : np.ndarray
        Test true values

    Returns:
    --------
    report : dict
        Report dictionary
    """
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

    # Model performance
    mae = mean_absolute_error(test_targets, test_predictions)
    rmse = np.sqrt(mean_squared_error(test_targets, test_predictions))
    r2 = r2_score(test_targets, test_predictions)

    # Embedding statistics
    embedding_stats = {
        'dimension': embeddings.shape[1],
        'mean_norm': float(np.mean(np.linalg.norm(embeddings, axis=1))),
        'std_norm': float(np.std(np.linalg.norm(embeddings, axis=1))),
    }

    # Cluster statistics
    cluster_stats = {}
    for cluster_id in np.unique(cluster_labels):
        mask = cluster_labels == cluster_id
        cluster_stats[f'cluster_{cluster_id}'] = {
            'size': int(np.sum(mask)),
            'mean_band_gap': float(np.mean(targets[mask])),
            'std_band_gap': float(np.std(targets[mask]))
        }

    # Report
    report = {
        'metadata': {
            'timestamp': datetime.now().isoformat(),
            'dataset_size': len(dataset),
            'model_parameters': sum(p.numel() for p in model.parameters())
        },
        'model_performance': {
            'MAE': float(mae),
            'RMSE': float(rmse),
            'R2': float(r2)
        },
        'embedding_statistics': embedding_stats,
        'cluster_statistics': cluster_stats,
        'target_statistics': {
            'mean': float(np.mean(targets)),
            'std': float(np.std(targets)),
            'min': float(np.min(targets)),
            'max': float(np.max(targets))
        }
    }

    return report


# Generate report
report = generate_comprehensive_report(
    model, dataset, embeddings, targets,
    materials_df, cluster_labels,
    test_predictions, test_targets
)

# Save in JSON format
with open('comprehensive_report.json', 'w') as f:
    json.dump(report, f, indent=2)

print("Saved comprehensive report to comprehensive_report.json\n")
print("Report summary:")
print(json.dumps(report, indent=2)[:1000] + "...")

# Generate Markdown report
markdown_report = f"""
# Materials Space Mapping - Comprehensive Report

Generated: {report['metadata']['timestamp']}

## 1. Dataset Information

- Dataset size: {report['metadata']['dataset_size']}
- Model parameters: {report['metadata']['model_parameters']:,}

## 2. Model Performance

| Metric | Value |
|--------|-------|
| MAE    | {report['model_performance']['MAE']:.4f} |
| RMSE   | {report['model_performance']['RMSE']:.4f} |
| R²     | {report['model_performance']['R2']:.4f} |

## 3. Embedding Statistics

- Dimensions: {report['embedding_statistics']['dimension']}
- Mean norm: {report['embedding_statistics']['mean_norm']:.3f}
- Norm standard deviation: {report['embedding_statistics']['std_norm']:.3f}

## 4. Target Property Statistics

| Statistic | Value (eV) |
|-----------|------------|
| Mean      | {report['target_statistics']['mean']:.3f} |
| Std Dev   | {report['target_statistics']['std']:.3f} |
| Min       | {report['target_statistics']['min']:.3f} |
| Max       | {report['target_statistics']['max']:.3f} |

## 5. Cluster Analysis

"""

for cluster_id, stats in report['cluster_statistics'].items():
    markdown_report += f"- **{cluster_id}**: Size={stats['size']}, Mean band gap={stats['mean_band_gap']:.3f} eV\n"

markdown_report += """

## 6. Generated Visualizations

- `training_curve.png`: Training curve
- `test_evaluation.png`: Test evaluation
- `pca_2d_projection.png`: PCA 2D projection
- `umap_2d_projection.png`: UMAP 2D projection
- `tsne_2d_projection.png`: t-SNE 2D projection
- `clustering_analysis.png`: Clustering analysis
- `materials_3d_interactive.html`: Interactive 3D visualization
- `materials_interactive.html`: Interactive 2D visualization

## 7. Summary

In this analysis, by combining GNN-based materials representation learning with dimensionality reduction,
we effectively visualized high-dimensional materials space and revealed similarities between materials and property trends.

"""

# Save Markdown report
with open('comprehensive_report.md', 'w') as f:
    f.write(markdown_report)

print("\nSaved Markdown report to comprehensive_report.md")

4.8 Summary

In this chapter, we built a practical materials mapping system combining GNN and dimensionality reduction.

Implemented Features

Feature	Number of Code Examples	Main Content
Data Collection & Preparation	Examples 1-6	MP API, graph conversion, dataset
Model Training	Examples 7-10	CGCNN, training loop, evaluation
Embedding Extraction & Dimensionality Reduction	Examples 11-15	PCA, UMAP, t-SNE, comparison
Materials Space Analysis	Examples 16-20	Clustering, neighbor search, recommendation
Interactive Visualization	Examples 21-23	Plotly 3D, Bokeh, Dash
Advanced Analysis	Examples 24-30	Voronoi, gradient, extrapolation, report

Best Practices

Data Preprocessing: Standardization, missing value handling, outlier removal
Model Training: Early stopping, learning rate scheduling, regularization
Dimensionality Reduction: Comparison of multiple methods, parameter tuning
Visualization: Interactivity, multiple perspectives, interpretability
Reproducibility: Fixed random seeds, save configurations, version control

Applications

Materials Discovery: Recommendation of materials with target properties
Property Prediction: Prediction of properties for novel materials
Structure-Property Relationships: Elucidating relationships between structure and properties in materials space
Experimental Design: Suggestions for next materials to synthesize

Future Developments

Multi-task Learning: Simultaneous prediction of multiple properties
Transfer Learning: Application to small datasets
Active Learning: Efficient data collection strategies
Explainable AI: Improving interpretability of GNN predictions

Previous Chapter: Chapter 3: Materials Representation Learning with GNN

Series Top: Introduction to Materials Property Mapping

Further Learning

Recommended Literature

GNN for Materials: "Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals" (Xie & Grossman, 2018)
UMAP: "UMAP: Uniform Manifold Approximation and Projection" (McInnes et al., 2018)
Materials Informatics: "Materials Informatics" (Ramprasad et al., 2017)

Related Resources

Congratulations! Through this series, you have learned the fundamentals and practical applications of materials property mapping.

Chapter 4: Practical Guide - Materials Mapping with GNN + Dimensionality Reduction

Overview

Learning Objectives

4.1 Environment Setup and Data Collection

Code Example 1: Installing Required Libraries

Code Example 2: Materials Project API Setup and Data Collection

Code Example 3: Exploratory Data Analysis

4.2 Graph Dataset Construction

Code Example 4: Crystal Structure to Graph Conversion (Optimized Version)

Code Example 5: Custom Dataset Class

Code Example 6: Dataset Splitting and DataLoader

4.3 GNN Model Training

Code Example 7: Improved CGCNN Model

Code Example 8: Training Loop (with Early Stopping)

Code Example 9: Training Curve Visualization

Code Example 10: Evaluation on Test Data

4.4 Embedding Extraction and Dimensionality Reduction

Code Example 11: Extracting Embeddings from All Data

Code Example 12: Dimensionality Reduction with PCA

Code Example 13: Dimensionality Reduction with UMAP

Code Example 14: Dimensionality Reduction with t-SNE

Code Example 15: Comparison of Dimensionality Reduction Methods

4.5 Materials Space Analysis

Code Example 16: Clustering and Property Analysis

Code Example 17: Material Property and Cluster Relationship Analysis

Code Example 18: Creating Density Maps

Code Example 19: Searching for Similar Materials

Code Example 20: Material Recommendation System

4.6 Interactive Visualization

Code Example 21: 3D UMAP with Plotly

Code Example 22: Interactive Scatter Plot with Bokeh

Code Example 23: Dashboard with Dash (Optional)

4.7 Advanced Analysis and Applications

Code Example 24: Voronoi Tessellation of Materials Space

Code Example 25: Property Gradient Visualization

Code Example 26: Embedding Stability Evaluation

Code Example 27: Ensemble Model Embedding Integration

Code Example 28: Time-series Materials Space Changes (Extensibility)

Code Example 29: Extrapolation Region Detection

Code Example 30: Comprehensive Report Generation

4.8 Summary

Implemented Features

Best Practices

Applications

Future Developments

Further Learning

Recommended Literature

Related Resources

Disclaimer