Chapter 2: Materials Space Mapping Using Dimensionality Reduction Techniques

This chapter covers Materials Space Mapping Using Dimensionality Reduction Techniques. You will learn essential concepts and techniques.

Overview

By projecting high-dimensional materials property data into 2D or 3D space, we can visually grasp similarities and structures among materials. This chapter applies dimensionality reduction techniques such as Principal Component Analysis (PCA), t-SNE, and UMAP to materials data to achieve effective materials space mapping.

Learning Objectives

Understand the principles and characteristics of PCA, t-SNE, and UMAP
Apply each method to materials data and compare results
Evaluate the quality of dimensionality reduction results
Implement interactive visualizations

2.1 Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction technique that establishes new axes (principal components) in the direction that maximizes data variance. It can reduce dimensions while preserving correlation structures among material properties.

2.1.1 Basic PCA Implementation

Code Example 1: Dimensionality Reduction with PCA

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: Code Example 1: Dimensionality Reduction with PCA

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load materials data (created in Chapter 1)
materials_data = pd.read_csv('materials_properties.csv')

# Extract feature columns
feature_cols = ['band_gap', 'formation_energy', 'density',
                'bulk_modulus', 'shear_modulus', 'melting_point']
X = materials_data[feature_cols].values

# Standardization (PCA is sensitive to feature scales)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Execute PCA (reduce to 2 dimensions)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Store results in DataFrame
materials_data['PC1'] = X_pca[:, 0]
materials_data['PC2'] = X_pca[:, 1]

# Explained variance ratio of principal components
explained_variance = pca.explained_variance_ratio_
print("PCA Results:")
print(f"PC1 explained variance: {explained_variance[0]:.3f}")
print(f"PC2 explained variance: {explained_variance[1]:.3f}")
print(f"Cumulative explained variance: {sum(explained_variance):.3f}")

# Principal component loadings (weights of each feature)
components_df = pd.DataFrame(
    pca.components_.T,
    columns=['PC1', 'PC2'],
    index=feature_cols
)
print("\nPrincipal component loadings (contribution of each feature):")
print(components_df.round(3))

Sample Output:

PCA Results:
PC1 explained variance: 0.342
PC2 explained variance: 0.234
Cumulative explained variance: 0.576

Principal component loadings (contribution of each feature):
                       PC1     PC2
band_gap            -0.245   0.512
formation_energy     0.387  -0.298
density              0.456   0.321
bulk_modulus         0.498   0.145
shear_modulus        0.445   0.087
melting_point        0.312   -0.687

Code Example 2: PCA Results Visualization

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 2: PCA Results Visualization

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
import numpy as np

# PCA score plot
fig, ax = plt.subplots(figsize=(12, 9))

# Categorize by stability
colors = materials_data['formation_energy'].apply(
    lambda x: 'green' if x < -1.0 else 'orange' if x < 0 else 'red'
)

scatter = ax.scatter(materials_data['PC1'],
                     materials_data['PC2'],
                     c=colors,
                     s=50,
                     alpha=0.6,
                     edgecolors='black',
                     linewidth=0.5)

# Axis labels (including explained variance)
ax.set_xlabel(f'PC1 ({explained_variance[0]*100:.1f}% variance)',
              fontsize=14, fontweight='bold')
ax.set_ylabel(f'PC2 ({explained_variance[1]*100:.1f}% variance)',
              fontsize=14, fontweight='bold')
ax.set_title('PCA: Materials Space Visualization',
             fontsize=16, fontweight='bold')

# Grid
ax.grid(True, alpha=0.3, linestyle='--')
ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5, alpha=0.5)
ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5, alpha=0.5)

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='green', edgecolor='black', label='Stable (E < -1 eV)'),
    Patch(facecolor='orange', edgecolor='black', label='Metastable (-1 < E < 0 eV)'),
    Patch(facecolor='red', edgecolor='black', label='Unstable (E > 0 eV)')
]
ax.legend(handles=legend_elements, loc='best', fontsize=12)

plt.tight_layout()
plt.savefig('pca_materials_space.png', dpi=300, bbox_inches='tight')
print("PCA score plot saved to pca_materials_space.png")
plt.show()

Code Example 3: PCA Scree Plot

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Example: Code Example 3: PCA Scree Plot

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Calculate all principal components
pca_full = PCA()
pca_full.fit(X_scaled)

# Scree plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Left: Explained variance ratio of each principal component
n_components = len(pca_full.explained_variance_ratio_)
ax1.bar(range(1, n_components + 1),
        pca_full.explained_variance_ratio_,
        alpha=0.7,
        edgecolor='black',
        color='steelblue')
ax1.set_xlabel('Principal Component', fontsize=14, fontweight='bold')
ax1.set_ylabel('Explained Variance Ratio', fontsize=14, fontweight='bold')
ax1.set_title('Scree Plot: Individual Variance', fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3, axis='y')

# Right: Cumulative explained variance
cumsum_variance = np.cumsum(pca_full.explained_variance_ratio_)
ax2.plot(range(1, n_components + 1),
         cumsum_variance,
         marker='o',
         linewidth=2,
         markersize=8,
         color='darkred')
ax2.axhline(y=0.95, color='green', linestyle='--', linewidth=2,
            label='95% variance threshold', alpha=0.7)
ax2.set_xlabel('Number of Components', fontsize=14, fontweight='bold')
ax2.set_ylabel('Cumulative Explained Variance', fontsize=14, fontweight='bold')
ax2.set_title('Cumulative Variance Explained', fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend(fontsize=12)

plt.tight_layout()
plt.savefig('pca_scree_plot.png', dpi=300, bbox_inches='tight')
print(f"Scree plot saved to pca_scree_plot.png")
print(f"\nNumber of principal components needed to explain 95% variance: {np.argmax(cumsum_variance >= 0.95) + 1}")
plt.show()

Code Example 4: PCA Loading Plot (Biplot)

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 4: PCA Loading Plot (Biplot)

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
import numpy as np

# Biplot
fig, ax = plt.subplots(figsize=(12, 10))

# Score plot (samples)
ax.scatter(materials_data['PC1'],
           materials_data['PC2'],
           alpha=0.3,
           s=20,
           color='lightblue',
           edgecolors='none',
           label='Materials')

# Loading vectors (variables)
scale_factor = 3.0  # Vector scaling
for i, feature in enumerate(feature_cols):
    ax.arrow(0, 0,
             pca.components_[0, i] * scale_factor,
             pca.components_[1, i] * scale_factor,
             head_width=0.15,
             head_length=0.15,
             fc='red',
             ec='darkred',
             linewidth=2,
             alpha=0.8)

    # Label
    ax.text(pca.components_[0, i] * scale_factor * 1.15,
            pca.components_[1, i] * scale_factor * 1.15,
            feature.replace('_', ' ').title(),
            fontsize=11,
            fontweight='bold',
            ha='center',
            va='center',
            bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7))

ax.set_xlabel(f'PC1 ({explained_variance[0]*100:.1f}% variance)',
              fontsize=14, fontweight='bold')
ax.set_ylabel(f'PC2 ({explained_variance[1]*100:.1f}% variance)',
              fontsize=14, fontweight='bold')
ax.set_title('PCA Biplot: Materials and Features',
             fontsize=16, fontweight='bold')

ax.grid(True, alpha=0.3, linestyle='--')
ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)
ax.legend(fontsize=12, loc='upper right')

plt.tight_layout()
plt.savefig('pca_biplot.png', dpi=300, bbox_inches='tight')
print("PCA biplot saved to pca_biplot.png")
plt.show()

2.2 t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is a non-linear dimensionality reduction technique that projects high-dimensional data into low dimensions while preserving local structure (neighborhood relationships). It excels at visualizing cluster structures.

2.2.1 Basic t-SNE Implementation

Code Example 5: Dimensionality Reduction with t-SNE

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 5: Dimensionality Reduction with t-SNE

Purpose: Demonstrate machine learning model training and evaluation
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

from sklearn.manifold import TSNE
import numpy as np
import time

# Execute t-SNE (experiment with multiple perplexity values)
perplexities = [5, 30, 50]
tsne_results = {}

for perplexity in perplexities:
    print(f"\nRunning t-SNE (perplexity={perplexity})...")
    start_time = time.time()

    tsne = TSNE(n_components=2,
                perplexity=perplexity,
                n_iter=1000,
                random_state=42,
                verbose=0)

    X_tsne = tsne.fit_transform(X_scaled)
    tsne_results[perplexity] = X_tsne

    elapsed_time = time.time() - start_time
    print(f"Completed (elapsed time: {elapsed_time:.2f}s)")
    print(f"KL divergence: {tsne.kl_divergence_:.3f}")

# Save results (perplexity=30 case)
materials_data['tsne1'] = tsne_results[30][:, 0]
materials_data['tsne2'] = tsne_results[30][:, 1]

Sample Output:

Running t-SNE (perplexity=5)...
Completed (elapsed time: 3.45s)
KL divergence: 1.234

Running t-SNE (perplexity=30)...
Completed (elapsed time: 3.67s)
KL divergence: 0.987

Running t-SNE (perplexity=50)...
Completed (elapsed time: 3.89s)
KL divergence: 1.056

Code Example 6: Comparison of Different Perplexities

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Example: Code Example 6: Comparison of Different Perplexities

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt

# Display results for three perplexity values side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for idx, perplexity in enumerate(perplexities):
    ax = axes[idx]
    X_tsne = tsne_results[perplexity]

    scatter = ax.scatter(X_tsne[:, 0],
                         X_tsne[:, 1],
                         c=materials_data['band_gap'],
                         cmap='viridis',
                         s=50,
                         alpha=0.6,
                         edgecolors='black',
                         linewidth=0.5)

    ax.set_title(f't-SNE (perplexity={perplexity})',
                 fontsize=14, fontweight='bold')
    ax.set_xlabel('t-SNE 1', fontsize=12, fontweight='bold')
    ax.set_ylabel('t-SNE 2', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3, linestyle='--')

    # Colorbar
    cbar = plt.colorbar(scatter, ax=ax)
    cbar.set_label('Band Gap (eV)', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('tsne_perplexity_comparison.png', dpi=300, bbox_inches='tight')
print("t-SNE perplexity comparison saved to tsne_perplexity_comparison.png")
plt.show()

Code Example 7: t-SNE Clustering Results Visualization

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Example: Code Example 7: t-SNE Clustering Results Visualization

Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Clustering on t-SNE results
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(tsne_results[30])

materials_data['cluster'] = cluster_labels

# Visualization by cluster
fig, ax = plt.subplots(figsize=(12, 9))

colors = plt.cm.Set3(np.linspace(0, 1, n_clusters))

for cluster_id in range(n_clusters):
    mask = cluster_labels == cluster_id
    ax.scatter(tsne_results[30][mask, 0],
               tsne_results[30][mask, 1],
               c=[colors[cluster_id]],
               label=f'Cluster {cluster_id}',
               s=60,
               alpha=0.7,
               edgecolors='black',
               linewidth=0.5)

# Cluster centers
centers_tsne = kmeans.cluster_centers_
ax.scatter(centers_tsne[:, 0],
           centers_tsne[:, 1],
           c='red',
           marker='X',
           s=300,
           edgecolors='black',
           linewidth=2,
           label='Cluster Centers',
           zorder=10)

ax.set_xlabel('t-SNE 1', fontsize=14, fontweight='bold')
ax.set_ylabel('t-SNE 2', fontsize=14, fontweight='bold')
ax.set_title(f't-SNE with K-Means Clustering (k={n_clusters})',
             fontsize=16, fontweight='bold')
ax.legend(fontsize=11, loc='best')
ax.grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.savefig('tsne_clustering.png', dpi=300, bbox_inches='tight')
print(f"t-SNE clustering results saved to tsne_clustering.png")
plt.show()

# Average property values per cluster
print("\nAverage property values per cluster:")
cluster_stats = materials_data.groupby('cluster')[feature_cols].mean()
print(cluster_stats.round(2))

2.3 UMAP (Uniform Manifold Approximation and Projection)

UMAP is a state-of-the-art dimensionality reduction technique that is faster than t-SNE and also preserves global structure. It operates efficiently even on large-scale datasets.

2.3.1 UMAP Installation and Basic Implementation

Code Example 8: Dimensionality Reduction with UMAP

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 8: Dimensionality Reduction with UMAP

Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

# Install UMAP (first time only)
# !pip install umap-learn

import umap
import numpy as np
import time

# Execute UMAP (experiment with multiple n_neighbors values)
n_neighbors_list = [5, 15, 50]
umap_results = {}

for n_neighbors in n_neighbors_list:
    print(f"\nRunning UMAP (n_neighbors={n_neighbors})...")
    start_time = time.time()

    reducer = umap.UMAP(n_components=2,
                        n_neighbors=n_neighbors,
                        min_dist=0.1,
                        metric='euclidean',
                        random_state=42)

    X_umap = reducer.fit_transform(X_scaled)
    umap_results[n_neighbors] = X_umap

    elapsed_time = time.time() - start_time
    print(f"Completed (elapsed time: {elapsed_time:.2f}s)")

# Save results (n_neighbors=15 case)
materials_data['umap1'] = umap_results[15][:, 0]
materials_data['umap2'] = umap_results[15][:, 1]

print("\nUMAP execution complete. Results saved to DataFrame.")

Sample Output:

Running UMAP (n_neighbors=5)...
Completed (elapsed time: 1.23s)

Running UMAP (n_neighbors=15)...
Completed (elapsed time: 1.34s)

Running UMAP (n_neighbors=50)...
Completed (elapsed time: 1.45s)

UMAP execution complete. Results saved to DataFrame.

Code Example 9: Comparison of Different n_neighbors

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Example: Code Example 9: Comparison of Different n_neighbors

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt

# Display results for three n_neighbors values side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for idx, n_neighbors in enumerate(n_neighbors_list):
    ax = axes[idx]
    X_umap = umap_results[n_neighbors]

    scatter = ax.scatter(X_umap[:, 0],
                         X_umap[:, 1],
                         c=materials_data['formation_energy'],
                         cmap='RdYlGn_r',
                         s=50,
                         alpha=0.6,
                         edgecolors='black',
                         linewidth=0.5)

    ax.set_title(f'UMAP (n_neighbors={n_neighbors})',
                 fontsize=14, fontweight='bold')
    ax.set_xlabel('UMAP 1', fontsize=12, fontweight='bold')
    ax.set_ylabel('UMAP 2', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3, linestyle='--')

    # Colorbar
    cbar = plt.colorbar(scatter, ax=ax)
    cbar.set_label('Formation Energy (eV/atom)', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('umap_neighbors_comparison.png', dpi=300, bbox_inches='tight')
print("UMAP n_neighbors comparison saved to umap_neighbors_comparison.png")
plt.show()

Code Example 10: UMAP Density Map

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Example: Code Example 10: UMAP Density Map

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# Density estimation of UMAP results
X_umap = umap_results[15]

# KDE (Kernel Density Estimation)
xy = np.vstack([X_umap[:, 0], X_umap[:, 1]])
density = gaussian_kde(xy)(xy)

# Sort by density (draw high-density points on top)
idx = density.argsort()
x, y, z = X_umap[idx, 0], X_umap[idx, 1], density[idx]

# Plot
fig, ax = plt.subplots(figsize=(12, 9))

scatter = ax.scatter(x, y, c=z, cmap='hot', s=50, alpha=0.7,
                     edgecolors='black', linewidth=0.3)

ax.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax.set_title('UMAP: Materials Space Density Map',
             fontsize=16, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')

cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Point Density', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('umap_density_map.png', dpi=300, bbox_inches='tight')
print("UMAP density map saved to umap_density_map.png")
plt.show()

2.4 Comparison of Methods

Code Example 11: Parallel Comparison of PCA vs t-SNE vs UMAP

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0

"""
Example: Code Example 11: Parallel Comparison of PCA vs t-SNE vs UMAP

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt

# Display results of three methods side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Common colormap (colored by band gap)
vmin = materials_data['band_gap'].min()
vmax = materials_data['band_gap'].max()

# PCA
ax = axes[0]
scatter = ax.scatter(materials_data['PC1'],
                     materials_data['PC2'],
                     c=materials_data['band_gap'],
                     cmap='viridis',
                     s=50,
                     alpha=0.6,
                     edgecolors='black',
                     linewidth=0.5,
                     vmin=vmin,
                     vmax=vmax)
ax.set_title('PCA', fontsize=16, fontweight='bold')
ax.set_xlabel('PC1', fontsize=12, fontweight='bold')
ax.set_ylabel('PC2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')

# t-SNE
ax = axes[1]
scatter = ax.scatter(materials_data['tsne1'],
                     materials_data['tsne2'],
                     c=materials_data['band_gap'],
                     cmap='viridis',
                     s=50,
                     alpha=0.6,
                     edgecolors='black',
                     linewidth=0.5,
                     vmin=vmin,
                     vmax=vmax)
ax.set_title('t-SNE', fontsize=16, fontweight='bold')
ax.set_xlabel('t-SNE 1', fontsize=12, fontweight='bold')
ax.set_ylabel('t-SNE 2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')

# UMAP
ax = axes[2]
scatter = ax.scatter(materials_data['umap1'],
                     materials_data['umap2'],
                     c=materials_data['band_gap'],
                     cmap='viridis',
                     s=50,
                     alpha=0.6,
                     edgecolors='black',
                     linewidth=0.5,
                     vmin=vmin,
                     vmax=vmax)
ax.set_title('UMAP', fontsize=16, fontweight='bold')
ax.set_xlabel('UMAP 1', fontsize=12, fontweight='bold')
ax.set_ylabel('UMAP 2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')

# Common colorbar
fig.subplots_adjust(right=0.9)
cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
cbar = fig.colorbar(scatter, cax=cbar_ax)
cbar.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')

plt.savefig('dimensionality_reduction_comparison.png', dpi=300, bbox_inches='tight')
print("Dimensionality reduction method comparison saved to dimensionality_reduction_comparison.png")
plt.show()

Code Example 12: Evaluation of Neighborhood Preservation Rate

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

from sklearn.neighbors import NearestNeighbors
import numpy as np

def calculate_neighborhood_preservation(X_high, X_low, k=10):
    """
    Calculate neighborhood preservation rate between high-dimensional and low-dimensional space

    Parameters:
    -----------
    X_high : array-like
        Data in high-dimensional space
    X_low : array-like
        Data in low-dimensional space
    k : int
        Number of neighbors

    Returns:
    --------
    preservation_rate : float
        Neighborhood preservation rate (0-1)
    """
    # k-nearest neighbors in high-dimensional space
    nbrs_high = NearestNeighbors(n_neighbors=k+1).fit(X_high)
    _, indices_high = nbrs_high.kneighbors(X_high)

    # k-nearest neighbors in low-dimensional space
    nbrs_low = NearestNeighbors(n_neighbors=k+1).fit(X_low)
    _, indices_low = nbrs_low.kneighbors(X_low)

    # Calculate neighborhood preservation rate
    preservation_scores = []
    for i in range(len(X_high)):
        # Exclude self
        neighbors_high = set(indices_high[i, 1:])
        neighbors_low = set(indices_low[i, 1:])

        # Proportion of common neighbors
        intersection = len(neighbors_high & neighbors_low)
        preservation_scores.append(intersection / k)

    return np.mean(preservation_scores)

# Evaluate neighborhood preservation rate for each method
k_values = [5, 10, 20, 50]
results = {
    'PCA': [],
    't-SNE': [],
    'UMAP': []
}

for k in k_values:
    pca_preservation = calculate_neighborhood_preservation(
        X_scaled, X_pca, k=k
    )
    tsne_preservation = calculate_neighborhood_preservation(
        X_scaled, tsne_results[30], k=k
    )
    umap_preservation = calculate_neighborhood_preservation(
        X_scaled, umap_results[15], k=k
    )

    results['PCA'].append(pca_preservation)
    results['t-SNE'].append(tsne_preservation)
    results['UMAP'].append(umap_preservation)

    print(f"Neighborhood preservation rate at k={k}:")
    print(f"  PCA:   {pca_preservation:.3f}")
    print(f"  t-SNE: {tsne_preservation:.3f}")
    print(f"  UMAP:  {umap_preservation:.3f}")
    print()

# Plot results
fig, ax = plt.subplots(figsize=(10, 7))

for method, scores in results.items():
    ax.plot(k_values, scores, marker='o', linewidth=2,
            markersize=8, label=method)

ax.set_xlabel('Number of Neighbors (k)', fontsize=14, fontweight='bold')
ax.set_ylabel('Neighborhood Preservation Rate', fontsize=14, fontweight='bold')
ax.set_title('Comparison of Neighborhood Preservation',
             fontsize=16, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3, linestyle='--')
ax.set_ylim([0, 1])

plt.tight_layout()
plt.savefig('neighborhood_preservation.png', dpi=300, bbox_inches='tight')
print("Neighborhood preservation rate comparison saved to neighborhood_preservation.png")
plt.show()

2.5 Interactive Visualization

Code Example 13: 3D UMAP with Plotly

# Requirements:
# - Python 3.9+
# - plotly>=5.14.0

"""
Example: Code Example 13: 3D UMAP with Plotly

Purpose: Demonstrate data visualization techniques
Target: Beginner
Execution time: 2-5 seconds
Dependencies: None
"""

# Install Plotly (first time only)
# !pip install plotly

import plotly.express as px
import plotly.graph_objects as go
import umap

# Execute 3D UMAP
reducer_3d = umap.UMAP(n_components=3,
                       n_neighbors=15,
                       min_dist=0.1,
                       random_state=42)

X_umap_3d = reducer_3d.fit_transform(X_scaled)

materials_data['umap1_3d'] = X_umap_3d[:, 0]
materials_data['umap2_3d'] = X_umap_3d[:, 1]
materials_data['umap3_3d'] = X_umap_3d[:, 2]

# Interactive 3D plot
fig = px.scatter_3d(materials_data,
                    x='umap1_3d',
                    y='umap2_3d',
                    z='umap3_3d',
                    color='band_gap',
                    size='density',
                    hover_data=['formula', 'formation_energy', 'bulk_modulus'],
                    color_continuous_scale='Viridis',
                    title='Interactive 3D UMAP: Materials Space')

fig.update_traces(marker=dict(line=dict(width=0.5, color='black')))

fig.update_layout(
    scene=dict(
        xaxis_title='UMAP 1',
        yaxis_title='UMAP 2',
        zaxis_title='UMAP 3',
        xaxis=dict(backgroundcolor="rgb(230, 230,230)",
                   gridcolor="white"),
        yaxis=dict(backgroundcolor="rgb(230, 230,230)",
                   gridcolor="white"),
        zaxis=dict(backgroundcolor="rgb(230, 230,230)",
                   gridcolor="white"),
    ),
    width=900,
    height=700,
    font=dict(size=12)
)

fig.write_html('umap_3d_interactive.html')
print("Interactive 3D UMAP saved to umap_3d_interactive.html")
fig.show()

Code Example 14: Interactive Scatter Plot with Bokeh

# Install Bokeh (first time only)
# !pip install bokeh

from bokeh.plotting import figure, output_file, save
from bokeh.models import HoverTool, ColorBar, LinearColorMapper
from bokeh.palettes import Viridis256
from bokeh.io import show

# Color mapper
color_mapper = LinearColorMapper(palette=Viridis256,
                                 low=materials_data['band_gap'].min(),
                                 high=materials_data['band_gap'].max())

# Create plot
output_file('umap_interactive.html')

p = figure(width=900,
           height=700,
           title='Interactive UMAP: Materials Space',
           tools='pan,wheel_zoom,box_zoom,reset,save')

# Data source
source_data = dict(
    x=materials_data['umap1'],
    y=materials_data['umap2'],
    formula=materials_data['formula'],
    band_gap=materials_data['band_gap'],
    formation_energy=materials_data['formation_energy'],
    density=materials_data['density'],
    bulk_modulus=materials_data['bulk_modulus']
)

# Scatter plot
circles = p.circle('x', 'y',
                   size=8,
                   source=source_data,
                   fill_color={'field': 'band_gap', 'transform': color_mapper},
                   fill_alpha=0.7,
                   line_color='black',
                   line_width=0.5)

# Hover tool
hover = HoverTool(tooltips=[
    ('Formula', '@formula'),
    ('Band Gap', '@band_gap{0.00} eV'),
    ('Formation E', '@formation_energy{0.00} eV/atom'),
    ('Density', '@density{0.00} g/cm³'),
    ('Bulk Modulus', '@bulk_modulus{0.0} GPa')
])
p.add_tools(hover)

# Colorbar
color_bar = ColorBar(color_mapper=color_mapper,
                     label_standoff=12,
                     title='Band Gap (eV)',
                     location=(0, 0))
p.add_layout(color_bar, 'right')

# Axis labels
p.xaxis.axis_label = 'UMAP 1'
p.yaxis.axis_label = 'UMAP 2'
p.title.text_font_size = '16pt'
p.xaxis.axis_label_text_font_size = '14pt'
p.yaxis.axis_label_text_font_size = '14pt'

save(p)
print("Interactive UMAP saved to umap_interactive.html")
show(p)

Code Example 15: Animation of Dimensionality Reduction Process

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Code Example 15: Animation of Dimensionality Reduction Proce

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import matplotlib.pyplot as plt
import matplotlib.animation as animation
from sklearn.decomposition import PCA
import numpy as np

# Animation with multi-stage PCA
n_frames = 20
fig, ax = plt.subplots(figsize=(10, 8))

# Initial 3D PCA
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)

def update(frame):
    ax.clear()

    # Rotation angle
    angle = frame * (360 / n_frames)
    angle_rad = np.radians(angle)

    # Apply rotation matrix
    rotation_matrix = np.array([
        [np.cos(angle_rad), -np.sin(angle_rad), 0],
        [np.sin(angle_rad), np.cos(angle_rad), 0],
        [0, 0, 1]
    ])

    X_rotated = X_pca_3d @ rotation_matrix

    # 2D projection
    scatter = ax.scatter(X_rotated[:, 0],
                         X_rotated[:, 1],
                         c=materials_data['band_gap'],
                         cmap='viridis',
                         s=50,
                         alpha=0.6,
                         edgecolors='black',
                         linewidth=0.5)

    ax.set_xlabel('Dimension 1', fontsize=12, fontweight='bold')
    ax.set_ylabel('Dimension 2', fontsize=12, fontweight='bold')
    ax.set_title(f'3D PCA Rotation (angle={angle:.0f}°)',
                 fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3, linestyle='--')

    # Fix axis ranges
    ax.set_xlim(X_rotated[:, 0].min() - 1, X_rotated[:, 0].max() + 1)
    ax.set_ylim(X_rotated[:, 1].min() - 1, X_rotated[:, 1].max() + 1)

    return scatter,

# Create animation
anim = animation.FuncAnimation(fig, update, frames=n_frames,
                               interval=200, blit=False)

# Save as GIF
anim.save('pca_rotation_animation.gif', writer='pillow', fps=5)
print("PCA rotation animation saved to pca_rotation_animation.gif")
plt.close()

2.6 Summary

In this chapter, we learned about materials space mapping using dimensionality reduction techniques:

Main Dimensionality Reduction Techniques

Method	Characteristics	Advantages	Disadvantages
PCA	Linear, variance maximization	Fast, high interpretability	Weak for non-linear structures
t-SNE	Non-linear, neighborhood preservation	Excellent for cluster visualization	Slow, loses global structure
UMAP	Non-linear, topology preservation	Fast, balances global and local	Requires parameter tuning

Implemented Code

Code Example	Content	Method
Example 1-4	Basic PCA implementation, visualization	PCA
Example 5-7	t-SNE implementation, parameter comparison	t-SNE
Example 8-10	UMAP implementation, density maps	UMAP
Example 11-12	Method comparison, evaluation metrics	Comparison
Example 13-15	Interactive visualization	Plotly, Bokeh

Best Practices

Preprocessing: Standardization (StandardScaler) is essential
Method Selection: - Interpretability focus → PCA - Cluster discovery → t-SNE - Balanced approach → UMAP
Parameter Tuning: Experiment with multiple settings to find optimal values
Evaluation: Assess quality with quantitative metrics like neighborhood preservation rate

Looking Ahead to the Next Chapter

In Chapter 3, we will use Graph Neural Networks (GNN) to learn feature representations from material structural information, achieving more advanced materials space mapping. We will implement state-of-the-art GNN models such as CGCNN, MEGNet, and SchNet, performing dimensionality reduction with crystal structures as direct input.

Previous Chapter: Chapter 1: Fundamentals of Materials Space Visualization

Next Chapter: Chapter 3: Materials Representation Learning with GNN

Series Top: Introduction to Materials Property Mapping

Chapter 2: Dimensionality Reduction for Materials Space Mapping

Chapter 2: Materials Space Mapping Using Dimensionality Reduction Techniques

Overview

Learning Objectives

2.1 Principal Component Analysis (PCA)

2.1.1 Basic PCA Implementation

Code Example 1: Dimensionality Reduction with PCA

Code Example 2: PCA Results Visualization

Code Example 3: PCA Scree Plot

Code Example 4: PCA Loading Plot (Biplot)

2.2 t-SNE (t-Distributed Stochastic Neighbor Embedding)

2.2.1 Basic t-SNE Implementation

Code Example 5: Dimensionality Reduction with t-SNE

Code Example 6: Comparison of Different Perplexities

Code Example 7: t-SNE Clustering Results Visualization

2.3 UMAP (Uniform Manifold Approximation and Projection)

2.3.1 UMAP Installation and Basic Implementation

Code Example 8: Dimensionality Reduction with UMAP

Code Example 9: Comparison of Different n_neighbors

Code Example 10: UMAP Density Map

2.4 Comparison of Methods

Code Example 11: Parallel Comparison of PCA vs t-SNE vs UMAP

Code Example 12: Evaluation of Neighborhood Preservation Rate

2.5 Interactive Visualization

Code Example 13: 3D UMAP with Plotly

Code Example 14: Interactive Scatter Plot with Bokeh

Code Example 15: Animation of Dimensionality Reduction Process

2.6 Summary

Main Dimensionality Reduction Techniques

Implemented Code

Best Practices

Looking Ahead to the Next Chapter

Disclaimer