Chapter 2: Materials Space Mapping Using Dimensionality Reduction Techniques
This chapter covers Materials Space Mapping Using Dimensionality Reduction Techniques. You will learn essential concepts and techniques.
Overview
By projecting high-dimensional materials property data into 2D or 3D space, we can visually grasp similarities and structures among materials. This chapter applies dimensionality reduction techniques such as Principal Component Analysis (PCA), t-SNE, and UMAP to materials data to achieve effective materials space mapping.
Learning Objectives
- Understand the principles and characteristics of PCA, t-SNE, and UMAP
- Apply each method to materials data and compare results
- Evaluate the quality of dimensionality reduction results
- Implement interactive visualizations
2.1 Principal Component Analysis (PCA)
PCA is a linear dimensionality reduction technique that establishes new axes (principal components) in the direction that maximizes data variance. It can reduce dimensions while preserving correlation structures among material properties.
2.1.1 Basic PCA Implementation
Code Example 1: Dimensionality Reduction with PCA
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
"""
Example: Code Example 1: Dimensionality Reduction with PCA
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load materials data (created in Chapter 1)
materials_data = pd.read_csv('materials_properties.csv')
# Extract feature columns
feature_cols = ['band_gap', 'formation_energy', 'density',
'bulk_modulus', 'shear_modulus', 'melting_point']
X = materials_data[feature_cols].values
# Standardization (PCA is sensitive to feature scales)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Execute PCA (reduce to 2 dimensions)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Store results in DataFrame
materials_data['PC1'] = X_pca[:, 0]
materials_data['PC2'] = X_pca[:, 1]
# Explained variance ratio of principal components
explained_variance = pca.explained_variance_ratio_
print("PCA Results:")
print(f"PC1 explained variance: {explained_variance[0]:.3f}")
print(f"PC2 explained variance: {explained_variance[1]:.3f}")
print(f"Cumulative explained variance: {sum(explained_variance):.3f}")
# Principal component loadings (weights of each feature)
components_df = pd.DataFrame(
pca.components_.T,
columns=['PC1', 'PC2'],
index=feature_cols
)
print("\nPrincipal component loadings (contribution of each feature):")
print(components_df.round(3))
Sample Output:
PCA Results:
PC1 explained variance: 0.342
PC2 explained variance: 0.234
Cumulative explained variance: 0.576
Principal component loadings (contribution of each feature):
PC1 PC2
band_gap -0.245 0.512
formation_energy 0.387 -0.298
density 0.456 0.321
bulk_modulus 0.498 0.145
shear_modulus 0.445 0.087
melting_point 0.312 -0.687
Code Example 2: PCA Results Visualization
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 2: PCA Results Visualization
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import numpy as np
# PCA score plot
fig, ax = plt.subplots(figsize=(12, 9))
# Categorize by stability
colors = materials_data['formation_energy'].apply(
lambda x: 'green' if x < -1.0 else 'orange' if x < 0 else 'red'
)
scatter = ax.scatter(materials_data['PC1'],
materials_data['PC2'],
c=colors,
s=50,
alpha=0.6,
edgecolors='black',
linewidth=0.5)
# Axis labels (including explained variance)
ax.set_xlabel(f'PC1 ({explained_variance[0]*100:.1f}% variance)',
fontsize=14, fontweight='bold')
ax.set_ylabel(f'PC2 ({explained_variance[1]*100:.1f}% variance)',
fontsize=14, fontweight='bold')
ax.set_title('PCA: Materials Space Visualization',
fontsize=16, fontweight='bold')
# Grid
ax.grid(True, alpha=0.3, linestyle='--')
ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5, alpha=0.5)
ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5, alpha=0.5)
# Legend
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='green', edgecolor='black', label='Stable (E < -1 eV)'),
Patch(facecolor='orange', edgecolor='black', label='Metastable (-1 < E < 0 eV)'),
Patch(facecolor='red', edgecolor='black', label='Unstable (E > 0 eV)')
]
ax.legend(handles=legend_elements, loc='best', fontsize=12)
plt.tight_layout()
plt.savefig('pca_materials_space.png', dpi=300, bbox_inches='tight')
print("PCA score plot saved to pca_materials_space.png")
plt.show()
Code Example 3: PCA Scree Plot
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Code Example 3: PCA Scree Plot
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Calculate all principal components
pca_full = PCA()
pca_full.fit(X_scaled)
# Scree plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Left: Explained variance ratio of each principal component
n_components = len(pca_full.explained_variance_ratio_)
ax1.bar(range(1, n_components + 1),
pca_full.explained_variance_ratio_,
alpha=0.7,
edgecolor='black',
color='steelblue')
ax1.set_xlabel('Principal Component', fontsize=14, fontweight='bold')
ax1.set_ylabel('Explained Variance Ratio', fontsize=14, fontweight='bold')
ax1.set_title('Scree Plot: Individual Variance', fontsize=16, fontweight='bold')
ax1.grid(True, alpha=0.3, axis='y')
# Right: Cumulative explained variance
cumsum_variance = np.cumsum(pca_full.explained_variance_ratio_)
ax2.plot(range(1, n_components + 1),
cumsum_variance,
marker='o',
linewidth=2,
markersize=8,
color='darkred')
ax2.axhline(y=0.95, color='green', linestyle='--', linewidth=2,
label='95% variance threshold', alpha=0.7)
ax2.set_xlabel('Number of Components', fontsize=14, fontweight='bold')
ax2.set_ylabel('Cumulative Explained Variance', fontsize=14, fontweight='bold')
ax2.set_title('Cumulative Variance Explained', fontsize=16, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend(fontsize=12)
plt.tight_layout()
plt.savefig('pca_scree_plot.png', dpi=300, bbox_inches='tight')
print(f"Scree plot saved to pca_scree_plot.png")
print(f"\nNumber of principal components needed to explain 95% variance: {np.argmax(cumsum_variance >= 0.95) + 1}")
plt.show()
Code Example 4: PCA Loading Plot (Biplot)
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 4: PCA Loading Plot (Biplot)
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import numpy as np
# Biplot
fig, ax = plt.subplots(figsize=(12, 10))
# Score plot (samples)
ax.scatter(materials_data['PC1'],
materials_data['PC2'],
alpha=0.3,
s=20,
color='lightblue',
edgecolors='none',
label='Materials')
# Loading vectors (variables)
scale_factor = 3.0 # Vector scaling
for i, feature in enumerate(feature_cols):
ax.arrow(0, 0,
pca.components_[0, i] * scale_factor,
pca.components_[1, i] * scale_factor,
head_width=0.15,
head_length=0.15,
fc='red',
ec='darkred',
linewidth=2,
alpha=0.8)
# Label
ax.text(pca.components_[0, i] * scale_factor * 1.15,
pca.components_[1, i] * scale_factor * 1.15,
feature.replace('_', ' ').title(),
fontsize=11,
fontweight='bold',
ha='center',
va='center',
bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7))
ax.set_xlabel(f'PC1 ({explained_variance[0]*100:.1f}% variance)',
fontsize=14, fontweight='bold')
ax.set_ylabel(f'PC2 ({explained_variance[1]*100:.1f}% variance)',
fontsize=14, fontweight='bold')
ax.set_title('PCA Biplot: Materials and Features',
fontsize=16, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)
ax.legend(fontsize=12, loc='upper right')
plt.tight_layout()
plt.savefig('pca_biplot.png', dpi=300, bbox_inches='tight')
print("PCA biplot saved to pca_biplot.png")
plt.show()
2.2 t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is a non-linear dimensionality reduction technique that projects high-dimensional data into low dimensions while preserving local structure (neighborhood relationships). It excels at visualizing cluster structures.
2.2.1 Basic t-SNE Implementation
Code Example 5: Dimensionality Reduction with t-SNE
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 5: Dimensionality Reduction with t-SNE
Purpose: Demonstrate machine learning model training and evaluation
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
from sklearn.manifold import TSNE
import numpy as np
import time
# Execute t-SNE (experiment with multiple perplexity values)
perplexities = [5, 30, 50]
tsne_results = {}
for perplexity in perplexities:
print(f"\nRunning t-SNE (perplexity={perplexity})...")
start_time = time.time()
tsne = TSNE(n_components=2,
perplexity=perplexity,
n_iter=1000,
random_state=42,
verbose=0)
X_tsne = tsne.fit_transform(X_scaled)
tsne_results[perplexity] = X_tsne
elapsed_time = time.time() - start_time
print(f"Completed (elapsed time: {elapsed_time:.2f}s)")
print(f"KL divergence: {tsne.kl_divergence_:.3f}")
# Save results (perplexity=30 case)
materials_data['tsne1'] = tsne_results[30][:, 0]
materials_data['tsne2'] = tsne_results[30][:, 1]
Sample Output:
Running t-SNE (perplexity=5)...
Completed (elapsed time: 3.45s)
KL divergence: 1.234
Running t-SNE (perplexity=30)...
Completed (elapsed time: 3.67s)
KL divergence: 0.987
Running t-SNE (perplexity=50)...
Completed (elapsed time: 3.89s)
KL divergence: 1.056
Code Example 6: Comparison of Different Perplexities
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Code Example 6: Comparison of Different Perplexities
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
# Display results for three perplexity values side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for idx, perplexity in enumerate(perplexities):
ax = axes[idx]
X_tsne = tsne_results[perplexity]
scatter = ax.scatter(X_tsne[:, 0],
X_tsne[:, 1],
c=materials_data['band_gap'],
cmap='viridis',
s=50,
alpha=0.6,
edgecolors='black',
linewidth=0.5)
ax.set_title(f't-SNE (perplexity={perplexity})',
fontsize=14, fontweight='bold')
ax.set_xlabel('t-SNE 1', fontsize=12, fontweight='bold')
ax.set_ylabel('t-SNE 2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
# Colorbar
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Band Gap (eV)', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.savefig('tsne_perplexity_comparison.png', dpi=300, bbox_inches='tight')
print("t-SNE perplexity comparison saved to tsne_perplexity_comparison.png")
plt.show()
Code Example 7: t-SNE Clustering Results Visualization
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Code Example 7: t-SNE Clustering Results Visualization
Purpose: Demonstrate data visualization techniques
Target: Advanced
Execution time: 2-5 seconds
Dependencies: None
"""
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Clustering on t-SNE results
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(tsne_results[30])
materials_data['cluster'] = cluster_labels
# Visualization by cluster
fig, ax = plt.subplots(figsize=(12, 9))
colors = plt.cm.Set3(np.linspace(0, 1, n_clusters))
for cluster_id in range(n_clusters):
mask = cluster_labels == cluster_id
ax.scatter(tsne_results[30][mask, 0],
tsne_results[30][mask, 1],
c=[colors[cluster_id]],
label=f'Cluster {cluster_id}',
s=60,
alpha=0.7,
edgecolors='black',
linewidth=0.5)
# Cluster centers
centers_tsne = kmeans.cluster_centers_
ax.scatter(centers_tsne[:, 0],
centers_tsne[:, 1],
c='red',
marker='X',
s=300,
edgecolors='black',
linewidth=2,
label='Cluster Centers',
zorder=10)
ax.set_xlabel('t-SNE 1', fontsize=14, fontweight='bold')
ax.set_ylabel('t-SNE 2', fontsize=14, fontweight='bold')
ax.set_title(f't-SNE with K-Means Clustering (k={n_clusters})',
fontsize=16, fontweight='bold')
ax.legend(fontsize=11, loc='best')
ax.grid(True, alpha=0.3, linestyle='--')
plt.tight_layout()
plt.savefig('tsne_clustering.png', dpi=300, bbox_inches='tight')
print(f"t-SNE clustering results saved to tsne_clustering.png")
plt.show()
# Average property values per cluster
print("\nAverage property values per cluster:")
cluster_stats = materials_data.groupby('cluster')[feature_cols].mean()
print(cluster_stats.round(2))
2.3 UMAP (Uniform Manifold Approximation and Projection)
UMAP is a state-of-the-art dimensionality reduction technique that is faster than t-SNE and also preserves global structure. It operates efficiently even on large-scale datasets.
2.3.1 UMAP Installation and Basic Implementation
Code Example 8: Dimensionality Reduction with UMAP
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 8: Dimensionality Reduction with UMAP
Purpose: Demonstrate data manipulation and preprocessing
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""
# Install UMAP (first time only)
# !pip install umap-learn
import umap
import numpy as np
import time
# Execute UMAP (experiment with multiple n_neighbors values)
n_neighbors_list = [5, 15, 50]
umap_results = {}
for n_neighbors in n_neighbors_list:
print(f"\nRunning UMAP (n_neighbors={n_neighbors})...")
start_time = time.time()
reducer = umap.UMAP(n_components=2,
n_neighbors=n_neighbors,
min_dist=0.1,
metric='euclidean',
random_state=42)
X_umap = reducer.fit_transform(X_scaled)
umap_results[n_neighbors] = X_umap
elapsed_time = time.time() - start_time
print(f"Completed (elapsed time: {elapsed_time:.2f}s)")
# Save results (n_neighbors=15 case)
materials_data['umap1'] = umap_results[15][:, 0]
materials_data['umap2'] = umap_results[15][:, 1]
print("\nUMAP execution complete. Results saved to DataFrame.")
Sample Output:
Running UMAP (n_neighbors=5)...
Completed (elapsed time: 1.23s)
Running UMAP (n_neighbors=15)...
Completed (elapsed time: 1.34s)
Running UMAP (n_neighbors=50)...
Completed (elapsed time: 1.45s)
UMAP execution complete. Results saved to DataFrame.
Code Example 9: Comparison of Different n_neighbors
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Code Example 9: Comparison of Different n_neighbors
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
# Display results for three n_neighbors values side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for idx, n_neighbors in enumerate(n_neighbors_list):
ax = axes[idx]
X_umap = umap_results[n_neighbors]
scatter = ax.scatter(X_umap[:, 0],
X_umap[:, 1],
c=materials_data['formation_energy'],
cmap='RdYlGn_r',
s=50,
alpha=0.6,
edgecolors='black',
linewidth=0.5)
ax.set_title(f'UMAP (n_neighbors={n_neighbors})',
fontsize=14, fontweight='bold')
ax.set_xlabel('UMAP 1', fontsize=12, fontweight='bold')
ax.set_ylabel('UMAP 2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
# Colorbar
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Formation Energy (eV/atom)', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.savefig('umap_neighbors_comparison.png', dpi=300, bbox_inches='tight')
print("UMAP n_neighbors comparison saved to umap_neighbors_comparison.png")
plt.show()
Code Example 10: UMAP Density Map
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Code Example 10: UMAP Density Map
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Density estimation of UMAP results
X_umap = umap_results[15]
# KDE (Kernel Density Estimation)
xy = np.vstack([X_umap[:, 0], X_umap[:, 1]])
density = gaussian_kde(xy)(xy)
# Sort by density (draw high-density points on top)
idx = density.argsort()
x, y, z = X_umap[idx, 0], X_umap[idx, 1], density[idx]
# Plot
fig, ax = plt.subplots(figsize=(12, 9))
scatter = ax.scatter(x, y, c=z, cmap='hot', s=50, alpha=0.7,
edgecolors='black', linewidth=0.3)
ax.set_xlabel('UMAP 1', fontsize=14, fontweight='bold')
ax.set_ylabel('UMAP 2', fontsize=14, fontweight='bold')
ax.set_title('UMAP: Materials Space Density Map',
fontsize=16, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Point Density', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('umap_density_map.png', dpi=300, bbox_inches='tight')
print("UMAP density map saved to umap_density_map.png")
plt.show()
2.4 Comparison of Methods
Code Example 11: Parallel Comparison of PCA vs t-SNE vs UMAP
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
"""
Example: Code Example 11: Parallel Comparison of PCA vs t-SNE vs UMAP
Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
# Display results of three methods side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Common colormap (colored by band gap)
vmin = materials_data['band_gap'].min()
vmax = materials_data['band_gap'].max()
# PCA
ax = axes[0]
scatter = ax.scatter(materials_data['PC1'],
materials_data['PC2'],
c=materials_data['band_gap'],
cmap='viridis',
s=50,
alpha=0.6,
edgecolors='black',
linewidth=0.5,
vmin=vmin,
vmax=vmax)
ax.set_title('PCA', fontsize=16, fontweight='bold')
ax.set_xlabel('PC1', fontsize=12, fontweight='bold')
ax.set_ylabel('PC2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
# t-SNE
ax = axes[1]
scatter = ax.scatter(materials_data['tsne1'],
materials_data['tsne2'],
c=materials_data['band_gap'],
cmap='viridis',
s=50,
alpha=0.6,
edgecolors='black',
linewidth=0.5,
vmin=vmin,
vmax=vmax)
ax.set_title('t-SNE', fontsize=16, fontweight='bold')
ax.set_xlabel('t-SNE 1', fontsize=12, fontweight='bold')
ax.set_ylabel('t-SNE 2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
# UMAP
ax = axes[2]
scatter = ax.scatter(materials_data['umap1'],
materials_data['umap2'],
c=materials_data['band_gap'],
cmap='viridis',
s=50,
alpha=0.6,
edgecolors='black',
linewidth=0.5,
vmin=vmin,
vmax=vmax)
ax.set_title('UMAP', fontsize=16, fontweight='bold')
ax.set_xlabel('UMAP 1', fontsize=12, fontweight='bold')
ax.set_ylabel('UMAP 2', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
# Common colorbar
fig.subplots_adjust(right=0.9)
cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
cbar = fig.colorbar(scatter, cax=cbar_ax)
cbar.set_label('Band Gap (eV)', fontsize=12, fontweight='bold')
plt.savefig('dimensionality_reduction_comparison.png', dpi=300, bbox_inches='tight')
print("Dimensionality reduction method comparison saved to dimensionality_reduction_comparison.png")
plt.show()
Code Example 12: Evaluation of Neighborhood Preservation Rate
# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
from sklearn.neighbors import NearestNeighbors
import numpy as np
def calculate_neighborhood_preservation(X_high, X_low, k=10):
"""
Calculate neighborhood preservation rate between high-dimensional and low-dimensional space
Parameters:
-----------
X_high : array-like
Data in high-dimensional space
X_low : array-like
Data in low-dimensional space
k : int
Number of neighbors
Returns:
--------
preservation_rate : float
Neighborhood preservation rate (0-1)
"""
# k-nearest neighbors in high-dimensional space
nbrs_high = NearestNeighbors(n_neighbors=k+1).fit(X_high)
_, indices_high = nbrs_high.kneighbors(X_high)
# k-nearest neighbors in low-dimensional space
nbrs_low = NearestNeighbors(n_neighbors=k+1).fit(X_low)
_, indices_low = nbrs_low.kneighbors(X_low)
# Calculate neighborhood preservation rate
preservation_scores = []
for i in range(len(X_high)):
# Exclude self
neighbors_high = set(indices_high[i, 1:])
neighbors_low = set(indices_low[i, 1:])
# Proportion of common neighbors
intersection = len(neighbors_high & neighbors_low)
preservation_scores.append(intersection / k)
return np.mean(preservation_scores)
# Evaluate neighborhood preservation rate for each method
k_values = [5, 10, 20, 50]
results = {
'PCA': [],
't-SNE': [],
'UMAP': []
}
for k in k_values:
pca_preservation = calculate_neighborhood_preservation(
X_scaled, X_pca, k=k
)
tsne_preservation = calculate_neighborhood_preservation(
X_scaled, tsne_results[30], k=k
)
umap_preservation = calculate_neighborhood_preservation(
X_scaled, umap_results[15], k=k
)
results['PCA'].append(pca_preservation)
results['t-SNE'].append(tsne_preservation)
results['UMAP'].append(umap_preservation)
print(f"Neighborhood preservation rate at k={k}:")
print(f" PCA: {pca_preservation:.3f}")
print(f" t-SNE: {tsne_preservation:.3f}")
print(f" UMAP: {umap_preservation:.3f}")
print()
# Plot results
fig, ax = plt.subplots(figsize=(10, 7))
for method, scores in results.items():
ax.plot(k_values, scores, marker='o', linewidth=2,
markersize=8, label=method)
ax.set_xlabel('Number of Neighbors (k)', fontsize=14, fontweight='bold')
ax.set_ylabel('Neighborhood Preservation Rate', fontsize=14, fontweight='bold')
ax.set_title('Comparison of Neighborhood Preservation',
fontsize=16, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3, linestyle='--')
ax.set_ylim([0, 1])
plt.tight_layout()
plt.savefig('neighborhood_preservation.png', dpi=300, bbox_inches='tight')
print("Neighborhood preservation rate comparison saved to neighborhood_preservation.png")
plt.show()
2.5 Interactive Visualization
Code Example 13: 3D UMAP with Plotly
# Requirements:
# - Python 3.9+
# - plotly>=5.14.0
"""
Example: Code Example 13: 3D UMAP with Plotly
Purpose: Demonstrate data visualization techniques
Target: Beginner
Execution time: 2-5 seconds
Dependencies: None
"""
# Install Plotly (first time only)
# !pip install plotly
import plotly.express as px
import plotly.graph_objects as go
import umap
# Execute 3D UMAP
reducer_3d = umap.UMAP(n_components=3,
n_neighbors=15,
min_dist=0.1,
random_state=42)
X_umap_3d = reducer_3d.fit_transform(X_scaled)
materials_data['umap1_3d'] = X_umap_3d[:, 0]
materials_data['umap2_3d'] = X_umap_3d[:, 1]
materials_data['umap3_3d'] = X_umap_3d[:, 2]
# Interactive 3D plot
fig = px.scatter_3d(materials_data,
x='umap1_3d',
y='umap2_3d',
z='umap3_3d',
color='band_gap',
size='density',
hover_data=['formula', 'formation_energy', 'bulk_modulus'],
color_continuous_scale='Viridis',
title='Interactive 3D UMAP: Materials Space')
fig.update_traces(marker=dict(line=dict(width=0.5, color='black')))
fig.update_layout(
scene=dict(
xaxis_title='UMAP 1',
yaxis_title='UMAP 2',
zaxis_title='UMAP 3',
xaxis=dict(backgroundcolor="rgb(230, 230,230)",
gridcolor="white"),
yaxis=dict(backgroundcolor="rgb(230, 230,230)",
gridcolor="white"),
zaxis=dict(backgroundcolor="rgb(230, 230,230)",
gridcolor="white"),
),
width=900,
height=700,
font=dict(size=12)
)
fig.write_html('umap_3d_interactive.html')
print("Interactive 3D UMAP saved to umap_3d_interactive.html")
fig.show()
Code Example 14: Interactive Scatter Plot with Bokeh
# Install Bokeh (first time only)
# !pip install bokeh
from bokeh.plotting import figure, output_file, save
from bokeh.models import HoverTool, ColorBar, LinearColorMapper
from bokeh.palettes import Viridis256
from bokeh.io import show
# Color mapper
color_mapper = LinearColorMapper(palette=Viridis256,
low=materials_data['band_gap'].min(),
high=materials_data['band_gap'].max())
# Create plot
output_file('umap_interactive.html')
p = figure(width=900,
height=700,
title='Interactive UMAP: Materials Space',
tools='pan,wheel_zoom,box_zoom,reset,save')
# Data source
source_data = dict(
x=materials_data['umap1'],
y=materials_data['umap2'],
formula=materials_data['formula'],
band_gap=materials_data['band_gap'],
formation_energy=materials_data['formation_energy'],
density=materials_data['density'],
bulk_modulus=materials_data['bulk_modulus']
)
# Scatter plot
circles = p.circle('x', 'y',
size=8,
source=source_data,
fill_color={'field': 'band_gap', 'transform': color_mapper},
fill_alpha=0.7,
line_color='black',
line_width=0.5)
# Hover tool
hover = HoverTool(tooltips=[
('Formula', '@formula'),
('Band Gap', '@band_gap{0.00} eV'),
('Formation E', '@formation_energy{0.00} eV/atom'),
('Density', '@density{0.00} g/cm³'),
('Bulk Modulus', '@bulk_modulus{0.0} GPa')
])
p.add_tools(hover)
# Colorbar
color_bar = ColorBar(color_mapper=color_mapper,
label_standoff=12,
title='Band Gap (eV)',
location=(0, 0))
p.add_layout(color_bar, 'right')
# Axis labels
p.xaxis.axis_label = 'UMAP 1'
p.yaxis.axis_label = 'UMAP 2'
p.title.text_font_size = '16pt'
p.xaxis.axis_label_text_font_size = '14pt'
p.yaxis.axis_label_text_font_size = '14pt'
save(p)
print("Interactive UMAP saved to umap_interactive.html")
show(p)
Code Example 15: Animation of Dimensionality Reduction Process
# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
"""
Example: Code Example 15: Animation of Dimensionality Reduction Proce
Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from sklearn.decomposition import PCA
import numpy as np
# Animation with multi-stage PCA
n_frames = 20
fig, ax = plt.subplots(figsize=(10, 8))
# Initial 3D PCA
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)
def update(frame):
ax.clear()
# Rotation angle
angle = frame * (360 / n_frames)
angle_rad = np.radians(angle)
# Apply rotation matrix
rotation_matrix = np.array([
[np.cos(angle_rad), -np.sin(angle_rad), 0],
[np.sin(angle_rad), np.cos(angle_rad), 0],
[0, 0, 1]
])
X_rotated = X_pca_3d @ rotation_matrix
# 2D projection
scatter = ax.scatter(X_rotated[:, 0],
X_rotated[:, 1],
c=materials_data['band_gap'],
cmap='viridis',
s=50,
alpha=0.6,
edgecolors='black',
linewidth=0.5)
ax.set_xlabel('Dimension 1', fontsize=12, fontweight='bold')
ax.set_ylabel('Dimension 2', fontsize=12, fontweight='bold')
ax.set_title(f'3D PCA Rotation (angle={angle:.0f}°)',
fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, linestyle='--')
# Fix axis ranges
ax.set_xlim(X_rotated[:, 0].min() - 1, X_rotated[:, 0].max() + 1)
ax.set_ylim(X_rotated[:, 1].min() - 1, X_rotated[:, 1].max() + 1)
return scatter,
# Create animation
anim = animation.FuncAnimation(fig, update, frames=n_frames,
interval=200, blit=False)
# Save as GIF
anim.save('pca_rotation_animation.gif', writer='pillow', fps=5)
print("PCA rotation animation saved to pca_rotation_animation.gif")
plt.close()
2.6 Summary
In this chapter, we learned about materials space mapping using dimensionality reduction techniques:
Main Dimensionality Reduction Techniques
| Method | Characteristics | Advantages | Disadvantages |
|---|---|---|---|
| PCA | Linear, variance maximization | Fast, high interpretability | Weak for non-linear structures |
| t-SNE | Non-linear, neighborhood preservation | Excellent for cluster visualization | Slow, loses global structure |
| UMAP | Non-linear, topology preservation | Fast, balances global and local | Requires parameter tuning |
Implemented Code
| Code Example | Content | Method |
|---|---|---|
| Example 1-4 | Basic PCA implementation, visualization | PCA |
| Example 5-7 | t-SNE implementation, parameter comparison | t-SNE |
| Example 8-10 | UMAP implementation, density maps | UMAP |
| Example 11-12 | Method comparison, evaluation metrics | Comparison |
| Example 13-15 | Interactive visualization | Plotly, Bokeh |
Best Practices
- Preprocessing: Standardization (StandardScaler) is essential
- Method Selection: - Interpretability focus → PCA - Cluster discovery → t-SNE - Balanced approach → UMAP
- Parameter Tuning: Experiment with multiple settings to find optimal values
- Evaluation: Assess quality with quantitative metrics like neighborhood preservation rate
Looking Ahead to the Next Chapter
In Chapter 3, we will use Graph Neural Networks (GNN) to learn feature representations from material structural information, achieving more advanced materials space mapping. We will implement state-of-the-art GNN models such as CGCNN, MEGNet, and SchNet, performing dimensionality reduction with crystal structures as direct input.
Previous Chapter: Chapter 1: Fundamentals of Materials Space Visualization
Next Chapter: Chapter 3: Materials Representation Learning with GNN
Series Top: Introduction to Materials Property Mapping