Chapter 3: Hands-On MLP with Python - SchNetPack Tutorial

Complete the entire training and evaluation workflow with a small dataset. Clarify checkpoints for reproducibility and overfitting prevention.

💡 Supplement: The basic three-piece set includes seed fixing, data splitting, and learning curve recording. Stabilize with early stopping and weight decay.

Learning Objectives

By reading this chapter, you will master the following:
- Install SchNetPack in Python environment and set up the environment
- Train an MLP model using a small-scale dataset (aspirin molecule from MD17)
- Evaluate trained model accuracy and verify energy/force prediction errors
- Execute MLP-MD simulations and analyze trajectories
- Understand common errors and their solutions

3.1 Environment Setup: Installing Required Tools

To practice MLP, you need to set up a Python environment and SchNetPack.

Required Software

Tool	Version	Purpose
Python	3.9-3.11	Base language
PyTorch	2.0+	Deep learning framework
SchNetPack	2.0+	MLP training and inference
ASE	3.22+	Atomic structure manipulation, MD execution
NumPy/Matplotlib	Latest	Data analysis and visualization

Installation Steps

Step 1: Create Conda Environment

# Create new Conda environment (Python 3.10)
conda create -n mlp-tutorial python=3.10 -y
conda activate mlp-tutorial

Step 2: Install PyTorch

# CPU version (local machine, lightweight)
conda install pytorch cpuonly -c pytorch

# GPU version (if CUDA available)
conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia

Step 3: Install SchNetPack and ASE

# SchNetPack (pip recommended)
pip install schnetpack

# ASE (Atomic Simulation Environment)
pip install ase

# Visualization tools
pip install matplotlib seaborn

Step 4: Verify Installation

# Example 1: Environment verification script (5 lines)
import torch
import schnetpack as spk
print(f"PyTorch: {torch.__version__}")
print(f"SchNetPack: {spk.__version__}")
print(f"GPU available: {torch.cuda.is_available()}")

Expected Output:

PyTorch: 2.1.0
SchNetPack: 2.0.3
GPU available: False  # For CPU

3.2 Data Preparation: Obtaining MD17 Dataset

SchNetPack includes the MD17 benchmark dataset for small molecules.

What is MD17 Dataset

Content: Molecular dynamics trajectories from DFT calculations
Target molecules: 10 types including aspirin, benzene, ethanol
Data count: Approximately 100,000 configurations per molecule
Accuracy: PBE/def2-SVP level (DFT)
Purpose: Benchmarking MLP methods

Downloading and Loading Data

Example 2: Loading MD17 Dataset (10 lines)

from schnetpack.datasets import MD17
from schnetpack.data import AtomsDataModule

# Download aspirin molecule dataset (approximately 100,000 configurations)
dataset = MD17(
    datapath='./data',
    molecule='aspirin',
    download=True
)

print(f"Total samples: {len(dataset)}")
print(f"Properties: {dataset.available_properties}")
print(f"First sample: {dataset[0]}")

Output:

Total samples: 211762
Properties: ['energy', 'forces']
First sample: {'_atomic_numbers': tensor([...]), 'energy': tensor(-1234.5), 'forces': tensor([...])}

Data Splitting

Example 3: Splitting into Train/Validation/Test Sets (10 lines)

# Split data into train:val:test = 70%:15%:15%
data_module = AtomsDataModule(
    datapath='./data',
    dataset=dataset,
    batch_size=32,
    num_train=100000,      # Number of training data
    num_val=10000,          # Number of validation data
    num_test=10000,         # Number of test data
    split_file='split.npz', # Save split information
)
data_module.prepare_data()
data_module.setup()

Explanation:
- batch_size=32: Process 32 configurations at once (memory efficiency)
- num_train=100000: Large data improves generalization
- split_file: Save split to file (ensure reproducibility)

3.3 Model Training with SchNetPack

Train a SchNet model to learn energy and forces.

SchNet Architecture Configuration

Example 4: Defining SchNet Model (15 lines)

import schnetpack.transform as trn
from schnetpack.representation import SchNet
from schnetpack.model import AtomisticModel
from schnetpack.task import ModelOutput

# 1. SchNet representation layer (atomic configuration → feature vector)
representation = SchNet(
    n_atom_basis=128,      # Dimension of atomic feature vectors
    n_interactions=6,      # Number of message passing layers
    cutoff=5.0,            # Cutoff radius (Å)
    n_filters=128          # Number of filters
)

# 2. Output layer (energy prediction)
output = ModelOutput(
    name='energy',
    loss_fn=torch.nn.MSELoss(),
    metrics={'MAE': spk.metrics.MeanAbsoluteError()}
)

Parameter Explanation:
- n_atom_basis=128: Each atom's feature vector is 128-dimensional (typical value)
- n_interactions=6: 6 layers of message passing (deeper captures longer-range interactions)
- cutoff=5.0Å: Ignore atomic interactions beyond this distance (computational efficiency)

Running Training

Example 5: Setting Up Training Loop (15 lines)

import pytorch_lightning as pl
from schnetpack.task import AtomisticTask

# Define training task
task = AtomisticTask(
    model=AtomisticModel(representation, [output]),
    outputs=[output],
    optimizer_cls=torch.optim.AdamW,
    optimizer_args={'lr': 1e-4}  # Learning rate
)

# Configure Trainer
trainer = pl.Trainer(
    max_epochs=50,               # Maximum 50 epochs
    accelerator='cpu',           # Use CPU (GPU: 'gpu')
    devices=1,
    default_root_dir='./training'
)

# Start training
trainer.fit(task, datamodule=data_module)

Training Time Estimate:
- CPU (4 cores): Approximately 2-3 hours (100,000 configurations)
- GPU (RTX 3090): Approximately 15-20 minutes

Monitoring Training Progress

Example 6: Visualization with TensorBoard (10 lines)

# Launch TensorBoard (separate terminal)
# tensorboard --logdir=./training/lightning_logs

# Check logs from Python
import pandas as pd

metrics = pd.read_csv('./training/lightning_logs/version_0/metrics.csv')
print(metrics[['epoch', 'train_loss', 'val_loss']].tail(10))

Expected Output:

   epoch  train_loss  val_loss
40    40      0.0023    0.0031
41    41      0.0022    0.0030
42    42      0.0021    0.0029
...

Observation Points:
- Both train_loss and val_loss decreasing → Normal learning progress
- val_loss starts increasing → Sign of overfitting → Consider Early Stopping

3.4 Accuracy Verification: Energy and Force Prediction Accuracy

Evaluate whether the trained model achieves DFT accuracy.

Test Set Evaluation

Example 7: Test Set Evaluation (12 lines)

# Evaluate on test set
test_results = trainer.test(task, datamodule=data_module)

# Display results
print(f"Energy MAE: {test_results[0]['test_energy_MAE']:.4f} eV")
print(f"Energy RMSE: {test_results[0]['test_energy_RMSE']:.4f} eV")

# Force evaluation (requires separate calculation)
from schnetpack.metrics import MeanAbsoluteError
force_mae = MeanAbsoluteError(target='forces')
# ... Force evaluation code

Good Accuracy Benchmarks (aspirin molecule, 21 atoms):
- Energy MAE: < 1 kcal/mol (< 0.043 eV)
- Force MAE: < 1 kcal/mol/Å (< 0.043 eV/Å)

Correlation Plot of Predictions vs True Values

Example 8: Visualizing Prediction Accuracy (15 lines)

import matplotlib.pyplot as plt
import numpy as np

# Predict on test data
model = task.model
predictions, targets = [], []

for batch in data_module.test_dataloader():
    pred = model(batch)['energy'].detach().numpy()
    true = batch['energy'].numpy()
    predictions.extend(pred)
    targets.extend(true)

# Scatter plot
plt.scatter(targets, predictions, alpha=0.5, s=1)
plt.plot([min(targets), max(targets)], [min(targets), max(targets)], 'r--')
plt.xlabel('DFT Energy (eV)')
plt.ylabel('MLP Predicted Energy (eV)')
plt.title('Energy Prediction Accuracy')
plt.show()

Ideal Result:
- Points concentrate along red diagonal line (y=x)
- R² > 0.99 (coefficient of determination)

3.5 MLP-MD Simulation: Running Molecular Dynamics

Execute MD simulations 10⁴ times faster than DFT using trained MLP.

MLP-MD Setup with ASE

Example 9: Preparing MLP-MD Calculation (10 lines)

from ase import units
from ase.md.velocitydistribution import MaxwellBoltzmannDistribution
from ase.md.verlet import VelocityVerlet
import schnetpack.interfaces.ase_interface as spk_ase

# Wrap MLP as ASE Calculator
calculator = spk_ase.SpkCalculator(
    model_file='./training/best_model.ckpt',
    device='cpu'
)

# Prepare initial structure (first configuration from MD17)
atoms = dataset.get_atoms(0)
atoms.calc = calculator

Initial Velocity Setup and Equilibration

Example 10: Temperature Initialization (10 lines)

# Set velocity distribution at 300K
temperature = 300  # K
MaxwellBoltzmannDistribution(atoms, temperature_K=temperature)

# Zero momentum (remove overall translation)
from ase.md.velocitydistribution import Stationary
Stationary(atoms)

print(f"Initial kinetic energy: {atoms.get_kinetic_energy():.3f} eV")
print(f"Initial potential energy: {atoms.get_potential_energy():.3f} eV")

Running MD Simulation

Example 11: MD Execution and Trajectory Saving (12 lines)

from ase.io.trajectory import Trajectory

# Configure MD simulator
timestep = 0.5 * units.fs  # 0.5 femtoseconds
dyn = VelocityVerlet(atoms, timestep=timestep)

# Output trajectory file
traj = Trajectory('aspirin_md.traj', 'w', atoms)
dyn.attach(traj.write, interval=10)  # Save every 10 steps

# Run MD for 10,000 steps (5 picoseconds)
dyn.run(10000)
print("MD simulation completed!")

Computational Time Estimate:
- CPU (4 cores): Approximately 5 minutes (10,000 steps)
- With DFT: Approximately 1 week (10,000 steps)
- Achieved 10⁴× speedup!

Trajectory Analysis

Example 12: Energy Conservation and RDF Calculation (15 lines)

from ase.io import read
import numpy as np

# Read trajectory
traj_data = read('aspirin_md.traj', index=':')

# Check energy conservation
energies = [a.get_total_energy() for a in traj_data]
plt.plot(energies)
plt.xlabel('Time step')
plt.ylabel('Total Energy (eV)')
plt.title('Energy Conservation Check')
plt.show()

# Calculate energy drift (monotonic increase/decrease)
drift = (energies[-1] - energies[0]) / len(energies)
print(f"Energy drift: {drift:.6f} eV/step")

Good Simulation Indicators:
- Energy drift: < 0.001 eV/step
- Total energy oscillates with time (conservation law)

3.6 Physical Property Calculations: Vibrational Spectra and Diffusion Coefficients

Calculate physical properties from MLP-MD.

Vibrational Spectrum (Power Spectrum)

Example 13: Vibrational Spectrum Calculation (15 lines)

from scipy.fft import fft, fftfreq

# Extract velocity time series for one atom
atom_idx = 0  # First atom
velocities = np.array([a.get_velocities()[atom_idx] for a in traj_data])

# Fourier transform of x-direction velocity
vx = velocities[:, 0]
freq = fftfreq(len(vx), d=timestep)
spectrum = np.abs(fft(vx))**2

# Plot positive frequencies only
mask = freq > 0
plt.plot(freq[mask] * 1e15 / (2 * np.pi), spectrum[mask])  # Hz → THz conversion
plt.xlabel('Frequency (THz)')
plt.ylabel('Power Spectrum')
plt.title('Vibrational Spectrum')
plt.xlim(0, 100)
plt.show()

Interpretation:
- Peaks correspond to molecular vibrational modes
- Compare with DFT-calculated vibrational spectrum for accuracy verification

Mean Square Displacement (MSD) and Diffusion Coefficient

Example 14: MSD Calculation (15 lines)

def calculate_msd(traj, atom_idx=0):
    """Calculate mean square displacement"""
    positions = np.array([a.positions[atom_idx] for a in traj])
    msd = np.zeros(len(positions))

    for t in range(len(positions)):
        displacement = positions[t:] - positions[:-t or None]
        msd[t] = np.mean(np.sum(displacement**2, axis=1))

    return msd

# Calculate and plot MSD
msd = calculate_msd(traj_data)
time_ps = np.arange(len(msd)) * timestep / units.fs * 1e-3  # picoseconds

plt.plot(time_ps, msd)
plt.xlabel('Time (ps)')
plt.ylabel('MSD (Ų)')
plt.title('Mean Square Displacement')
plt.show()

Diffusion Coefficient Calculation:

# Calculate diffusion coefficient from linear region of MSD (Einstein relation)
# D = lim_{t→∞} MSD(t) / (6t)
linear_region = slice(100, 500)
fit = np.polyfit(time_ps[linear_region], msd[linear_region], deg=1)
D = fit[0] / 6  # Ų/ps → cm²/s conversion needed
print(f"Diffusion coefficient: {D:.6f} Ų/ps")

3.7 Active Learning: Efficient Data Addition

Automatically detect configurations where the model is uncertain and add DFT calculations.

Ensemble Uncertainty Evaluation

Example 15: Prediction Uncertainty (15 lines)

# Train multiple independent models (omitted: run Example 5 three times)
models = [model1, model2, model3]  # 3 independent models

def predict_with_uncertainty(atoms, models):
    """Ensemble prediction with uncertainty"""
    predictions = []
    for model in models:
        atoms.calc = spk_ase.SpkCalculator(model_file=model, device='cpu')
        predictions.append(atoms.get_potential_energy())

    mean = np.mean(predictions)
    std = np.std(predictions)
    return mean, std

# Evaluate uncertainty for each configuration in MD trajectory
uncertainties = []
for atoms in traj_data[::100]:  # Every 100 frames
    _, std = predict_with_uncertainty(atoms, models)
    uncertainties.append(std)

# Identify configurations with high uncertainty
threshold = np.percentile(uncertainties, 95)
high_uncertainty_frames = np.where(np.array(uncertainties) > threshold)[0]
print(f"High uncertainty frames: {high_uncertainty_frames}")

Next Steps:
- Add configurations with high uncertainty to DFT calculations
- Update dataset and retrain model
- Verify accuracy improvement

3.8 Troubleshooting: Common Errors and Solutions

Introduce problems and solutions frequently encountered in practice.

Error	Cause	Solution
Out of Memory (OOM)	Batch size too large	Reduce `batch_size` from 32→16→8
Loss becomes NaN	Learning rate too high	Lower `lr=1e-4`→`1e-5`
Energy drift in MD	Timestep too large	Reduce `timestep=0.5fs`→`0.25fs`
Poor generalization	Training data biased	Diversify data with Active Learning
CUDA error	GPU compatibility issue	Verify PyTorch and CUDA versions

Debugging Best Practices

# 1. Test with small-scale data
data_module.num_train = 1000  # Quick test with 1,000 configurations

# 2. Check overfitting on 1 batch
trainer = pl.Trainer(max_epochs=100, overfit_batches=1)
# If training error approaches 0, model has learning capacity

# 3. Gradient clipping
task = AtomisticTask(..., gradient_clip_val=1.0)  # Prevent gradient explosion

3.9 Chapter Summary

What You Learned

Environment Setup
- Installing Conda environment, PyTorch, SchNetPack
- Choosing GPU/CPU environment
Data Preparation
- Downloading and loading MD17 dataset
- Splitting into training/validation/test sets
Model Training
- Configuring SchNet architecture (6 layers, 128 dimensions)
- Training for 50 epochs (CPU: 2-3 hours)
- Monitoring progress with TensorBoard
Accuracy Verification
- Verifying Energy MAE < 1 kcal/mol achieved
- Correlation plot of predictions vs true values
- High accuracy with R² > 0.99
MLP-MD Execution
- Integration as ASE Calculator
- Running MD for 10,000 steps (5 picoseconds)
- Experiencing 10⁴× speedup over DFT
Physical Property Calculations
- Vibrational spectrum (Fourier transform)
- Diffusion coefficient (calculated from mean square displacement)
Active Learning
- Configuration selection by ensemble uncertainty
- Automated data addition strategy

Important Points

SchNetPack is easy to implement: MLP training possible with a few dozen lines of code
Practical accuracy achieved with small data (100,000 configurations): MD17 is an excellent benchmark
MLP-MD is practical: 10⁴× faster than DFT, executable on personal PCs
Efficiency with Active Learning: Automatically discover important configurations, reduce data collection costs

To the Next Chapter

In Chapter 4, you'll learn about the latest MLP methods (NequIP, MACE) and actual research applications:
- Theory of E(3) equivariant graph neural networks
- Dramatic improvement in data efficiency (100,000→3,000 configurations)
- Application cases to catalytic reactions and battery materials
- Realization of large-scale simulations (1 million atoms)

Exercises

Exercise 1 (Difficulty: easy)

In Example 4's SchNet configuration, change n_interactions (number of message passing layers) to 3, 6, and 9, train the models, and predict how test MAE changes.

Hint

The deeper the layers, the better they capture long-range atomic interactions. However, too deep increases overfitting risk.

Sample Answer

**Expected Results**: | `n_interactions` | Predicted Test MAE | Training Time | Characteristics | |-----------------|-------------|---------|------| | **3** | 0.8-1.2 kcal/mol | 1 hour | Shallow, cannot fully capture long-range interactions | | **6** | 0.5-0.8 kcal/mol | 2-3 hours | Well balanced (recommended) | | **9** | 0.6-1.0 kcal/mol | 4-5 hours | Overfitting risk, accuracy decreases with insufficient training data | **Experimental Method**:

for n in [3, 6, 9]:
    representation = SchNet(n_interactions=n, ...)
    task = AtomisticTask(...)
    trainer.fit(task, datamodule=data_module)
    results = trainer.test(task, datamodule=data_module)
    print(f"n={n}: MAE={results[0]['test_energy_MAE']:.4f} eV")

**Conclusion**: For small molecules (21 aspirin atoms), `n_interactions=6` is optimal. For large-scale systems (100+ atoms), 9-12 layers may be effective.

Exercise 2 (Difficulty: medium)

In Example 11's MLP-MD, if energy drift exceeds acceptable range (e.g., 0.01 eV/step), what countermeasures can you consider? List three.

Hint

Consider from three perspectives: timestep, training accuracy, and MD algorithm.

Sample Answer

**Countermeasure 1: Reduce Timestep**

timestep = 0.25 * units.fs  # Halve from 0.5fs to 0.25fs
dyn = VelocityVerlet(atoms, timestep=timestep)

- **Reason**: Smaller timestep reduces numerical integration error - **Drawback**: 2× longer computation time **Countermeasure 2: Improve Model Training Accuracy**

# Train with more data
data_module.num_train = 200000  # Increase from 100,000 to 200,000 configurations

# Or increase weight of force loss function
task = AtomisticTask(..., loss_weights={'energy': 1.0, 'forces': 1000})

- **Reason**: Low force prediction accuracy causes unstable MD - **Target**: Force MAE < 0.05 eV/Å **Countermeasure 3: Switch to Langevin Dynamics (Heat Bath Coupling)**

from ase.md.langevin import Langevin
dyn = Langevin(atoms, timestep=0.5*units.fs,
               temperature_K=300, friction=0.01)

- **Reason**: Heat bath absorbs energy drift - **Caution**: No longer strictly microcanonical ensemble (NVE) **Priority**: Countermeasure 2 (improve accuracy) → Countermeasure 1 (timestep) → Countermeasure 3 (Langevin)

3.10 Data License and Reproducibility

Information on datasets and tool versions needed to reproduce the hands-on code in this chapter.

3.10.1 Datasets Used

Dataset	Description	License	Access Method
MD17	Small molecule MD trajectories (10 types including aspirin, benzene)	CC0 1.0 (Public Domain)	Built into SchNetPack (`MD17(molecule='aspirin')`)
Aspirin molecule	211,762 configurations, DFT (PBE/def2-SVP)	CC0 1.0	sgdml.org

Notes:
- Commercial Use: CC0 license allows all commercial use, modification, and redistribution freely
- Paper Citation: When using MD17, cite the following
Chmiela, S., et al. (2017). "Machine learning of accurate energy-conserving molecular force fields." Science Advances, 3(5), e1603015.
- Data Integrity: SchNetPack's download function verifies with SHA256 checksum

3.10.2 Environment Information for Code Reproducibility

To accurately reproduce the code examples in this chapter, use the following versions.

Tool	Recommended Version	Installation Command	Compatibility
Python	3.10.x	`conda create -n mlp python=3.10`	Tested on 3.9-3.11
PyTorch	2.1.0	`conda install pytorch=2.1.0`	2.0 or higher required
SchNetPack	2.0.3	`pip install schnetpack==2.0.3`	API differs between 2.0 and 1.x series
ASE	3.22.1	`pip install ase==3.22.1`	3.20 or higher recommended
PyTorch Lightning	2.1.0	`pip install pytorch-lightning==2.1.0`	Compatible with SchNetPack 2.0.3
NumPy	1.24.3	`pip install numpy==1.24.3`	1.20 or higher
Matplotlib	3.7.1	`pip install matplotlib==3.7.1`	3.5 or higher

Saving Environment File:

# Save current environment in reproducible form
conda env export > environment.yml

# Reproduce in another environment
conda env create -f environment.yml

Ensuring Reproducibility with Docker (recommended):

# Dockerfile example
FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
RUN pip install schnetpack==2.0.3 ase==3.22.1 pytorch-lightning==2.1.0

3.10.3 Complete Record of Training Hyperparameters

Complete record of hyperparameters used in Examples 4 and 5 (for paper reproduction):

Parameter	Value	Description
`n_atom_basis`	128	Dimension of atomic feature vectors
`n_interactions`	6	Number of message passing layers
`cutoff`	5.0 Å	Cutoff radius for atomic interactions
`n_filters`	128	Number of convolution filters
`batch_size`	32	Mini-batch size
`learning_rate`	1e-4	Initial learning rate (AdamW)
`max_epochs`	50	Maximum training epochs
`num_train`	100,000	Number of training data
`num_val`	10,000	Number of validation data
`num_test`	10,000	Number of test data
`random_seed`	42	Random seed (data split reproducibility)

Complete Reproduction Code:

import torch
torch.manual_seed(42)  # Ensure reproducibility

representation = SchNet(
    n_atom_basis=128, n_interactions=6, cutoff=5.0, n_filters=128
)
task = AtomisticTask(
    model=AtomisticModel(representation, [output]),
    optimizer_cls=torch.optim.AdamW,
    optimizer_args={'lr': 1e-4, 'weight_decay': 0.01}
)
trainer = pl.Trainer(max_epochs=50, deterministic=True)

3.10.4 Energy and Force Unit Conversion Table

Unit conversions for physical quantities used in this chapter (SchNetPack and ASE standard units):

Physical Quantity	SchNetPack/ASE	eV	kcal/mol	Hartree
Energy	eV	1.0	23.06	0.03674
Force	eV/Å	1.0	23.06	0.01945
Distance	Å	-	-	1.889726 Bohr
Time	fs (femtoseconds)	-	-	0.02419 a.u.

Unit Conversion Example:

from ase import units

# Energy conversion
energy_ev = 1.0  # eV
energy_kcal = energy_ev * 23.06052  # kcal/mol
energy_hartree = energy_ev * 0.036749  # Hartree

# Using ASE unit constants (recommended)
print(f"{energy_ev} eV = {energy_ev * units.eV / units.kcal * units.mol} kcal/mol")

3.11 Practical Precautions: Common Failure Patterns in Hands-On

3.11.1 Pitfalls in Environment Setup and Installation

Failure 1: PyTorch and CUDA Version Mismatch

Problem:

RuntimeError: CUDA error: no kernel image is available for execution on the device

Cause:
PyTorch 2.1.0 is compiled with CUDA 11.8 or 12.1, but system CUDA is older version like 10.2

Diagnostic Code:

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version (PyTorch): {torch.version.cuda}")

# Check system CUDA version (terminal)
# nvcc --version

Solution:

# For system CUDA 11.8
conda install pytorch==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia

# Switch to CPU version if CUDA unavailable
conda install pytorch==2.1.0 cpuonly -c pytorch

Prevention:
Check GPU driver and CUDA version with nvidia-smi before environment setup

Failure 2: Confusing SchNetPack 1.x and 2.x APIs

Problem:

AttributeError: module 'schnetpack' has no attribute 'AtomsData'

Cause:
Running old tutorial code from SchNetPack 1.x series on 2.x series

Version Check:

import schnetpack as spk
print(spk.__version__)  # If 2.0.3, this chapter's code works

Main API Changes:

SchNetPack 1.x	SchNetPack 2.x
`spk.AtomsData`	`spk.data.AtomsDataModule`
`spk.atomistic.Atomwise`	`spk.task.ModelOutput`
`spk.train.Trainer`	`pytorch_lightning.Trainer`

Solution:
Use this chapter's code examples (2.x series) or refer to SchNetPack official documentation (schnetpack.readthedocs.io) 2.x series tutorials

Failure 3: Misdiagnosis of Memory Shortage (OOM)

Problem:

RuntimeError: CUDA out of memory. Tried to allocate 1.50 GiB

Common Misconception:
"GPU memory shortage so need to buy new GPU" → Wrong

Diagnostic Procedure:

# 1. Check batch size
print(f"Current batch size: {data_module.batch_size}")

# 2. Check GPU memory usage
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Solutions (Priority Order):

Reduce batch size: batch_size=32 → 16 → 8 → 4
Gradient accumulation: Accumulate small batches multiple times to simulate large batch

trainer = pl.Trainer(accumulate_grad_batches=4)  # Update every 4 batches

Mixed Precision training: Halve memory usage

trainer = pl.Trainer(precision=16)  # Use float16

Guidelines:
- GPU 4GB: batch_size=4-8
- GPU 8GB: batch_size=16-32
- GPU 24GB: batch_size=64-128

3.11.2 Training and Debugging Pitfalls

Failure 4: Training Error Not Decreasing (NaN Loss)

Problem:

Epoch 5: train_loss=nan, val_loss=nan

Top 3 Causes:

Learning rate too high: Gradient explosion → Parameters become NaN
Lack of data normalization: Energy absolute values too large (e.g., -1000 eV)
Inappropriate force loss coefficient: Force loss too dominant

Diagnostic Code:

# Check model output immediately after training starts
for batch in data_module.train_dataloader():
    output = task.model(batch)
    print(f"Energy prediction: {output['energy'][:5]}")  # First 5 samples
    print(f"Energy target: {batch['energy'][:5]}")
    break

# NaN check
print(f"Has NaN in prediction: {torch.isnan(output['energy']).any()}")

Solutions:

Lower learning rate:

optimizer_args={'lr': 1e-5}  # Decrease from 1e-4 to 1e-5

Gradient clipping:

trainer = pl.Trainer(gradient_clip_val=1.0)  # Clip gradient norm to ≤1.0

Data normalization (automatic in SchNetPack 2.x, but verify manually):

import schnetpack.transform as trn
data_module.train_transforms = [
    trn.SubtractCenterOfMass(),
    trn.RemoveOffsets('energy', remove_mean=True)  # Remove energy offset
]

Failure 5: Missing Overfitting Signs

Problem:
Training error decreases but validation error stagnates or increases

Epoch 30: train_loss=0.001, val_loss=0.050  # val_loss worsening

Cause:
Model memorizes training data, generalization performance to unseen data deteriorates

Diagnostic Graph:

import matplotlib.pyplot as plt
import pandas as pd

metrics = pd.read_csv('./training/lightning_logs/version_0/metrics.csv')
plt.plot(metrics['epoch'], metrics['train_loss'], label='Train')
plt.plot(metrics['epoch'], metrics['val_loss'], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

Solutions:

Early Stopping (automatic stopping):

from pytorch_lightning.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=10, mode='min')
trainer = pl.Trainer(callbacks=[early_stop])

Data augmentation (increase training data):

data_module.num_train = 200000  # Increase from 100,000 to 200,000

Dropout (not recommended, less effective for MLP): Instead reduce model parameter count

representation = SchNet(n_atom_basis=64, n_interactions=4)  # Reduce from 128→64

3.11.3 MLP-MD Simulation Pitfalls

Failure 6: Underestimating Energy Drift

Problem:
Energy monotonically increases/decreases during MD simulation (breaking conservation law)

Misconception About Acceptable Range:

"A little drift is unavoidable" → Dangerous
0.01 eV/step drift results in 100 eV energy change over 10,000 steps (unrealistic)

Quantitative Diagnosis:

from ase.io import read

traj = read('aspirin_md.traj', index=':')
energies = [a.get_total_energy() for a in traj]

# Calculate drift (linear fit)
import numpy as np
time_steps = np.arange(len(energies))
drift_rate, offset = np.polyfit(time_steps, energies, deg=1)
print(f"Energy drift: {drift_rate:.6f} eV/step")

# Check acceptable range
if abs(drift_rate) > 0.001:
    print("⚠️ WARNING: Excessive energy drift detected!")

Solutions (Detailed from Exercise 2):

Improve force training accuracy (most important):

# Significantly increase weight of force loss function
task = AtomisticTask(
    loss_weights={'energy': 0.01, 'forces': 0.99}  # 99% weight on forces
)

Optimize timestep:

# Stability test
for dt in [0.1, 0.25, 0.5, 1.0]:  # fs
    # If stable at dt=0.5 but diverges at dt=1.0, adopt dt=0.5

Recognize MLP accuracy limits:
If force MAE > 0.1 eV/Å, long-time MD (>10 ps) has reduced reliability
→ Add training data with Active Learning

Failure 7: Not Verifying Physical Validity of MD Results

Problem:
Misconception that "MD completed so it's successful" → Actually unphysical structural changes

Items to Verify:

1. Temperature Control Check:

temperatures = [a.get_temperature() for a in traj]
print(f"Average T: {np.mean(temperatures):.1f} K (target: 300 K)")
print(f"Std T: {np.std(temperatures):.1f} K")
# Standard deviation > 30K indicates abnormality

2. Structure Disruption Check:

from ase.geometry.analysis import Analysis

# Compare initial and final structures
ana_init = Analysis(traj[0])
ana_final = Analysis(traj[-1])

# Check if bonds are broken
bonds_init = ana_init.all_bonds[0]
bonds_final = ana_final.all_bonds[0]
print(f"Initial bonds: {len(bonds_init)}, Final bonds: {len(bonds_final)}")

# Change in bond count → Possible structure disruption

3. Validity of Radial Distribution Function (RDF):

# Check if first peak position matches DFT calculations or X-ray diffraction data
# (Implementation advanced, omitted)

Solution:
If physically valid results not obtained, likely extrapolation beyond training data range
→ Add those configurations to training data with Active Learning

3.12 Chapter Completion Checklist: Hands-On Quality Assurance

After completing this chapter, verify the following items. If you can check all items, you're ready to apply MLP in actual research projects.

3.12.1 Conceptual Understanding

Understanding Tools and Environment:

□ Can explain SchNetPack's role (MLP training library)
□ Understand relationship between PyTorch and SchNetPack (PyTorch-based MLP implementation)
□ Can explain ASE's role (atomic structure manipulation, MD execution)
□ Understand MD17 dataset characteristics (10 small molecules, DFT accuracy)

Understanding Model Training:

□ Can explain meaning of SchNet hyperparameters (n_atom_basis, n_interactions, cutoff)
□ Understand roles of training/validation/test sets
□ Can identify overfitting signs (validation error increase)
□ Understand that Energy MAE < 1 kcal/mol is high accuracy benchmark

Understanding MLP-MD:

□ Understand mechanism of MLP integration as ASE Calculator
□ Can explain difference between energy conservation law and energy drift
□ Understand why timestep (0.5 fs) affects stability
□ Can explain why MLP-MD is 10⁴× faster than DFT

3.12.2 Practical Skills

Environment Setup:

□ Can create Conda environment and install Python 3.10
□ Can correctly install PyTorch (CPU/GPU versions)
□ Can install SchNetPack 2.0.3 and ASE 3.22.1
□ Can run environment verification script and check versions

Data Preparation and Training:

□ Can download MD17 dataset and split into 100,000 configurations
□ Can define SchNet model and set hyperparameters
□ Can run 50-epoch training and monitor progress with TensorBoard
□ Can evaluate MAE on test set and achieve target accuracy (< 1 kcal/mol)

MLP-MD Simulation:

□ Can wrap trained model as ASE Calculator
□ Can set initial velocities with Maxwell-Boltzmann distribution
□ Can run MD for 10,000 steps (5 picoseconds)
□ Can save trajectory and verify energy conservation

Analysis and Troubleshooting:

□ Can calculate vibrational spectrum (power spectrum)
□ Can calculate diffusion coefficient from mean square displacement (MSD)
□ Can handle Out of Memory (OOM) errors (reduce batch size)
□ Can diagnose NaN loss causes and adjust learning rate

3.12.3 Application Skills

Application Plan to Your Research:

□ Can design MD17-equivalent dataset for your research target (molecules, materials)
□ Can estimate required DFT calculation count (from target accuracy and system size)
□ Can develop strategy to optimize SchNet hyperparameters for your system
□ Can clearly define physical properties to obtain from MLP-MD (diffusion coefficient, vibrational spectrum, reaction path)

Problem Solving and Debugging:

□ Can execute diagnostic procedure when training doesn't converge (learning rate, data normalization, gradient clipping)
□ Can identify causes of energy drift and select countermeasures
□ Can detect overfitting and apply Early Stopping or Data Augmentation
□ Can optimize batch size and training time according to GPU/CPU resources

Preparation for Advanced Techniques:

□ Understand Active Learning concept (Example 15) and can explain implementation flow
□ Understand importance of configuration selection by ensemble uncertainty
□ Have expectations for how data efficiency improves in next chapter (NequIP, MACE)
□ Can continue self-study using SchNetPack documentation (schnetpack.readthedocs.io)

Bridge to Next Chapter:

□ Recognize SchNet limitations (data efficiency, rotational equivariance)
□ Interested in how E(3) equivariant architectures (NequIP, MACE) improve
□ Ready to learn actual research applications (catalysts, batteries, drug discovery) in Chapter 4

References

Schütt, K. T., et al. (2019). "SchNetPack: A Deep Learning Toolbox For Atomistic Systems." Journal of Chemical Theory and Computation, 15(1), 448-455.
DOI: 10.1021/acs.jctc.8b00908
Chmiela, S., et al. (2017). "Machine learning of accurate energy-conserving molecular force fields." Science Advances, 3(5), e1603015.
DOI: 10.1126/sciadv.1603015
Larsen, A. H., et al. (2017). "The atomic simulation environment—a Python library for working with atoms." Journal of Physics: Condensed Matter, 29(27), 273002.
DOI: 10.1088/1361-648X/aa680e
Paszke, A., et al. (2019). "PyTorch: An imperative style, high-performance deep learning library." Advances in Neural Information Processing Systems, 32.
arXiv: 1912.01703
Zhang, L., et al. (2020). "Active learning of uniformly accurate interatomic potentials for materials simulation." Physical Review Materials, 3(2), 023804.
DOI: 10.1103/PhysRevMaterials.3.023804
Schütt, K. T., et al. (2017). "Quantum-chemical insights from deep tensor neural networks." Nature Communications, 8(1), 13890.
DOI: 10.1038/ncomms13890

Author Information

Created by: MI Knowledge Hub Content Team
Created on: 2025-10-17
Version: 1.1 (Chapter 3 quality improvement)
Series: MLP Introduction Series

Update History:
- 2025-10-19: v1.1 Quality improvement revision
- Added data license and reproducibility section (MD17 dataset, aspirin molecule information)
- Code reproducibility information (Python 3.10.x, PyTorch 2.1.0, SchNetPack 2.0.3, ASE 3.22.1)
- Complete record of training hyperparameters (11 items, for paper reproduction)
- Energy and force unit conversion table (eV, kcal/mol, Hartree mutual conversion)
- Added practical precautions section (7 failure patterns: CUDA mismatch, API confusion, OOM, NaN loss, overfitting, energy drift, physical validity)
- Added chapter completion checklist (12 conceptual understanding items, 16 practical skill items, 16 application skill items)
- 2025-10-17: v1.0 Chapter 3 initial release
- Python environment setup (Conda, PyTorch, SchNetPack)
- MD17 dataset preparation and splitting
- SchNet model training (15 code examples)
- MLP-MD execution and analysis (trajectory, vibrational spectrum, MSD)
- Active Learning uncertainty evaluation
- Troubleshooting table (5 items)
- 2 exercises (easy, medium)
- 6 references

License: Creative Commons BY-NC-SA 4.0

← Back to Series Index

Chapter 3: Hands-On MLP with Python - SchNetPack Tutorial

Learning Objectives

3.1 Environment Setup: Installing Required Tools

Required Software

Installation Steps

3.2 Data Preparation: Obtaining MD17 Dataset

What is MD17 Dataset

Downloading and Loading Data

Data Splitting

3.3 Model Training with SchNetPack

SchNet Architecture Configuration

Running Training

Monitoring Training Progress

3.4 Accuracy Verification: Energy and Force Prediction Accuracy

Test Set Evaluation

Correlation Plot of Predictions vs True Values

3.5 MLP-MD Simulation: Running Molecular Dynamics

MLP-MD Setup with ASE

Initial Velocity Setup and Equilibration

Running MD Simulation

Trajectory Analysis

3.6 Physical Property Calculations: Vibrational Spectra and Diffusion Coefficients

Vibrational Spectrum (Power Spectrum)

Mean Square Displacement (MSD) and Diffusion Coefficient

3.7 Active Learning: Efficient Data Addition

Ensemble Uncertainty Evaluation

3.8 Troubleshooting: Common Errors and Solutions

Debugging Best Practices

3.9 Chapter Summary

What You Learned

Important Points

To the Next Chapter

Exercises

Exercise 1 (Difficulty: easy)

Exercise 2 (Difficulty: medium)

3.10 Data License and Reproducibility

3.10.1 Datasets Used

3.10.2 Environment Information for Code Reproducibility

3.10.3 Complete Record of Training Hyperparameters

3.10.4 Energy and Force Unit Conversion Table

3.11 Practical Precautions: Common Failure Patterns in Hands-On

3.11.1 Pitfalls in Environment Setup and Installation

3.11.2 Training and Debugging Pitfalls

3.11.3 MLP-MD Simulation Pitfalls

3.12 Chapter Completion Checklist: Hands-On Quality Assurance

3.12.1 Conceptual Understanding

3.12.2 Practical Skills

3.12.3 Application Skills

References

Author Information

Disclaimer