Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

NNQMD simulations based on machine learning are revolutionizing atomistic simulations of materials.
Allegro model combines group theory, rotational equivariance and local descriptors for higher accuracy and speed.
Allegro-Legato model combines Allegro model with sharpness aware minimization for improved smoothness and robustness.
Allegro-Legato model exhibits weaker dependence of time-to-failure on problem size and excellent computational scalability.

Paper Content

Introduction

Neural-network quantum molecular dynamics (NNQMD) simulations based on machine learning are revolutionizing atomistic modeling of materials
NNQMD predicts accurate interatomic forces and captures quantum properties such as electronic polarization and electronic excitation
NNQMD is one of the most scalable scientific applications on high-end supercomputers
A breakthrough in NNQMD is improved accuracy of force prediction over previous models
The latest NNQMD model, Allegro, combines state-of-the-art accuracy with record speed
NNQMD simulations face an unsolved issue known as fidelity scaling
A training algorithm, sharpness-aware minimization (SAM), is used to train the Allegro model to enhance its robustness
The Allegro-Legato model increases the time-to-failure of NNQMD simulations while maintaining the same inference speed and nearly equal accuracy

Method innovation

Summary of neural-network quantum molecular dynamics

MD simulation follows time evolution of positions of atoms
Neural networks are trained to reproduce ground-truth QM values
Allegro model uses pairwise embedding energies between atomic pairs
Allegro attains accuracy through group-theoretical equivariance and speed through data locality

Summary of sharpness-aware minimization

Neural networks are trained by minimizing a loss function.
Design choice of optimization methods impacts convergence speed and generalization performance.
Adversarial attacks are a problem unique to neural networks.

Key innovation: allegro-legato: sam-enhanced allegro

Hypothesis: Smoothened loss landscape through SAM enhances fidelity scaling of NNQMD.
Tested hypothesis by incorporating SAM into Allegro NNQMD model.
Tuned SAM’s hyperparameter ρ to provide most robust model.
Found ρ = 0.005 gives longest time-to-failure.
Used LAMMPS open-source MD simulation software.

Rxmd-nn: scalable parallel implementation of allegro-legato nnqmd

Implemented Allegro-Legato NNQMD model in RXMD-NN software
Hierarchical divide-and-conquer scheme for “globally-scalable and local-fast” parallelization
Interprocess communication using non-blocking MPI library
CPU responsible for adjacency-list construction in parallel
PyTorch tensor object for force inference on GPUs
Control computational granularity to find ideal balance between horizontal and vertical scalability

Results

Tested fidelity and computational scalability of Allegro-Legato NNQMD model
Implemented in RXMD-NN code on Polaris platform at ALCF

Experimental platform

Polaris is a Hewlett Packard Enterprise (HPE) Apollo 6500 Gen 10+ based system.
It has 560 nodes, each with one 2.8GHz AMD EPYC Milan 7543P 32-core CPU, 512 GB of DDR4 RAM, four NVIDIA A100 GPUs with 40GB HBM2 memory per GPU, two 1.6 TB of SSDs in RAID0 and two Slingshot network endpoints.
It uses the NVIDIA A 100 HGX platform to connect all 4 GPUs via NVLink, with a GPU interconnect bandwidth of 600 GB/s.
Slingshot interconnect is based on high radix 64-port switches arranged in dragonfly topology.
Rated at a production peak performance of 44 petaflops with node-wise performance at 78 teraflops for double precision.

Fidelity-scaling results

Allegro and Allegro-Legato models are tested for robustness
NVT and NVE ensembles are used to study thermal-equilibrium and non-equilibrium properties
Simulation instances are thermalized at 200K for 1000 steps and then switched to NVE
Time step of 2fs is used throughout the test
Over ten independent simulations are averaged to measure t failure
Fidelity scaling exponent is defined to quantify fidelity scaling

Computational-scaling results

Measured wall-clock time per MD step with scaled workload of 6,912 P-atom ammonia system on P MD domains
Each MD domain consists of 6,912 atoms offloaded to single GPU
Runtime includes force inference, adjacency list construction, data transfer, and internode communication
Excellent scalability, 0.91 parallel efficiency for up to 13,271,040 atoms on 1,920 A100 GPUs
Fast time-to-solution of 3.46 seconds per MD step
GPU acceleration of NNQMD algorithm on single Polaris node with up to 7.6x speedup

Discussion

SAM-enhanced Allegro model (Allegro-Legato) is more robust than SOTA Allegro model
SAM training affects accuracy and computational speed

Simulation time

MD simulation time is not affected by SAM.
GPU acceleration of NNQMD algorithm achieved 7.6x speedup.
Default value of 1 used for maximum tensor rank.
Larger tensor rank generates more accurate but larger models.

Training time

SAM requires more computation time than the base optimizer.
SAM converges faster than the default optimizer.
Training time increases drastically for larger maximum tensor ranks.
Allegro-Legato improves robustness without extra training cost.

Model accuracy

Obtained validation error in atomic force of 15.9 (RMSE) and 11.6 (MAE) with Allegro-Legato model
Obtained validation error of 14.7 (RMSE) and 10.7 (MAE) with original Allegro model
Guideline of MAE of 1kcal/mol/Å for reliable MD simulations (corresponds to 43.4meV/Å)
Allegro-Legato improves robustness without sacrificing accuracy (force error is about 4x smaller than guideline)

Implicit sharpness regularization in allegro

Allegro models with larger implicitly regulate sharpness, resulting in higher robustness
Allegro-Legato model achieves same level of sharpness as Allegro model with less computing time

Training details

Used default hyperparameters for fair comparison
SAM training uses default optimizer as base optimizer

Applications

Improved robustness of Allegro-Legato model while preserving accuracy and speed of Allegro
Enables large spatiotemporal scale NNQMD simulations on leadership-scale computers
Used to study vibrational properties of ammonia
Accurately reproducing vibrational spectra of molecular crystals and liquids is important for applications in energy, biological, and pharmaceutical systems
Ammonia has higher energy density than liquid hydrogen and existing infrastructure
Nuclear quantum effects and vibrational anharmonicity must be considered when developing computational frameworks
Allegro-Legato model can replace expensive first-principles calculations in PIMD simulations
Performed massively parallel PIMD simulations with Allegro-Legato model to evaluate phonon spectra for inter-molecular modes of ammonia
Allegro-Legato model produces expected softening of high-energy modes at finite temperature with inclusion of nuclear quantum effects

NNQMD simulations have been developed and applied
Parallel implementation of NNQMD has been developed
Robustness of NNQMD has been quantified
Fidelity-scaling problem has been identified and proposed solution has been suggested

Conclusion

Proposed SAM-based solution to fidelity-scaling problem improves accuracy and speed
Significantly lower exponent for Allegro-Legato model delays time-to-failure
Scalable parallel implementation with GPU acceleration
Simulation-time and training-time comparison with reference training times of Allegro models

Link to paper#

Abstract#

Paper Content#

Introduction#

Method innovation#

Summary of neural-network quantum molecular dynamics#

Summary of sharpness-aware minimization#

Key innovation: allegro-legato: sam-enhanced allegro#

Rxmd-nn: scalable parallel implementation of allegro-legato nnqmd#

Results#

Experimental platform#

Fidelity-scaling results#

Computational-scaling results#

Discussion#

Simulation time#

Training time#

Model accuracy#

Implicit sharpness regularization in allegro#

Training details#

Applications#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Method innovation

Summary of neural-network quantum molecular dynamics

Summary of sharpness-aware minimization

Key innovation: allegro-legato: sam-enhanced allegro

Rxmd-nn: scalable parallel implementation of allegro-legato nnqmd

Results

Experimental platform

Fidelity-scaling results

Computational-scaling results

Discussion

Simulation time

Training time

Model accuracy

Implicit sharpness regularization in allegro

Training details

Applications

Related work

Conclusion