Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • NNQMD simulations based on machine learning are revolutionizing atomistic simulations of materials.
  • Allegro model combines group theory, rotational equivariance and local descriptors for higher accuracy and speed.
  • Allegro-Legato model combines Allegro model with sharpness aware minimization for improved smoothness and robustness.
  • Allegro-Legato model exhibits weaker dependence of time-to-failure on problem size and excellent computational scalability.

Paper Content

Introduction

  • Neural-network quantum molecular dynamics (NNQMD) simulations based on machine learning are revolutionizing atomistic modeling of materials
  • NNQMD predicts accurate interatomic forces and captures quantum properties such as electronic polarization and electronic excitation
  • NNQMD is one of the most scalable scientific applications on high-end supercomputers
  • A breakthrough in NNQMD is improved accuracy of force prediction over previous models
  • The latest NNQMD model, Allegro, combines state-of-the-art accuracy with record speed
  • NNQMD simulations face an unsolved issue known as fidelity scaling
  • A training algorithm, sharpness-aware minimization (SAM), is used to train the Allegro model to enhance its robustness
  • The Allegro-Legato model increases the time-to-failure of NNQMD simulations while maintaining the same inference speed and nearly equal accuracy

Method innovation

Summary of neural-network quantum molecular dynamics

  • MD simulation follows time evolution of positions of atoms
  • Neural networks are trained to reproduce ground-truth QM values
  • Allegro model uses pairwise embedding energies between atomic pairs
  • Allegro attains accuracy through group-theoretical equivariance and speed through data locality

Summary of sharpness-aware minimization

  • Neural networks are trained by minimizing a loss function.
  • Design choice of optimization methods impacts convergence speed and generalization performance.
  • Adversarial attacks are a problem unique to neural networks.

Key innovation: allegro-legato: sam-enhanced allegro

  • Hypothesis: Smoothened loss landscape through SAM enhances fidelity scaling of NNQMD.
  • Tested hypothesis by incorporating SAM into Allegro NNQMD model.
  • Tuned SAM’s hyperparameter ρ to provide most robust model.
  • Found ρ = 0.005 gives longest time-to-failure.
  • Used LAMMPS open-source MD simulation software.

Rxmd-nn: scalable parallel implementation of allegro-legato nnqmd

  • Implemented Allegro-Legato NNQMD model in RXMD-NN software
  • Hierarchical divide-and-conquer scheme for “globally-scalable and local-fast” parallelization
  • Interprocess communication using non-blocking MPI library
  • CPU responsible for adjacency-list construction in parallel
  • PyTorch tensor object for force inference on GPUs
  • Control computational granularity to find ideal balance between horizontal and vertical scalability

Results

  • Tested fidelity and computational scalability of Allegro-Legato NNQMD model
  • Implemented in RXMD-NN code on Polaris platform at ALCF

Experimental platform

  • Polaris is a Hewlett Packard Enterprise (HPE) Apollo 6500 Gen 10+ based system.
  • It has 560 nodes, each with one 2.8GHz AMD EPYC Milan 7543P 32-core CPU, 512 GB of DDR4 RAM, four NVIDIA A100 GPUs with 40GB HBM2 memory per GPU, two 1.6 TB of SSDs in RAID0 and two Slingshot network endpoints.
  • It uses the NVIDIA A 100 HGX platform to connect all 4 GPUs via NVLink, with a GPU interconnect bandwidth of 600 GB/s.
  • Slingshot interconnect is based on high radix 64-port switches arranged in dragonfly topology.
  • Rated at a production peak performance of 44 petaflops with node-wise performance at 78 teraflops for double precision.

Fidelity-scaling results

  • Allegro and Allegro-Legato models are tested for robustness
  • NVT and NVE ensembles are used to study thermal-equilibrium and non-equilibrium properties
  • Simulation instances are thermalized at 200K for 1000 steps and then switched to NVE
  • Time step of 2fs is used throughout the test
  • Over ten independent simulations are averaged to measure t failure
  • Fidelity scaling exponent is defined to quantify fidelity scaling

Computational-scaling results

  • Measured wall-clock time per MD step with scaled workload of 6,912 P-atom ammonia system on P MD domains
  • Each MD domain consists of 6,912 atoms offloaded to single GPU
  • Runtime includes force inference, adjacency list construction, data transfer, and internode communication
  • Excellent scalability, 0.91 parallel efficiency for up to 13,271,040 atoms on 1,920 A100 GPUs
  • Fast time-to-solution of 3.46 seconds per MD step
  • GPU acceleration of NNQMD algorithm on single Polaris node with up to 7.6x speedup

Discussion

  • SAM-enhanced Allegro model (Allegro-Legato) is more robust than SOTA Allegro model
  • SAM training affects accuracy and computational speed

Simulation time

  • MD simulation time is not affected by SAM.
  • GPU acceleration of NNQMD algorithm achieved 7.6x speedup.
  • Default value of 1 used for maximum tensor rank.
  • Larger tensor rank generates more accurate but larger models.

Training time

  • SAM requires more computation time than the base optimizer.
  • SAM converges faster than the default optimizer.
  • Training time increases drastically for larger maximum tensor ranks.
  • Allegro-Legato improves robustness without extra training cost.

Model accuracy

  • Obtained validation error in atomic force of 15.9 (RMSE) and 11.6 (MAE) with Allegro-Legato model
  • Obtained validation error of 14.7 (RMSE) and 10.7 (MAE) with original Allegro model
  • Guideline of MAE of 1kcal/mol/Å for reliable MD simulations (corresponds to 43.4meV/Å)
  • Allegro-Legato improves robustness without sacrificing accuracy (force error is about 4x smaller than guideline)

Implicit sharpness regularization in allegro

  • Allegro models with larger implicitly regulate sharpness, resulting in higher robustness
  • Allegro-Legato model achieves same level of sharpness as Allegro model with less computing time

Training details

  • Used default hyperparameters for fair comparison
  • SAM training uses default optimizer as base optimizer

Applications

  • Improved robustness of Allegro-Legato model while preserving accuracy and speed of Allegro
  • Enables large spatiotemporal scale NNQMD simulations on leadership-scale computers
  • Used to study vibrational properties of ammonia
  • Accurately reproducing vibrational spectra of molecular crystals and liquids is important for applications in energy, biological, and pharmaceutical systems
  • Ammonia has higher energy density than liquid hydrogen and existing infrastructure
  • Nuclear quantum effects and vibrational anharmonicity must be considered when developing computational frameworks
  • Allegro-Legato model can replace expensive first-principles calculations in PIMD simulations
  • Performed massively parallel PIMD simulations with Allegro-Legato model to evaluate phonon spectra for inter-molecular modes of ammonia
  • Allegro-Legato model produces expected softening of high-energy modes at finite temperature with inclusion of nuclear quantum effects
  • NNQMD simulations have been developed and applied
  • Parallel implementation of NNQMD has been developed
  • Robustness of NNQMD has been quantified
  • Fidelity-scaling problem has been identified and proposed solution has been suggested

Conclusion

  • Proposed SAM-based solution to fidelity-scaling problem improves accuracy and speed
  • Significantly lower exponent for Allegro-Legato model delays time-to-failure
  • Scalable parallel implementation with GPU acceleration
  • Simulation-time and training-time comparison with reference training times of Allegro models