Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- NNQMD simulations based on machine learning are revolutionizing atomistic simulations of materials.
- Allegro model combines group theory, rotational equivariance and local descriptors for higher accuracy and speed.
- Allegro-Legato model combines Allegro model with sharpness aware minimization for improved smoothness and robustness.
- Allegro-Legato model exhibits weaker dependence of time-to-failure on problem size and excellent computational scalability.
Paper Content
Introduction
- Neural-network quantum molecular dynamics (NNQMD) simulations based on machine learning are revolutionizing atomistic modeling of materials
- NNQMD predicts accurate interatomic forces and captures quantum properties such as electronic polarization and electronic excitation
- NNQMD is one of the most scalable scientific applications on high-end supercomputers
- A breakthrough in NNQMD is improved accuracy of force prediction over previous models
- The latest NNQMD model, Allegro, combines state-of-the-art accuracy with record speed
- NNQMD simulations face an unsolved issue known as fidelity scaling
- A training algorithm, sharpness-aware minimization (SAM), is used to train the Allegro model to enhance its robustness
- The Allegro-Legato model increases the time-to-failure of NNQMD simulations while maintaining the same inference speed and nearly equal accuracy
Method innovation
Summary of neural-network quantum molecular dynamics
- MD simulation follows time evolution of positions of atoms
- Neural networks are trained to reproduce ground-truth QM values
- Allegro model uses pairwise embedding energies between atomic pairs
- Allegro attains accuracy through group-theoretical equivariance and speed through data locality
Summary of sharpness-aware minimization
- Neural networks are trained by minimizing a loss function.
- Design choice of optimization methods impacts convergence speed and generalization performance.
- Adversarial attacks are a problem unique to neural networks.
Key innovation: allegro-legato: sam-enhanced allegro
- Hypothesis: Smoothened loss landscape through SAM enhances fidelity scaling of NNQMD.
- Tested hypothesis by incorporating SAM into Allegro NNQMD model.
- Tuned SAM’s hyperparameter ρ to provide most robust model.
- Found ρ = 0.005 gives longest time-to-failure.
- Used LAMMPS open-source MD simulation software.
Rxmd-nn: scalable parallel implementation of allegro-legato nnqmd
- Implemented Allegro-Legato NNQMD model in RXMD-NN software
- Hierarchical divide-and-conquer scheme for “globally-scalable and local-fast” parallelization
- Interprocess communication using non-blocking MPI library
- CPU responsible for adjacency-list construction in parallel
- PyTorch tensor object for force inference on GPUs
- Control computational granularity to find ideal balance between horizontal and vertical scalability
Results
- Tested fidelity and computational scalability of Allegro-Legato NNQMD model
- Implemented in RXMD-NN code on Polaris platform at ALCF
Experimental platform
- Polaris is a Hewlett Packard Enterprise (HPE) Apollo 6500 Gen 10+ based system.
- It has 560 nodes, each with one 2.8GHz AMD EPYC Milan 7543P 32-core CPU, 512 GB of DDR4 RAM, four NVIDIA A100 GPUs with 40GB HBM2 memory per GPU, two 1.6 TB of SSDs in RAID0 and two Slingshot network endpoints.
- It uses the NVIDIA A 100 HGX platform to connect all 4 GPUs via NVLink, with a GPU interconnect bandwidth of 600 GB/s.
- Slingshot interconnect is based on high radix 64-port switches arranged in dragonfly topology.
- Rated at a production peak performance of 44 petaflops with node-wise performance at 78 teraflops for double precision.
Fidelity-scaling results
- Allegro and Allegro-Legato models are tested for robustness
- NVT and NVE ensembles are used to study thermal-equilibrium and non-equilibrium properties
- Simulation instances are thermalized at 200K for 1000 steps and then switched to NVE
- Time step of 2fs is used throughout the test
- Over ten independent simulations are averaged to measure t failure
- Fidelity scaling exponent is defined to quantify fidelity scaling
Computational-scaling results
- Measured wall-clock time per MD step with scaled workload of 6,912 P-atom ammonia system on P MD domains
- Each MD domain consists of 6,912 atoms offloaded to single GPU
- Runtime includes force inference, adjacency list construction, data transfer, and internode communication
- Excellent scalability, 0.91 parallel efficiency for up to 13,271,040 atoms on 1,920 A100 GPUs
- Fast time-to-solution of 3.46 seconds per MD step
- GPU acceleration of NNQMD algorithm on single Polaris node with up to 7.6x speedup
Discussion
- SAM-enhanced Allegro model (Allegro-Legato) is more robust than SOTA Allegro model
- SAM training affects accuracy and computational speed
Simulation time
- MD simulation time is not affected by SAM.
- GPU acceleration of NNQMD algorithm achieved 7.6x speedup.
- Default value of 1 used for maximum tensor rank.
- Larger tensor rank generates more accurate but larger models.
Training time
- SAM requires more computation time than the base optimizer.
- SAM converges faster than the default optimizer.
- Training time increases drastically for larger maximum tensor ranks.
- Allegro-Legato improves robustness without extra training cost.
Model accuracy
- Obtained validation error in atomic force of 15.9 (RMSE) and 11.6 (MAE) with Allegro-Legato model
- Obtained validation error of 14.7 (RMSE) and 10.7 (MAE) with original Allegro model
- Guideline of MAE of 1kcal/mol/Å for reliable MD simulations (corresponds to 43.4meV/Å)
- Allegro-Legato improves robustness without sacrificing accuracy (force error is about 4x smaller than guideline)
Implicit sharpness regularization in allegro
- Allegro models with larger implicitly regulate sharpness, resulting in higher robustness
- Allegro-Legato model achieves same level of sharpness as Allegro model with less computing time
Training details
- Used default hyperparameters for fair comparison
- SAM training uses default optimizer as base optimizer
Applications
- Improved robustness of Allegro-Legato model while preserving accuracy and speed of Allegro
- Enables large spatiotemporal scale NNQMD simulations on leadership-scale computers
- Used to study vibrational properties of ammonia
- Accurately reproducing vibrational spectra of molecular crystals and liquids is important for applications in energy, biological, and pharmaceutical systems
- Ammonia has higher energy density than liquid hydrogen and existing infrastructure
- Nuclear quantum effects and vibrational anharmonicity must be considered when developing computational frameworks
- Allegro-Legato model can replace expensive first-principles calculations in PIMD simulations
- Performed massively parallel PIMD simulations with Allegro-Legato model to evaluate phonon spectra for inter-molecular modes of ammonia
- Allegro-Legato model produces expected softening of high-energy modes at finite temperature with inclusion of nuclear quantum effects
Related work
- NNQMD simulations have been developed and applied
- Parallel implementation of NNQMD has been developed
- Robustness of NNQMD has been quantified
- Fidelity-scaling problem has been identified and proposed solution has been suggested
Conclusion
- Proposed SAM-based solution to fidelity-scaling problem improves accuracy and speed
- Significantly lower exponent for Allegro-Legato model delays time-to-failure
- Scalable parallel implementation with GPU acceleration
- Simulation-time and training-time comparison with reference training times of Allegro models