Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generating a Boltzmann distribution in high dimension has been achieved with Normalizing Flows.
- Current implementations rely on accurate training data.
- There is an incentive to train models with incomplete or no data.
- Standard losses based on Kullback-Leibler divergences have limitations.
- Strategies to alleviate these issues have been proposed.
- Imperfect pre-trained models can be further optimized in the absence of training data.
Paper Content
Introduction
- Statistical physics properties of materials and molecular systems are expressed as expectations over probability distributions.
- Monte Carlo averaging is used to estimate these expectations.
- Boltzmann distribution is used to characterize systems at equilibrium with a thermostat.
- Normalizing Flows are used to generate samples from the Boltzmann distribution.
- Training generative models on high-dimensional distributions is difficult.
- Training them in the absence of complete reference data is an unsolved problem.
- Standard loss function relies on samples from the target distribution.
- Data-free loss function uses samples from the generated distribution.
- Standard loss is stable and leads to good performance.
- Data-free loss is more erratic and often leads to mode collapse.
- Data-free loss is made more stable by leveraging importance sampling.
- Alignment penalty is used to discourage translations and rotations.
- Unnecessary degrees of freedom need to be removed for better performance.
- KL divergence optimization suffers from issues when discretized over minibatches.
Estimator variance as a loss
- Normalizing flows can compute exact probabilities for any given point.
- Importance sampling can be used to correct the sampler based on the trained generator.
- A generator can be used to estimate integral quantities of the form E p B [f ].
- A training loss can be used to optimize the generator to minimize the variance of estimators of expectations.
Practical recommendations
- Avoiding unnecessary symmetries in target distribution can be beneficial
- Hydrogen atoms can often be ignored
- Flat degrees of freedom should be removed
- Alignment penalty or generating configurations in internal coordinates can be used
- Loss function can suffer from training instabilities due to high variance of importance sampling weights
- Potential energy term can introduce training instabilities
- Potential energy should not be increased explicitly
- Decreasing probability distribution should be avoided
- Log-ratios used instead of ratios for numerical reasons
Technical details
- Data generated by Metropolis-Hastings simulations with Parallel Tempering
- Potential energy evaluated according to CHARMM36m force field
- Alignment penalty used with maximum rotation of π/3
- Two model architectures used
- No error bars provided, but experiments reproducible
- Experiments performed on two GPUs, 40m for Double Well 12D, 7-10 hours for Butane and Dialanine
Conclusion and perspectives
- Explored conditions necessary for training/refining flow-based models
- Found several losses lead to numerical failures in discrete setting
- Major instability issue when optimizing KL divergence between generated and target distributions
- Loss functions push model to spread local mass in improbable directions, resulting in instability
- Estimator variance minimization approach derived a stable data-free loss based on L2 distances
- Mask must be applied to follow criterion
- Stable optimization of correctly trained model
- Lifting requirement for complete reference data requires training protocol to explore target space
E analysis of the bias when generating deterministic, minimum-energy hydrogen coordinates
- Two-stage architecture generates heavy atom coordinates and hydrogen atoms
- Heavy atom coordinates are generated by generator G and hydrogen atoms by auxiliary neural network h
- Generator G is bijective, but complete pipeline hG is not
- Desirable target for generated distribution is marginal of target with respect to heavy atom coordinates
- Characterize convergence by computing probability of minor mode of dialanine
- Minor mode is defined solely based on values of x C
- Probability of generation of all-atom configuration is non-zero only on minimum-energy-hydrogen manifold
- Probability of minor mode as generated by perfectly trained network is estimated by importance sampling estimator
F.1 integral quantity of interest
- Generator is used to estimate integral quantities
- True value of the generator is denoted by Q
- Generator is exact when p G is not 0 and p B is not
F.2 estimation by sampling
- Estimate Q by sampling
- Mini-batch of points sampled according to pG
- Do not know pB, only pB = ZBpB
F.3 this estimator is unbiased
- The estimator is unbiased, meaning that for large mini-batches, the estimate tends to the true value.
- The convergence rate is typically in O(1/ √ N).
Variance of the estimator
- Estimate Q may converge faster for some distributions than others
- Variance of estimator Q should be as small as possible
- Gap between estimate Q and real value Q is of the order of magnitude of V
- Can we train p G to minimize V?
F.5 reducing variances over mini-batches to variances over single samples
- Variance over mini-batch size is minimized
- All points are sampled identically
- Variance behaves as O(1/N)
- Quality of generator can be quantified as a loss
- Special case of estimating free energy differences
- Log-ratios of target and generated densities can be computed
- Pairwise L2 loss is defined
- Masked L2 loss with detached means is defined
- Variation of K is non-negative
- K increases with time and converges
- Potential energy of generated samples is stable when K is detached
I optimization pitfalls induced by the discretization of distributions into minibatches
- Optimization pitfalls encountered during gradient descents over “distances” or divergences between probability distributions
- Search for new optimization criteria with better optimization properties
I.1 discretization issues with kullback-leibler and remedies
- Discretization and normalization issues can lead to a pitfall even if a parameterized model is used.
- The total mass of a distribution discretized on a minibatch is not 1, even if normalized by the number of samples.
- Gradient descent can lead to exploding dynamics if the minibatch is not properly normalized.
I.2 reducing the variance of estimators induced by discretization
- Losses vs. estimators of them by discretization
- Do not confuse quantity with estimator
- Do not confuse gradient of estimator with estimator of gradient
- Estimate gradient of KL(p B p G ) w.r.t. generator parameters
- Stabilizing trick to reduce estimator variance
- KL(p||q) is a divergence, not a distance
J.4 gradient of the estimator vs. estimator of the gradient
- Gradient ∇ L 2 q does not take into account that q should sum up to 1
- Gradient ∇ L 2 q should be projected onto set of possible variations of q
- Gradient ∇ L 2 θ of KL(p d q d ) between discretized distributions is i d log q(xi) dθ q d (x i ) − p d (x i )
- Discretization of gradient ∇ L 2 θ of KL(p d q) misses a term, leading to positive additive term in gradient descent
- Normalizing flows ensure all q θ are probability distributions
- Estimation of gradient A of form X dq dθ f should be done with formula
- Variance reduction due to stabilizing trick is 1
- Stabilizing trick removes normalization mistakes on average
- Figures 1-3 show results of fine-tunings with LKLx and L df KLz after pre-trainings with LKLz