Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generating a Boltzmann distribution in high dimension has been achieved with Normalizing Flows.
  • Current implementations rely on accurate training data.
  • There is an incentive to train models with incomplete or no data.
  • Standard losses based on Kullback-Leibler divergences have limitations.
  • Strategies to alleviate these issues have been proposed.
  • Imperfect pre-trained models can be further optimized in the absence of training data.

Paper Content

Introduction

  • Statistical physics properties of materials and molecular systems are expressed as expectations over probability distributions.
  • Monte Carlo averaging is used to estimate these expectations.
  • Boltzmann distribution is used to characterize systems at equilibrium with a thermostat.
  • Normalizing Flows are used to generate samples from the Boltzmann distribution.
  • Training generative models on high-dimensional distributions is difficult.
  • Training them in the absence of complete reference data is an unsolved problem.
  • Standard loss function relies on samples from the target distribution.
  • Data-free loss function uses samples from the generated distribution.
  • Standard loss is stable and leads to good performance.
  • Data-free loss is more erratic and often leads to mode collapse.
  • Data-free loss is made more stable by leveraging importance sampling.
  • Alignment penalty is used to discourage translations and rotations.
  • Unnecessary degrees of freedom need to be removed for better performance.
  • KL divergence optimization suffers from issues when discretized over minibatches.

Estimator variance as a loss

  • Normalizing flows can compute exact probabilities for any given point.
  • Importance sampling can be used to correct the sampler based on the trained generator.
  • A generator can be used to estimate integral quantities of the form E p B [f ].
  • A training loss can be used to optimize the generator to minimize the variance of estimators of expectations.

Practical recommendations

  • Avoiding unnecessary symmetries in target distribution can be beneficial
  • Hydrogen atoms can often be ignored
  • Flat degrees of freedom should be removed
  • Alignment penalty or generating configurations in internal coordinates can be used
  • Loss function can suffer from training instabilities due to high variance of importance sampling weights
  • Potential energy term can introduce training instabilities
  • Potential energy should not be increased explicitly
  • Decreasing probability distribution should be avoided
  • Log-ratios used instead of ratios for numerical reasons

Technical details

  • Data generated by Metropolis-Hastings simulations with Parallel Tempering
  • Potential energy evaluated according to CHARMM36m force field
  • Alignment penalty used with maximum rotation of π/3
  • Two model architectures used
  • No error bars provided, but experiments reproducible
  • Experiments performed on two GPUs, 40m for Double Well 12D, 7-10 hours for Butane and Dialanine

Conclusion and perspectives

  • Explored conditions necessary for training/refining flow-based models
  • Found several losses lead to numerical failures in discrete setting
  • Major instability issue when optimizing KL divergence between generated and target distributions
  • Loss functions push model to spread local mass in improbable directions, resulting in instability
  • Estimator variance minimization approach derived a stable data-free loss based on L2 distances
  • Mask must be applied to follow criterion
  • Stable optimization of correctly trained model
  • Lifting requirement for complete reference data requires training protocol to explore target space

E analysis of the bias when generating deterministic, minimum-energy hydrogen coordinates

  • Two-stage architecture generates heavy atom coordinates and hydrogen atoms
  • Heavy atom coordinates are generated by generator G and hydrogen atoms by auxiliary neural network h
  • Generator G is bijective, but complete pipeline hG is not
  • Desirable target for generated distribution is marginal of target with respect to heavy atom coordinates
  • Characterize convergence by computing probability of minor mode of dialanine
  • Minor mode is defined solely based on values of x C
  • Probability of generation of all-atom configuration is non-zero only on minimum-energy-hydrogen manifold
  • Probability of minor mode as generated by perfectly trained network is estimated by importance sampling estimator

F.1 integral quantity of interest

  • Generator is used to estimate integral quantities
  • True value of the generator is denoted by Q
  • Generator is exact when p G is not 0 and p B is not

F.2 estimation by sampling

  • Estimate Q by sampling
  • Mini-batch of points sampled according to pG
  • Do not know pB, only pB = ZBpB

F.3 this estimator is unbiased

  • The estimator is unbiased, meaning that for large mini-batches, the estimate tends to the true value.
  • The convergence rate is typically in O(1/ √ N).

Variance of the estimator

  • Estimate Q may converge faster for some distributions than others
  • Variance of estimator Q should be as small as possible
  • Gap between estimate Q and real value Q is of the order of magnitude of V
  • Can we train p G to minimize V?

F.5 reducing variances over mini-batches to variances over single samples

  • Variance over mini-batch size is minimized
  • All points are sampled identically
  • Variance behaves as O(1/N)
  • Quality of generator can be quantified as a loss
  • Special case of estimating free energy differences
  • Log-ratios of target and generated densities can be computed
  • Pairwise L2 loss is defined
  • Masked L2 loss with detached means is defined
  • Variation of K is non-negative
  • K increases with time and converges
  • Potential energy of generated samples is stable when K is detached

I optimization pitfalls induced by the discretization of distributions into minibatches

  • Optimization pitfalls encountered during gradient descents over “distances” or divergences between probability distributions
  • Search for new optimization criteria with better optimization properties

I.1 discretization issues with kullback-leibler and remedies

  • Discretization and normalization issues can lead to a pitfall even if a parameterized model is used.
  • The total mass of a distribution discretized on a minibatch is not 1, even if normalized by the number of samples.
  • Gradient descent can lead to exploding dynamics if the minibatch is not properly normalized.

I.2 reducing the variance of estimators induced by discretization

  • Losses vs. estimators of them by discretization
  • Do not confuse quantity with estimator
  • Do not confuse gradient of estimator with estimator of gradient
  • Estimate gradient of KL(p B p G ) w.r.t. generator parameters
  • Stabilizing trick to reduce estimator variance
  • KL(p||q) is a divergence, not a distance

J.4 gradient of the estimator vs. estimator of the gradient

  • Gradient ∇ L 2 q does not take into account that q should sum up to 1
  • Gradient ∇ L 2 q should be projected onto set of possible variations of q
  • Gradient ∇ L 2 θ of KL(p d q d ) between discretized distributions is i d log q(xi) dθ q d (x i ) − p d (x i )
  • Discretization of gradient ∇ L 2 θ of KL(p d q) misses a term, leading to positive additive term in gradient descent
  • Normalizing flows ensure all q θ are probability distributions
  • Estimation of gradient A of form X dq dθ f should be done with formula
  • Variance reduction due to stabilizing trick is 1
  • Stabilizing trick removes normalization mistakes on average
  • Figures 1-3 show results of fine-tunings with LKLx and L df KLz after pre-trainings with LKLz