Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generating a Boltzmann distribution in high dimension has been achieved with Normalizing Flows.
Current implementations rely on accurate training data.
There is an incentive to train models with incomplete or no data.
Standard losses based on Kullback-Leibler divergences have limitations.
Strategies to alleviate these issues have been proposed.
Imperfect pre-trained models can be further optimized in the absence of training data.

Paper Content

Introduction

Statistical physics properties of materials and molecular systems are expressed as expectations over probability distributions.
Monte Carlo averaging is used to estimate these expectations.
Boltzmann distribution is used to characterize systems at equilibrium with a thermostat.
Normalizing Flows are used to generate samples from the Boltzmann distribution.
Training generative models on high-dimensional distributions is difficult.
Training them in the absence of complete reference data is an unsolved problem.
Standard loss function relies on samples from the target distribution.
Data-free loss function uses samples from the generated distribution.
Standard loss is stable and leads to good performance.
Data-free loss is more erratic and often leads to mode collapse.
Data-free loss is made more stable by leveraging importance sampling.
Alignment penalty is used to discourage translations and rotations.
Unnecessary degrees of freedom need to be removed for better performance.
KL divergence optimization suffers from issues when discretized over minibatches.

Estimator variance as a loss

Normalizing flows can compute exact probabilities for any given point.
Importance sampling can be used to correct the sampler based on the trained generator.
A generator can be used to estimate integral quantities of the form E p B [f ].
A training loss can be used to optimize the generator to minimize the variance of estimators of expectations.

Practical recommendations

Avoiding unnecessary symmetries in target distribution can be beneficial
Hydrogen atoms can often be ignored
Flat degrees of freedom should be removed
Alignment penalty or generating configurations in internal coordinates can be used
Loss function can suffer from training instabilities due to high variance of importance sampling weights
Potential energy term can introduce training instabilities
Potential energy should not be increased explicitly
Decreasing probability distribution should be avoided
Log-ratios used instead of ratios for numerical reasons

Technical details

Data generated by Metropolis-Hastings simulations with Parallel Tempering
Potential energy evaluated according to CHARMM36m force field
Alignment penalty used with maximum rotation of π/3
Two model architectures used
No error bars provided, but experiments reproducible
Experiments performed on two GPUs, 40m for Double Well 12D, 7-10 hours for Butane and Dialanine

Conclusion and perspectives

Explored conditions necessary for training/refining flow-based models
Found several losses lead to numerical failures in discrete setting
Major instability issue when optimizing KL divergence between generated and target distributions
Loss functions push model to spread local mass in improbable directions, resulting in instability
Estimator variance minimization approach derived a stable data-free loss based on L2 distances
Mask must be applied to follow criterion
Stable optimization of correctly trained model
Lifting requirement for complete reference data requires training protocol to explore target space

E analysis of the bias when generating deterministic, minimum-energy hydrogen coordinates

Two-stage architecture generates heavy atom coordinates and hydrogen atoms
Heavy atom coordinates are generated by generator G and hydrogen atoms by auxiliary neural network h
Generator G is bijective, but complete pipeline hG is not
Desirable target for generated distribution is marginal of target with respect to heavy atom coordinates
Characterize convergence by computing probability of minor mode of dialanine
Minor mode is defined solely based on values of x C
Probability of generation of all-atom configuration is non-zero only on minimum-energy-hydrogen manifold
Probability of minor mode as generated by perfectly trained network is estimated by importance sampling estimator

F.1 integral quantity of interest

Generator is used to estimate integral quantities
True value of the generator is denoted by Q
Generator is exact when p G is not 0 and p B is not

F.2 estimation by sampling

Estimate Q by sampling
Mini-batch of points sampled according to pG
Do not know pB, only pB = ZBpB

F.3 this estimator is unbiased

The estimator is unbiased, meaning that for large mini-batches, the estimate tends to the true value.
The convergence rate is typically in O(1/ √ N).

Variance of the estimator

Estimate Q may converge faster for some distributions than others
Variance of estimator Q should be as small as possible
Gap between estimate Q and real value Q is of the order of magnitude of V
Can we train p G to minimize V?

F.5 reducing variances over mini-batches to variances over single samples

Variance over mini-batch size is minimized
All points are sampled identically
Variance behaves as O(1/N)
Quality of generator can be quantified as a loss
Special case of estimating free energy differences
Log-ratios of target and generated densities can be computed
Pairwise L2 loss is defined
Masked L2 loss with detached means is defined
Variation of K is non-negative
K increases with time and converges
Potential energy of generated samples is stable when K is detached

I optimization pitfalls induced by the discretization of distributions into minibatches

Optimization pitfalls encountered during gradient descents over “distances” or divergences between probability distributions
Search for new optimization criteria with better optimization properties

I.1 discretization issues with kullback-leibler and remedies

Discretization and normalization issues can lead to a pitfall even if a parameterized model is used.
The total mass of a distribution discretized on a minibatch is not 1, even if normalized by the number of samples.
Gradient descent can lead to exploding dynamics if the minibatch is not properly normalized.

I.2 reducing the variance of estimators induced by discretization

Losses vs. estimators of them by discretization
Do not confuse quantity with estimator
Do not confuse gradient of estimator with estimator of gradient
Estimate gradient of KL(p B p G ) w.r.t. generator parameters
Stabilizing trick to reduce estimator variance
KL(p||q) is a divergence, not a distance

J.4 gradient of the estimator vs. estimator of the gradient

Gradient ∇ L 2 q does not take into account that q should sum up to 1
Gradient ∇ L 2 q should be projected onto set of possible variations of q
Gradient ∇ L 2 θ of KL(p d q d ) between discretized distributions is i d log q(xi) dθ q d (x i ) − p d (x i )
Discretization of gradient ∇ L 2 θ of KL(p d q) misses a term, leading to positive additive term in gradient descent
Normalizing flows ensure all q θ are probability distributions
Estimation of gradient A of form X dq dθ f should be done with formula
Variance reduction due to stabilizing trick is 1
Stabilizing trick removes normalization mistakes on average
Figures 1-3 show results of fine-tunings with LKLx and L df KLz after pre-trainings with LKLz

Link to paper#

Abstract#

Paper Content#

Introduction#

Estimator variance as a loss#

Practical recommendations#

Technical details#

Conclusion and perspectives#

E analysis of the bias when generating deterministic, minimum-energy hydrogen coordinates#

F.1 integral quantity of interest#

F.2 estimation by sampling#

F.3 this estimator is unbiased#

Variance of the estimator#

F.5 reducing variances over mini-batches to variances over single samples#

I optimization pitfalls induced by the discretization of distributions into minibatches#

I.1 discretization issues with kullback-leibler and remedies#

I.2 reducing the variance of estimators induced by discretization#

J.4 gradient of the estimator vs. estimator of the gradient#

Link to paper

Abstract

Paper Content

Introduction

Estimator variance as a loss

Practical recommendations

Technical details

Conclusion and perspectives

E analysis of the bias when generating deterministic, minimum-energy hydrogen coordinates

F.1 integral quantity of interest

F.2 estimation by sampling

F.3 this estimator is unbiased

Variance of the estimator

F.5 reducing variances over mini-batches to variances over single samples

I optimization pitfalls induced by the discretization of distributions into minibatches

I.1 discretization issues with kullback-leibler and remedies

I.2 reducing the variance of estimators induced by discretization

J.4 gradient of the estimator vs. estimator of the gradient