Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Sampling from Gibbs distributions and computing their log-partition function are important tasks in computer science.
- Algorithms for non-convex potentials suffer from the curse of dimensionality.
- Smooth functions allow faster convergence rates for optimization.
- It is possible to achieve similar rates for sampling and log-partition computation.
- Polynomial-time algorithms sometimes exhibit interesting behavior but no near-optimal rates.
Paper Content
Introduction
- Sampling and log-partition problems are important in computer science
- Sampling problem is to draw samples from a distribution with density p(x)
- Log-partition problem is to compute the normalization constant
- Distributions of the form p(x) are known as Gibbs distributions
- Exact sampling and log-partition computation are possible for some simple functions
- We study the worst-case error of algorithms over a function class
- Error depends on variables B, n, d, m
- We study the asymptotic behavior in terms of n and B
- Typical convergence rates are of the form O m,d (Bn −m/d )
- Lemma 3 states that in the limit of low temperatures, sampling is equivalent to optimization
- Optimal worst-case convergence rate for optimization is the same as for approximation
- High-temperature limit yields a unique distribution on the maximizers
- Polynomial-time algorithms with fast convergence rates are possible for the high-temperature case and for optimization
- Ideal algorithm should have close to optimal convergence rate, polynomial runtime, and adaptivity
Contribution
- Analyze information-based complexity of sampling and log-partition problems
- Show optimal rate for log-partition problem is same as for approximation
- Show optimal rate for sampling is min{1, Bn −m/d }
- Show reductions between different problems
- Analyze bounds on convergence rates for different algorithms
Related work
- Analysis of sampling algorithms has received attention in recent years
- Good convergence rates established for versions of Langevin and Hamiltonian Monte Carlo algorithms
- Algorithm with optimal convergence rate for d = 1
- Not much known about algorithm-independent lower bounds in other cases
- Convergence rates established for versions of Langevin algorithm for non-log-concave distributions
- Convergence rate polynomial in d but depends on properties of f
- Mixing time of parallel and simulated tempering can scale exponentially with d
- Adaptive rejection sampling algorithm proposed and analyzed
- Sampling related to optimization, can be easier or harder
- Log-partition problem often addressed through sampling algorithms
- Laplace approximation does not converge to true log-partition function
- Information-based complexity used to analyze convergence rates
- Paper organized into 6 sections
Information-based complexity
- Log-partition: Define log-partition problem in general setting of Novak (1988)
- Sampling: Define sampling problem in general setting of Novak (1988)
Deterministic evaluation points
- Define a space A of admissible maps S : F → M
- Consider maps that evaluate functions in a deterministic set of points
- Allow adaptive points by defining , where evaluation points may be chosen depending on previous function values
- Interested in the (nonadaptive/adaptive) minimax optimal error
- Sets A n and A ad n can be interpreted as classes of “black-box algorithms”
- Minimax-optimal errors e n and e ad n give lower bounds to what can be achieved by computationally efficient algorithms
- Consider an idealized sampling algorithm to take some source of randomness ω sampled from a distribution P Ω
- Maps S ∈ A n (or A ad n ) produce distributions based on n function evaluations of a function f
- Theorem 5 adapts known results on minimax optimal rates to considered function spaces
- Theorem 7 gives upper bounds on minimax optimal rates
- Theorem 7 is optimal for the deterministic point setting
Stochastic evaluation points
- Monte-Carlo type methods can be used to choose points stochastically
- A map S can output a random distribution
- Sampling algorithms typically follow a fixed distribution
- Random samples are produced by evaluating a function at n (stochastic) points
- Results for approximation, optimization, and integration are known
- Faster rate for integration can be achieved by spending half of the n points for approximating f with g
- Upper bound for stochastic log-partition exists
- Lower bound for stochastic log-partition is known for optimization regime
- Lower bound outside of optimization regime is an open problem
- Combination of approximation and rejection sampling can be used
- Upper bound for sampling with stochastic evaluation points decays faster than exponential in n
- Lower bound for sampling with stochastic evaluation points is tight for optimization regime
- Zero error can be achieved for certain function classes
Relations between different problems
- Problems such as sampling, log-partition estimation, and optimization are related.
- Bounds can be established via connection to function approximation.
- Moving least squares method produces an approximant with optimal convergence rates and smooth approximation.
- Theorem 15 shows that the moving least squares method has desired properties.
Runtime-accuracy trade-off
- Investigating sampling and log-partition algorithms
- Studying convergence rate and runtime complexity
- Convergence rate and runtime complexity can be traded off
- Increasing n without using additional function values can improve runtime complexity
- Cannot trade runtime complexity for better convergence rates
- Can use N evaluations of an interpolant created using n function evaluations
Relation between stochastic and deterministic evaluation points
- Construction in Example 17 yields sampling algorithm with deterministic evaluation points
- Construction limits convergence rate of sampling algorithm to Ω m,d (Bn −m/d )
- Applying construction to log-partition algorithm yields stochastic log-partition algorithm with deterministic evaluation points
- Convergence rate of construction limited to Ω m,d (Bn −m/d )
Relation between sampling and log-partition estimation
- Sampling algorithms are used for log-partition estimation
- Thermodynamic integration is a method to achieve this
- Monte Carlo methods are used to evaluate the inner expectation
- The outer integral is approximated with a quadrature rule
- Thermodynamic integration has a convergence rate of Ω m,d,f (n −1/2 )
- Bisection sampling is another method to achieve efficient log-partition estimation
- Bisection sampling has a convergence rate of O m,d (N log(N ))
- Both methods can be performed on top of an approximation of f
Relation to optimization
- Sampling and optimization are related
- Two kinds of optimization problems exist
- Sampling can be used to solve optimization problems
- Probabilities can be upper-bounded to solve optimization problems
- Sampling algorithms can be used to achieve optimal rate for optimization
Algorithms
- Investigating convergence rates of algorithmic approaches
- Sampling and log-partition problems
Approximation-based algorithms
- Approximation-based methods can achieve optimal rates for sampling and log-partition problems with deterministic points
- Piecewise constant approximation divides X into N d equally-sized cubes
- Output the function g n that is piecewise constant on each cube and interpolates f at the center x (i) of the cube X i
- Given a piecewise constant function g n , we can easily compute L g
- We can sample from g n in time O m,d (n)
- Convergence rate of piecewise constant approximation is bad
- Combining with higher-order function approximation can achieve rate O m,d (Bn −m/d )
- Approximating λp with λq yields same sampling and log-partition errors
- Need to divide approximation bound by normalization constant
- Norm of the (normalized) density plays an important role for convergence rates
- Sum-of-squares model proposed by Marteau-Ferey et al. (2022) achieves rate of O m,d,f (n −m/d )
Simple stochastic algorithms
- Rejection sampling with uniform proposal distribution can achieve better rates than density-based approximation.
- Lower bounds for the convergence of rejection sampling can be obtained using Lemma 11.
- Monte Carlo quadrature can be used to approximate integrals.
- Theorem 25 gives an upper bound on the convergence rate of Monte Carlo log-partition.
- Theorem 26 shows that Monte Carlo sampling cannot achieve good rates in the optimization regime.
Markov chain monte carlo algorithms
- MCMC methods are a popular class of sampling algorithms
- Gradient-based MCMC algorithms have been studied intensively
- Theoretical guarantees consider the case of concave functions
- Extensions allow for non-concave functions in a compact region
- Mixing time bound depends exponentially on L, related to f C 2
- Convergence rates for other MCMC methods on F d,m,B is an open problem
Variational formulation for log-partition estimation
- Variational approach to log-partition problem introduced by Bach (2022)
- Optimization setting: probability measures on X
- Approximate function f with model of form H (Hermitian matrix) and feature map ϕ
- Define moment matrix for probability distribution P
- Reduce infinite-dimensional convex optimization problem to finite-dimensional
- Variational formulation by Donsker and Varadhan (1983) for general base distributions Q
- Replace integral with tr[HΣ P ]
- Replace KL divergence with something that only depends on Σ P
- Bach (2022) proposes multiple lower bounds
- Tightest one yields upper bound on log-partition function
- Lemma 27: infimum over P merges with supremum over P
- Theorem 28: lower bound for OPT relaxation
- Implications of Theorem 28 on convergence rates
Experiments
- Investigated the convergence behavior of simple algorithms numerically
- Studied functions of the form f : [0, 1] 3 → R, x → β(x 1 + x 2 + x 3 )
- Functions are simple and concave, but pose a challenge to some general algorithms
- Dimension d = 3 chosen for visualization purposes
- Plots can be reproduced using code at github.com/dholzmueller/sampling_experiments
Log-partition estimation
- PC algorithm computes log-partition function of piecewise constant approximation
- MC algorithm uses Monte Carlo log-partition estimation
- PC+MC uses importance sampling and MC quadrature
- PC+MC convergence rate is O(n-5/6)
Sampling
- Estimate distances between probability distributions through samples
- Energy distance is an efficient and easy-to-compute measure
- Compare PC, MC, RS and PC+MC sampling algorithms
- Combining approximation-based and stochastic methods performs better than either of the two in isolation
- RS initially performs poorly but reaches fast convergence for larger values of n