Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Sampling from Gibbs distributions and computing their log-partition function are important tasks in computer science.
  • Algorithms for non-convex potentials suffer from the curse of dimensionality.
  • Smooth functions allow faster convergence rates for optimization.
  • It is possible to achieve similar rates for sampling and log-partition computation.
  • Polynomial-time algorithms sometimes exhibit interesting behavior but no near-optimal rates.

Paper Content

Introduction

  • Sampling and log-partition problems are important in computer science
  • Sampling problem is to draw samples from a distribution with density p(x)
  • Log-partition problem is to compute the normalization constant
  • Distributions of the form p(x) are known as Gibbs distributions
  • Exact sampling and log-partition computation are possible for some simple functions
  • We study the worst-case error of algorithms over a function class
  • Error depends on variables B, n, d, m
  • We study the asymptotic behavior in terms of n and B
  • Typical convergence rates are of the form O m,d (Bn −m/d )
  • Lemma 3 states that in the limit of low temperatures, sampling is equivalent to optimization
  • Optimal worst-case convergence rate for optimization is the same as for approximation
  • High-temperature limit yields a unique distribution on the maximizers
  • Polynomial-time algorithms with fast convergence rates are possible for the high-temperature case and for optimization
  • Ideal algorithm should have close to optimal convergence rate, polynomial runtime, and adaptivity

Contribution

  • Analyze information-based complexity of sampling and log-partition problems
  • Show optimal rate for log-partition problem is same as for approximation
  • Show optimal rate for sampling is min{1, Bn −m/d }
  • Show reductions between different problems
  • Analyze bounds on convergence rates for different algorithms
  • Analysis of sampling algorithms has received attention in recent years
  • Good convergence rates established for versions of Langevin and Hamiltonian Monte Carlo algorithms
  • Algorithm with optimal convergence rate for d = 1
  • Not much known about algorithm-independent lower bounds in other cases
  • Convergence rates established for versions of Langevin algorithm for non-log-concave distributions
  • Convergence rate polynomial in d but depends on properties of f
  • Mixing time of parallel and simulated tempering can scale exponentially with d
  • Adaptive rejection sampling algorithm proposed and analyzed
  • Sampling related to optimization, can be easier or harder
  • Log-partition problem often addressed through sampling algorithms
  • Laplace approximation does not converge to true log-partition function
  • Information-based complexity used to analyze convergence rates
  • Paper organized into 6 sections

Information-based complexity

  • Log-partition: Define log-partition problem in general setting of Novak (1988)
  • Sampling: Define sampling problem in general setting of Novak (1988)

Deterministic evaluation points

  • Define a space A of admissible maps S : F → M
  • Consider maps that evaluate functions in a deterministic set of points
  • Allow adaptive points by defining , where evaluation points may be chosen depending on previous function values
  • Interested in the (nonadaptive/adaptive) minimax optimal error
  • Sets A n and A ad n can be interpreted as classes of “black-box algorithms”
  • Minimax-optimal errors e n and e ad n give lower bounds to what can be achieved by computationally efficient algorithms
  • Consider an idealized sampling algorithm to take some source of randomness ω sampled from a distribution P Ω
  • Maps S ∈ A n (or A ad n ) produce distributions based on n function evaluations of a function f
  • Theorem 5 adapts known results on minimax optimal rates to considered function spaces
  • Theorem 7 gives upper bounds on minimax optimal rates
  • Theorem 7 is optimal for the deterministic point setting

Stochastic evaluation points

  • Monte-Carlo type methods can be used to choose points stochastically
  • A map S can output a random distribution
  • Sampling algorithms typically follow a fixed distribution
  • Random samples are produced by evaluating a function at n (stochastic) points
  • Results for approximation, optimization, and integration are known
  • Faster rate for integration can be achieved by spending half of the n points for approximating f with g
  • Upper bound for stochastic log-partition exists
  • Lower bound for stochastic log-partition is known for optimization regime
  • Lower bound outside of optimization regime is an open problem
  • Combination of approximation and rejection sampling can be used
  • Upper bound for sampling with stochastic evaluation points decays faster than exponential in n
  • Lower bound for sampling with stochastic evaluation points is tight for optimization regime
  • Zero error can be achieved for certain function classes

Relations between different problems

  • Problems such as sampling, log-partition estimation, and optimization are related.
  • Bounds can be established via connection to function approximation.
  • Moving least squares method produces an approximant with optimal convergence rates and smooth approximation.
  • Theorem 15 shows that the moving least squares method has desired properties.

Runtime-accuracy trade-off

  • Investigating sampling and log-partition algorithms
  • Studying convergence rate and runtime complexity
  • Convergence rate and runtime complexity can be traded off
  • Increasing n without using additional function values can improve runtime complexity
  • Cannot trade runtime complexity for better convergence rates
  • Can use N evaluations of an interpolant created using n function evaluations

Relation between stochastic and deterministic evaluation points

  • Construction in Example 17 yields sampling algorithm with deterministic evaluation points
  • Construction limits convergence rate of sampling algorithm to Ω m,d (Bn −m/d )
  • Applying construction to log-partition algorithm yields stochastic log-partition algorithm with deterministic evaluation points
  • Convergence rate of construction limited to Ω m,d (Bn −m/d )

Relation between sampling and log-partition estimation

  • Sampling algorithms are used for log-partition estimation
  • Thermodynamic integration is a method to achieve this
  • Monte Carlo methods are used to evaluate the inner expectation
  • The outer integral is approximated with a quadrature rule
  • Thermodynamic integration has a convergence rate of Ω m,d,f (n −1/2 )
  • Bisection sampling is another method to achieve efficient log-partition estimation
  • Bisection sampling has a convergence rate of O m,d (N log(N ))
  • Both methods can be performed on top of an approximation of f

Relation to optimization

  • Sampling and optimization are related
  • Two kinds of optimization problems exist
  • Sampling can be used to solve optimization problems
  • Probabilities can be upper-bounded to solve optimization problems
  • Sampling algorithms can be used to achieve optimal rate for optimization

Algorithms

  • Investigating convergence rates of algorithmic approaches
  • Sampling and log-partition problems

Approximation-based algorithms

  • Approximation-based methods can achieve optimal rates for sampling and log-partition problems with deterministic points
  • Piecewise constant approximation divides X into N d equally-sized cubes
  • Output the function g n that is piecewise constant on each cube and interpolates f at the center x (i) of the cube X i
  • Given a piecewise constant function g n , we can easily compute L g
  • We can sample from g n in time O m,d (n)
  • Convergence rate of piecewise constant approximation is bad
  • Combining with higher-order function approximation can achieve rate O m,d (Bn −m/d )
  • Approximating λp with λq yields same sampling and log-partition errors
  • Need to divide approximation bound by normalization constant
  • Norm of the (normalized) density plays an important role for convergence rates
  • Sum-of-squares model proposed by Marteau-Ferey et al. (2022) achieves rate of O m,d,f (n −m/d )

Simple stochastic algorithms

  • Rejection sampling with uniform proposal distribution can achieve better rates than density-based approximation.
  • Lower bounds for the convergence of rejection sampling can be obtained using Lemma 11.
  • Monte Carlo quadrature can be used to approximate integrals.
  • Theorem 25 gives an upper bound on the convergence rate of Monte Carlo log-partition.
  • Theorem 26 shows that Monte Carlo sampling cannot achieve good rates in the optimization regime.

Markov chain monte carlo algorithms

  • MCMC methods are a popular class of sampling algorithms
  • Gradient-based MCMC algorithms have been studied intensively
  • Theoretical guarantees consider the case of concave functions
  • Extensions allow for non-concave functions in a compact region
  • Mixing time bound depends exponentially on L, related to f C 2
  • Convergence rates for other MCMC methods on F d,m,B is an open problem

Variational formulation for log-partition estimation

  • Variational approach to log-partition problem introduced by Bach (2022)
  • Optimization setting: probability measures on X
  • Approximate function f with model of form H (Hermitian matrix) and feature map ϕ
  • Define moment matrix for probability distribution P
  • Reduce infinite-dimensional convex optimization problem to finite-dimensional
  • Variational formulation by Donsker and Varadhan (1983) for general base distributions Q
  • Replace integral with tr[HΣ P ]
  • Replace KL divergence with something that only depends on Σ P
  • Bach (2022) proposes multiple lower bounds
  • Tightest one yields upper bound on log-partition function
  • Lemma 27: infimum over P merges with supremum over P
  • Theorem 28: lower bound for OPT relaxation
  • Implications of Theorem 28 on convergence rates

Experiments

  • Investigated the convergence behavior of simple algorithms numerically
  • Studied functions of the form f : [0, 1] 3 → R, x → β(x 1 + x 2 + x 3 )
  • Functions are simple and concave, but pose a challenge to some general algorithms
  • Dimension d = 3 chosen for visualization purposes
  • Plots can be reproduced using code at github.com/dholzmueller/sampling_experiments

Log-partition estimation

  • PC algorithm computes log-partition function of piecewise constant approximation
  • MC algorithm uses Monte Carlo log-partition estimation
  • PC+MC uses importance sampling and MC quadrature
  • PC+MC convergence rate is O(n-5/6)

Sampling

  • Estimate distances between probability distributions through samples
  • Energy distance is an efficient and easy-to-compute measure
  • Compare PC, MC, RS and PC+MC sampling algorithms
  • Combining approximation-based and stochastic methods performs better than either of the two in isolation
  • RS initially performs poorly but reaches fast convergence for larger values of n