Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Sampling from Gibbs distributions and computing their log-partition function are important tasks in computer science.
Algorithms for non-convex potentials suffer from the curse of dimensionality.
Smooth functions allow faster convergence rates for optimization.
It is possible to achieve similar rates for sampling and log-partition computation.
Polynomial-time algorithms sometimes exhibit interesting behavior but no near-optimal rates.

Paper Content

Introduction

Sampling and log-partition problems are important in computer science
Sampling problem is to draw samples from a distribution with density p(x)
Log-partition problem is to compute the normalization constant
Distributions of the form p(x) are known as Gibbs distributions
Exact sampling and log-partition computation are possible for some simple functions
We study the worst-case error of algorithms over a function class
Error depends on variables B, n, d, m
We study the asymptotic behavior in terms of n and B
Typical convergence rates are of the form O m,d (Bn −m/d )
Lemma 3 states that in the limit of low temperatures, sampling is equivalent to optimization
Optimal worst-case convergence rate for optimization is the same as for approximation
High-temperature limit yields a unique distribution on the maximizers
Polynomial-time algorithms with fast convergence rates are possible for the high-temperature case and for optimization
Ideal algorithm should have close to optimal convergence rate, polynomial runtime, and adaptivity

Contribution

Analyze information-based complexity of sampling and log-partition problems
Show optimal rate for log-partition problem is same as for approximation
Show optimal rate for sampling is min{1, Bn −m/d }
Show reductions between different problems
Analyze bounds on convergence rates for different algorithms

Analysis of sampling algorithms has received attention in recent years
Good convergence rates established for versions of Langevin and Hamiltonian Monte Carlo algorithms
Algorithm with optimal convergence rate for d = 1
Not much known about algorithm-independent lower bounds in other cases
Convergence rates established for versions of Langevin algorithm for non-log-concave distributions
Convergence rate polynomial in d but depends on properties of f
Mixing time of parallel and simulated tempering can scale exponentially with d
Adaptive rejection sampling algorithm proposed and analyzed
Sampling related to optimization, can be easier or harder
Log-partition problem often addressed through sampling algorithms
Laplace approximation does not converge to true log-partition function
Information-based complexity used to analyze convergence rates
Paper organized into 6 sections

Information-based complexity

Log-partition: Define log-partition problem in general setting of Novak (1988)
Sampling: Define sampling problem in general setting of Novak (1988)

Deterministic evaluation points

Define a space A of admissible maps S : F → M
Consider maps that evaluate functions in a deterministic set of points
Allow adaptive points by defining , where evaluation points may be chosen depending on previous function values
Interested in the (nonadaptive/adaptive) minimax optimal error
Sets A n and A ad n can be interpreted as classes of “black-box algorithms”
Minimax-optimal errors e n and e ad n give lower bounds to what can be achieved by computationally efficient algorithms
Consider an idealized sampling algorithm to take some source of randomness ω sampled from a distribution P Ω
Maps S ∈ A n (or A ad n ) produce distributions based on n function evaluations of a function f
Theorem 5 adapts known results on minimax optimal rates to considered function spaces
Theorem 7 gives upper bounds on minimax optimal rates
Theorem 7 is optimal for the deterministic point setting

Stochastic evaluation points

Monte-Carlo type methods can be used to choose points stochastically
A map S can output a random distribution
Sampling algorithms typically follow a fixed distribution
Random samples are produced by evaluating a function at n (stochastic) points
Results for approximation, optimization, and integration are known
Faster rate for integration can be achieved by spending half of the n points for approximating f with g
Upper bound for stochastic log-partition exists
Lower bound for stochastic log-partition is known for optimization regime
Lower bound outside of optimization regime is an open problem
Combination of approximation and rejection sampling can be used
Upper bound for sampling with stochastic evaluation points decays faster than exponential in n
Lower bound for sampling with stochastic evaluation points is tight for optimization regime
Zero error can be achieved for certain function classes

Relations between different problems

Problems such as sampling, log-partition estimation, and optimization are related.
Bounds can be established via connection to function approximation.
Moving least squares method produces an approximant with optimal convergence rates and smooth approximation.
Theorem 15 shows that the moving least squares method has desired properties.

Runtime-accuracy trade-off

Investigating sampling and log-partition algorithms
Studying convergence rate and runtime complexity
Convergence rate and runtime complexity can be traded off
Increasing n without using additional function values can improve runtime complexity
Cannot trade runtime complexity for better convergence rates
Can use N evaluations of an interpolant created using n function evaluations

Relation between stochastic and deterministic evaluation points

Construction in Example 17 yields sampling algorithm with deterministic evaluation points
Construction limits convergence rate of sampling algorithm to Ω m,d (Bn −m/d )
Applying construction to log-partition algorithm yields stochastic log-partition algorithm with deterministic evaluation points
Convergence rate of construction limited to Ω m,d (Bn −m/d )

Relation between sampling and log-partition estimation

Sampling algorithms are used for log-partition estimation
Thermodynamic integration is a method to achieve this
Monte Carlo methods are used to evaluate the inner expectation
The outer integral is approximated with a quadrature rule
Thermodynamic integration has a convergence rate of Ω m,d,f (n −1/2 )
Bisection sampling is another method to achieve efficient log-partition estimation
Bisection sampling has a convergence rate of O m,d (N log(N ))
Both methods can be performed on top of an approximation of f

Relation to optimization

Sampling and optimization are related
Two kinds of optimization problems exist
Sampling can be used to solve optimization problems
Probabilities can be upper-bounded to solve optimization problems
Sampling algorithms can be used to achieve optimal rate for optimization

Algorithms

Investigating convergence rates of algorithmic approaches
Sampling and log-partition problems

Approximation-based algorithms

Approximation-based methods can achieve optimal rates for sampling and log-partition problems with deterministic points
Piecewise constant approximation divides X into N d equally-sized cubes
Output the function g n that is piecewise constant on each cube and interpolates f at the center x (i) of the cube X i
Given a piecewise constant function g n , we can easily compute L g
We can sample from g n in time O m,d (n)
Convergence rate of piecewise constant approximation is bad
Combining with higher-order function approximation can achieve rate O m,d (Bn −m/d )
Approximating λp with λq yields same sampling and log-partition errors
Need to divide approximation bound by normalization constant
Norm of the (normalized) density plays an important role for convergence rates
Sum-of-squares model proposed by Marteau-Ferey et al. (2022) achieves rate of O m,d,f (n −m/d )

Simple stochastic algorithms

Rejection sampling with uniform proposal distribution can achieve better rates than density-based approximation.
Lower bounds for the convergence of rejection sampling can be obtained using Lemma 11.
Monte Carlo quadrature can be used to approximate integrals.
Theorem 25 gives an upper bound on the convergence rate of Monte Carlo log-partition.
Theorem 26 shows that Monte Carlo sampling cannot achieve good rates in the optimization regime.

Markov chain monte carlo algorithms

MCMC methods are a popular class of sampling algorithms
Gradient-based MCMC algorithms have been studied intensively
Theoretical guarantees consider the case of concave functions
Extensions allow for non-concave functions in a compact region
Mixing time bound depends exponentially on L, related to f C 2
Convergence rates for other MCMC methods on F d,m,B is an open problem

Variational formulation for log-partition estimation

Variational approach to log-partition problem introduced by Bach (2022)
Optimization setting: probability measures on X
Approximate function f with model of form H (Hermitian matrix) and feature map ϕ
Define moment matrix for probability distribution P
Reduce infinite-dimensional convex optimization problem to finite-dimensional
Variational formulation by Donsker and Varadhan (1983) for general base distributions Q
Replace integral with tr[HΣ P ]
Replace KL divergence with something that only depends on Σ P
Bach (2022) proposes multiple lower bounds
Tightest one yields upper bound on log-partition function
Lemma 27: infimum over P merges with supremum over P
Theorem 28: lower bound for OPT relaxation
Implications of Theorem 28 on convergence rates

Experiments

Investigated the convergence behavior of simple algorithms numerically
Studied functions of the form f : [0, 1] 3 → R, x → β(x 1 + x 2 + x 3 )
Functions are simple and concave, but pose a challenge to some general algorithms
Dimension d = 3 chosen for visualization purposes
Plots can be reproduced using code at github.com/dholzmueller/sampling_experiments

Log-partition estimation

PC algorithm computes log-partition function of piecewise constant approximation
MC algorithm uses Monte Carlo log-partition estimation
PC+MC uses importance sampling and MC quadrature
PC+MC convergence rate is O(n-5/6)

Sampling

Estimate distances between probability distributions through samples
Energy distance is an efficient and easy-to-compute measure
Compare PC, MC, RS and PC+MC sampling algorithms
Combining approximation-based and stochastic methods performs better than either of the two in isolation
RS initially performs poorly but reaches fast convergence for larger values of n

Link to paper#

Abstract#

Paper Content#

Introduction#

Contribution#

Related work#

Information-based complexity#

Deterministic evaluation points#

Stochastic evaluation points#

Relations between different problems#

Runtime-accuracy trade-off#

Relation between stochastic and deterministic evaluation points#

Relation between sampling and log-partition estimation#

Relation to optimization#

Algorithms#

Approximation-based algorithms#

Simple stochastic algorithms#

Markov chain monte carlo algorithms#

Variational formulation for log-partition estimation#

Experiments#

Log-partition estimation#

Sampling#

Link to paper

Abstract

Paper Content

Introduction

Contribution

Related work

Information-based complexity

Deterministic evaluation points

Stochastic evaluation points

Relations between different problems

Runtime-accuracy trade-off

Relation between stochastic and deterministic evaluation points

Relation between sampling and log-partition estimation

Relation to optimization

Algorithms

Approximation-based algorithms

Simple stochastic algorithms

Markov chain monte carlo algorithms

Variational formulation for log-partition estimation

Experiments

Log-partition estimation

Sampling