Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Deep learning has potential for generation tasks due to its latent representation
  • Generative models can generate observations randomly
  • Diffusion Model is a rising class of generative models with power-generating ability
  • Diffusion Model has drawbacks such as slow generation process, single data types, low likelihood, and inability for dimension reduction
  • Improved techniques for existing problems in the diffusion-based model field include speed-up improvement, data structure diversification, likelihood optimization, and dimension reduction
  • Applications with diffusion models include computer vision, sequence modeling, audio, and AI for science

Paper Content

Introduction

  • Deep generative models have potential to create patterns humans cannot distinguish
  • Focus on diffusion-based generative models
  • Diffusion models do not require aligning posterior distributions, dealing with intractable partition functions, training additional discriminators, or imposing network constraints
  • Diffusion models have been used in computer vision, natural language processing, and graph analysis
  • Lack of systematic taxonomy and analysis of research progress on diffusion models
  • Diffusion models provide tractable probabilistic parameterization, stable training procedure, and unified loss function design
  • Diffusion models have been used in computer vision, sequence modeling, audio processing, and AI for science
  • Diffusion models have inherent drawback of plenty of sampling steps and long sampling time
  • Works aspire to accelerate diffusion process and improve sampling quality
  • Diffusion models improved algorithms classified into four categories: speed-up improvement, data structure diversification, likelihood optimization, and dimension reduction

  • Application of diffusion models to computer vision, natural language processing, bioinformatics, and speech processing
  • Domain-specialized problem formulation, related datasets, evaluation metrics, and downstream tasks, along with sets of benchmarks
  • Limitations of models and possible further-proof directions

Problem statement

Notions and definitions

State

  • States are data distributions that describe diffusion models.
  • Starting state 0 is the initial distribution.
  • Noise is injected into the starting state.
  • After enough steps, the distribution becomes a known noise distribution (Gaussian).
  • This is called the prior state.
  • Intermediate states are the distributions between the starting and prior states.

Process & transition kernel

  • Forward process transforms starting state into tractable noise
  • Reverse process samples noise gradients into samples as starting state
  • Interchange between states is achieved by transition kernel
  • Forward process consists of forward transition kernels
  • Reverse process consists of reverse transition kernels
  • Most frequently used kernel is Markov kernel
  • Variable noise scale controls randomness of process

Discrete and continuous

  • Discrete process contains infinite steps
  • Continuous process used in improved algorithms to obtain better performance
  • Continuous process enables extraction of information from any time state
  • Continuous process has better theoretical support

Training objective

  • Diffusion model is a type of generative model
  • Training objective is to keep starting and sample distributions close
  • Log-likelihood is maximized to achieve this
  • σ in reverse process differs from forward process

Problem formulation

Denoised diffusion probabilistic model

  • DDPM chooses a sequence of noise coefficients for Markov transition kernels following specific patterns.
  • DDPM forward and reverse processes are defined.
  • Diffusion training objective is to minimize the negative log-likelihood.

Score matching formulation

  • Score matching model attempts to solve data distribution estimation problem by approximating the gradient of data.
  • Score network is trained to predict the score.
  • Score matching process consists of a sequence of perturbation steps with increasing noise scales.
  • Gaussian perturbation kernel is used and the score is equivalent to the gradient of the perturbation kernel.
  • Transition kernel between two neighbor states is defined.

Score matching process:

Ddpm & dsm

  • Wiener process/Brownian Motion is used to describe the data distribution and probability density of the system.
  • The forward SDE equation has a unique solution.
  • The Reversed SDE Process is defined with respect to the reverse-time Stochastic Differential Equation.
  • The score of the system is used to match the data.

Score sde training objective:

  • Score SDE uses a weighting scheme in the score loss.
  • Score loss is compared to denoised score matching.
  • () and (0) are continuous time variables of and 0.

Sde-based ddpm & dsm:

  • DDPM and DSM transition kernel can be expressed as two continuous-time variables of discrete noise scales
  • SDE can be classified as Variation Preserving (VP) and Variation Explosion (VE)
  • Probability Flow ODE is a continuous-time ODE that supports the same marginal probability density as SDE
  • Probability Flow ODE can be solved with larger step sizes and no randomness

Training strategy

Denoising diffusion training strategy

  • Minimizing negative log-likelihood requires using 1: -1
  • Baye’s rule is used to parameterize the posterior
  • Mean and variance schedules are expressed
  • Reparameterizing 1-1 w.r.t 0 yields a simplified training objective
  • Most diffusion models use DDPMs training strategy, but there are exceptions

Score matching training strategy

  • Traditional score-matching techniques require a lot of computing power
  • Advanced methods find ways to avoid computing the Hessian
  • Implicit score matching (ISM) uses a non-normalized density function that can be optimized by a neural network
  • Sliced score matching (SSM) uses reverse-mode auto-differentiation to estimate the score
  • Denoised score matching (DSM) transforms the original score matching into a perturbation kernel learning by adding noise to a sequence

Sampling algorithm

  • Unconditional sampling is the process of rebuilding samples from random noise.
  • Conditional sampling is a class of sampling that utilizes specific conditions.

Unconditional sampling

Ancestral sampling

  • Ancestral sampling is reconstructed with the gradient of inverse Markovian step by step.
  • PC sampling is inspired by a type of ODE black-box ODE solver.

Langevin dynamics sampling

Conditional sampling

  • Labeled Condition Sampling uses gradient guidance and a classifier with UNet Encoder architecture
  • Labels can be text, categorical, binary, or extracted features
  • Unlabeled Condition Sampling uses self-information as guidance and is used in denoising, resolution, and inpainting tasks

Algorithm improvement

  • Diffusion models have low speed and high computation cost.
  • Improved algorithms are classified according to mainstream problems.

Speed-up improvement

  • Diffusion models have high-fidelity generation but low sampling speed.
  • Advanced techniques can be divided into four categories to improve sampling speed.

Training schedule

  • Modifying traditional training settings
  • Key factors in training schemes influence learning patterns and models’ performance
  • Training enhancement divided into three categories: knowledge distillation, diffusion scheme learning, and noise scale designing

Knowledge distillation

  • Knowledge distillation is a method for obtaining small-scale networks from complex teacher models
  • Student models benefit from model compression and acceleration
  • Salimans et al. applied the core idea to diffusion model improvement
  • Student models learn to conduct two-step updates from teacher models in one-step
  • Denoising student distills knowledge from scratch by minimizing KL Divergence

Diffusion scheme learning

  • Diffusion model encodes data onto latent spaces with the same dimension to achieve high expressiveness.
  • Current methods divided into projecting approaches exploration and encoding degree optimization.
  • Truncation conducts a trade-off between generating speed and sample fidelity.
  • Works focus on the diversity of diffusion kernels.

Noise scale designing

  • Traditional diffusion process uses noise to determine transition steps.
  • Noise scale design can lead to reasonable generation and fast convergence.
  • Existing methods treat noise scale as a learnable parameter.
  • Different methods use different approaches to design noise scale.

Training-free sampling

  • Training enhancement methods can be used to speed up sampling
  • Training-free methods apply pre-trained information directly to advanced sampling algorithms with fewer steps and higher fidelity
  • Training-free methods are divided into four categories: analytical methods, Flow-based, Unification Reformulation, and Connection

Continuous space

Analytical method

  • Existing training-free sampling methods use hand-crafted noise sequences.
  • Analytical methods optimize reverse mean and covariance for each state.
  • Analytical methods have a theoretical guarantee, but are limited to certain distributions.

Implicit sampler

  • Implicit sampler follows jump-step pattern using pre-trained diffusion model
  • Probability treated as Score SDE derived from discrete formulation
  • Implicit sampler is type of neural ODE solver
  • Advanced ODE solvers used, such as PNDM, edm, DEIS, gDDIM, and DPM-Solver
  • Dynamic programming based jump-step method for sampling optimal implicit route

Differential equation solver sampler

  • Differential Equation (DE) Solver Sampler minimizes approximation error during reverse sampling
  • Two basic DE formulations: SDE and ODE
  • Higher-order DE solvers have smaller approximation errors and higher order of convergence
  • Speed-prior and accuracy-prior methods
  • Semi-linear-based ODE performs the best

Dynamic programming adjustment

  • Dynamic programming (DP) is a technique used to find optimized solutions in a reduced time.
  • DP algorithms explore the optimal traversal along a trajectory, assuming each path has the same KL divergence.
  • Current DP-based methods have a computational cost of O 2.

Mixed-modeling

  • Mixed-modeling applies fast-sampling and high-expressiveness generative models in diffusion pipeline.
  • Mixed modeling improvement can be classified into two classes from the perspective of mixing purposes: acceleration mixture and expressiveness mixture.

Acceleration mixture

  • Acceleration mixture applies high-speed generation of VAEs and GANs to reduce steps in sampling data from random noise.
  • Two types of models generate predicted 0 with VAE and GAN.
  • ES-DDPM reconstructs intermediate samples as early stop technique.

Expressiveness mixture

  • Expressiveness mixture support diffusion models to express data or noise in different patterns
  • High expressiveness data combined with fast-sampling generative models to obtain mean and variance more accurately
  • Noise modulation, space projection, and kernel expressiveness are high expressiveness methods
  • Reformulation problems unify diffusion pipeline based on one or two variables
  • Connection problems link score and diffusion frameworks to extend them into a higher view

Data structure diversification

  • Diffusion methods are mostly used for image generation tasks.
  • Diffusion mechanism has been used in inter-disciplinary tasks with different data types.
  • Traditional diffusion patterns are expected to be extended for universal use.

Non-linear space

  • Existing methods handle linear perturbations
  • Non-linear space has effects on low-level vision tasks
  • Kawar et al. and DPS use pseudo-inverse operator and posterior sampling approximation to solve JPEG artifact correction, image deblurring, and phase retrieval

Image & point cloud

  • Luo et al. proposed a method for generating point cloud data
  • Other techniques have been developed to generate and complete 3D shapes
  • Improvements have been made to latent space transformation, such as canonical map, condition feature extraction sub-nets, and point-voxel representation

Latent space

  • Expressiveness mixture modeling is used to process latent space data distributions for diffusion applications.
  • Current methods project data into continuous space, with the help of EDM and antigen-diffusion models.
  • Latent space processing can be beneficial in new application fields.

Function

  • Traditional diffusion processes are limited for some tasks
  • Dutordoir et al. proposed a diffusion model that samples from the function space
  • This model captures multi-dimensional distributions by sampling from joint posteriors

Others

  • Score-flow uses a flow function to project RGB images into dequantization space
  • Cold diffusion proposes algorithms for projecting data into random distributions with the support of reconstructing correction

Discrete space

  • Deep generative models have achieved success in natural language processing, multimodal learning, and AI for science.
  • Processing discrete data such as sentences, residue, atom, and vector-quantized data is necessary to eliminate inductive bias.
  • Diffusion models are a promising approach for relevant tasks.
  • Main problem is divided into processing text & categorical data, and vector-quantized data.

Text & categorical

  • D3PM uses diffusion algorithm to process categorical features
  • Multi-nomial diffusion and ARDM extend categorical diffusion to multi-nomial data

Vector-quantized

  • Vector-quantized (VQ) data is proposed to combine data from different fields into the codebook.
  • VQ data processing achieved great performance in autoregressive encoders.

Constrained space

  • Graph-based neural networks can be used to analyze data such as social networks, molecular data, and weather conditions.
  • Manifold learning methods can be used to non-redundantly express and portray data such as proteins and RNA.

Manifold space

  • Data structures such as images and video are defined in Euclidean space.
  • Data in robotics, geoscience, and protein modeling are defined in Riemannian manifold.
  • Current methods for Euclidean space cannot capture Riemann feature.
  • Recent methods applied diffusion sampling into Riemannian manifold.
  • Theoretical works provide comprehensive support for manifold sampling.

Graph

  • Graph-based neural networks are popular due to their high expressiveness.
  • Diffusion theories are used to process graph data.

Likelihood optimization

  • Variational methods, diffusion methods, and other methods use the principle of variational evidence lower bound (ELBO) to train models.
  • Solutions to the likelihood optimization problem can be divided into two classes - improved ELBO and variational gap optimization.

Improved elbo

Score connection

  • Score connection methods provide a connection between ELBO optimization and score matching.
  • Score-flow treats the forward KL divergence in ELBO as optimizing a score-matching loss.
  • Huang et al. treated Brownian motion as a latent variable to track the loglikelihood estimation.
  • Analytic-DPM and NCSN++ enhance ELBO by analyzing the KL Divergence and introducing a truncation factor.

Re-design

  • Loss transformation techniques are compared to re-Design methods.
  • Re-Design methods directly tighten the ELBO.
  • VDM and DDPM++ optimize ELBO by finding optimal factors.
  • Improved DDPM and D3PM propose hybrid loss functions based on ELBO.

Variational gap optimization

  • Minimizing the variational gap is an approach to maximize loglikelihood.
  • INDM (120) is successful in the VAE field.

Dimension reduction

  • Variational auto-encoder projects data into a lower dimension
  • Diffusion models have high expressiveness from equal-dimension transitions
  • Diffusing on a low-dimensional manifold has wide applications in graph-based representations
  • Reduced-dimension diffusion can be achieved with latent and dimension projection techniques

Latent projection

  • Project training data onto lower dimensional latent space using flow function and VAE-encoder
  • LSGM, INDM, and PDM learn smoother models in smaller space, reducing network evaluations and speeding up sampling
  • Weighting training techniques use joint training of diffusion models and projecting models based on ELBO and log-likelihood maximization

Dimension projection

  • Dimension projection reduces spatial redundancy on image manifolds
  • DVDP combines DDPM and VAE
  • Theoretical analysis of reduction scale of dimensionality and down-sampling & up-sampling steps needs to be explored

Application

  • Diffusion models have powerful ability to generate realistic samples
  • Diffusion models used in computer vision, natural language processing, and bioinformatics

Computer vision

  • CMDE outperformed vanilla conditional denoising estimator in in-painting and super-resolution tasks
  • DDRM proposed an efficient, unsupervised posterior sampling method for image restoration
  • Palette developed a unified diffusion-based framework for low-level vision tasks
  • DiffC proposed an unconditional generative approach for lossy image compression
  • RePaint replaced reverse diffusion by sampling unmasked regions using given image information

High-level vision

  • FSDM is a few-shot generation framework based on conditional diffusion models
  • CARD proposed a denoising diffusion-based conditional generative model to predict data distribution
  • GLIDE explored realistic image synthesis conditioned on text using diffusion models
  • DreamFusion extended GLIDE’s achievement into 3D space
  • LSGM built a diffusion model trained in the latent space with a variational autoencoder framework
  • VQ-Diffusion improved vector quantized diffusion by exploring classifier-free guidance sampling

3d vision

  • 33 was an early work on 3D vision tasks using diffusion
  • Diffusion process used to generate point clouds
  • 210 used diffusion for point cloud generation without shape encoders
  • 34 proposed a diffusion model for point cloud completion
  • 35 used a neural network to denoise point clouds

Video modeling

  • Video diffusion uses generative models to create videos
  • RVD, FDM, MCVD, and RaMViD are all methods of using diffusion models to generate videos

Medical application

  • Diffusion models can be applied to medical images.
  • Score-MRI proposed a diffusion-based framework for MRI reconstruction.
  • [213] provided a more flexible framework that didn’t require a paired dataset for training.
  • R2D2+ combined diffusion-based MRI reconstruction and super-resolution into the same network.
  • [215] explored the application of the generative diffusion model to medical image segmentation.

Sequential modeling

Natural language processing

  • Diffusion models are non-autoregressive
  • Diffusion-LM used diffusions to denoise noisy vectors into word vectors
  • Bit Diffusion used diffusion models to generate discrete data for image caption tasks

Time series

  • CSDI [41] used score-based diffusion models to address time series imputation.
  • SSSD [42] used structured state space models to capture long-term dependencies in time series data.

Audio

  • WaveGrad and DiffWave applied diffusion models to raw waveform generation
  • GradTTS and Diff-TTS implemented diffusion models but generated mel feature instead of raw waves
  • DiffVC challenged the one-shot many-to-many voice conversion problem
  • DiffSinger extended sound generation to singing voice synthesis based on a shallow diffusion mechanism

Ai for science

Molecular conformation generation

  • ConfGF was an early work on diffusion-based molecular conformation generation models
  • DGSM proposed to dynamically construct molecular graph structures between atoms
  • GeoDiff introduced a roto-translational invariant Markov process to impose constraints on the density
  • EDM incorporated discrete atom features and deriving the equations required for loglikelihood computation
  • Torsional diffusion operated on the space of torsional angles
  • DiffDock conducts denoised score matching on transition, rotation, and torsion angle

Material design

  • CDVAE explored the periodic structure of stable material generation
  • Diffusion-based network designed to capture specific local bonding preferences
  • Recent work developed a diffusion-based generative model to target specific antigen structures
  • Anand et al. introduced a diffusion-based generative model for protein structure and sequence
  • ProteinSGM formulated protein design as an image inpainting problem
  • DiffFolding generates protein backbone concentrating on internal angles

Conclusions & discussions

  • Diffusion model is important for many fields
  • Paper provides review of diffusion models, including theory, algorithms, and applications

Limitations & further directions

  • Diffusion models should be viewed as a class, not a branch of DDPM-based models
  • Training objectives and evaluation metrics should match the initial goal
  • Complex modeling is needed to eliminate inductive bias
  • Improvement algorithms with reduced steps should be explored

Appendix b evaluation metric b.1 inception score (is)

  • Inception score is a way to measure the diversity and resolution of generated images based on the ImageNet dataset.
  • Inception score is divided into two parts: diversity measurement and quality measurement.
  • Diversity measurement is calculated based on the class entropy of generated samples.
  • Quality measurement is computed through the similarity between a sample and the related class images using entropy.
  • KL divergence is applied to inception score calculation.

B.2 frechet inception distance (fid)

  • Inception Score is based on a specific dataset with 1000 classes and a trained network.
  • Bias between ImageNet and real-world images may cause inaccurate outcome.
  • FID is proposed to solve bias from specific reference datasets.
  • FID shows distance between real-world data distribution and generated samples using mean and covariance.

B.3 negative log likelihood (nll)

  • Negative log-likelihood is a common evaluation metric for data distribution.
  • Normalizing flow field, VAE field, and improved DDPM use NLL for evaluation.

Appendix c benchmarks

  • Benchmarks of landmark models and improved techniques are provided on CIFAR-10, ImageNet, and CelebA-64 datasets
  • Performance of LSUN, FFHQ, and MINST datasets are not presented
  • Performance is listed according to NFE in descending order
  • Different tasks such as audio diffusion, audio SDE, molecular score, molecular diffusion, protein score, and protein diffusion are discussed

C.1 benchmarks on celeba-64