Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Deep learning has potential for generation tasks due to its latent representation
Generative models can generate observations randomly
Diffusion Model is a rising class of generative models with power-generating ability
Diffusion Model has drawbacks such as slow generation process, single data types, low likelihood, and inability for dimension reduction
Improved techniques for existing problems in the diffusion-based model field include speed-up improvement, data structure diversification, likelihood optimization, and dimension reduction
Applications with diffusion models include computer vision, sequence modeling, audio, and AI for science

Paper Content

Introduction

Deep generative models have potential to create patterns humans cannot distinguish
Focus on diffusion-based generative models
Diffusion models do not require aligning posterior distributions, dealing with intractable partition functions, training additional discriminators, or imposing network constraints
Diffusion models have been used in computer vision, natural language processing, and graph analysis
Lack of systematic taxonomy and analysis of research progress on diffusion models
Diffusion models provide tractable probabilistic parameterization, stable training procedure, and unified loss function design
Diffusion models have been used in computer vision, sequence modeling, audio processing, and AI for science
Diffusion models have inherent drawback of plenty of sampling steps and long sampling time
Works aspire to accelerate diffusion process and improve sampling quality
Diffusion models improved algorithms classified into four categories: speed-up improvement, data structure diversification, likelihood optimization, and dimension reduction

•

Application of diffusion models to computer vision, natural language processing, bioinformatics, and speech processing
Domain-specialized problem formulation, related datasets, evaluation metrics, and downstream tasks, along with sets of benchmarks
Limitations of models and possible further-proof directions

Problem statement

Notions and definitions

State

States are data distributions that describe diffusion models.
Starting state 0 is the initial distribution.
Noise is injected into the starting state.
After enough steps, the distribution becomes a known noise distribution (Gaussian).
This is called the prior state.
Intermediate states are the distributions between the starting and prior states.

Process & transition kernel

Forward process transforms starting state into tractable noise
Reverse process samples noise gradients into samples as starting state
Interchange between states is achieved by transition kernel
Forward process consists of forward transition kernels
Reverse process consists of reverse transition kernels
Most frequently used kernel is Markov kernel
Variable noise scale controls randomness of process

Discrete and continuous

Discrete process contains infinite steps
Continuous process used in improved algorithms to obtain better performance
Continuous process enables extraction of information from any time state
Continuous process has better theoretical support

Training objective

Diffusion model is a type of generative model
Training objective is to keep starting and sample distributions close
Log-likelihood is maximized to achieve this
σ in reverse process differs from forward process

Problem formulation

Denoised diffusion probabilistic model

DDPM chooses a sequence of noise coefficients for Markov transition kernels following specific patterns.
DDPM forward and reverse processes are defined.
Diffusion training objective is to minimize the negative log-likelihood.

Score matching formulation

Score matching model attempts to solve data distribution estimation problem by approximating the gradient of data.
Score network is trained to predict the score.
Score matching process consists of a sequence of perturbation steps with increasing noise scales.
Gaussian perturbation kernel is used and the score is equivalent to the gradient of the perturbation kernel.
Transition kernel between two neighbor states is defined.

Score matching process:

Ddpm & dsm

Wiener process/Brownian Motion is used to describe the data distribution and probability density of the system.
The forward SDE equation has a unique solution.
The Reversed SDE Process is defined with respect to the reverse-time Stochastic Differential Equation.
The score of the system is used to match the data.

Score sde training objective:

Score SDE uses a weighting scheme in the score loss.
Score loss is compared to denoised score matching.
() and (0) are continuous time variables of and 0.

Sde-based ddpm & dsm:

DDPM and DSM transition kernel can be expressed as two continuous-time variables of discrete noise scales
SDE can be classified as Variation Preserving (VP) and Variation Explosion (VE)
Probability Flow ODE is a continuous-time ODE that supports the same marginal probability density as SDE
Probability Flow ODE can be solved with larger step sizes and no randomness

Training strategy

Denoising diffusion training strategy

Minimizing negative log-likelihood requires using 1: -1
Baye’s rule is used to parameterize the posterior
Mean and variance schedules are expressed
Reparameterizing 1-1 w.r.t 0 yields a simplified training objective
Most diffusion models use DDPMs training strategy, but there are exceptions

Score matching training strategy

Traditional score-matching techniques require a lot of computing power
Advanced methods find ways to avoid computing the Hessian
Implicit score matching (ISM) uses a non-normalized density function that can be optimized by a neural network
Sliced score matching (SSM) uses reverse-mode auto-differentiation to estimate the score
Denoised score matching (DSM) transforms the original score matching into a perturbation kernel learning by adding noise to a sequence

Sampling algorithm

Unconditional sampling is the process of rebuilding samples from random noise.
Conditional sampling is a class of sampling that utilizes specific conditions.

Unconditional sampling

Ancestral sampling

Ancestral sampling is reconstructed with the gradient of inverse Markovian step by step.
PC sampling is inspired by a type of ODE black-box ODE solver.

Langevin dynamics sampling

Conditional sampling

Labeled Condition Sampling uses gradient guidance and a classifier with UNet Encoder architecture
Labels can be text, categorical, binary, or extracted features
Unlabeled Condition Sampling uses self-information as guidance and is used in denoising, resolution, and inpainting tasks

Algorithm improvement

Diffusion models have low speed and high computation cost.
Improved algorithms are classified according to mainstream problems.

Speed-up improvement

Diffusion models have high-fidelity generation but low sampling speed.
Advanced techniques can be divided into four categories to improve sampling speed.

Training schedule

Modifying traditional training settings
Key factors in training schemes influence learning patterns and models’ performance
Training enhancement divided into three categories: knowledge distillation, diffusion scheme learning, and noise scale designing

Knowledge distillation

Knowledge distillation is a method for obtaining small-scale networks from complex teacher models
Student models benefit from model compression and acceleration
Salimans et al. applied the core idea to diffusion model improvement
Student models learn to conduct two-step updates from teacher models in one-step
Denoising student distills knowledge from scratch by minimizing KL Divergence

Diffusion scheme learning

Diffusion model encodes data onto latent spaces with the same dimension to achieve high expressiveness.
Current methods divided into projecting approaches exploration and encoding degree optimization.
Truncation conducts a trade-off between generating speed and sample fidelity.
Works focus on the diversity of diffusion kernels.

Noise scale designing

Traditional diffusion process uses noise to determine transition steps.
Noise scale design can lead to reasonable generation and fast convergence.
Existing methods treat noise scale as a learnable parameter.
Different methods use different approaches to design noise scale.

Training-free sampling

Training enhancement methods can be used to speed up sampling
Training-free methods apply pre-trained information directly to advanced sampling algorithms with fewer steps and higher fidelity
Training-free methods are divided into four categories: analytical methods, Flow-based, Unification Reformulation, and Connection

Continuous space

Analytical method

Existing training-free sampling methods use hand-crafted noise sequences.
Analytical methods optimize reverse mean and covariance for each state.
Analytical methods have a theoretical guarantee, but are limited to certain distributions.

Implicit sampler

Implicit sampler follows jump-step pattern using pre-trained diffusion model
Probability treated as Score SDE derived from discrete formulation
Implicit sampler is type of neural ODE solver
Advanced ODE solvers used, such as PNDM, edm, DEIS, gDDIM, and DPM-Solver
Dynamic programming based jump-step method for sampling optimal implicit route

Differential equation solver sampler

Differential Equation (DE) Solver Sampler minimizes approximation error during reverse sampling
Two basic DE formulations: SDE and ODE
Higher-order DE solvers have smaller approximation errors and higher order of convergence
Speed-prior and accuracy-prior methods
Semi-linear-based ODE performs the best

Dynamic programming adjustment

Dynamic programming (DP) is a technique used to find optimized solutions in a reduced time.
DP algorithms explore the optimal traversal along a trajectory, assuming each path has the same KL divergence.
Current DP-based methods have a computational cost of O 2.

Mixed-modeling

Mixed-modeling applies fast-sampling and high-expressiveness generative models in diffusion pipeline.
Mixed modeling improvement can be classified into two classes from the perspective of mixing purposes: acceleration mixture and expressiveness mixture.

Acceleration mixture

Acceleration mixture applies high-speed generation of VAEs and GANs to reduce steps in sampling data from random noise.
Two types of models generate predicted 0 with VAE and GAN.
ES-DDPM reconstructs intermediate samples as early stop technique.

Expressiveness mixture

Expressiveness mixture support diffusion models to express data or noise in different patterns
High expressiveness data combined with fast-sampling generative models to obtain mean and variance more accurately
Noise modulation, space projection, and kernel expressiveness are high expressiveness methods
Reformulation problems unify diffusion pipeline based on one or two variables
Connection problems link score and diffusion frameworks to extend them into a higher view

Data structure diversification

Diffusion methods are mostly used for image generation tasks.
Diffusion mechanism has been used in inter-disciplinary tasks with different data types.
Traditional diffusion patterns are expected to be extended for universal use.

Non-linear space

Existing methods handle linear perturbations
Non-linear space has effects on low-level vision tasks
Kawar et al. and DPS use pseudo-inverse operator and posterior sampling approximation to solve JPEG artifact correction, image deblurring, and phase retrieval

Image & point cloud

Luo et al. proposed a method for generating point cloud data
Other techniques have been developed to generate and complete 3D shapes
Improvements have been made to latent space transformation, such as canonical map, condition feature extraction sub-nets, and point-voxel representation

Latent space

Expressiveness mixture modeling is used to process latent space data distributions for diffusion applications.
Current methods project data into continuous space, with the help of EDM and antigen-diffusion models.
Latent space processing can be beneficial in new application fields.

Function

Traditional diffusion processes are limited for some tasks
Dutordoir et al. proposed a diffusion model that samples from the function space
This model captures multi-dimensional distributions by sampling from joint posteriors

Others

Score-flow uses a flow function to project RGB images into dequantization space
Cold diffusion proposes algorithms for projecting data into random distributions with the support of reconstructing correction

Discrete space

Deep generative models have achieved success in natural language processing, multimodal learning, and AI for science.
Processing discrete data such as sentences, residue, atom, and vector-quantized data is necessary to eliminate inductive bias.
Diffusion models are a promising approach for relevant tasks.
Main problem is divided into processing text & categorical data, and vector-quantized data.

Text & categorical

D3PM uses diffusion algorithm to process categorical features
Multi-nomial diffusion and ARDM extend categorical diffusion to multi-nomial data

Vector-quantized

Vector-quantized (VQ) data is proposed to combine data from different fields into the codebook.
VQ data processing achieved great performance in autoregressive encoders.

Constrained space

Graph-based neural networks can be used to analyze data such as social networks, molecular data, and weather conditions.
Manifold learning methods can be used to non-redundantly express and portray data such as proteins and RNA.

Manifold space

Data structures such as images and video are defined in Euclidean space.
Data in robotics, geoscience, and protein modeling are defined in Riemannian manifold.
Current methods for Euclidean space cannot capture Riemann feature.
Recent methods applied diffusion sampling into Riemannian manifold.
Theoretical works provide comprehensive support for manifold sampling.

Graph

Graph-based neural networks are popular due to their high expressiveness.
Diffusion theories are used to process graph data.

Likelihood optimization

Variational methods, diffusion methods, and other methods use the principle of variational evidence lower bound (ELBO) to train models.
Solutions to the likelihood optimization problem can be divided into two classes - improved ELBO and variational gap optimization.

Improved elbo

Score connection

Score connection methods provide a connection between ELBO optimization and score matching.
Score-flow treats the forward KL divergence in ELBO as optimizing a score-matching loss.
Huang et al. treated Brownian motion as a latent variable to track the loglikelihood estimation.
Analytic-DPM and NCSN++ enhance ELBO by analyzing the KL Divergence and introducing a truncation factor.

Re-design

Loss transformation techniques are compared to re-Design methods.
Re-Design methods directly tighten the ELBO.
VDM and DDPM++ optimize ELBO by finding optimal factors.
Improved DDPM and D3PM propose hybrid loss functions based on ELBO.

Variational gap optimization

Minimizing the variational gap is an approach to maximize loglikelihood.
INDM (120) is successful in the VAE field.

Dimension reduction

Variational auto-encoder projects data into a lower dimension
Diffusion models have high expressiveness from equal-dimension transitions
Diffusing on a low-dimensional manifold has wide applications in graph-based representations
Reduced-dimension diffusion can be achieved with latent and dimension projection techniques

Latent projection

Project training data onto lower dimensional latent space using flow function and VAE-encoder
LSGM, INDM, and PDM learn smoother models in smaller space, reducing network evaluations and speeding up sampling
Weighting training techniques use joint training of diffusion models and projecting models based on ELBO and log-likelihood maximization

Dimension projection

Dimension projection reduces spatial redundancy on image manifolds
DVDP combines DDPM and VAE
Theoretical analysis of reduction scale of dimensionality and down-sampling & up-sampling steps needs to be explored

Application

Diffusion models have powerful ability to generate realistic samples
Diffusion models used in computer vision, natural language processing, and bioinformatics

Computer vision

CMDE outperformed vanilla conditional denoising estimator in in-painting and super-resolution tasks
DDRM proposed an efficient, unsupervised posterior sampling method for image restoration
Palette developed a unified diffusion-based framework for low-level vision tasks
DiffC proposed an unconditional generative approach for lossy image compression
RePaint replaced reverse diffusion by sampling unmasked regions using given image information

High-level vision

FSDM is a few-shot generation framework based on conditional diffusion models
CARD proposed a denoising diffusion-based conditional generative model to predict data distribution
GLIDE explored realistic image synthesis conditioned on text using diffusion models
DreamFusion extended GLIDE’s achievement into 3D space
LSGM built a diffusion model trained in the latent space with a variational autoencoder framework
VQ-Diffusion improved vector quantized diffusion by exploring classifier-free guidance sampling

3d vision

33 was an early work on 3D vision tasks using diffusion
Diffusion process used to generate point clouds
210 used diffusion for point cloud generation without shape encoders
34 proposed a diffusion model for point cloud completion
35 used a neural network to denoise point clouds

Video modeling

Video diffusion uses generative models to create videos
RVD, FDM, MCVD, and RaMViD are all methods of using diffusion models to generate videos

Medical application

Diffusion models can be applied to medical images.
Score-MRI proposed a diffusion-based framework for MRI reconstruction.
[213] provided a more flexible framework that didn’t require a paired dataset for training.
R2D2+ combined diffusion-based MRI reconstruction and super-resolution into the same network.
[215] explored the application of the generative diffusion model to medical image segmentation.

Sequential modeling

Natural language processing

Diffusion models are non-autoregressive
Diffusion-LM used diffusions to denoise noisy vectors into word vectors
Bit Diffusion used diffusion models to generate discrete data for image caption tasks

Time series

CSDI [41] used score-based diffusion models to address time series imputation.
SSSD [42] used structured state space models to capture long-term dependencies in time series data.

Audio

WaveGrad and DiffWave applied diffusion models to raw waveform generation
GradTTS and Diff-TTS implemented diffusion models but generated mel feature instead of raw waves
DiffVC challenged the one-shot many-to-many voice conversion problem
DiffSinger extended sound generation to singing voice synthesis based on a shallow diffusion mechanism

Ai for science

Molecular conformation generation

ConfGF was an early work on diffusion-based molecular conformation generation models
DGSM proposed to dynamically construct molecular graph structures between atoms
GeoDiff introduced a roto-translational invariant Markov process to impose constraints on the density
EDM incorporated discrete atom features and deriving the equations required for loglikelihood computation
Torsional diffusion operated on the space of torsional angles
DiffDock conducts denoised score matching on transition, rotation, and torsion angle

Material design

CDVAE explored the periodic structure of stable material generation
Diffusion-based network designed to capture specific local bonding preferences
Recent work developed a diffusion-based generative model to target specific antigen structures
Anand et al. introduced a diffusion-based generative model for protein structure and sequence
ProteinSGM formulated protein design as an image inpainting problem
DiffFolding generates protein backbone concentrating on internal angles

Conclusions & discussions

Diffusion model is important for many fields
Paper provides review of diffusion models, including theory, algorithms, and applications

Limitations & further directions

Diffusion models should be viewed as a class, not a branch of DDPM-based models
Training objectives and evaluation metrics should match the initial goal
Complex modeling is needed to eliminate inductive bias
Improvement algorithms with reduced steps should be explored

Appendix b evaluation metric b.1 inception score (is)

Inception score is a way to measure the diversity and resolution of generated images based on the ImageNet dataset.
Inception score is divided into two parts: diversity measurement and quality measurement.
Diversity measurement is calculated based on the class entropy of generated samples.
Quality measurement is computed through the similarity between a sample and the related class images using entropy.
KL divergence is applied to inception score calculation.

B.2 frechet inception distance (fid)

Inception Score is based on a specific dataset with 1000 classes and a trained network.
Bias between ImageNet and real-world images may cause inaccurate outcome.
FID is proposed to solve bias from specific reference datasets.
FID shows distance between real-world data distribution and generated samples using mean and covariance.

B.3 negative log likelihood (nll)

Negative log-likelihood is a common evaluation metric for data distribution.
Normalizing flow field, VAE field, and improved DDPM use NLL for evaluation.

Appendix c benchmarks

Benchmarks of landmark models and improved techniques are provided on CIFAR-10, ImageNet, and CelebA-64 datasets
Performance of LSUN, FFHQ, and MINST datasets are not presented
Performance is listed according to NFE in descending order
Different tasks such as audio diffusion, audio SDE, molecular score, molecular diffusion, protein score, and protein diffusion are discussed

Link to paper#

Abstract#

Paper Content#

Introduction#

•#

Problem statement#

Notions and definitions#

State#

Process & transition kernel#

Discrete and continuous#

Training objective#

Problem formulation#

Denoised diffusion probabilistic model#

Score matching formulation#

Score matching process:#

Ddpm & dsm#

Score sde training objective:#

Sde-based ddpm & dsm:#

Training strategy#

Denoising diffusion training strategy#

Score matching training strategy#

Sampling algorithm#

Unconditional sampling#

Ancestral sampling#

Langevin dynamics sampling#

Conditional sampling#

Algorithm improvement#

Speed-up improvement#

Training schedule#

Knowledge distillation#

Diffusion scheme learning#

Noise scale designing#

Training-free sampling#

Continuous space#

Analytical method#

Implicit sampler#

Differential equation solver sampler#

Dynamic programming adjustment#

Mixed-modeling#

Acceleration mixture#

Expressiveness mixture#

Data structure diversification#

Non-linear space#

Image & point cloud#

Latent space#

Function#

Others#

Discrete space#

Text & categorical#

Vector-quantized#

Constrained space#

Manifold space#

Graph#

Likelihood optimization#

Improved elbo#

Score connection#

Re-design#

Variational gap optimization#

Dimension reduction#

Latent projection#

Dimension projection#

Application#

Computer vision#

High-level vision#

3d vision#

Video modeling#

Medical application#

Sequential modeling#

Natural language processing#

Time series#

Audio#

Ai for science#

Molecular conformation generation#

Material design#

Conclusions & discussions#

Limitations & further directions#

Appendix b evaluation metric b.1 inception score (is)#

B.2 frechet inception distance (fid)#

B.3 negative log likelihood (nll)#

Appendix c benchmarks#

Link to paper

Abstract

Paper Content

Introduction

•

Problem statement

Notions and definitions

State

Process & transition kernel

Discrete and continuous

Training objective

Problem formulation

Denoised diffusion probabilistic model

Score matching formulation

Score matching process:

Ddpm & dsm

Score sde training objective:

Sde-based ddpm & dsm:

Training strategy

Denoising diffusion training strategy

Score matching training strategy

Sampling algorithm

Unconditional sampling

Ancestral sampling

Langevin dynamics sampling

Conditional sampling

Algorithm improvement

Speed-up improvement

Training schedule

Knowledge distillation

Diffusion scheme learning

Noise scale designing

Training-free sampling

Continuous space

Analytical method

Implicit sampler

Differential equation solver sampler

Dynamic programming adjustment

Mixed-modeling

Acceleration mixture

Expressiveness mixture

Data structure diversification

Non-linear space

Image & point cloud

Latent space

Function

Others

Discrete space

Text & categorical

Vector-quantized

Constrained space

Manifold space

Graph

Likelihood optimization

Improved elbo

Score connection

Re-design

Variational gap optimization

Dimension reduction

Latent projection

Dimension projection

Application

Computer vision

High-level vision

3d vision

Video modeling

Medical application

Sequential modeling

Natural language processing

Time series

Audio

Ai for science

Molecular conformation generation

Material design

Conclusions & discussions

Limitations & further directions

Appendix b evaluation metric b.1 inception score (is)

B.2 frechet inception distance (fid)

B.3 negative log likelihood (nll)

Appendix c benchmarks