Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep learning has potential for generation tasks due to its latent representation
- Generative models can generate observations randomly
- Diffusion Model is a rising class of generative models with power-generating ability
- Diffusion Model has drawbacks such as slow generation process, single data types, low likelihood, and inability for dimension reduction
- Improved techniques for existing problems in the diffusion-based model field include speed-up improvement, data structure diversification, likelihood optimization, and dimension reduction
- Applications with diffusion models include computer vision, sequence modeling, audio, and AI for science
Paper Content
Introduction
- Deep generative models have potential to create patterns humans cannot distinguish
- Focus on diffusion-based generative models
- Diffusion models do not require aligning posterior distributions, dealing with intractable partition functions, training additional discriminators, or imposing network constraints
- Diffusion models have been used in computer vision, natural language processing, and graph analysis
- Lack of systematic taxonomy and analysis of research progress on diffusion models
- Diffusion models provide tractable probabilistic parameterization, stable training procedure, and unified loss function design
- Diffusion models have been used in computer vision, sequence modeling, audio processing, and AI for science
- Diffusion models have inherent drawback of plenty of sampling steps and long sampling time
- Works aspire to accelerate diffusion process and improve sampling quality
- Diffusion models improved algorithms classified into four categories: speed-up improvement, data structure diversification, likelihood optimization, and dimension reduction
•
- Application of diffusion models to computer vision, natural language processing, bioinformatics, and speech processing
- Domain-specialized problem formulation, related datasets, evaluation metrics, and downstream tasks, along with sets of benchmarks
- Limitations of models and possible further-proof directions
Problem statement
Notions and definitions
State
- States are data distributions that describe diffusion models.
- Starting state 0 is the initial distribution.
- Noise is injected into the starting state.
- After enough steps, the distribution becomes a known noise distribution (Gaussian).
- This is called the prior state.
- Intermediate states are the distributions between the starting and prior states.
Process & transition kernel
- Forward process transforms starting state into tractable noise
- Reverse process samples noise gradients into samples as starting state
- Interchange between states is achieved by transition kernel
- Forward process consists of forward transition kernels
- Reverse process consists of reverse transition kernels
- Most frequently used kernel is Markov kernel
- Variable noise scale controls randomness of process
Discrete and continuous
- Discrete process contains infinite steps
- Continuous process used in improved algorithms to obtain better performance
- Continuous process enables extraction of information from any time state
- Continuous process has better theoretical support
Training objective
- Diffusion model is a type of generative model
- Training objective is to keep starting and sample distributions close
- Log-likelihood is maximized to achieve this
- σ in reverse process differs from forward process
Problem formulation
Denoised diffusion probabilistic model
- DDPM chooses a sequence of noise coefficients for Markov transition kernels following specific patterns.
- DDPM forward and reverse processes are defined.
- Diffusion training objective is to minimize the negative log-likelihood.
Score matching formulation
- Score matching model attempts to solve data distribution estimation problem by approximating the gradient of data.
- Score network is trained to predict the score.
- Score matching process consists of a sequence of perturbation steps with increasing noise scales.
- Gaussian perturbation kernel is used and the score is equivalent to the gradient of the perturbation kernel.
- Transition kernel between two neighbor states is defined.
Score matching process:
Ddpm & dsm
- Wiener process/Brownian Motion is used to describe the data distribution and probability density of the system.
- The forward SDE equation has a unique solution.
- The Reversed SDE Process is defined with respect to the reverse-time Stochastic Differential Equation.
- The score of the system is used to match the data.
Score sde training objective:
- Score SDE uses a weighting scheme in the score loss.
- Score loss is compared to denoised score matching.
- () and (0) are continuous time variables of and 0.
Sde-based ddpm & dsm:
- DDPM and DSM transition kernel can be expressed as two continuous-time variables of discrete noise scales
- SDE can be classified as Variation Preserving (VP) and Variation Explosion (VE)
- Probability Flow ODE is a continuous-time ODE that supports the same marginal probability density as SDE
- Probability Flow ODE can be solved with larger step sizes and no randomness
Training strategy
Denoising diffusion training strategy
- Minimizing negative log-likelihood requires using 1: -1
- Baye’s rule is used to parameterize the posterior
- Mean and variance schedules are expressed
- Reparameterizing 1-1 w.r.t 0 yields a simplified training objective
- Most diffusion models use DDPMs training strategy, but there are exceptions
Score matching training strategy
- Traditional score-matching techniques require a lot of computing power
- Advanced methods find ways to avoid computing the Hessian
- Implicit score matching (ISM) uses a non-normalized density function that can be optimized by a neural network
- Sliced score matching (SSM) uses reverse-mode auto-differentiation to estimate the score
- Denoised score matching (DSM) transforms the original score matching into a perturbation kernel learning by adding noise to a sequence
Sampling algorithm
- Unconditional sampling is the process of rebuilding samples from random noise.
- Conditional sampling is a class of sampling that utilizes specific conditions.
Unconditional sampling
Ancestral sampling
- Ancestral sampling is reconstructed with the gradient of inverse Markovian step by step.
- PC sampling is inspired by a type of ODE black-box ODE solver.
Langevin dynamics sampling
Conditional sampling
- Labeled Condition Sampling uses gradient guidance and a classifier with UNet Encoder architecture
- Labels can be text, categorical, binary, or extracted features
- Unlabeled Condition Sampling uses self-information as guidance and is used in denoising, resolution, and inpainting tasks
Algorithm improvement
- Diffusion models have low speed and high computation cost.
- Improved algorithms are classified according to mainstream problems.
Speed-up improvement
- Diffusion models have high-fidelity generation but low sampling speed.
- Advanced techniques can be divided into four categories to improve sampling speed.
Training schedule
- Modifying traditional training settings
- Key factors in training schemes influence learning patterns and models’ performance
- Training enhancement divided into three categories: knowledge distillation, diffusion scheme learning, and noise scale designing
Knowledge distillation
- Knowledge distillation is a method for obtaining small-scale networks from complex teacher models
- Student models benefit from model compression and acceleration
- Salimans et al. applied the core idea to diffusion model improvement
- Student models learn to conduct two-step updates from teacher models in one-step
- Denoising student distills knowledge from scratch by minimizing KL Divergence
Diffusion scheme learning
- Diffusion model encodes data onto latent spaces with the same dimension to achieve high expressiveness.
- Current methods divided into projecting approaches exploration and encoding degree optimization.
- Truncation conducts a trade-off between generating speed and sample fidelity.
- Works focus on the diversity of diffusion kernels.
Noise scale designing
- Traditional diffusion process uses noise to determine transition steps.
- Noise scale design can lead to reasonable generation and fast convergence.
- Existing methods treat noise scale as a learnable parameter.
- Different methods use different approaches to design noise scale.
Training-free sampling
- Training enhancement methods can be used to speed up sampling
- Training-free methods apply pre-trained information directly to advanced sampling algorithms with fewer steps and higher fidelity
- Training-free methods are divided into four categories: analytical methods, Flow-based, Unification Reformulation, and Connection
Continuous space
Analytical method
- Existing training-free sampling methods use hand-crafted noise sequences.
- Analytical methods optimize reverse mean and covariance for each state.
- Analytical methods have a theoretical guarantee, but are limited to certain distributions.
Implicit sampler
- Implicit sampler follows jump-step pattern using pre-trained diffusion model
- Probability treated as Score SDE derived from discrete formulation
- Implicit sampler is type of neural ODE solver
- Advanced ODE solvers used, such as PNDM, edm, DEIS, gDDIM, and DPM-Solver
- Dynamic programming based jump-step method for sampling optimal implicit route
Differential equation solver sampler
- Differential Equation (DE) Solver Sampler minimizes approximation error during reverse sampling
- Two basic DE formulations: SDE and ODE
- Higher-order DE solvers have smaller approximation errors and higher order of convergence
- Speed-prior and accuracy-prior methods
- Semi-linear-based ODE performs the best
Dynamic programming adjustment
- Dynamic programming (DP) is a technique used to find optimized solutions in a reduced time.
- DP algorithms explore the optimal traversal along a trajectory, assuming each path has the same KL divergence.
- Current DP-based methods have a computational cost of O 2.
Mixed-modeling
- Mixed-modeling applies fast-sampling and high-expressiveness generative models in diffusion pipeline.
- Mixed modeling improvement can be classified into two classes from the perspective of mixing purposes: acceleration mixture and expressiveness mixture.
Acceleration mixture
- Acceleration mixture applies high-speed generation of VAEs and GANs to reduce steps in sampling data from random noise.
- Two types of models generate predicted 0 with VAE and GAN.
- ES-DDPM reconstructs intermediate samples as early stop technique.
Expressiveness mixture
- Expressiveness mixture support diffusion models to express data or noise in different patterns
- High expressiveness data combined with fast-sampling generative models to obtain mean and variance more accurately
- Noise modulation, space projection, and kernel expressiveness are high expressiveness methods
- Reformulation problems unify diffusion pipeline based on one or two variables
- Connection problems link score and diffusion frameworks to extend them into a higher view
Data structure diversification
- Diffusion methods are mostly used for image generation tasks.
- Diffusion mechanism has been used in inter-disciplinary tasks with different data types.
- Traditional diffusion patterns are expected to be extended for universal use.
Non-linear space
- Existing methods handle linear perturbations
- Non-linear space has effects on low-level vision tasks
- Kawar et al. and DPS use pseudo-inverse operator and posterior sampling approximation to solve JPEG artifact correction, image deblurring, and phase retrieval
Image & point cloud
- Luo et al. proposed a method for generating point cloud data
- Other techniques have been developed to generate and complete 3D shapes
- Improvements have been made to latent space transformation, such as canonical map, condition feature extraction sub-nets, and point-voxel representation
Latent space
- Expressiveness mixture modeling is used to process latent space data distributions for diffusion applications.
- Current methods project data into continuous space, with the help of EDM and antigen-diffusion models.
- Latent space processing can be beneficial in new application fields.
Function
- Traditional diffusion processes are limited for some tasks
- Dutordoir et al. proposed a diffusion model that samples from the function space
- This model captures multi-dimensional distributions by sampling from joint posteriors
Others
- Score-flow uses a flow function to project RGB images into dequantization space
- Cold diffusion proposes algorithms for projecting data into random distributions with the support of reconstructing correction
Discrete space
- Deep generative models have achieved success in natural language processing, multimodal learning, and AI for science.
- Processing discrete data such as sentences, residue, atom, and vector-quantized data is necessary to eliminate inductive bias.
- Diffusion models are a promising approach for relevant tasks.
- Main problem is divided into processing text & categorical data, and vector-quantized data.
Text & categorical
- D3PM uses diffusion algorithm to process categorical features
- Multi-nomial diffusion and ARDM extend categorical diffusion to multi-nomial data
Vector-quantized
- Vector-quantized (VQ) data is proposed to combine data from different fields into the codebook.
- VQ data processing achieved great performance in autoregressive encoders.
Constrained space
- Graph-based neural networks can be used to analyze data such as social networks, molecular data, and weather conditions.
- Manifold learning methods can be used to non-redundantly express and portray data such as proteins and RNA.
Manifold space
- Data structures such as images and video are defined in Euclidean space.
- Data in robotics, geoscience, and protein modeling are defined in Riemannian manifold.
- Current methods for Euclidean space cannot capture Riemann feature.
- Recent methods applied diffusion sampling into Riemannian manifold.
- Theoretical works provide comprehensive support for manifold sampling.
Graph
- Graph-based neural networks are popular due to their high expressiveness.
- Diffusion theories are used to process graph data.
Likelihood optimization
- Variational methods, diffusion methods, and other methods use the principle of variational evidence lower bound (ELBO) to train models.
- Solutions to the likelihood optimization problem can be divided into two classes - improved ELBO and variational gap optimization.
Improved elbo
Score connection
- Score connection methods provide a connection between ELBO optimization and score matching.
- Score-flow treats the forward KL divergence in ELBO as optimizing a score-matching loss.
- Huang et al. treated Brownian motion as a latent variable to track the loglikelihood estimation.
- Analytic-DPM and NCSN++ enhance ELBO by analyzing the KL Divergence and introducing a truncation factor.
Re-design
- Loss transformation techniques are compared to re-Design methods.
- Re-Design methods directly tighten the ELBO.
- VDM and DDPM++ optimize ELBO by finding optimal factors.
- Improved DDPM and D3PM propose hybrid loss functions based on ELBO.
Variational gap optimization
- Minimizing the variational gap is an approach to maximize loglikelihood.
- INDM (120) is successful in the VAE field.
Dimension reduction
- Variational auto-encoder projects data into a lower dimension
- Diffusion models have high expressiveness from equal-dimension transitions
- Diffusing on a low-dimensional manifold has wide applications in graph-based representations
- Reduced-dimension diffusion can be achieved with latent and dimension projection techniques
Latent projection
- Project training data onto lower dimensional latent space using flow function and VAE-encoder
- LSGM, INDM, and PDM learn smoother models in smaller space, reducing network evaluations and speeding up sampling
- Weighting training techniques use joint training of diffusion models and projecting models based on ELBO and log-likelihood maximization
Dimension projection
- Dimension projection reduces spatial redundancy on image manifolds
- DVDP combines DDPM and VAE
- Theoretical analysis of reduction scale of dimensionality and down-sampling & up-sampling steps needs to be explored
Application
- Diffusion models have powerful ability to generate realistic samples
- Diffusion models used in computer vision, natural language processing, and bioinformatics
Computer vision
- CMDE outperformed vanilla conditional denoising estimator in in-painting and super-resolution tasks
- DDRM proposed an efficient, unsupervised posterior sampling method for image restoration
- Palette developed a unified diffusion-based framework for low-level vision tasks
- DiffC proposed an unconditional generative approach for lossy image compression
- RePaint replaced reverse diffusion by sampling unmasked regions using given image information
High-level vision
- FSDM is a few-shot generation framework based on conditional diffusion models
- CARD proposed a denoising diffusion-based conditional generative model to predict data distribution
- GLIDE explored realistic image synthesis conditioned on text using diffusion models
- DreamFusion extended GLIDE’s achievement into 3D space
- LSGM built a diffusion model trained in the latent space with a variational autoencoder framework
- VQ-Diffusion improved vector quantized diffusion by exploring classifier-free guidance sampling
3d vision
- 33 was an early work on 3D vision tasks using diffusion
- Diffusion process used to generate point clouds
- 210 used diffusion for point cloud generation without shape encoders
- 34 proposed a diffusion model for point cloud completion
- 35 used a neural network to denoise point clouds
Video modeling
- Video diffusion uses generative models to create videos
- RVD, FDM, MCVD, and RaMViD are all methods of using diffusion models to generate videos
Medical application
- Diffusion models can be applied to medical images.
- Score-MRI proposed a diffusion-based framework for MRI reconstruction.
- [213] provided a more flexible framework that didn’t require a paired dataset for training.
- R2D2+ combined diffusion-based MRI reconstruction and super-resolution into the same network.
- [215] explored the application of the generative diffusion model to medical image segmentation.
Sequential modeling
Natural language processing
- Diffusion models are non-autoregressive
- Diffusion-LM used diffusions to denoise noisy vectors into word vectors
- Bit Diffusion used diffusion models to generate discrete data for image caption tasks
Time series
- CSDI [41] used score-based diffusion models to address time series imputation.
- SSSD [42] used structured state space models to capture long-term dependencies in time series data.
Audio
- WaveGrad and DiffWave applied diffusion models to raw waveform generation
- GradTTS and Diff-TTS implemented diffusion models but generated mel feature instead of raw waves
- DiffVC challenged the one-shot many-to-many voice conversion problem
- DiffSinger extended sound generation to singing voice synthesis based on a shallow diffusion mechanism
Ai for science
Molecular conformation generation
- ConfGF was an early work on diffusion-based molecular conformation generation models
- DGSM proposed to dynamically construct molecular graph structures between atoms
- GeoDiff introduced a roto-translational invariant Markov process to impose constraints on the density
- EDM incorporated discrete atom features and deriving the equations required for loglikelihood computation
- Torsional diffusion operated on the space of torsional angles
- DiffDock conducts denoised score matching on transition, rotation, and torsion angle
Material design
- CDVAE explored the periodic structure of stable material generation
- Diffusion-based network designed to capture specific local bonding preferences
- Recent work developed a diffusion-based generative model to target specific antigen structures
- Anand et al. introduced a diffusion-based generative model for protein structure and sequence
- ProteinSGM formulated protein design as an image inpainting problem
- DiffFolding generates protein backbone concentrating on internal angles
Conclusions & discussions
- Diffusion model is important for many fields
- Paper provides review of diffusion models, including theory, algorithms, and applications
Limitations & further directions
- Diffusion models should be viewed as a class, not a branch of DDPM-based models
- Training objectives and evaluation metrics should match the initial goal
- Complex modeling is needed to eliminate inductive bias
- Improvement algorithms with reduced steps should be explored
Appendix b evaluation metric b.1 inception score (is)
- Inception score is a way to measure the diversity and resolution of generated images based on the ImageNet dataset.
- Inception score is divided into two parts: diversity measurement and quality measurement.
- Diversity measurement is calculated based on the class entropy of generated samples.
- Quality measurement is computed through the similarity between a sample and the related class images using entropy.
- KL divergence is applied to inception score calculation.
B.2 frechet inception distance (fid)
- Inception Score is based on a specific dataset with 1000 classes and a trained network.
- Bias between ImageNet and real-world images may cause inaccurate outcome.
- FID is proposed to solve bias from specific reference datasets.
- FID shows distance between real-world data distribution and generated samples using mean and covariance.
B.3 negative log likelihood (nll)
- Negative log-likelihood is a common evaluation metric for data distribution.
- Normalizing flow field, VAE field, and improved DDPM use NLL for evaluation.
Appendix c benchmarks
- Benchmarks of landmark models and improved techniques are provided on CIFAR-10, ImageNet, and CelebA-64 datasets
- Performance of LSUN, FFHQ, and MINST datasets are not presented
- Performance is listed according to NFE in descending order
- Different tasks such as audio diffusion, audio SDE, molecular score, molecular diffusion, protein score, and protein diffusion are discussed