Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion models have made breakthroughs in image, audio, and video generation.
- Diffusion models have slow sampling speed and limit potential for real-time applications.
- Consistency models are proposed as a new family of generative models that achieve high sample quality without adversarial training.
- Consistency models support fast one-step generation and few-step sampling.
- Consistency models support zero-shot data editing without explicit training.
Paper Content
Introduction
- Diffusion models are score-based generative models used for image, audio, and video generation
- Diffusion models do not rely on adversarial training and are less prone to issues like unstable training and mode collapse
- Diffusion models do not impose strict constraints on model architectures
- Diffusion models use an iterative sampling process to remove noise from a random noise vector
- This iterative refinement allows for trade-off of compute for sample quality
- Diffusion models can be used for zeroshot data editing tasks
- Diffusion models require 10-2000 times more compute than single-step generative models
- Objective is to create generative models that facilitate efficient, single-step generation
- Consistency models are proposed to map any point on an ODE trajectory to its origin
- Consistency models are self-consistent and can generate data samples with one network evaluation
- Two methods are proposed to train consistency models
- Consistency models are demonstrated on CIFAR-10, ImageNet 64, and LSUN 256
- Consistency models outperform progressive distillation and many GANs
Diffusion models
- Consistency models are inspired by diffusion models
- Diffusion models generate data by perturbing data to noise and creating samples from noise
- Diffusion models use a stochastic differential equation (SDE)
- There is an ordinary differential equation (ODE) associated with the SDE
- Diffusion models are designed so the distribution at the end is close to a tractable Gaussian distribution
- Sampling from diffusion models is slow
- Existing methods for fast sampling include faster numerical ODE solvers and distillation techniques
- Progressive distillation does not require collecting a large dataset of samples
Consistency models
- We propose consistency models, a new type of generative models
- Consistency models can be trained in either the distillation mode or the isolation mode
- Consistency models have the property of self-consistency
- Consistency models are parameterized using skip connections
- Sampling from consistency models involves one forward pass
- Consistency models can interpolate between samples
- Consistency models can perform denoising
- Consistency models can be used for image editing tasks such as inpainting, colorization, super-resolution, and stroke-guided image editing
Training consistency models via distillation
- Present a method for training consistency models based on distilling a pre-trained score model
- Sample along the distribution of ODE trajectories by first sampling x from the dataset, followed by sampling x tn`1 from the transition density of the SDE
- Train the consistency model by minimizing its output differences on the pair px φ tn , x tn`1 q
- Consistency distillation loss is defined as an expectation taken with respect to x " p data , n " U 1, N ´1 , and x tn
1 " N px; t 2 n
1 Iq - Consider the squared 2 distance, 1 distance, and the Learned Perceptual Image Patch Similarity (LPIPS)
- Minimize the objective by stochastic gradient descent on the model parameters θ
- Update θ ´with exponential moving average (EMA)
- Provide a theoretical justification for consistency distillation based on asymptotic analysis
- Theorem 1 states that, under some regularity conditions, the estimated consistency model can become arbitrarily accurate
- Continuous-time distillation loss functions involve Jacobian-vector products and require forward-mode automatic differentiation
Training consistency models in isolation
- Consistency models can be trained without relying on pre-trained diffusion models.
- Diffusion distillation techniques are different from consistency models.
- There is an unbiased estimator of the ground truth score function.
- Monte Carlo estimate of the score function can be formed.
- Pre-trained diffusion model can be replaced in consistency distillation.
- Consistency training objective does not depend on diffusion model parameters.
- Consistency training loss can be extended to hold in continuous time.
Experiments
- Used consistency distillation and consistency training to learn consistency models on real image datasets
- Compared results according to Fréchet Inception Distance, Inception Score, Precision, and Recall
- Additional experimental details in Appendix C
Training consistency models
- Performance of consistency models trained by consistency distillation (CD) and consistency training (CT) is affected by various hyperparameters.
- Metric function used for CD experiments is squared 2 distance, 1 distance, and Learned Perceptual Image Patch Similarity (LPIPS).
- ODE solver used for CD experiments is Euler’s forward method and Heun’s second order method.
- Number of discretization steps N used for CD experiments is N P t9, 12, 18, 36, 50, 60, 80, 120u.
- Optimal metric for CD is LPIPS.
- Heun ODE solver and N " 18 are the best choices for CD.
- For CT experiments, LPIPS is used throughout.
- Convergence of CT is sensitive to N -smaller N leads to faster convergence but worse samples, whereas larger N leads to slower convergence but better samples upon convergence.
- Adaptive schedules of N and µ significantly improve the convergence speed and sample quality of CT.
Few-step image generation
- CD and PD are two distillation approaches that do not require synthetic data
- CD outperforms PD across all datasets, sampling steps, and metric functions
- CT outperforms single-step, non-adversarial generative models by a significant margin
- CT samples share significant structural similarity with EDM samples
Zero-shot image editing
- Consistency models allow zeroshot image editing by modifying the multistep sampling process.
- Consistency model can colorize gray-scale bedroom images, generate high-resolution images from low-resolution inputs, and generate images based on stroke inputs.
- Examples of zero-shot capability of consistency models include inpainting, interpolation, denoising, colorization, super-resolution, and stroke-guided image generation.
Conclusion
- Introduced consistency models to support one-step and few-step generation
- Consistency distillation method outperforms existing distillation techniques for diffusion models
- Consistency models outdo other available models that permit single-step generation, barring GANs
- Allow zero-shot image editing applications such as inpainting, colorization, super-resolution, denoising, interpolation, and stroke-guided image generation
- Share striking similarities with techniques employed in other fields
- Consistency distillation and consistency training objectives can be generalized to hold for infinite time steps
- Depending on whether θ ´" θ or θ ´" stopgradpθq, two possible continuous-time extensions for the consistency distillation objective
- Theorem 3 implies that consistency models can be trained by minimizing L 8 CD pθ, θ; φq
- Theorem 4 implies that consistency models can be trained by minimizing L 8 CD, 1 pθ, θ; φq
- L 8 CD pθ, θ; φq " 0 if and only if f θ px t , tq " x for all x t P R d and t P r , T s
- L 8 CD, 1 pθ, θ; φq " 0 if and only if f θ px t , tq " x for all x t P R d and t P r , T s