Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Diffusion models have made breakthroughs in image, audio, and video generation.
Diffusion models have slow sampling speed and limit potential for real-time applications.
Consistency models are proposed as a new family of generative models that achieve high sample quality without adversarial training.
Consistency models support fast one-step generation and few-step sampling.
Consistency models support zero-shot data editing without explicit training.

Paper Content

Introduction

Diffusion models are score-based generative models used for image, audio, and video generation
Diffusion models do not rely on adversarial training and are less prone to issues like unstable training and mode collapse
Diffusion models do not impose strict constraints on model architectures
Diffusion models use an iterative sampling process to remove noise from a random noise vector
This iterative refinement allows for trade-off of compute for sample quality
Diffusion models can be used for zeroshot data editing tasks
Diffusion models require 10-2000 times more compute than single-step generative models
Objective is to create generative models that facilitate efficient, single-step generation
Consistency models are proposed to map any point on an ODE trajectory to its origin
Consistency models are self-consistent and can generate data samples with one network evaluation
Two methods are proposed to train consistency models
Consistency models are demonstrated on CIFAR-10, ImageNet 64, and LSUN 256
Consistency models outperform progressive distillation and many GANs

Diffusion models

Consistency models are inspired by diffusion models
Diffusion models generate data by perturbing data to noise and creating samples from noise
Diffusion models use a stochastic differential equation (SDE)
There is an ordinary differential equation (ODE) associated with the SDE
Diffusion models are designed so the distribution at the end is close to a tractable Gaussian distribution
Sampling from diffusion models is slow
Existing methods for fast sampling include faster numerical ODE solvers and distillation techniques
Progressive distillation does not require collecting a large dataset of samples

Consistency models

We propose consistency models, a new type of generative models
Consistency models can be trained in either the distillation mode or the isolation mode
Consistency models have the property of self-consistency
Consistency models are parameterized using skip connections
Sampling from consistency models involves one forward pass
Consistency models can interpolate between samples
Consistency models can perform denoising
Consistency models can be used for image editing tasks such as inpainting, colorization, super-resolution, and stroke-guided image editing

Training consistency models via distillation

Present a method for training consistency models based on distilling a pre-trained score model
Sample along the distribution of ODE trajectories by first sampling x from the dataset, followed by sampling x tn`1 from the transition density of the SDE
Train the consistency model by minimizing its output differences on the pair px φ tn , x tn`1 q
Consistency distillation loss is defined as an expectation taken with respect to x " p data , n " U 1, N ´1 , and x tn1 " N px; t 2 n1 Iq
Consider the squared 2 distance, 1 distance, and the Learned Perceptual Image Patch Similarity (LPIPS)
Minimize the objective by stochastic gradient descent on the model parameters θ
Update θ ´with exponential moving average (EMA)
Provide a theoretical justification for consistency distillation based on asymptotic analysis
Theorem 1 states that, under some regularity conditions, the estimated consistency model can become arbitrarily accurate
Continuous-time distillation loss functions involve Jacobian-vector products and require forward-mode automatic differentiation

Training consistency models in isolation

Consistency models can be trained without relying on pre-trained diffusion models.
Diffusion distillation techniques are different from consistency models.
There is an unbiased estimator of the ground truth score function.
Monte Carlo estimate of the score function can be formed.
Pre-trained diffusion model can be replaced in consistency distillation.
Consistency training objective does not depend on diffusion model parameters.
Consistency training loss can be extended to hold in continuous time.

Experiments

Used consistency distillation and consistency training to learn consistency models on real image datasets
Compared results according to Fréchet Inception Distance, Inception Score, Precision, and Recall
Additional experimental details in Appendix C

Training consistency models

Performance of consistency models trained by consistency distillation (CD) and consistency training (CT) is affected by various hyperparameters.
Metric function used for CD experiments is squared 2 distance, 1 distance, and Learned Perceptual Image Patch Similarity (LPIPS).
ODE solver used for CD experiments is Euler’s forward method and Heun’s second order method.
Number of discretization steps N used for CD experiments is N P t9, 12, 18, 36, 50, 60, 80, 120u.
Optimal metric for CD is LPIPS.
Heun ODE solver and N " 18 are the best choices for CD.
For CT experiments, LPIPS is used throughout.
Convergence of CT is sensitive to N -smaller N leads to faster convergence but worse samples, whereas larger N leads to slower convergence but better samples upon convergence.
Adaptive schedules of N and µ significantly improve the convergence speed and sample quality of CT.

Few-step image generation

CD and PD are two distillation approaches that do not require synthetic data
CD outperforms PD across all datasets, sampling steps, and metric functions
CT outperforms single-step, non-adversarial generative models by a significant margin
CT samples share significant structural similarity with EDM samples

Zero-shot image editing

Consistency models allow zeroshot image editing by modifying the multistep sampling process.
Consistency model can colorize gray-scale bedroom images, generate high-resolution images from low-resolution inputs, and generate images based on stroke inputs.
Examples of zero-shot capability of consistency models include inpainting, interpolation, denoising, colorization, super-resolution, and stroke-guided image generation.

Conclusion

Introduced consistency models to support one-step and few-step generation
Consistency distillation method outperforms existing distillation techniques for diffusion models
Consistency models outdo other available models that permit single-step generation, barring GANs
Allow zero-shot image editing applications such as inpainting, colorization, super-resolution, denoising, interpolation, and stroke-guided image generation
Share striking similarities with techniques employed in other fields
Consistency distillation and consistency training objectives can be generalized to hold for infinite time steps
Depending on whether θ ´" θ or θ ´" stopgradpθq, two possible continuous-time extensions for the consistency distillation objective
Theorem 3 implies that consistency models can be trained by minimizing L 8 CD pθ, θ; φq
Theorem 4 implies that consistency models can be trained by minimizing L 8 CD, 1 pθ, θ; φq
L 8 CD pθ, θ; φq " 0 if and only if f θ px t , tq " x for all x t P R d and t P r , T s
L 8 CD, 1 pθ, θ; φq " 0 if and only if f θ px t , tq " x for all x t P R d and t P r , T s

Link to paper#

Abstract#

Paper Content#

Introduction#

Diffusion models#

Consistency models#

Training consistency models via distillation#

Training consistency models in isolation#

Experiments#

Training consistency models#

Few-step image generation#

Zero-shot image editing#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Diffusion models

Consistency models

Training consistency models via distillation

Training consistency models in isolation

Experiments

Training consistency models

Few-step image generation

Zero-shot image editing

Conclusion