Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion models have made breakthroughs in image, audio, and video generation.
  • Diffusion models have slow sampling speed and limit potential for real-time applications.
  • Consistency models are proposed as a new family of generative models that achieve high sample quality without adversarial training.
  • Consistency models support fast one-step generation and few-step sampling.
  • Consistency models support zero-shot data editing without explicit training.

Paper Content

Introduction

  • Diffusion models are score-based generative models used for image, audio, and video generation
  • Diffusion models do not rely on adversarial training and are less prone to issues like unstable training and mode collapse
  • Diffusion models do not impose strict constraints on model architectures
  • Diffusion models use an iterative sampling process to remove noise from a random noise vector
  • This iterative refinement allows for trade-off of compute for sample quality
  • Diffusion models can be used for zeroshot data editing tasks
  • Diffusion models require 10-2000 times more compute than single-step generative models
  • Objective is to create generative models that facilitate efficient, single-step generation
  • Consistency models are proposed to map any point on an ODE trajectory to its origin
  • Consistency models are self-consistent and can generate data samples with one network evaluation
  • Two methods are proposed to train consistency models
  • Consistency models are demonstrated on CIFAR-10, ImageNet 64, and LSUN 256
  • Consistency models outperform progressive distillation and many GANs

Diffusion models

  • Consistency models are inspired by diffusion models
  • Diffusion models generate data by perturbing data to noise and creating samples from noise
  • Diffusion models use a stochastic differential equation (SDE)
  • There is an ordinary differential equation (ODE) associated with the SDE
  • Diffusion models are designed so the distribution at the end is close to a tractable Gaussian distribution
  • Sampling from diffusion models is slow
  • Existing methods for fast sampling include faster numerical ODE solvers and distillation techniques
  • Progressive distillation does not require collecting a large dataset of samples

Consistency models

  • We propose consistency models, a new type of generative models
  • Consistency models can be trained in either the distillation mode or the isolation mode
  • Consistency models have the property of self-consistency
  • Consistency models are parameterized using skip connections
  • Sampling from consistency models involves one forward pass
  • Consistency models can interpolate between samples
  • Consistency models can perform denoising
  • Consistency models can be used for image editing tasks such as inpainting, colorization, super-resolution, and stroke-guided image editing

Training consistency models via distillation

  • Present a method for training consistency models based on distilling a pre-trained score model
  • Sample along the distribution of ODE trajectories by first sampling x from the dataset, followed by sampling x tn`1 from the transition density of the SDE
  • Train the consistency model by minimizing its output differences on the pair px φ tn , x tn`1 q
  • Consistency distillation loss is defined as an expectation taken with respect to x " p data , n " U 1, N ´1 , and x tn1 " N px; t 2 n1 Iq
  • Consider the squared 2 distance, 1 distance, and the Learned Perceptual Image Patch Similarity (LPIPS)
  • Minimize the objective by stochastic gradient descent on the model parameters θ
  • Update θ ´with exponential moving average (EMA)
  • Provide a theoretical justification for consistency distillation based on asymptotic analysis
  • Theorem 1 states that, under some regularity conditions, the estimated consistency model can become arbitrarily accurate
  • Continuous-time distillation loss functions involve Jacobian-vector products and require forward-mode automatic differentiation

Training consistency models in isolation

  • Consistency models can be trained without relying on pre-trained diffusion models.
  • Diffusion distillation techniques are different from consistency models.
  • There is an unbiased estimator of the ground truth score function.
  • Monte Carlo estimate of the score function can be formed.
  • Pre-trained diffusion model can be replaced in consistency distillation.
  • Consistency training objective does not depend on diffusion model parameters.
  • Consistency training loss can be extended to hold in continuous time.

Experiments

  • Used consistency distillation and consistency training to learn consistency models on real image datasets
  • Compared results according to Fréchet Inception Distance, Inception Score, Precision, and Recall
  • Additional experimental details in Appendix C

Training consistency models

  • Performance of consistency models trained by consistency distillation (CD) and consistency training (CT) is affected by various hyperparameters.
  • Metric function used for CD experiments is squared 2 distance, 1 distance, and Learned Perceptual Image Patch Similarity (LPIPS).
  • ODE solver used for CD experiments is Euler’s forward method and Heun’s second order method.
  • Number of discretization steps N used for CD experiments is N P t9, 12, 18, 36, 50, 60, 80, 120u.
  • Optimal metric for CD is LPIPS.
  • Heun ODE solver and N " 18 are the best choices for CD.
  • For CT experiments, LPIPS is used throughout.
  • Convergence of CT is sensitive to N -smaller N leads to faster convergence but worse samples, whereas larger N leads to slower convergence but better samples upon convergence.
  • Adaptive schedules of N and µ significantly improve the convergence speed and sample quality of CT.

Few-step image generation

  • CD and PD are two distillation approaches that do not require synthetic data
  • CD outperforms PD across all datasets, sampling steps, and metric functions
  • CT outperforms single-step, non-adversarial generative models by a significant margin
  • CT samples share significant structural similarity with EDM samples

Zero-shot image editing

  • Consistency models allow zeroshot image editing by modifying the multistep sampling process.
  • Consistency model can colorize gray-scale bedroom images, generate high-resolution images from low-resolution inputs, and generate images based on stroke inputs.
  • Examples of zero-shot capability of consistency models include inpainting, interpolation, denoising, colorization, super-resolution, and stroke-guided image generation.

Conclusion

  • Introduced consistency models to support one-step and few-step generation
  • Consistency distillation method outperforms existing distillation techniques for diffusion models
  • Consistency models outdo other available models that permit single-step generation, barring GANs
  • Allow zero-shot image editing applications such as inpainting, colorization, super-resolution, denoising, interpolation, and stroke-guided image generation
  • Share striking similarities with techniques employed in other fields
  • Consistency distillation and consistency training objectives can be generalized to hold for infinite time steps
  • Depending on whether θ ´" θ or θ ´" stopgradpθq, two possible continuous-time extensions for the consistency distillation objective
  • Theorem 3 implies that consistency models can be trained by minimizing L 8 CD pθ, θ; φq
  • Theorem 4 implies that consistency models can be trained by minimizing L 8 CD, 1 pθ, θ; φq
  • L 8 CD pθ, θ; φq " 0 if and only if f θ px t , tq " x for all x t P R d and t P r , T s
  • L 8 CD, 1 pθ, θ; φq " 0 if and only if f θ px t , tq " x for all x t P R d and t P r , T s