Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion models can achieve better image sample quality than current state-of-the-art generative models.
  • Classifier guidance can improve sample quality and trade off diversity for fidelity.
  • FID scores of 2.97, 4.59, and 7.72 achieved on ImageNet 128$\times$128, 256$\times$256, and 512$\times$512 respectively.
  • FID scores of 3.94 and 3.85 achieved on ImageNet 256$\times$256 and 512$\times$512 with classifier guidance and upsampling diffusion models.

Paper Content

Introduction

  • Generative models can generate natural language, images, and speech/music
  • Generative models can be used to generate images from text prompts or learn feature representations
  • Paper describes improvements to diffusion models and evaluation setup
  • Architecture improvements give substantial boost to FID
  • Gradients from classifier can be used to guide diffusion model during sampling
  • Improved models achieve state-of-the-art on unconditional and conditional image synthesis tasks

Background

  • Diffusion models sample from a distribution by reversing a gradual noising process.
  • Noise is drawn from a diagonal Gaussian distribution.
  • A diffusion model learns to produce a slightly more “denoised” x t−1 from x t.
  • Training objective is mean-squared error loss between the true noise and the predicted noise.
  • Distribution of x t−1 given x t is modeled as a diagonal Gaussian.

Improvements

  • Song and Ermon and Ho et al. proposed improvements to diffusion models
  • Nichol and Dhariwal proposed parameterizing the variance of the model as a neural network
  • Nichol and Dhariwal proposed a hybrid objective for training both the model and the variance
  • Song et al. proposed an alternative non-Markovian noising process with the same forward marginals as DDPM
  • Song et al. proposed a way to turn any model into a deterministic mapping from latents to images

Sample quality metrics

  • Quantitative evaluations are used to compare sample quality across models.
  • Inception Score (IS) measures how well a model captures the full ImageNet class distribution.
  • Fréchet Inception Distance (FID) is more consistent with human judgement and captures spatial relationships.
  • Improved Precision and Recall metrics measure sample fidelity and diversity.
  • FID is the de facto standard metric for state-of-the-art generative modeling work.

Architecture improvements

  • UNet architecture improves sample quality over previous architectures
  • Architecture changes explored: increasing depth/width, number of attention heads, attention at different resolutions, BigGAN residual block, rescaling residual connections
  • Training models on ImageNet 128x128 with batch size 256, sampling using 250 steps
  • Table 1 shows architecture changes improve performance
  • Figure 2 shows increased depth helps performance but takes longer to reach same performance as wider model
  • Table 2 shows more heads or fewer channels per head improves FID
  • Opt to use 64 channels per head as default

Adaptive group normalization

  • Used adaptive group normalization (AdaGN) layer in experiments
  • Improved FID when using AdaGN layer

Classifier guidance

  • GANs use class labels to generate images
  • Synthetic labels can be used in label-limited regimes
  • Diffusion models can be conditioned on class labels
  • Classifiers can be used to guide the diffusion sampling process
  • Two ways of deriving conditional sampling processes using classifiers are reviewed

Conditional reverse noising process

  • Diffusion model with reverse noising process
  • Sampling from distribution can be approximated as a perturbed Gaussian distribution
  • Algorithm 1 and 2 summarize the corresponding sampling algorithm

Conditional sampling for ddim

  • Conditional sampling is only valid for stochastic diffusion sampling process.
  • Score-based conditioning trick is used for deterministic sampling methods.
  • Score function is derived from a model that predicts noise added to a sample.
  • New epsilon prediction is defined from the score of the joint distribution.
  • Sampling procedure is the same as regular DDIM, but with modified noise predictions.

Scaling classifier gradients

  • Trained classification models on ImageNet
  • Incorporated classifier into sampling process of diffusion model
  • Scaled classifier gradients by a factor larger than 1
  • Sample quality of both unconditional and conditional models improved by classifier guidance

Results

  • Trained separate diffusion models on three LSUN classes: bedroom, horse, and cat
  • Trained conditional diffusion models on ImageNet
  • Trained own model due to lack of public models or samples
  • Results use two-resolution stacks

State-of-the-art image synthesis

  • Diffusion models obtain best FID and sFID on most tasks
  • Improved architecture achieves state-of-the-art image generation on LSUN and ImageNet 64x64
  • Classifier guidance allows models to outperform GANs
  • Diffusion models contain more modes than GANs

Comparison to upsampling

  • Nichol and Dhariwal and Saharia et al. trained two-stage diffusion models by combining a low-resolution diffusion model with a corresponding upsampling diffusion model.
  • Combining classifier guidance with upsampling improves sample quality along different axes.
  • Guidance provides a knob to trade off diversity for higher precision.
  • Score based generative models introduced by Song and Ermon
  • Ho et al. found connection between this method and diffusion models
  • Many works followed up with more promising results
  • Diffusion models work well for audio
  • GAN-like setup can improve samples from these models
  • Techniques from stochastic differential equations can improve sample quality
  • Methods to improve sampling speed
  • Diffusion models used for ImageNet generation task
  • Goyal et al. described technique for learning model with learned iterative generation steps
  • Truncation trick for GANs to trade off diversity for fidelity
  • Classifier rejection sampling to filter out bad samples
  • Low-temperature sampling to emphasize modes of data distribution
  • VQ-VAE and VQ-VAE-2 autoregressive models with quantized latent codes
  • DCTransformer and NVAE/VDVAE VAEs for difficult image generation
  • Energy-based models with Langevin dynamics for sampling coherent images
  • Classifiers used as stand-alone generative models

Limitations and future work

  • Diffusion models are slower than GANs at sampling time
  • Luhman and Luhman [37] explore a way to distill the DDIM sampling process into a single step model
  • Samples from the single step model are not yet competitive with GANs
  • Classifier guidance technique is limited to labeled datasets
  • Future work could extend classifier guidance to unlabeled data
  • Classifier guidance demonstrates that powerful generative models can be obtained from the gradients of a classification function

Conclusion

  • Diffusion models can obtain better sample quality than state-of-the-art GANs
  • Diffusion models can be used for both unconditional and class-conditional tasks
  • Diversity and fidelity can be traded off by adjusting the scale of the classifier gradients
  • Sampling time gap between GANs and diffusion models can be reduced
  • Better sample quality can be achieved on high-resolution conditional image synthesis
  • Results better than StyleGAN2 and BigGAN-deep can be achieved with the same or lower compute budget
  • Naive implementation of models in PyTorch is inefficient
  • Optimized version uses larger per-GPU batch sizes and fused GroupNorm-Swish and fused Adam CUDA ops
  • Training for fewer iterations can maintain sample quality superior to BigGAN-deep
  • Temperature scaling does not provide any substantial improvement in evaluation metrics
  • Low temperatures have both low precision and low recall
  • Samples from the model with no guidance have almost perfect reconstructions
  • Samples from the model with classifier guidance are unique and not stored in the training set
  • DDIM latent space interpolations can be used to reconstruct and interpolate real images
  • Samples from the best 512x512 model can be seen in Figures 13, 14, and 15