Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion models can achieve better image sample quality than current state-of-the-art generative models.
- Classifier guidance can improve sample quality and trade off diversity for fidelity.
- FID scores of 2.97, 4.59, and 7.72 achieved on ImageNet 128$\times$128, 256$\times$256, and 512$\times$512 respectively.
- FID scores of 3.94 and 3.85 achieved on ImageNet 256$\times$256 and 512$\times$512 with classifier guidance and upsampling diffusion models.
Paper Content
Introduction
- Generative models can generate natural language, images, and speech/music
- Generative models can be used to generate images from text prompts or learn feature representations
- Paper describes improvements to diffusion models and evaluation setup
- Architecture improvements give substantial boost to FID
- Gradients from classifier can be used to guide diffusion model during sampling
- Improved models achieve state-of-the-art on unconditional and conditional image synthesis tasks
Background
- Diffusion models sample from a distribution by reversing a gradual noising process.
- Noise is drawn from a diagonal Gaussian distribution.
- A diffusion model learns to produce a slightly more “denoised” x t−1 from x t.
- Training objective is mean-squared error loss between the true noise and the predicted noise.
- Distribution of x t−1 given x t is modeled as a diagonal Gaussian.
Improvements
- Song and Ermon and Ho et al. proposed improvements to diffusion models
- Nichol and Dhariwal proposed parameterizing the variance of the model as a neural network
- Nichol and Dhariwal proposed a hybrid objective for training both the model and the variance
- Song et al. proposed an alternative non-Markovian noising process with the same forward marginals as DDPM
- Song et al. proposed a way to turn any model into a deterministic mapping from latents to images
Sample quality metrics
- Quantitative evaluations are used to compare sample quality across models.
- Inception Score (IS) measures how well a model captures the full ImageNet class distribution.
- Fréchet Inception Distance (FID) is more consistent with human judgement and captures spatial relationships.
- Improved Precision and Recall metrics measure sample fidelity and diversity.
- FID is the de facto standard metric for state-of-the-art generative modeling work.
Architecture improvements
- UNet architecture improves sample quality over previous architectures
- Architecture changes explored: increasing depth/width, number of attention heads, attention at different resolutions, BigGAN residual block, rescaling residual connections
- Training models on ImageNet 128x128 with batch size 256, sampling using 250 steps
- Table 1 shows architecture changes improve performance
- Figure 2 shows increased depth helps performance but takes longer to reach same performance as wider model
- Table 2 shows more heads or fewer channels per head improves FID
- Opt to use 64 channels per head as default
Adaptive group normalization
- Used adaptive group normalization (AdaGN) layer in experiments
- Improved FID when using AdaGN layer
Classifier guidance
- GANs use class labels to generate images
- Synthetic labels can be used in label-limited regimes
- Diffusion models can be conditioned on class labels
- Classifiers can be used to guide the diffusion sampling process
- Two ways of deriving conditional sampling processes using classifiers are reviewed
Conditional reverse noising process
- Diffusion model with reverse noising process
- Sampling from distribution can be approximated as a perturbed Gaussian distribution
- Algorithm 1 and 2 summarize the corresponding sampling algorithm
Conditional sampling for ddim
- Conditional sampling is only valid for stochastic diffusion sampling process.
- Score-based conditioning trick is used for deterministic sampling methods.
- Score function is derived from a model that predicts noise added to a sample.
- New epsilon prediction is defined from the score of the joint distribution.
- Sampling procedure is the same as regular DDIM, but with modified noise predictions.
Scaling classifier gradients
- Trained classification models on ImageNet
- Incorporated classifier into sampling process of diffusion model
- Scaled classifier gradients by a factor larger than 1
- Sample quality of both unconditional and conditional models improved by classifier guidance
Results
- Trained separate diffusion models on three LSUN classes: bedroom, horse, and cat
- Trained conditional diffusion models on ImageNet
- Trained own model due to lack of public models or samples
- Results use two-resolution stacks
State-of-the-art image synthesis
- Diffusion models obtain best FID and sFID on most tasks
- Improved architecture achieves state-of-the-art image generation on LSUN and ImageNet 64x64
- Classifier guidance allows models to outperform GANs
- Diffusion models contain more modes than GANs
Comparison to upsampling
- Nichol and Dhariwal and Saharia et al. trained two-stage diffusion models by combining a low-resolution diffusion model with a corresponding upsampling diffusion model.
- Combining classifier guidance with upsampling improves sample quality along different axes.
- Guidance provides a knob to trade off diversity for higher precision.
Related work
- Score based generative models introduced by Song and Ermon
- Ho et al. found connection between this method and diffusion models
- Many works followed up with more promising results
- Diffusion models work well for audio
- GAN-like setup can improve samples from these models
- Techniques from stochastic differential equations can improve sample quality
- Methods to improve sampling speed
- Diffusion models used for ImageNet generation task
- Goyal et al. described technique for learning model with learned iterative generation steps
- Truncation trick for GANs to trade off diversity for fidelity
- Classifier rejection sampling to filter out bad samples
- Low-temperature sampling to emphasize modes of data distribution
- VQ-VAE and VQ-VAE-2 autoregressive models with quantized latent codes
- DCTransformer and NVAE/VDVAE VAEs for difficult image generation
- Energy-based models with Langevin dynamics for sampling coherent images
- Classifiers used as stand-alone generative models
Limitations and future work
- Diffusion models are slower than GANs at sampling time
- Luhman and Luhman [37] explore a way to distill the DDIM sampling process into a single step model
- Samples from the single step model are not yet competitive with GANs
- Classifier guidance technique is limited to labeled datasets
- Future work could extend classifier guidance to unlabeled data
- Classifier guidance demonstrates that powerful generative models can be obtained from the gradients of a classification function
Conclusion
- Diffusion models can obtain better sample quality than state-of-the-art GANs
- Diffusion models can be used for both unconditional and class-conditional tasks
- Diversity and fidelity can be traded off by adjusting the scale of the classifier gradients
- Sampling time gap between GANs and diffusion models can be reduced
- Better sample quality can be achieved on high-resolution conditional image synthesis
- Results better than StyleGAN2 and BigGAN-deep can be achieved with the same or lower compute budget
- Naive implementation of models in PyTorch is inefficient
- Optimized version uses larger per-GPU batch sizes and fused GroupNorm-Swish and fused Adam CUDA ops
- Training for fewer iterations can maintain sample quality superior to BigGAN-deep
- Temperature scaling does not provide any substantial improvement in evaluation metrics
- Low temperatures have both low precision and low recall
- Samples from the model with no guidance have almost perfect reconstructions
- Samples from the model with classifier guidance are unique and not stored in the training set
- DDIM latent space interpolations can be used to reconstruct and interpolate real images
- Samples from the best 512x512 model can be seen in Figures 13, 14, and 15