Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- DDPMs produce excellent samples
- DDPMs can achieve competitive log-likelihoods while maintaining high sample quality
- Learning variances of the reverse diffusion process allows sampling with fewer forward passes
- Precision and recall used to compare DDPMs and GANs
- Sample quality and likelihood of DDPMs scale smoothly with model capacity and training compute
Paper Content
Introduction
- Diffusion probabilistic models match data distribution by learning to reverse a gradual, multi-step noising process
- Denoising diffusion probabilistic models (DDPM) and score based generative models are equivalent
- DDPMs can produce high-quality images and audio
- DDPMs have yet to be shown to achieve loglikelihoods competitive with other likelihood-based models
- DDPMs can achieve loglikelihoods competitive with other likelihood-based models, even on high-diversity datasets
- DDPMs can generate audio using a small number of sampling steps
- DDPMs can be optimized with a hybrid learning objective
- DDPMs can sample in fewer steps with little change in sample quality
- DDPMs have higher recall than GANs for similar FID
- Performance of models increases with model size and training compute
Denoising diffusion probabilistic models
- DDPMs are formulated by Ho et al. (2020).
- DDPMs involve a fixed noising process q which adds diagonal Gaussian noise at each timestep.
- Sohl-Dickstein et al. (2015) provides a more general derivation.
Definitions
- A forward noising process is defined which adds Gaussian noise at each time step with variance βt.
- The latent xT is nearly an isotropic Gaussian distribution if T and βt are sufficiently large.
- A neural network is used to approximate the reverse distribution q(xt-1|xt).
- A variational lower bound is written as an equation which can be evaluated in closed form.
- The probability of pθ(x0|x1) landing in the correct bin is calculated.
- An arbitrary step of the noised latents can be sampled directly conditioned on the input x0.
Training in practice
- Objective in Equation 4 is a sum of independent terms
- Equation 9 provides an efficient way to sample from an arbitrary step of the forward noising process
- Ho et al. (2020) uniformly sample t for each image in each mini-batch
- Network can predict x 0 or noise to derive µ θ (x t , t)
- Reweighted loss function used to optimize L vlb
- Generative score matching used to explain better sample quality
- Variance fixed to σ 2 t I for best results
Improving the log-likelihood
- Ho et al. (2020) found that DDPMs can generate high-fidelity samples according to FID and Inception Score.
- DDPMs were unable to achieve competitive log-likelihoods with these models.
- Log-likelihood is a widely used metric in generative modeling and it is believed that optimizing log-likelihood forces generative models to capture all of the modes of the data distribution.
- Small improvements in log-likelihood can have a dramatic impact on sample quality and learnt feature representations.
- This section explores modifications to the algorithm that allow DDPMs to achieve better log-likelihoods on image datasets.
- Experiments were conducted on ImageNet 64 × 64 and CIFAR-10 datasets.
- The setup from Ho et al. (2020) achieved a log-likelihood of 3.99 (bits/dim) on ImageNet 64 × 64 after 200K training iterations.
- Increasing T from 1000 to 4000 improved log-likelihood to 3.77.
- The model parameterizes the variance as an interpolation between β t and βt in the log domain.
Improving the noise schedule
- Linear noise schedule works well for high resolution images, but not for 64x64 and 32x32
- End of forward noising process is too noisy and doesn’t contribute to sample quality
- Constructed a different noise schedule to address this problem
- Cosine schedule has linear drop-off in the middle of the process, while changing very little near the extremes
Reducing gradient noise
- Optimizing L vlb directly was expected to achieve the best log-likelihoods, but was difficult to optimize in practice.
- L hybrid achieved better log-likelihoods on the training set given the same amount of training time.
- The gradient of L vlb was much noisier than that of L hybrid.
- To reduce the variance of L vlb, importance sampling was employed.
- Optimizing L vlb with importance sampling achieved the best log-likelihoods.
Improving sampling speed
- 4000 diffusion steps are needed to produce a single sample on a modern GPU.
- Sampling can be done in a few seconds instead of minutes by reducing the steps used during sampling.
- Sampling noise schedule can be obtained from a given sequence of t values.
- Sampling steps can be reduced from T to K by using K evenly spaced real numbers between 1 and T.
- L hybrid model with learnt sigmas maintains high sample quality with 100 sampling steps.
Comparison to gans
- Likelihood is a good proxy for mode-coverage, but difficult to compare to GANs.
- Precision and recall used instead.
- Class-conditional models trained, class information injected through same pathway as timestep t.
- Two models trained, one BigGAN-deep model with 100M parameters.
- 50K samples generated for metrics, full training set used for FID.
Scaling model size
- Algorithmic changes can improve log-likelihood and FID without changing training compute
- Trend in modern machine learning is that larger models and more training time improve model performance
- Investigated how FID and NLL scale as a function of training compute
- Results suggest DDPMs improve in a predictable way as training compute increases
- FID scales according to a power law, NLL does not
Related work
- Song et al. (2020a) and Song et al. (2020b) propose fast sampling algorithms for models trained with the DDPM objective.
- Gao et al. (2020) develops a diffusion model with reverse diffusion steps modeled by an energy-based model.
- Our method allows fast sampling directly from the ancestral process, removing the need for extra hyperparameters.
Conclusion
- DDPMs can sample faster and achieve better log-likelihoods with little impact on sample quality
- Learning Σ θ using a parameterization and L hybrid objective improves likelihood
- Fewer steps are needed for sampling
- DDPMs match sample quality of GANs and have better mode coverage
- More training compute leads to better sample quality and log-likelihood
- DDPMs combine good log-likelihoods, high-quality samples, and fast sampling
- UNet model architecture used
- Attention layers use multi-head attention
- Model conditions on t using GroupNorm
- Model parameters and FLOPs estimated
- Smaller model used for CIFAR-10
- Dropout values {0.1, 0.2, 0.3} used
- Adam used for all experiments
- Batch size of 128 and learning rate of 10-4 used
- EMA rate of 0.9999 used
- 50K samples produced for most experiments
- 10K samples produced for unconditional ImageNet 64 × 64
- Training set used for CIFAR-10 and ImageNet
- 50K training samples used for LSUN
- L hybrid model outperforms DDIM with more than 50 diffusion steps
- L hybrid and L vlb combined to achieve FID of 19.9 and NLL of 3.52 bits/dim
- Strided subset of timesteps used to improve log-likelihood
- NLL not calculated for DDIM
- Overfitting observed on CIFAR-10 and ImageNet 64 × 64
- EMA hyperparameter of 0.9999 and 0.99995 worked best
- Dropout of 0.1 and 0.3 used
- Cosine schedule easier to optimize and overfit
- Two models trained on class conditional ImageNet 256 × 256
- VQ-VAE-2 used for comparison
- Diffusion models obtain best FIDs for likelihood-based model