Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion models are difficult to apply to high resolution images.
- Existing approaches focus on lower dimensional spaces or multiple super-resolution levels.
- This paper aims to improve denoising diffusion for high resolution images while keeping the model simple.
- Four main findings: noise schedule should be adjusted, scale only a particular part of the architecture, add dropout at specific locations, and downsampling is an effective strategy.
Paper Content
Introduction
- Score-based diffusion models are used to generate data by adding random noise and approximating the denoising process with a neural network.
- Diffusion models are effective for image, audio, and video generation.
- For higher resolutions, literature typically uses lower dimensional latent spaces or divides the generative process into multiple sub-problems.
- This paper aims to improve standard denoising diffusion for higher resolutions while keeping the model as simple as possible.
Background: diffusion models
- Diffusion model generates data by learning the reverse of a destruction process
- Gaussian noise is added over time
- Hyperparameters determine how much signal is destroyed
- Variance preserving process fixes relation between hyperparameters
- Transition distributions are given by a normal distribution
- Noise schedule is αcosine schedule
- Denoising process can be written as an equation
- Neural network can approximate data without loss of generality
- Epsilon and v prediction parametrizations can be used
- Epsilon loss is used to train the model
- Lowerbound on model log-likelihood can be derived using variational inference
Method: simple diffusion
- Introduce modifications to enable denoising diffusion to work on high resolutions
- Modifications improve performance on high resolutions
Adjusting noise schedules
- Noise schedule used in diffusion models is the α-cosine schedule
- This schedule was originally proposed to improve performance on CIFAR10 and ImageNet
- For higher resolutions, not enough noise is added
- Diffusion distribution for pixel i is given by q(z (i))
- Variance of independent random variables is additive
- For higher resolutions, noise schedule can be changed in a predictable way
- SNR increases by a factor s when averaging over a window of size s x s
- Noise schedule can be defined with respect to a reference resolution
- SNR is multiplied by (64/d)2 for d > 64
- Interpolating schedules can be used to include higher frequency details
Multiscale training loss
- Noise schedule of diffusion model should be adjusted when training on high resolution images to keep signal-to-noise ratio constant.
- Standard training loss is dominated by high frequency details, so propose replacing it with multiscale version.
- Multiscale loss enables quicker convergence at resolutions greater than 256x256.
- Training loss is a weighted sum of losses for resolutions starting at base resolution (32x32) and including final resolution.
- Relative weight of loss is decreased as resolution is increased.
Scaling the architecture
- Typical model architectures halve the channels each time the resolution is doubled.
- Low computational intensity leads to poor utilization of the accelerator and large activations result in out-of-memory issues.
- Scaling on the 16x16 resolution is sufficient to improve performance.
- Low resolution operations have relatively small feature maps.
- Memory requirements per device decrease with 1/devices.
- Avoiding high resolution feature maps is important to prevent out-of-memory issues.
Dropout
- ImageNet dataset has 1 million images
- Regularizing networks to avoid overfitting is important
- Dropout is enabled on a subset of network layers
- Hypothesis that regularizing lower resolution feature maps is sufficient holds
- Increasing number of 16x16 modules improves performance
- Downsampling techniques can be used to avoid high resolution feature maps
- Multiscale loss reduces FID score for larger resolutions
The u-vit architecture
- Replacing convolutional layers with MLP blocks if the architecture already uses self-attention
- Combination of self-attention and MLP blocks has high accelerator utilization, leading to faster training
- U-Vision Transformer (U-ViT) architecture is a small convolutional U-Net with a large transformer applied at 16x16 resolution
Text to image generation
- Trained a simple diffusion model conditioned on text data
- Used T5 XXL text encoder as conditioning
- Trained three models on different image resolutions (256x256, 512x512, 384x640)
- Images are rotated during preprocessing if width is smaller than height, and a ‘portrait mode’ flag is set to true
Related work
- Score-based diffusion models are a generative model that pre-defines a stochastic destruction process.
- Diffusion models for high resolutions are generally not learned directly, but divided into sub-problems.
- This paper shows that it is possible to train a single denoising diffusion model for resolutions up to 512 × 512.
Experiments
Effects of the proposed modifications
- Noise schedule affects the quality of generated images.
- Shifting the log SNR curve using the ratio between the image resolution and the As improves performance.
- Difference in performance between the shift towards either 64 and 32 is relatively small.
Comparison with literature
- Simple diffusion is compared to existing approaches in literature
- U-ViT models perform well on train FID and Inception Score
- U-Net models perform better on eval FID
- Simple diffusion achieves SOTA FID scores on class-conditional ImageNet generation
- Simple diffusion is a little better than some recent text-to-image models
- Simple diffusion is the first model that can generate images of this quality using only a single diffusion model
Conclusion
- Introduced several modifications of denoising diffusion formulation for high resolution images
- Achieved state-of-the-art performance on ImageNet in FID score
- First single-stage text to image model that can generate images with high visual quality
- Defined how signal is destroyed (diffused)
- Loss computed as defined below
- Conditioning added as input to the uvit call
- Standard cosine logsnr schedule
- Used standard ddpm sampler
- Effect of guidance on ImageNet models
- Clip vs MSCOCO FID30K score for text to image model
- Generated images with simple diffusion in full image space
- Samples drawn from U-Net model with guidance scale 4
- Standard and shifted diffusion noise on an image of 512 × 512
- Text to image samples at resolution 512 × 512
- Text to image samples generated with simple diffusion at resolution 256 × 256
- ‘5/3’ DWT transform transforms an image to low and high frequency response feature maps
- Standard cosine schedule and shifted schedule
- Interpolated schedule
- Downsampling strategies on ImageNet 512 × 512
- Multiscale loss
- Comparison to generative models in the literature on ImageNet
- Guidance scale, the shifted schedule is quite sensitive to guidance