Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Diffusion models are difficult to apply to high resolution images.
Existing approaches focus on lower dimensional spaces or multiple super-resolution levels.
This paper aims to improve denoising diffusion for high resolution images while keeping the model simple.
Four main findings: noise schedule should be adjusted, scale only a particular part of the architecture, add dropout at specific locations, and downsampling is an effective strategy.

Paper Content

Introduction

Score-based diffusion models are used to generate data by adding random noise and approximating the denoising process with a neural network.
Diffusion models are effective for image, audio, and video generation.
For higher resolutions, literature typically uses lower dimensional latent spaces or divides the generative process into multiple sub-problems.
This paper aims to improve standard denoising diffusion for higher resolutions while keeping the model as simple as possible.

Background: diffusion models

Diffusion model generates data by learning the reverse of a destruction process
Gaussian noise is added over time
Hyperparameters determine how much signal is destroyed
Variance preserving process fixes relation between hyperparameters
Transition distributions are given by a normal distribution
Noise schedule is αcosine schedule
Denoising process can be written as an equation
Neural network can approximate data without loss of generality
Epsilon and v prediction parametrizations can be used
Epsilon loss is used to train the model
Lowerbound on model log-likelihood can be derived using variational inference

Method: simple diffusion

Introduce modifications to enable denoising diffusion to work on high resolutions
Modifications improve performance on high resolutions

Adjusting noise schedules

Noise schedule used in diffusion models is the α-cosine schedule
This schedule was originally proposed to improve performance on CIFAR10 and ImageNet
For higher resolutions, not enough noise is added
Diffusion distribution for pixel i is given by q(z (i))
Variance of independent random variables is additive
For higher resolutions, noise schedule can be changed in a predictable way
SNR increases by a factor s when averaging over a window of size s x s
Noise schedule can be defined with respect to a reference resolution
SNR is multiplied by (64/d)2 for d > 64
Interpolating schedules can be used to include higher frequency details

Multiscale training loss

Noise schedule of diffusion model should be adjusted when training on high resolution images to keep signal-to-noise ratio constant.
Standard training loss is dominated by high frequency details, so propose replacing it with multiscale version.
Multiscale loss enables quicker convergence at resolutions greater than 256x256.
Training loss is a weighted sum of losses for resolutions starting at base resolution (32x32) and including final resolution.
Relative weight of loss is decreased as resolution is increased.

Scaling the architecture

Typical model architectures halve the channels each time the resolution is doubled.
Low computational intensity leads to poor utilization of the accelerator and large activations result in out-of-memory issues.
Scaling on the 16x16 resolution is sufficient to improve performance.
Low resolution operations have relatively small feature maps.
Memory requirements per device decrease with 1/devices.
Avoiding high resolution feature maps is important to prevent out-of-memory issues.

Dropout

ImageNet dataset has 1 million images
Regularizing networks to avoid overfitting is important
Dropout is enabled on a subset of network layers
Hypothesis that regularizing lower resolution feature maps is sufficient holds
Increasing number of 16x16 modules improves performance
Downsampling techniques can be used to avoid high resolution feature maps
Multiscale loss reduces FID score for larger resolutions

The u-vit architecture

Replacing convolutional layers with MLP blocks if the architecture already uses self-attention
Combination of self-attention and MLP blocks has high accelerator utilization, leading to faster training
U-Vision Transformer (U-ViT) architecture is a small convolutional U-Net with a large transformer applied at 16x16 resolution

Text to image generation

Trained a simple diffusion model conditioned on text data
Used T5 XXL text encoder as conditioning
Trained three models on different image resolutions (256x256, 512x512, 384x640)
Images are rotated during preprocessing if width is smaller than height, and a ‘portrait mode’ flag is set to true

Score-based diffusion models are a generative model that pre-defines a stochastic destruction process.
Diffusion models for high resolutions are generally not learned directly, but divided into sub-problems.
This paper shows that it is possible to train a single denoising diffusion model for resolutions up to 512 × 512.

Experiments

Effects of the proposed modifications

Noise schedule affects the quality of generated images.
Shifting the log SNR curve using the ratio between the image resolution and the As improves performance.
Difference in performance between the shift towards either 64 and 32 is relatively small.

Comparison with literature

Simple diffusion is compared to existing approaches in literature
U-ViT models perform well on train FID and Inception Score
U-Net models perform better on eval FID
Simple diffusion achieves SOTA FID scores on class-conditional ImageNet generation
Simple diffusion is a little better than some recent text-to-image models
Simple diffusion is the first model that can generate images of this quality using only a single diffusion model

Conclusion

Introduced several modifications of denoising diffusion formulation for high resolution images
Achieved state-of-the-art performance on ImageNet in FID score
First single-stage text to image model that can generate images with high visual quality
Defined how signal is destroyed (diffused)
Loss computed as defined below
Conditioning added as input to the uvit call
Standard cosine logsnr schedule
Used standard ddpm sampler
Effect of guidance on ImageNet models
Clip vs MSCOCO FID30K score for text to image model
Generated images with simple diffusion in full image space
Samples drawn from U-Net model with guidance scale 4
Standard and shifted diffusion noise on an image of 512 × 512
Text to image samples at resolution 512 × 512
Text to image samples generated with simple diffusion at resolution 256 × 256
‘5/3’ DWT transform transforms an image to low and high frequency response feature maps
Standard cosine schedule and shifted schedule
Interpolated schedule
Downsampling strategies on ImageNet 512 × 512
Multiscale loss
Comparison to generative models in the literature on ImageNet
Guidance scale, the shifted schedule is quite sensitive to guidance

Link to paper#

Abstract#

Paper Content#

Introduction#

Background: diffusion models#

Method: simple diffusion#

Adjusting noise schedules#

Multiscale training loss#

Scaling the architecture#

Dropout#

The u-vit architecture#

Text to image generation#

Related work#

Experiments#

Effects of the proposed modifications#

Comparison with literature#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background: diffusion models

Method: simple diffusion

Adjusting noise schedules

Multiscale training loss

Scaling the architecture

Dropout

The u-vit architecture

Text to image generation

Related work

Experiments

Effects of the proposed modifications

Comparison with literature

Conclusion