Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion models are difficult to apply to high resolution images.
  • Existing approaches focus on lower dimensional spaces or multiple super-resolution levels.
  • This paper aims to improve denoising diffusion for high resolution images while keeping the model simple.
  • Four main findings: noise schedule should be adjusted, scale only a particular part of the architecture, add dropout at specific locations, and downsampling is an effective strategy.

Paper Content

Introduction

  • Score-based diffusion models are used to generate data by adding random noise and approximating the denoising process with a neural network.
  • Diffusion models are effective for image, audio, and video generation.
  • For higher resolutions, literature typically uses lower dimensional latent spaces or divides the generative process into multiple sub-problems.
  • This paper aims to improve standard denoising diffusion for higher resolutions while keeping the model as simple as possible.

Background: diffusion models

  • Diffusion model generates data by learning the reverse of a destruction process
  • Gaussian noise is added over time
  • Hyperparameters determine how much signal is destroyed
  • Variance preserving process fixes relation between hyperparameters
  • Transition distributions are given by a normal distribution
  • Noise schedule is αcosine schedule
  • Denoising process can be written as an equation
  • Neural network can approximate data without loss of generality
  • Epsilon and v prediction parametrizations can be used
  • Epsilon loss is used to train the model
  • Lowerbound on model log-likelihood can be derived using variational inference

Method: simple diffusion

  • Introduce modifications to enable denoising diffusion to work on high resolutions
  • Modifications improve performance on high resolutions

Adjusting noise schedules

  • Noise schedule used in diffusion models is the α-cosine schedule
  • This schedule was originally proposed to improve performance on CIFAR10 and ImageNet
  • For higher resolutions, not enough noise is added
  • Diffusion distribution for pixel i is given by q(z (i))
  • Variance of independent random variables is additive
  • For higher resolutions, noise schedule can be changed in a predictable way
  • SNR increases by a factor s when averaging over a window of size s x s
  • Noise schedule can be defined with respect to a reference resolution
  • SNR is multiplied by (64/d)2 for d > 64
  • Interpolating schedules can be used to include higher frequency details

Multiscale training loss

  • Noise schedule of diffusion model should be adjusted when training on high resolution images to keep signal-to-noise ratio constant.
  • Standard training loss is dominated by high frequency details, so propose replacing it with multiscale version.
  • Multiscale loss enables quicker convergence at resolutions greater than 256x256.
  • Training loss is a weighted sum of losses for resolutions starting at base resolution (32x32) and including final resolution.
  • Relative weight of loss is decreased as resolution is increased.

Scaling the architecture

  • Typical model architectures halve the channels each time the resolution is doubled.
  • Low computational intensity leads to poor utilization of the accelerator and large activations result in out-of-memory issues.
  • Scaling on the 16x16 resolution is sufficient to improve performance.
  • Low resolution operations have relatively small feature maps.
  • Memory requirements per device decrease with 1/devices.
  • Avoiding high resolution feature maps is important to prevent out-of-memory issues.

Dropout

  • ImageNet dataset has 1 million images
  • Regularizing networks to avoid overfitting is important
  • Dropout is enabled on a subset of network layers
  • Hypothesis that regularizing lower resolution feature maps is sufficient holds
  • Increasing number of 16x16 modules improves performance
  • Downsampling techniques can be used to avoid high resolution feature maps
  • Multiscale loss reduces FID score for larger resolutions

The u-vit architecture

  • Replacing convolutional layers with MLP blocks if the architecture already uses self-attention
  • Combination of self-attention and MLP blocks has high accelerator utilization, leading to faster training
  • U-Vision Transformer (U-ViT) architecture is a small convolutional U-Net with a large transformer applied at 16x16 resolution

Text to image generation

  • Trained a simple diffusion model conditioned on text data
  • Used T5 XXL text encoder as conditioning
  • Trained three models on different image resolutions (256x256, 512x512, 384x640)
  • Images are rotated during preprocessing if width is smaller than height, and a ‘portrait mode’ flag is set to true
  • Score-based diffusion models are a generative model that pre-defines a stochastic destruction process.
  • Diffusion models for high resolutions are generally not learned directly, but divided into sub-problems.
  • This paper shows that it is possible to train a single denoising diffusion model for resolutions up to 512 × 512.

Experiments

Effects of the proposed modifications

  • Noise schedule affects the quality of generated images.
  • Shifting the log SNR curve using the ratio between the image resolution and the As improves performance.
  • Difference in performance between the shift towards either 64 and 32 is relatively small.

Comparison with literature

  • Simple diffusion is compared to existing approaches in literature
  • U-ViT models perform well on train FID and Inception Score
  • U-Net models perform better on eval FID
  • Simple diffusion achieves SOTA FID scores on class-conditional ImageNet generation
  • Simple diffusion is a little better than some recent text-to-image models
  • Simple diffusion is the first model that can generate images of this quality using only a single diffusion model

Conclusion

  • Introduced several modifications of denoising diffusion formulation for high resolution images
  • Achieved state-of-the-art performance on ImageNet in FID score
  • First single-stage text to image model that can generate images with high visual quality
  • Defined how signal is destroyed (diffused)
  • Loss computed as defined below
  • Conditioning added as input to the uvit call
  • Standard cosine logsnr schedule
  • Used standard ddpm sampler
  • Effect of guidance on ImageNet models
  • Clip vs MSCOCO FID30K score for text to image model
  • Generated images with simple diffusion in full image space
  • Samples drawn from U-Net model with guidance scale 4
  • Standard and shifted diffusion noise on an image of 512 × 512
  • Text to image samples at resolution 512 × 512
  • Text to image samples generated with simple diffusion at resolution 256 × 256
  • ‘5/3’ DWT transform transforms an image to low and high frequency response feature maps
  • Standard cosine schedule and shifted schedule
  • Interpolated schedule
  • Downsampling strategies on ImageNet 512 × 512
  • Multiscale loss
  • Comparison to generative models in the literature on ImageNet
  • Guidance scale, the shifted schedule is quite sensitive to guidance