Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Decomposing image formation process into denoising autoencoders and diffusion models achieves state-of-the-art synthesis results.
  • Diffusion models are powerful but require hundreds of GPU days and expensive inference.
  • Applying diffusion models in latent space of pretrained autoencoders reduces complexity and preserves detail.
  • Cross-attention layers turn diffusion models into powerful and flexible generators for general conditioning inputs.
  • Latent diffusion models achieve new state of the art for image inpainting and competitive performance on various tasks.

Paper Content

Introduction

  • Image synthesis is a field of computer vision with recent development and high computational demands
  • GANs have been shown to be limited to data with limited variability
  • Diffusion models have achieved impressive results in image synthesis and beyond
  • DMs belong to the class of likelihood-based models, which can model complex distributions of natural images
  • Training and evaluating DMs requires repeated function evaluations and gradient computations in the high-dimensional space of RGB images
  • Our approach reduces the computational complexity for both training and sampling
  • We separate training into two distinct phases: an autoencoder and a generative model
  • Our method scales more gracefully to higher dimensional data and can be applied to high-resolution synthesis of megapixel images
  • We achieve competitive performance on multiple tasks while significantly lowering computational costs
  • We design a general-purpose conditioning mechanism based on cross-attention
  • We release pretrained latent diffusion and autoencoding models
  • Generative Adversarial Networks (GANs) allow for efficient sampling of high resolution images with good perceptual quality.
  • Variational autoencoders (VAEs) and flow-based models enable efficient synthesis of high resolution images, but sample quality is not as good as GANs.
  • Autoregressive models (ARMs) achieve strong performance in density estimation, but are limited to low resolution images.
  • Maximum-likelihood training spends a disproportionate amount of capacity on modeling high-frequency details, resulting in long training times.
  • Two-stage approaches use ARMs to model a compressed latent image space instead of raw pixels.
  • Diffusion Probabilistic Models (DMs) have achieved state-of-the-art results in density estimation and sample quality.
  • Approaches to combine the strengths of different methods into more efficient and performant models exist.
  • Our proposed LDMs work on a compressed latent space of lower dimensionality, making training computationally cheaper and speeding up inference.

Method

  • Computational demands of training diffusion models for high-resolution image synthesis are costly.
  • Proposed approach introduces an explicit separation of compressive and generative learning phases.
  • Autoencoding model learns a space that is perceptually equivalent to image space but with reduced complexity.
  • Sampling is performed on a low-dimensional space, reducing computational complexity.
  • Exploits inductive bias of DMs to reduce quality-reducing compression levels.
  • General-purpose compression models can be used to train multiple generative models and for other downstream applications.

Perceptual image compression

  • Perceptual compression model based on previous work
  • Autoencoder trained by combination of perceptual loss and patch-based adversarial objective
  • Local realism enforced to avoid bluriness
  • Encoder downsamples image by factor f
  • Two regularization variants: KL-penalty and vector quantization
  • Mild compression rates achieve good reconstructions
  • Preserves details of image better than previous works

Latent diffusion models

  • Diffusion Models are probabilistic models used to learn a data distribution.
  • Reweighted variant of variational lower bound is used for image synthesis.
  • Model consists of a sequence of denoising autoencoders.
  • Perceptual compression models are used to access a low-dimensional latent space.

Conditioning mechanisms

  • Diffusion models can model conditional distributions of the form p(z|y).
  • Diffusion models can be used for image-to-image translation tasks.
  • Combining the generative power of DMs with other types of conditionings is an under-explored area of research.
  • Cross-attention mechanism is used to make DMs more flexible.
  • Samples from LDMs trained on various datasets.

Experiments

  • LDMs provide means to flexible and computationally tractable diffusion based image synthesis.
  • LDMs are compared to pixel-based diffusion models in terms of training and inference.
  • LDMs trained in VQregularized latent spaces sometimes achieve better sample quality.
  • Reconstruction capabilities of VQregularized first stage models slightly fall behind those of their continuous counterparts.

On perceptual compression tradeoffs

  • LDMs with different downsampling factors were tested
  • Fixed computational resources and same number of parameters for all experiments
  • Small downsampling factors result in slow training progress
  • Too large values of f cause stagnating fidelity
  • LDM-4 and -8 offer best conditions for high-quality synthesis results

Image generation with latent diffusion

  • Trained unconditional models on CelebA-HQ, FFHQ, LSUN-Churches and -Bedrooms
  • Evaluated sample quality and coverage of data manifold using FID and Precision-and-Recall
  • Reported state-of-the-art FID of 5.11 on CelebA-HQ
  • Trained diffusion models in fixed space
  • Used language prompts to train 1.45B parameter KL-regularized LDM
  • Improved upon powerful AR and GAN-based methods for text-to-image generation
  • Used classifier-free guidance to boost sample quality
  • Trained models to synthesize images based on semantic layouts on OpenImages
  • Used images of landscapes paired with semantic maps to train semantic synthesis models
  • Applied super-resolution and inpainting models to generate large images between 512 2 and 1024 2

Super-resolution with latent diffusion

  • LDM can be used to train for super-resolution
  • Experiment used bicubic interpolation with 4x-downsampling and ImageNet data
  • LDM-SR outperforms SR3 in FID, but SR3 has better IS
  • Simple image regression model has highest PSNR and SSIM scores
  • User study confirms good performance of LDM-SR

Inpainting with latent diffusion

  • Inpainting is the task of filling masked regions of an image with new content
  • We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches
  • We compare the inpainting efficiency of two models with different regularizations
  • Our model with attention improves the overall image quality and is favored by human subjects in a user study

Limitations & societal impact

Conclusion

  • Latent diffusion models improve training and sampling efficiency of denoising diffusion models
  • Cross-attention conditioning mechanism used in experiments
  • Experiments show favorable results compared to state-of-the-art methods
  • Updated results on text-to-image synthesis
  • Updated results on class-conditional synthesis on ImageNet
  • User study conducted
  • Added Fig. 5 to main paper, moved Fig. 18 to appendix, added Fig. 13 to appendix
  • Diffusion models can be conditioned at test-time
  • Post-hoc image-guiding used
  • Gaussian guider with fixed variance used
  • Perceptual similarity guiding used
  • Cross-attention mechanism used
  • Class-conditional model uses single learnable embedding layer
  • Synthetic masks used for image-inpainting
  • FID, Precision, and Recall scores estimated
  • FID and Inception Score computed for Text-to-Image models
  • FID scores computed for Layout-to-Image models
  • FID and Inception Score computed for Super-Resolution models
  • Human preference scores assessed for two distinct tasks