Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Decomposing image formation process into denoising autoencoders and diffusion models achieves state-of-the-art synthesis results.
Diffusion models are powerful but require hundreds of GPU days and expensive inference.
Applying diffusion models in latent space of pretrained autoencoders reduces complexity and preserves detail.
Cross-attention layers turn diffusion models into powerful and flexible generators for general conditioning inputs.
Latent diffusion models achieve new state of the art for image inpainting and competitive performance on various tasks.

Paper Content

Introduction

Image synthesis is a field of computer vision with recent development and high computational demands
GANs have been shown to be limited to data with limited variability
Diffusion models have achieved impressive results in image synthesis and beyond
DMs belong to the class of likelihood-based models, which can model complex distributions of natural images
Training and evaluating DMs requires repeated function evaluations and gradient computations in the high-dimensional space of RGB images
Our approach reduces the computational complexity for both training and sampling
We separate training into two distinct phases: an autoencoder and a generative model
Our method scales more gracefully to higher dimensional data and can be applied to high-resolution synthesis of megapixel images
We achieve competitive performance on multiple tasks while significantly lowering computational costs
We design a general-purpose conditioning mechanism based on cross-attention
We release pretrained latent diffusion and autoencoding models

Generative Adversarial Networks (GANs) allow for efficient sampling of high resolution images with good perceptual quality.
Variational autoencoders (VAEs) and flow-based models enable efficient synthesis of high resolution images, but sample quality is not as good as GANs.
Autoregressive models (ARMs) achieve strong performance in density estimation, but are limited to low resolution images.
Maximum-likelihood training spends a disproportionate amount of capacity on modeling high-frequency details, resulting in long training times.
Two-stage approaches use ARMs to model a compressed latent image space instead of raw pixels.
Diffusion Probabilistic Models (DMs) have achieved state-of-the-art results in density estimation and sample quality.
Approaches to combine the strengths of different methods into more efficient and performant models exist.
Our proposed LDMs work on a compressed latent space of lower dimensionality, making training computationally cheaper and speeding up inference.

Method

Computational demands of training diffusion models for high-resolution image synthesis are costly.
Proposed approach introduces an explicit separation of compressive and generative learning phases.
Autoencoding model learns a space that is perceptually equivalent to image space but with reduced complexity.
Sampling is performed on a low-dimensional space, reducing computational complexity.
Exploits inductive bias of DMs to reduce quality-reducing compression levels.
General-purpose compression models can be used to train multiple generative models and for other downstream applications.

Perceptual image compression

Perceptual compression model based on previous work
Autoencoder trained by combination of perceptual loss and patch-based adversarial objective
Local realism enforced to avoid bluriness
Encoder downsamples image by factor f
Two regularization variants: KL-penalty and vector quantization
Mild compression rates achieve good reconstructions
Preserves details of image better than previous works

Latent diffusion models

Diffusion Models are probabilistic models used to learn a data distribution.
Reweighted variant of variational lower bound is used for image synthesis.
Model consists of a sequence of denoising autoencoders.
Perceptual compression models are used to access a low-dimensional latent space.

Conditioning mechanisms

Diffusion models can model conditional distributions of the form p(z|y).
Diffusion models can be used for image-to-image translation tasks.
Combining the generative power of DMs with other types of conditionings is an under-explored area of research.
Cross-attention mechanism is used to make DMs more flexible.
Samples from LDMs trained on various datasets.

Experiments

LDMs provide means to flexible and computationally tractable diffusion based image synthesis.
LDMs are compared to pixel-based diffusion models in terms of training and inference.
LDMs trained in VQregularized latent spaces sometimes achieve better sample quality.
Reconstruction capabilities of VQregularized first stage models slightly fall behind those of their continuous counterparts.

On perceptual compression tradeoffs

LDMs with different downsampling factors were tested
Fixed computational resources and same number of parameters for all experiments
Small downsampling factors result in slow training progress
Too large values of f cause stagnating fidelity
LDM-4 and -8 offer best conditions for high-quality synthesis results

Image generation with latent diffusion

Trained unconditional models on CelebA-HQ, FFHQ, LSUN-Churches and -Bedrooms
Evaluated sample quality and coverage of data manifold using FID and Precision-and-Recall
Reported state-of-the-art FID of 5.11 on CelebA-HQ
Trained diffusion models in fixed space
Used language prompts to train 1.45B parameter KL-regularized LDM
Improved upon powerful AR and GAN-based methods for text-to-image generation
Used classifier-free guidance to boost sample quality
Trained models to synthesize images based on semantic layouts on OpenImages
Used images of landscapes paired with semantic maps to train semantic synthesis models
Applied super-resolution and inpainting models to generate large images between 512 2 and 1024 2

Super-resolution with latent diffusion

LDM can be used to train for super-resolution
Experiment used bicubic interpolation with 4x-downsampling and ImageNet data
LDM-SR outperforms SR3 in FID, but SR3 has better IS
Simple image regression model has highest PSNR and SSIM scores
User study confirms good performance of LDM-SR

Inpainting with latent diffusion

Inpainting is the task of filling masked regions of an image with new content
We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches
We compare the inpainting efficiency of two models with different regularizations
Our model with attention improves the overall image quality and is favored by human subjects in a user study

Limitations & societal impact

Conclusion

Latent diffusion models improve training and sampling efficiency of denoising diffusion models
Cross-attention conditioning mechanism used in experiments
Experiments show favorable results compared to state-of-the-art methods
Updated results on text-to-image synthesis
Updated results on class-conditional synthesis on ImageNet
User study conducted
Added Fig. 5 to main paper, moved Fig. 18 to appendix, added Fig. 13 to appendix
Diffusion models can be conditioned at test-time
Post-hoc image-guiding used
Gaussian guider with fixed variance used
Perceptual similarity guiding used
Cross-attention mechanism used
Class-conditional model uses single learnable embedding layer
Synthetic masks used for image-inpainting
FID, Precision, and Recall scores estimated
FID and Inception Score computed for Text-to-Image models
FID scores computed for Layout-to-Image models
FID and Inception Score computed for Super-Resolution models
Human preference scores assessed for two distinct tasks

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Perceptual image compression#

Latent diffusion models#

Conditioning mechanisms#

Experiments#

On perceptual compression tradeoffs#

Image generation with latent diffusion#

Super-resolution with latent diffusion#

Inpainting with latent diffusion#

Limitations & societal impact#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Perceptual image compression

Latent diffusion models

Conditioning mechanisms

Experiments

On perceptual compression tradeoffs

Image generation with latent diffusion

Super-resolution with latent diffusion

Inpainting with latent diffusion

Limitations & societal impact

Conclusion