Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Decomposing image formation process into denoising autoencoders and diffusion models achieves state-of-the-art synthesis results.
- Diffusion models are powerful but require hundreds of GPU days and expensive inference.
- Applying diffusion models in latent space of pretrained autoencoders reduces complexity and preserves detail.
- Cross-attention layers turn diffusion models into powerful and flexible generators for general conditioning inputs.
- Latent diffusion models achieve new state of the art for image inpainting and competitive performance on various tasks.
Paper Content
Introduction
- Image synthesis is a field of computer vision with recent development and high computational demands
- GANs have been shown to be limited to data with limited variability
- Diffusion models have achieved impressive results in image synthesis and beyond
- DMs belong to the class of likelihood-based models, which can model complex distributions of natural images
- Training and evaluating DMs requires repeated function evaluations and gradient computations in the high-dimensional space of RGB images
- Our approach reduces the computational complexity for both training and sampling
- We separate training into two distinct phases: an autoencoder and a generative model
- Our method scales more gracefully to higher dimensional data and can be applied to high-resolution synthesis of megapixel images
- We achieve competitive performance on multiple tasks while significantly lowering computational costs
- We design a general-purpose conditioning mechanism based on cross-attention
- We release pretrained latent diffusion and autoencoding models
Related work
- Generative Adversarial Networks (GANs) allow for efficient sampling of high resolution images with good perceptual quality.
- Variational autoencoders (VAEs) and flow-based models enable efficient synthesis of high resolution images, but sample quality is not as good as GANs.
- Autoregressive models (ARMs) achieve strong performance in density estimation, but are limited to low resolution images.
- Maximum-likelihood training spends a disproportionate amount of capacity on modeling high-frequency details, resulting in long training times.
- Two-stage approaches use ARMs to model a compressed latent image space instead of raw pixels.
- Diffusion Probabilistic Models (DMs) have achieved state-of-the-art results in density estimation and sample quality.
- Approaches to combine the strengths of different methods into more efficient and performant models exist.
- Our proposed LDMs work on a compressed latent space of lower dimensionality, making training computationally cheaper and speeding up inference.
Method
- Computational demands of training diffusion models for high-resolution image synthesis are costly.
- Proposed approach introduces an explicit separation of compressive and generative learning phases.
- Autoencoding model learns a space that is perceptually equivalent to image space but with reduced complexity.
- Sampling is performed on a low-dimensional space, reducing computational complexity.
- Exploits inductive bias of DMs to reduce quality-reducing compression levels.
- General-purpose compression models can be used to train multiple generative models and for other downstream applications.
Perceptual image compression
- Perceptual compression model based on previous work
- Autoencoder trained by combination of perceptual loss and patch-based adversarial objective
- Local realism enforced to avoid bluriness
- Encoder downsamples image by factor f
- Two regularization variants: KL-penalty and vector quantization
- Mild compression rates achieve good reconstructions
- Preserves details of image better than previous works
Latent diffusion models
- Diffusion Models are probabilistic models used to learn a data distribution.
- Reweighted variant of variational lower bound is used for image synthesis.
- Model consists of a sequence of denoising autoencoders.
- Perceptual compression models are used to access a low-dimensional latent space.
Conditioning mechanisms
- Diffusion models can model conditional distributions of the form p(z|y).
- Diffusion models can be used for image-to-image translation tasks.
- Combining the generative power of DMs with other types of conditionings is an under-explored area of research.
- Cross-attention mechanism is used to make DMs more flexible.
- Samples from LDMs trained on various datasets.
Experiments
- LDMs provide means to flexible and computationally tractable diffusion based image synthesis.
- LDMs are compared to pixel-based diffusion models in terms of training and inference.
- LDMs trained in VQregularized latent spaces sometimes achieve better sample quality.
- Reconstruction capabilities of VQregularized first stage models slightly fall behind those of their continuous counterparts.
On perceptual compression tradeoffs
- LDMs with different downsampling factors were tested
- Fixed computational resources and same number of parameters for all experiments
- Small downsampling factors result in slow training progress
- Too large values of f cause stagnating fidelity
- LDM-4 and -8 offer best conditions for high-quality synthesis results
Image generation with latent diffusion
- Trained unconditional models on CelebA-HQ, FFHQ, LSUN-Churches and -Bedrooms
- Evaluated sample quality and coverage of data manifold using FID and Precision-and-Recall
- Reported state-of-the-art FID of 5.11 on CelebA-HQ
- Trained diffusion models in fixed space
- Used language prompts to train 1.45B parameter KL-regularized LDM
- Improved upon powerful AR and GAN-based methods for text-to-image generation
- Used classifier-free guidance to boost sample quality
- Trained models to synthesize images based on semantic layouts on OpenImages
- Used images of landscapes paired with semantic maps to train semantic synthesis models
- Applied super-resolution and inpainting models to generate large images between 512 2 and 1024 2
Super-resolution with latent diffusion
- LDM can be used to train for super-resolution
- Experiment used bicubic interpolation with 4x-downsampling and ImageNet data
- LDM-SR outperforms SR3 in FID, but SR3 has better IS
- Simple image regression model has highest PSNR and SSIM scores
- User study confirms good performance of LDM-SR
Inpainting with latent diffusion
- Inpainting is the task of filling masked regions of an image with new content
- We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches
- We compare the inpainting efficiency of two models with different regularizations
- Our model with attention improves the overall image quality and is favored by human subjects in a user study
Limitations & societal impact
Conclusion
- Latent diffusion models improve training and sampling efficiency of denoising diffusion models
- Cross-attention conditioning mechanism used in experiments
- Experiments show favorable results compared to state-of-the-art methods
- Updated results on text-to-image synthesis
- Updated results on class-conditional synthesis on ImageNet
- User study conducted
- Added Fig. 5 to main paper, moved Fig. 18 to appendix, added Fig. 13 to appendix
- Diffusion models can be conditioned at test-time
- Post-hoc image-guiding used
- Gaussian guider with fixed variance used
- Perceptual similarity guiding used
- Cross-attention mechanism used
- Class-conditional model uses single learnable embedding layer
- Synthetic masks used for image-inpainting
- FID, Precision, and Recall scores estimated
- FID and Inception Score computed for Text-to-Image models
- FID scores computed for Layout-to-Image models
- FID and Inception Score computed for Super-Resolution models
- Human preference scores assessed for two distinct tasks