Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generative models have been studied in computer vision.
  • Diffusion models have been used to generate high quality images.
  • GANs have been found to have the ability to disentangle different attributes.
  • This work explores whether diffusion models have the same capability.
  • It was found that diffusion models can modify images towards a style without changing the semantic content.
  • A simple, light-weight image editing algorithm was proposed.
  • Experiments showed that the proposed method outperformed other diffusion-model-based image-editing algorithms.

Paper Content

Introduction

  • Image generation is a widely studied research problem in computer vision
  • Generative models such as GANs and VAEs have been proposed
  • Diffusion models have recently attracted attention for their ability to generate high-quality images
  • Disentangling different aspects of generated images is important for image editing and style transfer
  • GANs have an inherent disentanglement capability
  • Diffusion models have yet to be found to have this capability
  • This paper seeks to answer if diffusion models have this capability
  • Results show that diffusion models can disentangle a wide range of concepts and attributes
  • Optimal mixing weights of two descriptions can generate convincing image pairs
  • Proposed image editing algorithm can match or outperform more sophisticated baselines
  • Generative models should be able to disentangle different attributes
  • Disentanglement can be achieved by moving in particular directions in latent space
  • Multiple methods have been proposed to discover these latent directions
  • Disentanglement has been studied in GANs, VAEs, and flow-based models
  • Diffusion models have achieved state-of-the-art performance in image synthesis
  • Text-to-image diffusion models take text descriptions as inputs and generate images
  • Image editing has been studied using GANs and diffusion models
  • Most methods require fine-tuning diffusion models and auxiliary inputs
  • Proposed method performs image editing without auxiliary inputs and fine-tuning

Attribute disentanglement in stable diffusion models

  • Explore disentanglement properties of diffusion models
  • Propose approach for disentangled image modification and editing

Preliminaries on diffusion models

  • DDIM is a model used to generate an image from a text description
  • DDIM adds Gaussian noise to the image at each step of the process
  • The parameters of the denoising network are fixed
  • Hyperparameters govern the diffusion process
  • The generated image is a deterministic function of initial random noise and text descriptions

The disentanglement properties

  • The stable diffusion model is capable of disentangling styles from semantic content.
  • An example is given of two text embeddings, one style-neutral and one with an explicit style.
  • When the text embeddings are partially replaced, the model can maintain the identity of the person while changing the facial expression.
  • Experiments on more objects and styles have been performed and the observations are consistent.

Optimizing for disentanglement

  • Proposed a principled and tractable optimization scheme to combine a given pair of c (0) and c (1) to achieve the best disentanglement
  • Soft combination of the text embeddings used instead of feeding either c (0) or c (1) at each denoising step
  • Optimization procedure to find an optimal λ1:T such that X (λ) 0 maintains the same semantic content as X (0) 0 but conforms to the style described in c (1)

Extension to image editing

  • Disentangled image modification algorithm developed to extend to disentangled image editing
  • Find optimal text embedding for disentanglement
  • Generate noisy images based on given image
  • Introduce new diffusion process to close approximation gap
  • Follow same procedure in Sec. 3.3 to perform image editing
  • Re-diffusion approach adopted to enhance quality of edited image

Experiments

  • Experiments to explore the disentanglement capability of the stable diffusion model
  • Pre-trained model is frozen and default hyperparameters are kept
  • Images generated are 512x512
  • DDIM sampler used to synthesize images with 50 backward diffusion steps
  • Adam optimizer used to optimize λ1:T with learning rate 0.03
  • β set to 0.05 for human face experiments and 0.03 for scenes and buildings
  • Number of re-diffusion steps when editing real images is 20

Exploring the disentanglement capability

  • Stable diffusion model can inherently disentangle a wide range of objects and attributes
  • Global styles like scenery, drawing, and architecture materials can be disentangled
  • Local attributes like facial expressions can be disentangled
  • Difficulties disentangling attributes that involve small objects
  • Learned λ1:T has great transferability to unseen images

Evaluation on disentangled image editing

  • Evaluated proposed method on image editing task
  • Used Celeb-A and LSUN-church datasets
  • Conducted subjective evaluation on Amazon Mechanical Turk
  • Asked 3 questions regarding editing quality
  • 6 out of 8 attributes outperformed baseline
  • Baseline had over-optimization problem
  • Less competitive in human-related editing
  • Qualitative comparison with other baselines showed high quality of edited images

Ablation study

  • Optimal λ1:T depends on specific values of c (0) , c (1) , and X T
  • Investigating robustness against variations in c (0) and c (1)
  • Varying text descriptions to check robustness
  • Varying complexity of target attribute description in c (1)
  • Varying c (0) to check robustness

Conclusion

  • Studied the disentanglement property in the stable diffusion model
  • Found that stable diffusion inherently has the disentanglement capability
  • Proposed a simple and light-weight disentanglement algorithm
  • Optimized combination weights of two text embeddings for style matching and content preservation
  • Outperformed sophisticated baselines that require fine-tuning on image editing task
  • Reported detailed hyperparameter settings and model architectures used
  • Used different loss balancing weight and different initialization of combination weights for attributes on person and scenes
  • Used pre-trained models without changing any parameters
  • Provided exact text descriptions used for disentanglement and image editing
  • Demonstrated inherent disentanglement capability in the stable diffusion model
  • Provided examples of attributes that can be disentangled by the method
  • Detailed subjective evaluation process
  • Compared performance of method with DIF-FUSIONCLIP on image editing task
  • Collected source and edited images from state-of-the-art diffusion-model-based image editing methods
  • Analyzed robustness of method to different choices of text descriptions
  • Investigated whether successful disentanglement depends on choice of particular image used for optimization and number of images used for optimization