Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generative models have been studied in computer vision.
Diffusion models have been used to generate high quality images.
GANs have been found to have the ability to disentangle different attributes.
This work explores whether diffusion models have the same capability.
It was found that diffusion models can modify images towards a style without changing the semantic content.
A simple, light-weight image editing algorithm was proposed.
Experiments showed that the proposed method outperformed other diffusion-model-based image-editing algorithms.

Paper Content

Introduction

Image generation is a widely studied research problem in computer vision
Generative models such as GANs and VAEs have been proposed
Diffusion models have recently attracted attention for their ability to generate high-quality images
Disentangling different aspects of generated images is important for image editing and style transfer
GANs have an inherent disentanglement capability
Diffusion models have yet to be found to have this capability
This paper seeks to answer if diffusion models have this capability
Results show that diffusion models can disentangle a wide range of concepts and attributes
Optimal mixing weights of two descriptions can generate convincing image pairs
Proposed image editing algorithm can match or outperform more sophisticated baselines

Generative models should be able to disentangle different attributes
Disentanglement can be achieved by moving in particular directions in latent space
Multiple methods have been proposed to discover these latent directions
Disentanglement has been studied in GANs, VAEs, and flow-based models
Diffusion models have achieved state-of-the-art performance in image synthesis
Text-to-image diffusion models take text descriptions as inputs and generate images
Image editing has been studied using GANs and diffusion models
Most methods require fine-tuning diffusion models and auxiliary inputs
Proposed method performs image editing without auxiliary inputs and fine-tuning

Attribute disentanglement in stable diffusion models

Explore disentanglement properties of diffusion models
Propose approach for disentangled image modification and editing

Preliminaries on diffusion models

DDIM is a model used to generate an image from a text description
DDIM adds Gaussian noise to the image at each step of the process
The parameters of the denoising network are fixed
Hyperparameters govern the diffusion process
The generated image is a deterministic function of initial random noise and text descriptions

The disentanglement properties

The stable diffusion model is capable of disentangling styles from semantic content.
An example is given of two text embeddings, one style-neutral and one with an explicit style.
When the text embeddings are partially replaced, the model can maintain the identity of the person while changing the facial expression.
Experiments on more objects and styles have been performed and the observations are consistent.

Optimizing for disentanglement

Proposed a principled and tractable optimization scheme to combine a given pair of c (0) and c (1) to achieve the best disentanglement
Soft combination of the text embeddings used instead of feeding either c (0) or c (1) at each denoising step
Optimization procedure to find an optimal λ1:T such that X (λ) 0 maintains the same semantic content as X (0) 0 but conforms to the style described in c (1)

Extension to image editing

Disentangled image modification algorithm developed to extend to disentangled image editing
Find optimal text embedding for disentanglement
Generate noisy images based on given image
Introduce new diffusion process to close approximation gap
Follow same procedure in Sec. 3.3 to perform image editing
Re-diffusion approach adopted to enhance quality of edited image

Experiments

Experiments to explore the disentanglement capability of the stable diffusion model
Pre-trained model is frozen and default hyperparameters are kept
Images generated are 512x512
DDIM sampler used to synthesize images with 50 backward diffusion steps
Adam optimizer used to optimize λ1:T with learning rate 0.03
β set to 0.05 for human face experiments and 0.03 for scenes and buildings
Number of re-diffusion steps when editing real images is 20

Exploring the disentanglement capability

Stable diffusion model can inherently disentangle a wide range of objects and attributes
Global styles like scenery, drawing, and architecture materials can be disentangled
Local attributes like facial expressions can be disentangled
Difficulties disentangling attributes that involve small objects
Learned λ1:T has great transferability to unseen images

Evaluation on disentangled image editing

Evaluated proposed method on image editing task
Used Celeb-A and LSUN-church datasets
Conducted subjective evaluation on Amazon Mechanical Turk
Asked 3 questions regarding editing quality
6 out of 8 attributes outperformed baseline
Baseline had over-optimization problem
Less competitive in human-related editing
Qualitative comparison with other baselines showed high quality of edited images

Ablation study

Optimal λ1:T depends on specific values of c (0) , c (1) , and X T
Investigating robustness against variations in c (0) and c (1)
Varying text descriptions to check robustness
Varying complexity of target attribute description in c (1)
Varying c (0) to check robustness

Conclusion

Studied the disentanglement property in the stable diffusion model
Found that stable diffusion inherently has the disentanglement capability
Proposed a simple and light-weight disentanglement algorithm
Optimized combination weights of two text embeddings for style matching and content preservation
Outperformed sophisticated baselines that require fine-tuning on image editing task
Reported detailed hyperparameter settings and model architectures used
Used different loss balancing weight and different initialization of combination weights for attributes on person and scenes
Used pre-trained models without changing any parameters
Provided exact text descriptions used for disentanglement and image editing
Demonstrated inherent disentanglement capability in the stable diffusion model
Provided examples of attributes that can be disentangled by the method
Detailed subjective evaluation process
Compared performance of method with DIF-FUSIONCLIP on image editing task
Collected source and edited images from state-of-the-art diffusion-model-based image editing methods
Analyzed robustness of method to different choices of text descriptions
Investigated whether successful disentanglement depends on choice of particular image used for optimization and number of images used for optimization

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Attribute disentanglement in stable diffusion models#

Preliminaries on diffusion models#

The disentanglement properties#

Optimizing for disentanglement#

Extension to image editing#

Experiments#

Exploring the disentanglement capability#

Evaluation on disentangled image editing#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Attribute disentanglement in stable diffusion models

Preliminaries on diffusion models

The disentanglement properties

Optimizing for disentanglement

Extension to image editing

Experiments

Exploring the disentanglement capability

Evaluation on disentangled image editing

Ablation study

Conclusion