Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generative models have been studied in computer vision.
- Diffusion models have been used to generate high quality images.
- GANs have been found to have the ability to disentangle different attributes.
- This work explores whether diffusion models have the same capability.
- It was found that diffusion models can modify images towards a style without changing the semantic content.
- A simple, light-weight image editing algorithm was proposed.
- Experiments showed that the proposed method outperformed other diffusion-model-based image-editing algorithms.
Paper Content
Introduction
- Image generation is a widely studied research problem in computer vision
- Generative models such as GANs and VAEs have been proposed
- Diffusion models have recently attracted attention for their ability to generate high-quality images
- Disentangling different aspects of generated images is important for image editing and style transfer
- GANs have an inherent disentanglement capability
- Diffusion models have yet to be found to have this capability
- This paper seeks to answer if diffusion models have this capability
- Results show that diffusion models can disentangle a wide range of concepts and attributes
- Optimal mixing weights of two descriptions can generate convincing image pairs
- Proposed image editing algorithm can match or outperform more sophisticated baselines
Related works
- Generative models should be able to disentangle different attributes
- Disentanglement can be achieved by moving in particular directions in latent space
- Multiple methods have been proposed to discover these latent directions
- Disentanglement has been studied in GANs, VAEs, and flow-based models
- Diffusion models have achieved state-of-the-art performance in image synthesis
- Text-to-image diffusion models take text descriptions as inputs and generate images
- Image editing has been studied using GANs and diffusion models
- Most methods require fine-tuning diffusion models and auxiliary inputs
- Proposed method performs image editing without auxiliary inputs and fine-tuning
Attribute disentanglement in stable diffusion models
- Explore disentanglement properties of diffusion models
- Propose approach for disentangled image modification and editing
Preliminaries on diffusion models
- DDIM is a model used to generate an image from a text description
- DDIM adds Gaussian noise to the image at each step of the process
- The parameters of the denoising network are fixed
- Hyperparameters govern the diffusion process
- The generated image is a deterministic function of initial random noise and text descriptions
The disentanglement properties
- The stable diffusion model is capable of disentangling styles from semantic content.
- An example is given of two text embeddings, one style-neutral and one with an explicit style.
- When the text embeddings are partially replaced, the model can maintain the identity of the person while changing the facial expression.
- Experiments on more objects and styles have been performed and the observations are consistent.
Optimizing for disentanglement
- Proposed a principled and tractable optimization scheme to combine a given pair of c (0) and c (1) to achieve the best disentanglement
- Soft combination of the text embeddings used instead of feeding either c (0) or c (1) at each denoising step
- Optimization procedure to find an optimal λ1:T such that X (λ) 0 maintains the same semantic content as X (0) 0 but conforms to the style described in c (1)
Extension to image editing
- Disentangled image modification algorithm developed to extend to disentangled image editing
- Find optimal text embedding for disentanglement
- Generate noisy images based on given image
- Introduce new diffusion process to close approximation gap
- Follow same procedure in Sec. 3.3 to perform image editing
- Re-diffusion approach adopted to enhance quality of edited image
Experiments
- Experiments to explore the disentanglement capability of the stable diffusion model
- Pre-trained model is frozen and default hyperparameters are kept
- Images generated are 512x512
- DDIM sampler used to synthesize images with 50 backward diffusion steps
- Adam optimizer used to optimize λ1:T with learning rate 0.03
- β set to 0.05 for human face experiments and 0.03 for scenes and buildings
- Number of re-diffusion steps when editing real images is 20
Exploring the disentanglement capability
- Stable diffusion model can inherently disentangle a wide range of objects and attributes
- Global styles like scenery, drawing, and architecture materials can be disentangled
- Local attributes like facial expressions can be disentangled
- Difficulties disentangling attributes that involve small objects
- Learned λ1:T has great transferability to unseen images
Evaluation on disentangled image editing
- Evaluated proposed method on image editing task
- Used Celeb-A and LSUN-church datasets
- Conducted subjective evaluation on Amazon Mechanical Turk
- Asked 3 questions regarding editing quality
- 6 out of 8 attributes outperformed baseline
- Baseline had over-optimization problem
- Less competitive in human-related editing
- Qualitative comparison with other baselines showed high quality of edited images
Ablation study
- Optimal λ1:T depends on specific values of c (0) , c (1) , and X T
- Investigating robustness against variations in c (0) and c (1)
- Varying text descriptions to check robustness
- Varying complexity of target attribute description in c (1)
- Varying c (0) to check robustness
Conclusion
- Studied the disentanglement property in the stable diffusion model
- Found that stable diffusion inherently has the disentanglement capability
- Proposed a simple and light-weight disentanglement algorithm
- Optimized combination weights of two text embeddings for style matching and content preservation
- Outperformed sophisticated baselines that require fine-tuning on image editing task
- Reported detailed hyperparameter settings and model architectures used
- Used different loss balancing weight and different initialization of combination weights for attributes on person and scenes
- Used pre-trained models without changing any parameters
- Provided exact text descriptions used for disentanglement and image editing
- Demonstrated inherent disentanglement capability in the stable diffusion model
- Provided examples of attributes that can be disentangled by the method
- Detailed subjective evaluation process
- Compared performance of method with DIF-FUSIONCLIP on image editing task
- Collected source and edited images from state-of-the-art diffusion-model-based image editing methods
- Analyzed robustness of method to different choices of text descriptions
- Investigated whether successful disentanglement depends on choice of particular image used for optimization and number of images used for optimization