Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed algorithm enables diffusion models to be controlled by arbitrary guidance modalities without retraining.
- Algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals.
Paper Content
Introduction
- Diffusion models are powerful tools for creating digital art and graphics.
- Most models are controlled through conditioning.
- Guidance is a more flexible approach to controlling model outputs.
- Guidance functions can be used without re-training or modification.
- We propose an algorithm that enables universal guidance for diffusion models.
Background
- Review of core framework behind diffusion models
- Definition of problem setting of controlled image generation
- Discussion of previous related works
Diffusion models
- Diffusion models are generative models used for image, audio and text generation.
- Diffusion models are a combination of a forward process and a reverse process.
- The forward process adds noise to a clean data point, while the reverse process attempts to denoise a noisy input.
Controlled image generation
- Focus on controlled image generation with various constraints
- Consider a differentiable guidance function f
- Measure closeness of two vectors c and c
- Prompt is a particular choice of c
- Two categories of prior work: conditional image generation and guided image generation
- Conditional image generation requires training new diffusion models
- Guided image generation uses frozen pre-trained diffusion model
- Prior work studied guided image generation with a variety of restrictions and external guidance functions
- This work studies universal guidance algorithms for guided image generation with any off-the-shelf guidance functions
Universal guidance
- Proposed guidance algorithm augments image sampling method of diffusion model
- Algorithm motivated by observation that reconstructed clean image is appropriate for generic guidance function
- Forward universal guidance extends classifier guidance to leverage observation
- Backward universal guidance helps enforce generated image to satisfy constraint based on guidance function
- Self-recurrence trick to improve fidelity of generated images
Forward universal guidance
- Classifier guidance is a method of sampling that uses a class prompt c and a guidance function f cl to output classification probability.
- Universal guidance is an extension of classifier guidance that allows for any general guidance function and loss function.
- Universal guidance uses a predicted clean image αΊ0 to calculate the guidance.
- Forward universal guidance is a related approach that is studied in (Chung et al., 2022a).
Backward universal guidance
- Forward guidance sometimes fails to match the given prompt.
- Backward guidance is proposed to supplement forward guidance and enforce the constraint.
- Backward guidance produces an optimized direction for the generated image to match the given prompt.
Per-step self-recurrence
- Applying universal guidance to standard generation pipelines often produces images with artifacts and strange behaviors.
- Attempts to prioritize realness by decreasing s(t) were ineffective.
- Self-recurrence is used to explore different regions of the data manifold and improve the harmony of generated images.
- Algorithm 1 summarizes the universal guidance algorithm composed of forward universal guidance, backward universal guidance and per-step self-recurrence.
Experiments
- Tested proposed universal guidance algorithm against a variety of guidance functions
- Experimented with Stable Diffusion and ImageNet diffusion model
- Results demonstrate universal algorithm is comparable to specialized conditional model in generating quality images that satisfy text constraints
Results for stable diffusion
- Stable Diffusion used as foundation model for guided image generation
- Experiments with CLIP feature extractor, segmentation network, face recognition network, and object detection network
- Forward guidance produces high-quality images that match given prompt
- Loss function calculates negative cosine similarity between image embedding and CLIP text embedding
- Segmentation map used to produce clear separation between object and background
- Face recognition module used to guide image generation to resemble given person
- Object location guidance used to generate objects in designated location
- Style guidance used to capture reference style from style image
- Results show high-quality images that match given text and style prompts
Results for imagenet diffusion
- Results presented for guided image generation using an unconditional diffusion model trained on ImageNet
- Experiments conducted with CLIP guidance, object location guidance and segmentation-guided inpainting
- Hand-crafted text prompts used to assess limit of universal guidance algorithm
- Results show successful guidance to produce quality images that match text prompts
- Results demonstrate effectiveness of universal guidance algorithm and necessity of backward guidance
- Results show ability of algorithm to handle multiple guidance functions
Limitations
- Generation using universal guidance is slower than standard conditional generation
- Multiple iterations of denoising are required to generate high-quality images
- Time complexity of algorithm scales linearly with number of recurrence steps
- Backward guidance is required in certain scenarios
- Computing backward guidance requires performing minimization with a multistep gradient descent inner loop
- Sampling hyper-parameters must be chosen individually for each guidance network
Conclusion
- Proposes a universal guidance algorithm for guided image generation
- Algorithm only requires guidance and loss functions to be differentiable
- Demonstrates promising results with complex guidance including segmentation, face recognition and object detection
- Multiple guidance functions can be combined and used in conjunction
- Self-recurrence helps segmentation-guided generation
- Generation guided by object detection with the unconditional ImageNet model
- Algorithm handles multiple guidance functions effectively