Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed algorithm enables diffusion models to be controlled by arbitrary guidance modalities without retraining.
Algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals.

Paper Content

Introduction

Diffusion models are powerful tools for creating digital art and graphics.
Most models are controlled through conditioning.
Guidance is a more flexible approach to controlling model outputs.
Guidance functions can be used without re-training or modification.
We propose an algorithm that enables universal guidance for diffusion models.

Background

Review of core framework behind diffusion models
Definition of problem setting of controlled image generation
Discussion of previous related works

Diffusion models

Diffusion models are generative models used for image, audio and text generation.
Diffusion models are a combination of a forward process and a reverse process.
The forward process adds noise to a clean data point, while the reverse process attempts to denoise a noisy input.

Controlled image generation

Focus on controlled image generation with various constraints
Consider a differentiable guidance function f
Measure closeness of two vectors c and c
Prompt is a particular choice of c
Two categories of prior work: conditional image generation and guided image generation
Conditional image generation requires training new diffusion models
Guided image generation uses frozen pre-trained diffusion model
Prior work studied guided image generation with a variety of restrictions and external guidance functions
This work studies universal guidance algorithms for guided image generation with any off-the-shelf guidance functions

Universal guidance

Proposed guidance algorithm augments image sampling method of diffusion model
Algorithm motivated by observation that reconstructed clean image is appropriate for generic guidance function
Forward universal guidance extends classifier guidance to leverage observation
Backward universal guidance helps enforce generated image to satisfy constraint based on guidance function
Self-recurrence trick to improve fidelity of generated images

Forward universal guidance

Classifier guidance is a method of sampling that uses a class prompt c and a guidance function f cl to output classification probability.
Universal guidance is an extension of classifier guidance that allows for any general guidance function and loss function.
Universal guidance uses a predicted clean image ẑ0 to calculate the guidance.
Forward universal guidance is a related approach that is studied in (Chung et al., 2022a).

Backward universal guidance

Forward guidance sometimes fails to match the given prompt.
Backward guidance is proposed to supplement forward guidance and enforce the constraint.
Backward guidance produces an optimized direction for the generated image to match the given prompt.

Per-step self-recurrence

Applying universal guidance to standard generation pipelines often produces images with artifacts and strange behaviors.
Attempts to prioritize realness by decreasing s(t) were ineffective.
Self-recurrence is used to explore different regions of the data manifold and improve the harmony of generated images.
Algorithm 1 summarizes the universal guidance algorithm composed of forward universal guidance, backward universal guidance and per-step self-recurrence.

Experiments

Tested proposed universal guidance algorithm against a variety of guidance functions
Experimented with Stable Diffusion and ImageNet diffusion model
Results demonstrate universal algorithm is comparable to specialized conditional model in generating quality images that satisfy text constraints

Results for stable diffusion

Stable Diffusion used as foundation model for guided image generation
Experiments with CLIP feature extractor, segmentation network, face recognition network, and object detection network
Forward guidance produces high-quality images that match given prompt
Loss function calculates negative cosine similarity between image embedding and CLIP text embedding
Segmentation map used to produce clear separation between object and background
Face recognition module used to guide image generation to resemble given person
Object location guidance used to generate objects in designated location
Style guidance used to capture reference style from style image
Results show high-quality images that match given text and style prompts

Results for imagenet diffusion

Results presented for guided image generation using an unconditional diffusion model trained on ImageNet
Experiments conducted with CLIP guidance, object location guidance and segmentation-guided inpainting
Hand-crafted text prompts used to assess limit of universal guidance algorithm
Results show successful guidance to produce quality images that match text prompts
Results demonstrate effectiveness of universal guidance algorithm and necessity of backward guidance
Results show ability of algorithm to handle multiple guidance functions

Limitations

Generation using universal guidance is slower than standard conditional generation
Multiple iterations of denoising are required to generate high-quality images
Time complexity of algorithm scales linearly with number of recurrence steps
Backward guidance is required in certain scenarios
Computing backward guidance requires performing minimization with a multistep gradient descent inner loop
Sampling hyper-parameters must be chosen individually for each guidance network

Conclusion

Proposes a universal guidance algorithm for guided image generation
Algorithm only requires guidance and loss functions to be differentiable
Demonstrates promising results with complex guidance including segmentation, face recognition and object detection
Multiple guidance functions can be combined and used in conjunction
Self-recurrence helps segmentation-guided generation
Generation guided by object detection with the unconditional ImageNet model
Algorithm handles multiple guidance functions effectively

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Diffusion models#

Controlled image generation#

Universal guidance#

Forward universal guidance#

Backward universal guidance#

Per-step self-recurrence#

Experiments#

Results for stable diffusion#

Results for imagenet diffusion#

Limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Diffusion models

Controlled image generation

Universal guidance

Forward universal guidance

Backward universal guidance

Per-step self-recurrence

Experiments

Results for stable diffusion

Results for imagenet diffusion

Limitations

Conclusion