Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed algorithm enables diffusion models to be controlled by arbitrary guidance modalities without retraining.
  • Algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals.

Paper Content

Introduction

  • Diffusion models are powerful tools for creating digital art and graphics.
  • Most models are controlled through conditioning.
  • Guidance is a more flexible approach to controlling model outputs.
  • Guidance functions can be used without re-training or modification.
  • We propose an algorithm that enables universal guidance for diffusion models.

Background

  • Review of core framework behind diffusion models
  • Definition of problem setting of controlled image generation
  • Discussion of previous related works

Diffusion models

  • Diffusion models are generative models used for image, audio and text generation.
  • Diffusion models are a combination of a forward process and a reverse process.
  • The forward process adds noise to a clean data point, while the reverse process attempts to denoise a noisy input.

Controlled image generation

  • Focus on controlled image generation with various constraints
  • Consider a differentiable guidance function f
  • Measure closeness of two vectors c and c
  • Prompt is a particular choice of c
  • Two categories of prior work: conditional image generation and guided image generation
  • Conditional image generation requires training new diffusion models
  • Guided image generation uses frozen pre-trained diffusion model
  • Prior work studied guided image generation with a variety of restrictions and external guidance functions
  • This work studies universal guidance algorithms for guided image generation with any off-the-shelf guidance functions

Universal guidance

  • Proposed guidance algorithm augments image sampling method of diffusion model
  • Algorithm motivated by observation that reconstructed clean image is appropriate for generic guidance function
  • Forward universal guidance extends classifier guidance to leverage observation
  • Backward universal guidance helps enforce generated image to satisfy constraint based on guidance function
  • Self-recurrence trick to improve fidelity of generated images

Forward universal guidance

  • Classifier guidance is a method of sampling that uses a class prompt c and a guidance function f cl to output classification probability.
  • Universal guidance is an extension of classifier guidance that allows for any general guidance function and loss function.
  • Universal guidance uses a predicted clean image αΊ‘0 to calculate the guidance.
  • Forward universal guidance is a related approach that is studied in (Chung et al., 2022a).

Backward universal guidance

  • Forward guidance sometimes fails to match the given prompt.
  • Backward guidance is proposed to supplement forward guidance and enforce the constraint.
  • Backward guidance produces an optimized direction for the generated image to match the given prompt.

Per-step self-recurrence

  • Applying universal guidance to standard generation pipelines often produces images with artifacts and strange behaviors.
  • Attempts to prioritize realness by decreasing s(t) were ineffective.
  • Self-recurrence is used to explore different regions of the data manifold and improve the harmony of generated images.
  • Algorithm 1 summarizes the universal guidance algorithm composed of forward universal guidance, backward universal guidance and per-step self-recurrence.

Experiments

  • Tested proposed universal guidance algorithm against a variety of guidance functions
  • Experimented with Stable Diffusion and ImageNet diffusion model
  • Results demonstrate universal algorithm is comparable to specialized conditional model in generating quality images that satisfy text constraints

Results for stable diffusion

  • Stable Diffusion used as foundation model for guided image generation
  • Experiments with CLIP feature extractor, segmentation network, face recognition network, and object detection network
  • Forward guidance produces high-quality images that match given prompt
  • Loss function calculates negative cosine similarity between image embedding and CLIP text embedding
  • Segmentation map used to produce clear separation between object and background
  • Face recognition module used to guide image generation to resemble given person
  • Object location guidance used to generate objects in designated location
  • Style guidance used to capture reference style from style image
  • Results show high-quality images that match given text and style prompts

Results for imagenet diffusion

  • Results presented for guided image generation using an unconditional diffusion model trained on ImageNet
  • Experiments conducted with CLIP guidance, object location guidance and segmentation-guided inpainting
  • Hand-crafted text prompts used to assess limit of universal guidance algorithm
  • Results show successful guidance to produce quality images that match text prompts
  • Results demonstrate effectiveness of universal guidance algorithm and necessity of backward guidance
  • Results show ability of algorithm to handle multiple guidance functions

Limitations

  • Generation using universal guidance is slower than standard conditional generation
  • Multiple iterations of denoising are required to generate high-quality images
  • Time complexity of algorithm scales linearly with number of recurrence steps
  • Backward guidance is required in certain scenarios
  • Computing backward guidance requires performing minimization with a multistep gradient descent inner loop
  • Sampling hyper-parameters must be chosen individually for each guidance network

Conclusion

  • Proposes a universal guidance algorithm for guided image generation
  • Algorithm only requires guidance and loss functions to be differentiable
  • Demonstrates promising results with complex guidance including segmentation, face recognition and object detection
  • Multiple guidance functions can be combined and used in conjunction
  • Self-recurrence helps segmentation-guided generation
  • Generation guided by object detection with the unconditional ImageNet model
  • Algorithm handles multiple guidance functions effectively