Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Text-conditional diffusion models generate high-quality, diverse images.
Users can control image generation by defining a collage.
Collage Diffusion modifies text-image cross-attention with the layers’ alpha masks.
Collage Diffusion learns specialized text representations per layer.
Layer-based controls provide fine-grained control over the final output.
Collage Diffusion generates globally harmonized images.

Paper Content

Introduction

Diffusion-based text-conditional image generation can generate plausible images from a text prompt
Recent work provides new ways to specify desired output, such as sketching, segmentation masks, and reference images
This paper seeks to give users precise control over image output when creating scenes with a specific desired spatial arrangement
Users can make a collage of images to express artistic intent
Collage Diffusion generates novel, high-quality images that respect the scene composition and object appearance
Collage input enables per-layer control mechanisms to control the harmonization-fidelity tradeoff

Problem definition and goals

Goal is to generate high-quality images that respect desired scene composition
User describes desired intent with a collage
Collage consists of a text string and a sequence of layers
Output image should be globally harmonized and have appearance fidelity
Diffusion-based techniques used to constrain spatial layout and appearance of objects

Traditional graphics techniques can be used to flatten collage layers into a single image
Diffusion-based image harmonization techniques can be used to improve the visual quality of the image
Adding noise to the image can lead to a loss of spatial and appearance fidelity
Collage Diffusion seeks to better maintain the spatial and appearance fidelity of the initial collage
Existing techniques define spatial layouts in terms of segmentation maps
Collage Diffusion uses a collage as an intuitive way to specify spatial composition
Collage Diffusion aims to preserve visual characteristics of the input layers
Collage Diffusion can be framed as a constrained form of image stylization
Image-to-image approaches struggle to constrain scene composition for collage-conditional generation
Layered image and video editing is a well-established technique in traditional computer graphics

Collage diffusion

Text-conditioned diffusion models can be used to perform image harmonization.
Collage Diffusion leverages additional information to increase fidelity of output.

Global image harmonization

SDEdit algorithm improves image quality by adding Gaussian noise and denoising the noised image.
SDEdit is not sufficient for complex images with many objects.

Spatial fidelity through cross-attention manipulation

Collage Diffusion modifies the text-image cross-attention in the text-conditional U-Net model D θ
Not all tokens in the input collage text c correspond to layer strings c i
Global tokens lack specific regional influence
Layer tokens are restricted to the regions of the image according to where the corresponding layer is visible
Cross-attention is computed as softmax QK T √ d V
Collage Diffusion alters QK T to increase or decrease the influence of a particular token on a part of the image

Appearance fidelity through textual inversion

Layer text often fails to capture the appearance of layer images.
Starting image provides guidance on desired look, but noise reduces influence.
Collage Diffusion refines layer text to more accurately capture layer’s appearance.

Controlling the harmonization-fidelity tradeoff with per-layer noise

Content in input collage layers needs to be changed to globally harmonize the image.
Users can control the harmonization-fidelity tradeoff on a per-object basis.
Noise levels are set for each layer and converted into a single-channel noise image.
Gaussian blur is applied to the noise image to smooth boundaries.
Collage Diffusion modifies the diffusion process to add different levels of noise to different regions of the image.

Editing individual layers in generated images

Generating an image from a n-layer collage requires per-layer noise controls.
Per-layer noise controls enable users to keep a part of an input collage “fixed”.
Layer-driven editing can be used to refine an image when a small part doesn’t look right.

Experiments

Evaluated performance of Collage Diffusion by analyzing its ability to generate globally harmonized images
Analyzed capacity of Collage Diffusion with user in the loop
Analyzed capacity of Collage Diffusion for image generation in non-interactive settings
Compared Collage Diffusion against multiple image harmonization approaches
Ablated impact of individual components of Collage Diffusion

Experimental setup

Generate 10 images with different random seeds
User selects image they like
User selects object in image to re-generate
Process continues until user is satisfied

Non-interactive generation

CA improves spatial fidelity across scenes
TI improves appearance fidelity across scenes
LN helps optimize the harmonization-fidelity tradeoff across scenes
Collage construction involves importing images, generating masks, placing objects, and adding captions
Stable Diffusion is used as the base model
Euler ancestral solver is used with 50 steps
Noise is tuned to optimize the harmonization-fidelity tradeoff
SA is used for comparison

Interactive editing

Collage Diffusion is used to author complex scenes
It is done in 3 steps: generating an initial collection of images, exploring different options for an object, and exploring different options for another object
GH generates globally-harmonized images, while SA struggles with harmonization
SA fails to harmonize images when objects need to be moved or rotated
GH reliably generates globally-harmonized images with consistent perspective and lighting
Collage Diffusion is most helpful when objects are easy to discriminate or when the user is particular about the exact appearance of several complex layers
Iterative editing workflow is valuable for complex scenes

Conclusion

Text-conditional diffusion models can produce high-quality images from natural language input
Collage Diffusion introduces a new form of control to generate visually compelling images
Collage Diffusion enables users to express compositional intent
Collage Diffusion manipulates cross-attention, learns layer-specific representations, and harmonizes layers
Collage Diffusion outputs diverse, globally-harmonized images
Iterative editing workflow enables user to modify individual layers of generated images
Collage Diffusion generates images with greater spatial and appearance fidelity than baseline methods

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem definition and goals#

Related work#

Collage diffusion#

Global image harmonization#

Spatial fidelity through cross-attention manipulation#

Appearance fidelity through textual inversion#

Controlling the harmonization-fidelity tradeoff with per-layer noise#

Editing individual layers in generated images#

Experiments#

Experimental setup#

Non-interactive generation#

Interactive editing#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Problem definition and goals

Related work

Collage diffusion

Global image harmonization

Spatial fidelity through cross-attention manipulation

Appearance fidelity through textual inversion

Controlling the harmonization-fidelity tradeoff with per-layer noise

Editing individual layers in generated images

Experiments

Experimental setup

Non-interactive generation

Interactive editing

Conclusion