Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text-conditional diffusion models generate high-quality, diverse images.
- Users can control image generation by defining a collage.
- Collage Diffusion modifies text-image cross-attention with the layers’ alpha masks.
- Collage Diffusion learns specialized text representations per layer.
- Layer-based controls provide fine-grained control over the final output.
- Collage Diffusion generates globally harmonized images.
Paper Content
Introduction
- Diffusion-based text-conditional image generation can generate plausible images from a text prompt
- Recent work provides new ways to specify desired output, such as sketching, segmentation masks, and reference images
- This paper seeks to give users precise control over image output when creating scenes with a specific desired spatial arrangement
- Users can make a collage of images to express artistic intent
- Collage Diffusion generates novel, high-quality images that respect the scene composition and object appearance
- Collage input enables per-layer control mechanisms to control the harmonization-fidelity tradeoff
Problem definition and goals
- Goal is to generate high-quality images that respect desired scene composition
- User describes desired intent with a collage
- Collage consists of a text string and a sequence of layers
- Output image should be globally harmonized and have appearance fidelity
- Diffusion-based techniques used to constrain spatial layout and appearance of objects
Related work
- Traditional graphics techniques can be used to flatten collage layers into a single image
- Diffusion-based image harmonization techniques can be used to improve the visual quality of the image
- Adding noise to the image can lead to a loss of spatial and appearance fidelity
- Collage Diffusion seeks to better maintain the spatial and appearance fidelity of the initial collage
- Existing techniques define spatial layouts in terms of segmentation maps
- Collage Diffusion uses a collage as an intuitive way to specify spatial composition
- Collage Diffusion aims to preserve visual characteristics of the input layers
- Collage Diffusion can be framed as a constrained form of image stylization
- Image-to-image approaches struggle to constrain scene composition for collage-conditional generation
- Layered image and video editing is a well-established technique in traditional computer graphics
Collage diffusion
- Text-conditioned diffusion models can be used to perform image harmonization.
- Collage Diffusion leverages additional information to increase fidelity of output.
Global image harmonization
- SDEdit algorithm improves image quality by adding Gaussian noise and denoising the noised image.
- SDEdit is not sufficient for complex images with many objects.
Spatial fidelity through cross-attention manipulation
- Collage Diffusion modifies the text-image cross-attention in the text-conditional U-Net model D θ
- Not all tokens in the input collage text c correspond to layer strings c i
- Global tokens lack specific regional influence
- Layer tokens are restricted to the regions of the image according to where the corresponding layer is visible
- Cross-attention is computed as softmax QK T √ d V
- Collage Diffusion alters QK T to increase or decrease the influence of a particular token on a part of the image
Appearance fidelity through textual inversion
- Layer text often fails to capture the appearance of layer images.
- Starting image provides guidance on desired look, but noise reduces influence.
- Collage Diffusion refines layer text to more accurately capture layer’s appearance.
Controlling the harmonization-fidelity tradeoff with per-layer noise
- Content in input collage layers needs to be changed to globally harmonize the image.
- Users can control the harmonization-fidelity tradeoff on a per-object basis.
- Noise levels are set for each layer and converted into a single-channel noise image.
- Gaussian blur is applied to the noise image to smooth boundaries.
- Collage Diffusion modifies the diffusion process to add different levels of noise to different regions of the image.
Editing individual layers in generated images
- Generating an image from a n-layer collage requires per-layer noise controls.
- Per-layer noise controls enable users to keep a part of an input collage “fixed”.
- Layer-driven editing can be used to refine an image when a small part doesn’t look right.
Experiments
- Evaluated performance of Collage Diffusion by analyzing its ability to generate globally harmonized images
- Analyzed capacity of Collage Diffusion with user in the loop
- Analyzed capacity of Collage Diffusion for image generation in non-interactive settings
- Compared Collage Diffusion against multiple image harmonization approaches
- Ablated impact of individual components of Collage Diffusion
Experimental setup
- Generate 10 images with different random seeds
- User selects image they like
- User selects object in image to re-generate
- Process continues until user is satisfied
Non-interactive generation
- CA improves spatial fidelity across scenes
- TI improves appearance fidelity across scenes
- LN helps optimize the harmonization-fidelity tradeoff across scenes
- Collage construction involves importing images, generating masks, placing objects, and adding captions
- Stable Diffusion is used as the base model
- Euler ancestral solver is used with 50 steps
- Noise is tuned to optimize the harmonization-fidelity tradeoff
- SA is used for comparison
Interactive editing
- Collage Diffusion is used to author complex scenes
- It is done in 3 steps: generating an initial collection of images, exploring different options for an object, and exploring different options for another object
- GH generates globally-harmonized images, while SA struggles with harmonization
- SA fails to harmonize images when objects need to be moved or rotated
- GH reliably generates globally-harmonized images with consistent perspective and lighting
- Collage Diffusion is most helpful when objects are easy to discriminate or when the user is particular about the exact appearance of several complex layers
- Iterative editing workflow is valuable for complex scenes
Conclusion
- Text-conditional diffusion models can produce high-quality images from natural language input
- Collage Diffusion introduces a new form of control to generate visually compelling images
- Collage Diffusion enables users to express compositional intent
- Collage Diffusion manipulates cross-attention, learns layer-specific representations, and harmonizes layers
- Collage Diffusion outputs diverse, globally-harmonized images
- Iterative editing workflow enables user to modify individual layers of generated images
- Collage Diffusion generates images with greater spatial and appearance fidelity than baseline methods