Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Text-conditional diffusion models generate high-quality, diverse images.
  • Users can control image generation by defining a collage.
  • Collage Diffusion modifies text-image cross-attention with the layers’ alpha masks.
  • Collage Diffusion learns specialized text representations per layer.
  • Layer-based controls provide fine-grained control over the final output.
  • Collage Diffusion generates globally harmonized images.

Paper Content


  • Diffusion-based text-conditional image generation can generate plausible images from a text prompt
  • Recent work provides new ways to specify desired output, such as sketching, segmentation masks, and reference images
  • This paper seeks to give users precise control over image output when creating scenes with a specific desired spatial arrangement
  • Users can make a collage of images to express artistic intent
  • Collage Diffusion generates novel, high-quality images that respect the scene composition and object appearance
  • Collage input enables per-layer control mechanisms to control the harmonization-fidelity tradeoff

Problem definition and goals

  • Goal is to generate high-quality images that respect desired scene composition
  • User describes desired intent with a collage
  • Collage consists of a text string and a sequence of layers
  • Output image should be globally harmonized and have appearance fidelity
  • Diffusion-based techniques used to constrain spatial layout and appearance of objects
  • Traditional graphics techniques can be used to flatten collage layers into a single image
  • Diffusion-based image harmonization techniques can be used to improve the visual quality of the image
  • Adding noise to the image can lead to a loss of spatial and appearance fidelity
  • Collage Diffusion seeks to better maintain the spatial and appearance fidelity of the initial collage
  • Existing techniques define spatial layouts in terms of segmentation maps
  • Collage Diffusion uses a collage as an intuitive way to specify spatial composition
  • Collage Diffusion aims to preserve visual characteristics of the input layers
  • Collage Diffusion can be framed as a constrained form of image stylization
  • Image-to-image approaches struggle to constrain scene composition for collage-conditional generation
  • Layered image and video editing is a well-established technique in traditional computer graphics

Collage diffusion

  • Text-conditioned diffusion models can be used to perform image harmonization.
  • Collage Diffusion leverages additional information to increase fidelity of output.

Global image harmonization

  • SDEdit algorithm improves image quality by adding Gaussian noise and denoising the noised image.
  • SDEdit is not sufficient for complex images with many objects.

Spatial fidelity through cross-attention manipulation

  • Collage Diffusion modifies the text-image cross-attention in the text-conditional U-Net model D θ
  • Not all tokens in the input collage text c correspond to layer strings c i
  • Global tokens lack specific regional influence
  • Layer tokens are restricted to the regions of the image according to where the corresponding layer is visible
  • Cross-attention is computed as softmax QK T √ d V
  • Collage Diffusion alters QK T to increase or decrease the influence of a particular token on a part of the image

Appearance fidelity through textual inversion

  • Layer text often fails to capture the appearance of layer images.
  • Starting image provides guidance on desired look, but noise reduces influence.
  • Collage Diffusion refines layer text to more accurately capture layer’s appearance.

Controlling the harmonization-fidelity tradeoff with per-layer noise

  • Content in input collage layers needs to be changed to globally harmonize the image.
  • Users can control the harmonization-fidelity tradeoff on a per-object basis.
  • Noise levels are set for each layer and converted into a single-channel noise image.
  • Gaussian blur is applied to the noise image to smooth boundaries.
  • Collage Diffusion modifies the diffusion process to add different levels of noise to different regions of the image.

Editing individual layers in generated images

  • Generating an image from a n-layer collage requires per-layer noise controls.
  • Per-layer noise controls enable users to keep a part of an input collage “fixed”.
  • Layer-driven editing can be used to refine an image when a small part doesn’t look right.


  • Evaluated performance of Collage Diffusion by analyzing its ability to generate globally harmonized images
  • Analyzed capacity of Collage Diffusion with user in the loop
  • Analyzed capacity of Collage Diffusion for image generation in non-interactive settings
  • Compared Collage Diffusion against multiple image harmonization approaches
  • Ablated impact of individual components of Collage Diffusion

Experimental setup

  • Generate 10 images with different random seeds
  • User selects image they like
  • User selects object in image to re-generate
  • Process continues until user is satisfied

Non-interactive generation

  • CA improves spatial fidelity across scenes
  • TI improves appearance fidelity across scenes
  • LN helps optimize the harmonization-fidelity tradeoff across scenes
  • Collage construction involves importing images, generating masks, placing objects, and adding captions
  • Stable Diffusion is used as the base model
  • Euler ancestral solver is used with 50 steps
  • Noise is tuned to optimize the harmonization-fidelity tradeoff
  • SA is used for comparison

Interactive editing

  • Collage Diffusion is used to author complex scenes
  • It is done in 3 steps: generating an initial collection of images, exploring different options for an object, and exploring different options for another object
  • GH generates globally-harmonized images, while SA struggles with harmonization
  • SA fails to harmonize images when objects need to be moved or rotated
  • GH reliably generates globally-harmonized images with consistent perspective and lighting
  • Collage Diffusion is most helpful when objects are easy to discriminate or when the user is particular about the exact appearance of several complex layers
  • Iterative editing workflow is valuable for complex scenes


  • Text-conditional diffusion models can produce high-quality images from natural language input
  • Collage Diffusion introduces a new form of control to generate visually compelling images
  • Collage Diffusion enables users to express compositional intent
  • Collage Diffusion manipulates cross-attention, learns layer-specific representations, and harmonizes layers
  • Collage Diffusion outputs diverse, globally-harmonized images
  • Iterative editing workflow enables user to modify individual layers of generated images
  • Collage Diffusion generates images with greater spatial and appearance fidelity than baseline methods