Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generative models can create incredible images but lack control.
  • This work offers a new paradigm that allows flexible control of the output image.
  • Compositionality is the core idea, decomposing an image into factors and training a diffusion model.
  • At inference, the representations work as composable elements, leading to a huge design space.
  • Supports various levels of conditions, such as text description, depth map, sketch, color histogram.
  • Improves controllability and serves as a general framework for classical generative tasks.

Paper Content

Introduction

  • Noam Chomsky proposed the idea of “The infinite use of finite means”
  • Generative image models can now produce photorealistic and diverse images
  • Recent works extend text-to-image models by introducing conditions such as segmentation maps, scene graphs, sketches, depthmaps, and inpainting masks
  • Key to controllable image generation relies on compositionality
  • Composer is a realization of compositional generative models
  • Composer is capable of producing new images from unseen combinations of representations
  • Composer can be used for text-to-image generation, style transfer, pose transfer, image translation, virtual try-on, interpolation, image variation, image reconfiguration, colorization, and more

Method

  • Image is divided into independent components
  • Conditional diffusion model used to reassemble components
  • Introduction to diffusion models and guidance directions
  • Implementation of image decomposition and composition explained

Diffusion models

  • Diffusion models are generative models that produce data from Gaussian noise
  • Mean-squared error is used as the denoising objective
  • Classifier-free guidance is used for conditional data sampling
  • Sampling algorithms such as DDIM and DPM-Solver are used to speed up the sampling process
  • DDIM can be used to reverse a sample back to its pure noise latent
  • Composer is a diffusion model that accepts multiple conditions
  • Bidirectional guidance can be used to manipulate images in a disentangled manner

Decomposition

  • Decompose image into 8 representations
  • Use title/description as image captions
  • Represent captions with sentence/word embeddings
  • Represent color statistics with CIELab histogram
  • Extract sketch of image with edge detection/sketch simplification
  • Extract instance masks with YOLOv5
  • Extract depthmap with monocular depth estimation
  • Introduce raw grayscale images and image masks

Composition

  • Use diffusion models to recompose images from representations
  • Global conditioning: Project and add representations to timestep embedding
  • Localized conditioning: Project representations into uniform-dimensional embeddings
  • Joint training strategy: Use independent dropout probability for each condition
  • Upscale images from 64x64 to 1024x1024 resolution
  • Optional prior model to improve diversity of generated images

Experiments

Training details

  • Trained 2B parameter base model for conditional image generation at 64x64 resolution
  • Trained 1.1B parameter model for upscaling images to 256x256 resolution
  • Trained 300M parameter model for upscaling images to 1024x1024 resolution
  • Trained 1B parameter prior model for projecting captions to image embeddings
  • Used batch sizes of 4096, 1024, 512, and 512 for prior, base, and two upsampling models
  • Trained on combination of public datasets, including ImageNet21K, WebVision, and LAION dataset
  • Eliminated duplicates, low resolution images, and images potentially containing harmful content from LAION dataset
  • Pretrained base model with 1M steps on full dataset using image embeddings as condition
  • Finetuned base model on subset of 60M examples (excluding LAION images with aesthetic scores below 7.0) for 200K steps with all conditions enabled
  • Prior and upsampling models trained for 1M steps on full dataset

Image manipulation

  • Composer can create new images that vary in certain aspects from a given image
  • Composer offers flexibility to control the scope of image variations
  • Composer yields more accurate reconstructions than unCLIP
  • Composer can blend two images for variations and control which elements to interpolate between them
  • Composer can manipulate an image through direct modification of one or more of its representations
  • Composer can restrict variations within an area defined by a masked image

Reformulation of traditional generation tasks

  • Traditional image generation and manipulation tasks can be reformulated using the Composer architecture
  • Two methods to colorize an image using Composer: one conditions the sampling process on both the grayscale version of the image and the palette, the other involves applying a reconfiguration
  • Style transfer: Composer disentangles content and style representations, allowing style of one image to be transferred to another
  • Image translation: Transform an image to a variant with content kept unchanged but style converted to match a target domain
  • Virtual try-on: Given a garment image and a body image, the sampling process is conditioned on the masked image and the CLIP image embedding of the garment image to produce a virtual try-on result

Compositional image generation

  • Composer can be conditioned on visual components from different sources.
  • This allows for a large number of generation results from a limited set of materials.

Text-to-image generation

  • Composer’s image generation quality was assessed by comparing it to state-of-the-art text-to-image generation models on the COCO dataset.
  • Sampling steps of 100, 50, and 20 were used for the prior, base, and 64 × 64 to 256 × 256 upsampling models respectively.
  • Guidance scale of 3.0 was used for the prior and base models.
  • Composer achieved a competitive FID score of 9.2 and a CLIP score of 0.28 on COCO, comparable to the best-performing models.
  • Diffusion models are successful for image generation
  • Diffusion models outperform GANs and are comparable to autoregressive models
  • Recent hierarchical diffusion models use one large diffusion model to produce small-resolution images and two smaller diffusion models to upscale the images
  • Composer supports composable conditions and has better flexibility and controllability than other methods

Conclusion and discussion

  • Decomposition-composition paradigm expands control space of generative models
  • Composer architecture can be used to reformulate traditional generative tasks
  • Composer can be used for image generation and manipulation tasks
  • Joint training of multiple conditions can downweight single-conditional generation performance
  • Potential risks associated with image generation models highlighted