Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generative models can create incredible images but lack control.
This work offers a new paradigm that allows flexible control of the output image.
Compositionality is the core idea, decomposing an image into factors and training a diffusion model.
At inference, the representations work as composable elements, leading to a huge design space.
Supports various levels of conditions, such as text description, depth map, sketch, color histogram.
Improves controllability and serves as a general framework for classical generative tasks.

Paper Content

Introduction

Noam Chomsky proposed the idea of “The infinite use of finite means”
Generative image models can now produce photorealistic and diverse images
Recent works extend text-to-image models by introducing conditions such as segmentation maps, scene graphs, sketches, depthmaps, and inpainting masks
Key to controllable image generation relies on compositionality
Composer is a realization of compositional generative models
Composer is capable of producing new images from unseen combinations of representations
Composer can be used for text-to-image generation, style transfer, pose transfer, image translation, virtual try-on, interpolation, image variation, image reconfiguration, colorization, and more

Method

Image is divided into independent components
Conditional diffusion model used to reassemble components
Introduction to diffusion models and guidance directions
Implementation of image decomposition and composition explained

Diffusion models

Diffusion models are generative models that produce data from Gaussian noise
Mean-squared error is used as the denoising objective
Classifier-free guidance is used for conditional data sampling
Sampling algorithms such as DDIM and DPM-Solver are used to speed up the sampling process
DDIM can be used to reverse a sample back to its pure noise latent
Composer is a diffusion model that accepts multiple conditions
Bidirectional guidance can be used to manipulate images in a disentangled manner

Decomposition

Decompose image into 8 representations
Use title/description as image captions
Represent captions with sentence/word embeddings
Represent color statistics with CIELab histogram
Extract sketch of image with edge detection/sketch simplification
Extract instance masks with YOLOv5
Extract depthmap with monocular depth estimation
Introduce raw grayscale images and image masks

Composition

Use diffusion models to recompose images from representations
Global conditioning: Project and add representations to timestep embedding
Localized conditioning: Project representations into uniform-dimensional embeddings
Joint training strategy: Use independent dropout probability for each condition
Upscale images from 64x64 to 1024x1024 resolution
Optional prior model to improve diversity of generated images

Experiments

Training details

Trained 2B parameter base model for conditional image generation at 64x64 resolution
Trained 1.1B parameter model for upscaling images to 256x256 resolution
Trained 300M parameter model for upscaling images to 1024x1024 resolution
Trained 1B parameter prior model for projecting captions to image embeddings
Used batch sizes of 4096, 1024, 512, and 512 for prior, base, and two upsampling models
Trained on combination of public datasets, including ImageNet21K, WebVision, and LAION dataset
Eliminated duplicates, low resolution images, and images potentially containing harmful content from LAION dataset
Pretrained base model with 1M steps on full dataset using image embeddings as condition
Finetuned base model on subset of 60M examples (excluding LAION images with aesthetic scores below 7.0) for 200K steps with all conditions enabled
Prior and upsampling models trained for 1M steps on full dataset

Image manipulation

Composer can create new images that vary in certain aspects from a given image
Composer offers flexibility to control the scope of image variations
Composer yields more accurate reconstructions than unCLIP
Composer can blend two images for variations and control which elements to interpolate between them
Composer can manipulate an image through direct modification of one or more of its representations
Composer can restrict variations within an area defined by a masked image

Reformulation of traditional generation tasks

Traditional image generation and manipulation tasks can be reformulated using the Composer architecture
Two methods to colorize an image using Composer: one conditions the sampling process on both the grayscale version of the image and the palette, the other involves applying a reconfiguration
Style transfer: Composer disentangles content and style representations, allowing style of one image to be transferred to another
Image translation: Transform an image to a variant with content kept unchanged but style converted to match a target domain
Virtual try-on: Given a garment image and a body image, the sampling process is conditioned on the masked image and the CLIP image embedding of the garment image to produce a virtual try-on result

Compositional image generation

Composer can be conditioned on visual components from different sources.
This allows for a large number of generation results from a limited set of materials.

Text-to-image generation

Composer’s image generation quality was assessed by comparing it to state-of-the-art text-to-image generation models on the COCO dataset.
Sampling steps of 100, 50, and 20 were used for the prior, base, and 64 × 64 to 256 × 256 upsampling models respectively.
Guidance scale of 3.0 was used for the prior and base models.
Composer achieved a competitive FID score of 9.2 and a CLIP score of 0.28 on COCO, comparable to the best-performing models.

Diffusion models are successful for image generation
Diffusion models outperform GANs and are comparable to autoregressive models
Recent hierarchical diffusion models use one large diffusion model to produce small-resolution images and two smaller diffusion models to upscale the images
Composer supports composable conditions and has better flexibility and controllability than other methods

Conclusion and discussion

Decomposition-composition paradigm expands control space of generative models
Composer architecture can be used to reformulate traditional generative tasks
Composer can be used for image generation and manipulation tasks
Joint training of multiple conditions can downweight single-conditional generation performance
Potential risks associated with image generation models highlighted

Link to paper#

Abstract#

Paper Content#

Introduction#

Method#

Diffusion models#

Decomposition#

Composition#

Experiments#

Training details#

Image manipulation#

Reformulation of traditional generation tasks#

Compositional image generation#

Text-to-image generation#

Related work#

Conclusion and discussion#

Link to paper

Abstract

Paper Content

Introduction

Method

Diffusion models

Decomposition

Composition

Experiments

Training details

Image manipulation

Reformulation of traditional generation tasks

Compositional image generation

Text-to-image generation

Related work

Conclusion and discussion