Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generative models can create incredible images but lack control.
- This work offers a new paradigm that allows flexible control of the output image.
- Compositionality is the core idea, decomposing an image into factors and training a diffusion model.
- At inference, the representations work as composable elements, leading to a huge design space.
- Supports various levels of conditions, such as text description, depth map, sketch, color histogram.
- Improves controllability and serves as a general framework for classical generative tasks.
Paper Content
Introduction
- Noam Chomsky proposed the idea of “The infinite use of finite means”
- Generative image models can now produce photorealistic and diverse images
- Recent works extend text-to-image models by introducing conditions such as segmentation maps, scene graphs, sketches, depthmaps, and inpainting masks
- Key to controllable image generation relies on compositionality
- Composer is a realization of compositional generative models
- Composer is capable of producing new images from unseen combinations of representations
- Composer can be used for text-to-image generation, style transfer, pose transfer, image translation, virtual try-on, interpolation, image variation, image reconfiguration, colorization, and more
Method
- Image is divided into independent components
- Conditional diffusion model used to reassemble components
- Introduction to diffusion models and guidance directions
- Implementation of image decomposition and composition explained
Diffusion models
- Diffusion models are generative models that produce data from Gaussian noise
- Mean-squared error is used as the denoising objective
- Classifier-free guidance is used for conditional data sampling
- Sampling algorithms such as DDIM and DPM-Solver are used to speed up the sampling process
- DDIM can be used to reverse a sample back to its pure noise latent
- Composer is a diffusion model that accepts multiple conditions
- Bidirectional guidance can be used to manipulate images in a disentangled manner
Decomposition
- Decompose image into 8 representations
- Use title/description as image captions
- Represent captions with sentence/word embeddings
- Represent color statistics with CIELab histogram
- Extract sketch of image with edge detection/sketch simplification
- Extract instance masks with YOLOv5
- Extract depthmap with monocular depth estimation
- Introduce raw grayscale images and image masks
Composition
- Use diffusion models to recompose images from representations
- Global conditioning: Project and add representations to timestep embedding
- Localized conditioning: Project representations into uniform-dimensional embeddings
- Joint training strategy: Use independent dropout probability for each condition
- Upscale images from 64x64 to 1024x1024 resolution
- Optional prior model to improve diversity of generated images
Experiments
Training details
- Trained 2B parameter base model for conditional image generation at 64x64 resolution
- Trained 1.1B parameter model for upscaling images to 256x256 resolution
- Trained 300M parameter model for upscaling images to 1024x1024 resolution
- Trained 1B parameter prior model for projecting captions to image embeddings
- Used batch sizes of 4096, 1024, 512, and 512 for prior, base, and two upsampling models
- Trained on combination of public datasets, including ImageNet21K, WebVision, and LAION dataset
- Eliminated duplicates, low resolution images, and images potentially containing harmful content from LAION dataset
- Pretrained base model with 1M steps on full dataset using image embeddings as condition
- Finetuned base model on subset of 60M examples (excluding LAION images with aesthetic scores below 7.0) for 200K steps with all conditions enabled
- Prior and upsampling models trained for 1M steps on full dataset
Image manipulation
- Composer can create new images that vary in certain aspects from a given image
- Composer offers flexibility to control the scope of image variations
- Composer yields more accurate reconstructions than unCLIP
- Composer can blend two images for variations and control which elements to interpolate between them
- Composer can manipulate an image through direct modification of one or more of its representations
- Composer can restrict variations within an area defined by a masked image
Reformulation of traditional generation tasks
- Traditional image generation and manipulation tasks can be reformulated using the Composer architecture
- Two methods to colorize an image using Composer: one conditions the sampling process on both the grayscale version of the image and the palette, the other involves applying a reconfiguration
- Style transfer: Composer disentangles content and style representations, allowing style of one image to be transferred to another
- Image translation: Transform an image to a variant with content kept unchanged but style converted to match a target domain
- Virtual try-on: Given a garment image and a body image, the sampling process is conditioned on the masked image and the CLIP image embedding of the garment image to produce a virtual try-on result
Compositional image generation
- Composer can be conditioned on visual components from different sources.
- This allows for a large number of generation results from a limited set of materials.
Text-to-image generation
- Composer’s image generation quality was assessed by comparing it to state-of-the-art text-to-image generation models on the COCO dataset.
- Sampling steps of 100, 50, and 20 were used for the prior, base, and 64 × 64 to 256 × 256 upsampling models respectively.
- Guidance scale of 3.0 was used for the prior and base models.
- Composer achieved a competitive FID score of 9.2 and a CLIP score of 0.28 on COCO, comparable to the best-performing models.
Related work
- Diffusion models are successful for image generation
- Diffusion models outperform GANs and are comparable to autoregressive models
- Recent hierarchical diffusion models use one large diffusion model to produce small-resolution images and two smaller diffusion models to upscale the images
- Composer supports composable conditions and has better flexibility and controllability than other methods
Conclusion and discussion
- Decomposition-composition paradigm expands control space of generative models
- Composer architecture can be used to reformulate traditional generative tasks
- Composer can be used for image generation and manipulation tasks
- Joint training of multiple conditions can downweight single-conditional generation performance
- Potential risks associated with image generation models highlighted