Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduces X&Fuse, a general approach for conditioning on visual information when generating images from text.
- Demonstrates potential of X&Fuse in three different text-to-image generation scenarios.
- Retrieve&Fuse results in significant improvements on MS-COCO benchmark, achieving state-of-the-art FID score of 6.65 in zero-shot settings.
- Crop&Fuse outperforms textual inversion method while being more than x100 faster.
- Scene&Fuse achieves FID score of 5.03 on MS-COCO in zero-shot settings.
Paper Content
Introduction
- Current state-of-the-art text-to-image generation diffusion models are restricted to textual inputs
- X&Fuse is a new general approach to utilize visual information on top of the textual information
- X&Fuse can fit various scenarios, process all of the visual information, enable full attention between different elements, and can be easily applied to pretrained text-to-image models
- X&Fuse yields state-of-the-art zero-shot FID results on the MS-COCO benchmark
- X&Fuse creates new opportunities for injecting visual cues when generating images from captions
Related work
- Text-to-image generation models are improving due to new datasets, models, and approaches
- These approaches often restrict the model to textual inputs
- This work focuses on using additional visual inputs
- Retrieval augmented diffusion models have been used to facilitate textless training
- Subject-driven generation uses textual inversion to generate images with properties from a set of images
- Scene-based generation enables the user to specify the scene of the generated image
Method
Modeling
- U-Net architecture receives two inputs: text embeddings and a noised image
- Model receives an additional image (conditioned image) to fuse into the generated image
- Weights that process the conditioned image and noised image are shared
Advantages
- X&Fuse does not assume spatial correspondence between conditioned and generated images
- X&Fuse preserves identity in subject-driven generation
- X&Fuse can be easily adjusted to new distributions
- X&Fuse does not require any new weights
Vanilla text-to-image generation
Evaluation
- Collected 5k images from MS-COCO validation set
- Used segmentation model to identify objects in images
- Extracted object with highest confidence to create dataset
- Leveraged in-context learning abilities of language models to create captions
- Filtered images with people and subjects that are too small or too large
- Used FID, CLIP-Score and human evaluation to measure results
- Used vanilla Stable Diffusion (SD) model, SD (Cont) and textual inversion method as baselines
Retrieve&fuse
- Retrieve&Fuse approach is a special case of X&Fuse
- Retrieve conditioned image from a bank of images
- Semi-parametric approach
- Makes use of a large database of text-image pairs
- Condition model on RGB image of the scene
- Create scene image by applying segmentation model on ground truth image
- Add predicted textual label to caption and color image according to predicted mask
- Choose color according to order of objects in caption
- Generate any object at inference time
- Compare to Make-A-Scene
- Evaluate on MS-COCO benchmark
- Use 30K-FID and CLIP-Score as automatic metrics
- Use hyperparameters and settings from Sec. 4.2
Results
- Our model is able to get a higher CLIP-Score between the generated image and the subject image.
- Crop&Fuse FID score is lower than the baseline.
- Human raters prefer X&Fuse by a large margin in all parameters.
Ablation study
- Alternative approaches for conditioning on visual data are tested
- Retrieve&Channel and Retrieve&CLIP are trained with the same settings as Retrieve&Fuse
- Non-trainable alternatives Retrieve&Null and Retrieve&Init lead to degraded performance
- Increasing index size improves FID score
- Removing ground truth conditioning yields slightly worse performance
- Using an image-based index improves CLIP-Score but drops FID
Subject-driven generation
Crop&fuse
- Self-supervised scheme used to train model on subject-driven generation
- Augmentations applied to crop to encourage model to use input caption when reconstructing augmented object
- Hyperparameters and settings described in section 5.4
- Augmentations include random affine transformation, scaling factor between 0.2 and 4, translation factor of 0.3, and degree range of 0-180
Analysis
- Training encourages the model to change the subject according to the prompt.
- The more the subject is augmented, the harder it is to reconstruct it.
- At inference time, the scale value can be changed to preserve fidelity to either the caption or the subject.
Scene-based generation
- Model receives additional input to generate an image
- Goal is to show versatility and simplicity of X&Fuse approach, not achieve state-of-the-art results
Conclusions
- Introduced a new approach, X&Fuse, for conditioning on visual information in text-to-image
- Compared X&Fuse to strong baselines and modeling alternatives
- Experimented with three different scenarios
- X&Fuse set a new state-of-the-art FID result in text-to-image
- Showed impressive performance regardless of the scenario
- Provided an appealing option for other scenarios that may benefit from additional visual information