Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduces X&Fuse, a general approach for conditioning on visual information when generating images from text.
Demonstrates potential of X&Fuse in three different text-to-image generation scenarios.
Retrieve&Fuse results in significant improvements on MS-COCO benchmark, achieving state-of-the-art FID score of 6.65 in zero-shot settings.
Crop&Fuse outperforms textual inversion method while being more than x100 faster.
Scene&Fuse achieves FID score of 5.03 on MS-COCO in zero-shot settings.

Paper Content

Introduction

Current state-of-the-art text-to-image generation diffusion models are restricted to textual inputs
X&Fuse is a new general approach to utilize visual information on top of the textual information
X&Fuse can fit various scenarios, process all of the visual information, enable full attention between different elements, and can be easily applied to pretrained text-to-image models
X&Fuse yields state-of-the-art zero-shot FID results on the MS-COCO benchmark
X&Fuse creates new opportunities for injecting visual cues when generating images from captions

Text-to-image generation models are improving due to new datasets, models, and approaches
These approaches often restrict the model to textual inputs
This work focuses on using additional visual inputs
Retrieval augmented diffusion models have been used to facilitate textless training
Subject-driven generation uses textual inversion to generate images with properties from a set of images
Scene-based generation enables the user to specify the scene of the generated image

Method

Modeling

U-Net architecture receives two inputs: text embeddings and a noised image
Model receives an additional image (conditioned image) to fuse into the generated image
Weights that process the conditioned image and noised image are shared

Advantages

X&Fuse does not assume spatial correspondence between conditioned and generated images
X&Fuse preserves identity in subject-driven generation
X&Fuse can be easily adjusted to new distributions
X&Fuse does not require any new weights

Vanilla text-to-image generation

Evaluation

Collected 5k images from MS-COCO validation set
Used segmentation model to identify objects in images
Extracted object with highest confidence to create dataset
Leveraged in-context learning abilities of language models to create captions
Filtered images with people and subjects that are too small or too large
Used FID, CLIP-Score and human evaluation to measure results
Used vanilla Stable Diffusion (SD) model, SD (Cont) and textual inversion method as baselines

Retrieve&fuse

Retrieve&Fuse approach is a special case of X&Fuse
Retrieve conditioned image from a bank of images
Semi-parametric approach
Makes use of a large database of text-image pairs
Condition model on RGB image of the scene
Create scene image by applying segmentation model on ground truth image
Add predicted textual label to caption and color image according to predicted mask
Choose color according to order of objects in caption
Generate any object at inference time
Compare to Make-A-Scene
Evaluate on MS-COCO benchmark
Use 30K-FID and CLIP-Score as automatic metrics
Use hyperparameters and settings from Sec. 4.2

Results

Our model is able to get a higher CLIP-Score between the generated image and the subject image.
Crop&Fuse FID score is lower than the baseline.
Human raters prefer X&Fuse by a large margin in all parameters.

Ablation study

Alternative approaches for conditioning on visual data are tested
Retrieve&Channel and Retrieve&CLIP are trained with the same settings as Retrieve&Fuse
Non-trainable alternatives Retrieve&Null and Retrieve&Init lead to degraded performance
Increasing index size improves FID score
Removing ground truth conditioning yields slightly worse performance
Using an image-based index improves CLIP-Score but drops FID

Subject-driven generation

Crop&fuse

Self-supervised scheme used to train model on subject-driven generation
Augmentations applied to crop to encourage model to use input caption when reconstructing augmented object
Hyperparameters and settings described in section 5.4
Augmentations include random affine transformation, scaling factor between 0.2 and 4, translation factor of 0.3, and degree range of 0-180

Analysis

Training encourages the model to change the subject according to the prompt.
The more the subject is augmented, the harder it is to reconstruct it.
At inference time, the scale value can be changed to preserve fidelity to either the caption or the subject.

Scene-based generation

Model receives additional input to generate an image
Goal is to show versatility and simplicity of X&Fuse approach, not achieve state-of-the-art results

Conclusions

Introduced a new approach, X&Fuse, for conditioning on visual information in text-to-image
Compared X&Fuse to strong baselines and modeling alternatives
Experimented with three different scenarios
X&Fuse set a new state-of-the-art FID result in text-to-image
Showed impressive performance regardless of the scenario
Provided an appealing option for other scenarios that may benefit from additional visual information

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Modeling#

Advantages#

Vanilla text-to-image generation#

Evaluation#

Retrieve&fuse#

Results#

Ablation study#

Subject-driven generation#

Crop&fuse#

Analysis#

Scene-based generation#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Modeling

Advantages

Vanilla text-to-image generation

Evaluation

Retrieve&fuse

Results

Ablation study

Subject-driven generation

Crop&fuse

Analysis

Scene-based generation

Conclusions