Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduces X&Fuse, a general approach for conditioning on visual information when generating images from text.
  • Demonstrates potential of X&Fuse in three different text-to-image generation scenarios.
  • Retrieve&Fuse results in significant improvements on MS-COCO benchmark, achieving state-of-the-art FID score of 6.65 in zero-shot settings.
  • Crop&Fuse outperforms textual inversion method while being more than x100 faster.
  • Scene&Fuse achieves FID score of 5.03 on MS-COCO in zero-shot settings.

Paper Content

Introduction

  • Current state-of-the-art text-to-image generation diffusion models are restricted to textual inputs
  • X&Fuse is a new general approach to utilize visual information on top of the textual information
  • X&Fuse can fit various scenarios, process all of the visual information, enable full attention between different elements, and can be easily applied to pretrained text-to-image models
  • X&Fuse yields state-of-the-art zero-shot FID results on the MS-COCO benchmark
  • X&Fuse creates new opportunities for injecting visual cues when generating images from captions
  • Text-to-image generation models are improving due to new datasets, models, and approaches
  • These approaches often restrict the model to textual inputs
  • This work focuses on using additional visual inputs
  • Retrieval augmented diffusion models have been used to facilitate textless training
  • Subject-driven generation uses textual inversion to generate images with properties from a set of images
  • Scene-based generation enables the user to specify the scene of the generated image

Method

Modeling

  • U-Net architecture receives two inputs: text embeddings and a noised image
  • Model receives an additional image (conditioned image) to fuse into the generated image
  • Weights that process the conditioned image and noised image are shared

Advantages

  • X&Fuse does not assume spatial correspondence between conditioned and generated images
  • X&Fuse preserves identity in subject-driven generation
  • X&Fuse can be easily adjusted to new distributions
  • X&Fuse does not require any new weights

Vanilla text-to-image generation

Evaluation

  • Collected 5k images from MS-COCO validation set
  • Used segmentation model to identify objects in images
  • Extracted object with highest confidence to create dataset
  • Leveraged in-context learning abilities of language models to create captions
  • Filtered images with people and subjects that are too small or too large
  • Used FID, CLIP-Score and human evaluation to measure results
  • Used vanilla Stable Diffusion (SD) model, SD (Cont) and textual inversion method as baselines

Retrieve&fuse

  • Retrieve&Fuse approach is a special case of X&Fuse
  • Retrieve conditioned image from a bank of images
  • Semi-parametric approach
  • Makes use of a large database of text-image pairs
  • Condition model on RGB image of the scene
  • Create scene image by applying segmentation model on ground truth image
  • Add predicted textual label to caption and color image according to predicted mask
  • Choose color according to order of objects in caption
  • Generate any object at inference time
  • Compare to Make-A-Scene
  • Evaluate on MS-COCO benchmark
  • Use 30K-FID and CLIP-Score as automatic metrics
  • Use hyperparameters and settings from Sec. 4.2

Results

  • Our model is able to get a higher CLIP-Score between the generated image and the subject image.
  • Crop&Fuse FID score is lower than the baseline.
  • Human raters prefer X&Fuse by a large margin in all parameters.

Ablation study

  • Alternative approaches for conditioning on visual data are tested
  • Retrieve&Channel and Retrieve&CLIP are trained with the same settings as Retrieve&Fuse
  • Non-trainable alternatives Retrieve&Null and Retrieve&Init lead to degraded performance
  • Increasing index size improves FID score
  • Removing ground truth conditioning yields slightly worse performance
  • Using an image-based index improves CLIP-Score but drops FID

Subject-driven generation

Crop&fuse

  • Self-supervised scheme used to train model on subject-driven generation
  • Augmentations applied to crop to encourage model to use input caption when reconstructing augmented object
  • Hyperparameters and settings described in section 5.4
  • Augmentations include random affine transformation, scaling factor between 0.2 and 4, translation factor of 0.3, and degree range of 0-180

Analysis

  • Training encourages the model to change the subject according to the prompt.
  • The more the subject is augmented, the harder it is to reconstruct it.
  • At inference time, the scale value can be changed to preserve fidelity to either the caption or the subject.

Scene-based generation

  • Model receives additional input to generate an image
  • Goal is to show versatility and simplicity of X&Fuse approach, not achieve state-of-the-art results

Conclusions

  • Introduced a new approach, X&Fuse, for conditioning on visual information in text-to-image
  • Compared X&Fuse to strong baselines and modeling alternatives
  • Experimented with three different scenarios
  • X&Fuse set a new state-of-the-art FID result in text-to-image
  • Showed impressive performance regardless of the scenario
  • Provided an appealing option for other scenarios that may benefit from additional visual information