Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a learning-based encoder for fast and accurate concept customization
  • Consists of global and local mapping networks
  • Global mapping network projects hierarchical features of image into multiple “new” words in textual word embedding space
  • Local mapping network injects encoded patch features into cross attention layers to provide omitted details
  • Compares method with prior optimization-based approaches on user-defined concepts
  • Demonstrates faster encoding process with more high-fidelity inversion and robust editability

Paper Content

Introduction

  • Large-scale diffusion models demonstrate impressive superiority in text-to-image generation
  • Applied to various tasks such as image editing, data augmentation, and artistic creation
  • Customized text-to-image generation aims to learn a specific concept from a small set of user-provided images
  • Existing methods usually adopt the per-concept optimization formulation, which requires several or tens of minutes to learn a single concept
  • Proposed a learning-based encoder to encode visual concepts into textual embeddings
  • Global mapping network maps CLIP image features into the textual word embedding space
  • Local mapping network encodes CLIP features into the textual feature space
  • Experiments demonstrate that the method can encode the target concept efficiently and faithfully

Text-to-image generation

  • Deep generative models have been successful in text-conditioned image generation
  • Models can be categorized into three groups: GAN-based, VAE-based, and diffusion-based
  • Diffusion-based models demonstrate high-quality and controllable imaginary generation
  • GLIDE introduces diffusion models into text-to-image generation
  • Diffusion models struggle to express specific or user-defined concepts

Gan inversion

  • GAN inversion is a way to project real images into latent codes
  • There are two types of GAN inversion algorithms: optimization-based and encoder-based
  • Optimization-based methods require many iterations, while encoder-based methods only require one feed-forward pass
  • ELITE proposes a local mapping network to improve details consistency

Diffusion-based inversion

  • Text-to-image diffusion models can be inverted in two types of latent spaces: Textual Word Embedding (TWE) space and Imagebased Noise Map (INM) space.
  • Existing inversion methods in the TWE space use an optimization-based formulation.
  • These methods require multiple user-provided images and many iterations to learn a new concept.
  • Our method uses an encoder-based formulation to accelerate the process and learn a new concept with one single image.

Proposed method

Preliminary

  • Stable Diffusion is a text-to-image model
  • It consists of an autoencoder and a conditional diffusion model
  • Autoencoder maps an image to a lower dimensional latent space
  • Conditional diffusion model generates latent codes based on text condition
  • Cross attention is used to incorporate text information in image synthesis

Global mapping

  • We choose the textual word embedding space of CLIP text encoder as the target for inversion.
  • We propose a global mapping network to encode the given concept image into word embeddings.
  • We introduce a pseudo-word S* to represent the new concept and associate its embedding with v.
  • We use the word embedding of the deepest feature during local training and image generation stages.

Local mapping

  • Single word embedding is not enough to accurately describe a concept
  • Multiple word embeddings can reduce editability
  • Local mapping network encodes multi-layer CLIP features into textual feature space
  • Cross attention module adds local information to generation
  • Attention map is reweighted to focus on object region
  • Evaluation metrics include Image-alignment, Text-alignment, and KID
  • Optimization time is also used as a metric

Ablation study

  • Conducted ablation studies to evaluate effects of components in method
  • Visualized words learned by multi-layer features in global mapping network
  • Experimented with several variants to demonstrate editability
  • Local mapping network improves consistency with concept image without compromising editability
  • λ controls fusion of information from global and local mapping networks
  • λ set to 0.6 for editing prompts and 0.8 for generating prompts

Qualitative results

  • ELITE is compared to existing optimization-based methods
  • ELITE is capable of faithfully capturing details of target concept and generating diverse images
  • ELITE has superior editing performance compared to existing methods

Quantitative results

  • ELITE achieved better text-alignment than state-of-the-art methods
  • ELITE achieved comparable detail consistency and image quality
  • ELITE was significantly faster than optimization-based methods
  • User study showed ELITE was preferred over competing methods

Conclusion

  • Propose a novel learning-based encoder for text-to-image synthesis
  • Reduces computational and memory burden of learning new concepts
  • Superior flexibility in editing learned concepts into new scenes
  • Future work to leverage multiple concept images and compose multiple concepts
  • Figure 1: given an image, learn a pseudo-word to represent target concept
  • Figure 4: visual comparisons of concept generation
  • Figure 5: visualization of learned word embeddings
  • Figure 6: visual comparisons of different variants
  • Figure 7: visual comparisons of effect of value of λ
  • Figure 8: visual comparisons of concept editing
  • Ablation and quantitative comparisons with existing methods
  • User study