Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a learning-based encoder for fast and accurate concept customization
- Consists of global and local mapping networks
- Global mapping network projects hierarchical features of image into multiple “new” words in textual word embedding space
- Local mapping network injects encoded patch features into cross attention layers to provide omitted details
- Compares method with prior optimization-based approaches on user-defined concepts
- Demonstrates faster encoding process with more high-fidelity inversion and robust editability
Paper Content
Introduction
- Large-scale diffusion models demonstrate impressive superiority in text-to-image generation
- Applied to various tasks such as image editing, data augmentation, and artistic creation
- Customized text-to-image generation aims to learn a specific concept from a small set of user-provided images
- Existing methods usually adopt the per-concept optimization formulation, which requires several or tens of minutes to learn a single concept
- Proposed a learning-based encoder to encode visual concepts into textual embeddings
- Global mapping network maps CLIP image features into the textual word embedding space
- Local mapping network encodes CLIP features into the textual feature space
- Experiments demonstrate that the method can encode the target concept efficiently and faithfully
Related work
Text-to-image generation
- Deep generative models have been successful in text-conditioned image generation
- Models can be categorized into three groups: GAN-based, VAE-based, and diffusion-based
- Diffusion-based models demonstrate high-quality and controllable imaginary generation
- GLIDE introduces diffusion models into text-to-image generation
- Diffusion models struggle to express specific or user-defined concepts
Gan inversion
- GAN inversion is a way to project real images into latent codes
- There are two types of GAN inversion algorithms: optimization-based and encoder-based
- Optimization-based methods require many iterations, while encoder-based methods only require one feed-forward pass
- ELITE proposes a local mapping network to improve details consistency
Diffusion-based inversion
- Text-to-image diffusion models can be inverted in two types of latent spaces: Textual Word Embedding (TWE) space and Imagebased Noise Map (INM) space.
- Existing inversion methods in the TWE space use an optimization-based formulation.
- These methods require multiple user-provided images and many iterations to learn a new concept.
- Our method uses an encoder-based formulation to accelerate the process and learn a new concept with one single image.
Proposed method
Preliminary
- Stable Diffusion is a text-to-image model
- It consists of an autoencoder and a conditional diffusion model
- Autoencoder maps an image to a lower dimensional latent space
- Conditional diffusion model generates latent codes based on text condition
- Cross attention is used to incorporate text information in image synthesis
Global mapping
- We choose the textual word embedding space of CLIP text encoder as the target for inversion.
- We propose a global mapping network to encode the given concept image into word embeddings.
- We introduce a pseudo-word S* to represent the new concept and associate its embedding with v.
- We use the word embedding of the deepest feature during local training and image generation stages.
Local mapping
- Single word embedding is not enough to accurately describe a concept
- Multiple word embeddings can reduce editability
- Local mapping network encodes multi-layer CLIP features into textual feature space
- Cross attention module adds local information to generation
- Attention map is reweighted to focus on object region
- Evaluation metrics include Image-alignment, Text-alignment, and KID
- Optimization time is also used as a metric
Ablation study
- Conducted ablation studies to evaluate effects of components in method
- Visualized words learned by multi-layer features in global mapping network
- Experimented with several variants to demonstrate editability
- Local mapping network improves consistency with concept image without compromising editability
- λ controls fusion of information from global and local mapping networks
- λ set to 0.6 for editing prompts and 0.8 for generating prompts
Qualitative results
- ELITE is compared to existing optimization-based methods
- ELITE is capable of faithfully capturing details of target concept and generating diverse images
- ELITE has superior editing performance compared to existing methods
Quantitative results
- ELITE achieved better text-alignment than state-of-the-art methods
- ELITE achieved comparable detail consistency and image quality
- ELITE was significantly faster than optimization-based methods
- User study showed ELITE was preferred over competing methods
Conclusion
- Propose a novel learning-based encoder for text-to-image synthesis
- Reduces computational and memory burden of learning new concepts
- Superior flexibility in editing learned concepts into new scenes
- Future work to leverage multiple concept images and compose multiple concepts
- Figure 1: given an image, learn a pseudo-word to represent target concept
- Figure 4: visual comparisons of concept generation
- Figure 5: visualization of learned word embeddings
- Figure 6: visual comparisons of different variants
- Figure 7: visual comparisons of effect of value of λ
- Figure 8: visual comparisons of concept editing
- Ablation and quantitative comparisons with existing methods
- User study