Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce an Extended Textual Conditioning space in text-to-image models
- Show that the extended space provides greater disentangling and control over image synthesis
- Introduce Extended Textual Inversion (XTI)
- XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space
- Conduct a series of experiments to analyze and understand the properties of the new space
Paper Content
Introduction
- Neural generative models have advanced the field of image synthesis.
- Text-to-image models have taken this field to new heights.
- Text-to-image models use the encoded text as conditioning.
- Extended U-net introduces a new textual conditioning space called P+ space.
- P+ space is more expressive and provides better control on the synthesized image.
Related works
- Exploring neural sub-spaces in generative models has been studied
- StyleGAN has been used to explore extended latent space
- P+ is similar to W+ but is more editable
- Exploiting deeper and more disentangled layers has been explored
- Text-to-image diffusion models use cross-attention layers to condition by text prompts
- Different layers are responsible for different abstraction levels
Text-driven editing
- Text-to-Image models generate images based on textual inputs
- Diffusion models are powerful architectures used in Text-to-Image models
- Single-image editing has been attempted with Text-to-Image models
- Text-only editing approaches can be used for global or local editing
- Plug-and-Play, InstructPix2Pix, and Parmar et al. allow users to manipulate real images with instructions
Personalization
- Synthesizing concepts not widespread in training data is challenging
- Inversion process enables regenerating depicted object using text-guided diffusion model
- Personalization of text-to-image models is powerful technique
- Current methods face trade-off between learning tokens and avoiding overfitting
Extended conditioning space
- Experiment conducted on Stable Diffusion model
- Cross-attention layers of denoising U-net partitioned into two subsets
- Two conditioning prompts used: “red cube” and “green lizard”
- Prompts injected into different subsets of cross-attention layers
- Results suggest different attributes exert greater influence at different levels
- Extended Textual Conditioning space (P+) introduced
- P+ allows for higher degree of control over various attributes
- Potential for enhancing textual inversion
Extended textual inversion (xti)
- Goal of Textual Inversion (TI) operation is to find a representation of an object in the conditioning space P.
- Extended Textual Inversion (XTI) adds new textual tokens and token embeddings to the tokenizer model.
- Reconstruction objective for the embeddings is defined to predict the noise of a noisy image.
Experiments and evaluation
- Analysis of U-net cross-attention layers conducted
- Motivation for effectiveness of proposed P+ space
- Comprehensive evaluation of XTI approach for personalization task
- Stable Diffusion 1.4 model used
- Vector with 768 entries used for token embedding
- U-net has 4 spatial resolution levels
- 16, 32, and 64 resolution levels have 2 and 3 cross-attention layers respectively
- 8 resolution level has 1 cross-attention layer
- Distribution of cross-attention varies across layers
- Coarse layers attend more to object token, fine layers attend more to appearance token
- CLIP similarity metric used to quantify contribution of each layer
- Coarse layers determine object shape and structure, fine layers determine color appearance
- Style is a more ambiguous descriptor involving both shape and texture
Xti evaluation
- Evaluated proposed XTI and compared to original Textual Inversion (TI)
- Used combined dataset of 9 concepts from TI and 6 concepts from another dataset
- Focused on TI as baseline because it does not fine tune model weights
- Fine-tuning approaches have disadvantages
- Used batch size of 8 and 5000 optimization steps for TI, reduced learning rate of 0.005 for XTI
- Evaluated editability quality with average cosine similarity between CLIP embeddings
- Measured distortion with average pairwise cosine similarity between ViT-S/16 DINO embeddings
- Compared XTI to TI, DreamBooth, and single-image inversion
- XTI outperforms TI in subject and text similarity
- Conducted user study, results show preference for XTI for both subject and text fidelity
Single image inversion
- Extended Textual Inversion is effective in data-hungry setups with a single image.
- Learning rate was reduced to 0.001 to prevent overfitting.
- DreamBooth performed poorly in the single-image setting and was prone to overfitting.
Embedding density
- XTI has better editability properties than original TI
- Evaluated density of optimized tokens with respect to original tokens look-up table embeddings
- Kernel-based density estimation used to quantify intuition
- Density of optimized tokens is significantly smaller compared to original embeddings
- Figure 11 illustrates original tokens density distribution and textual inversion tokens densities
Style mixing application
- Denoising U-net layers are responsible for different aspects of a synthesized image.
- Style Mixing combines the inversions of two different concepts by passing tokens from different subjects to different layers.
- An additional density regularization loss term enhances the ability to mix objects and styles.
- Figure 12 demonstrates combining two concepts from [15].
- Figure 13 shows a variety of examples generated with this method.
- Figure 14 provides a qualitative comparison between XTI-based style mixing and baselines.