Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce an Extended Textual Conditioning space in text-to-image models
  • Show that the extended space provides greater disentangling and control over image synthesis
  • Introduce Extended Textual Inversion (XTI)
  • XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space
  • Conduct a series of experiments to analyze and understand the properties of the new space

Paper Content

Introduction

  • Neural generative models have advanced the field of image synthesis.
  • Text-to-image models have taken this field to new heights.
  • Text-to-image models use the encoded text as conditioning.
  • Extended U-net introduces a new textual conditioning space called P+ space.
  • P+ space is more expressive and provides better control on the synthesized image.
  • Exploring neural sub-spaces in generative models has been studied
  • StyleGAN has been used to explore extended latent space
  • P+ is similar to W+ but is more editable
  • Exploiting deeper and more disentangled layers has been explored
  • Text-to-image diffusion models use cross-attention layers to condition by text prompts
  • Different layers are responsible for different abstraction levels

Text-driven editing

  • Text-to-Image models generate images based on textual inputs
  • Diffusion models are powerful architectures used in Text-to-Image models
  • Single-image editing has been attempted with Text-to-Image models
  • Text-only editing approaches can be used for global or local editing
  • Plug-and-Play, InstructPix2Pix, and Parmar et al. allow users to manipulate real images with instructions

Personalization

  • Synthesizing concepts not widespread in training data is challenging
  • Inversion process enables regenerating depicted object using text-guided diffusion model
  • Personalization of text-to-image models is powerful technique
  • Current methods face trade-off between learning tokens and avoiding overfitting

Extended conditioning space

  • Experiment conducted on Stable Diffusion model
  • Cross-attention layers of denoising U-net partitioned into two subsets
  • Two conditioning prompts used: “red cube” and “green lizard”
  • Prompts injected into different subsets of cross-attention layers
  • Results suggest different attributes exert greater influence at different levels
  • Extended Textual Conditioning space (P+) introduced
  • P+ allows for higher degree of control over various attributes
  • Potential for enhancing textual inversion

Extended textual inversion (xti)

  • Goal of Textual Inversion (TI) operation is to find a representation of an object in the conditioning space P.
  • Extended Textual Inversion (XTI) adds new textual tokens and token embeddings to the tokenizer model.
  • Reconstruction objective for the embeddings is defined to predict the noise of a noisy image.

Experiments and evaluation

  • Analysis of U-net cross-attention layers conducted
  • Motivation for effectiveness of proposed P+ space
  • Comprehensive evaluation of XTI approach for personalization task
  • Stable Diffusion 1.4 model used
  • Vector with 768 entries used for token embedding
  • U-net has 4 spatial resolution levels
  • 16, 32, and 64 resolution levels have 2 and 3 cross-attention layers respectively
  • 8 resolution level has 1 cross-attention layer
  • Distribution of cross-attention varies across layers
  • Coarse layers attend more to object token, fine layers attend more to appearance token
  • CLIP similarity metric used to quantify contribution of each layer
  • Coarse layers determine object shape and structure, fine layers determine color appearance
  • Style is a more ambiguous descriptor involving both shape and texture

Xti evaluation

  • Evaluated proposed XTI and compared to original Textual Inversion (TI)
  • Used combined dataset of 9 concepts from TI and 6 concepts from another dataset
  • Focused on TI as baseline because it does not fine tune model weights
  • Fine-tuning approaches have disadvantages
  • Used batch size of 8 and 5000 optimization steps for TI, reduced learning rate of 0.005 for XTI
  • Evaluated editability quality with average cosine similarity between CLIP embeddings
  • Measured distortion with average pairwise cosine similarity between ViT-S/16 DINO embeddings
  • Compared XTI to TI, DreamBooth, and single-image inversion
  • XTI outperforms TI in subject and text similarity
  • Conducted user study, results show preference for XTI for both subject and text fidelity

Single image inversion

  • Extended Textual Inversion is effective in data-hungry setups with a single image.
  • Learning rate was reduced to 0.001 to prevent overfitting.
  • DreamBooth performed poorly in the single-image setting and was prone to overfitting.

Embedding density

  • XTI has better editability properties than original TI
  • Evaluated density of optimized tokens with respect to original tokens look-up table embeddings
  • Kernel-based density estimation used to quantify intuition
  • Density of optimized tokens is significantly smaller compared to original embeddings
  • Figure 11 illustrates original tokens density distribution and textual inversion tokens densities

Style mixing application

  • Denoising U-net layers are responsible for different aspects of a synthesized image.
  • Style Mixing combines the inversions of two different concepts by passing tokens from different subjects to different layers.
  • An additional density regularization loss term enhances the ability to mix objects and styles.
  • Figure 12 demonstrates combining two concepts from [15].
  • Figure 13 shows a variety of examples generated with this method.
  • Figure 14 provides a qualitative comparison between XTI-based style mixing and baselines.

Conclusions, limitations, and future work