Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce an Extended Textual Conditioning space in text-to-image models
Show that the extended space provides greater disentangling and control over image synthesis
Introduce Extended Textual Inversion (XTI)
XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space
Conduct a series of experiments to analyze and understand the properties of the new space

Paper Content

Introduction

Neural generative models have advanced the field of image synthesis.
Text-to-image models have taken this field to new heights.
Text-to-image models use the encoded text as conditioning.
Extended U-net introduces a new textual conditioning space called P+ space.
P+ space is more expressive and provides better control on the synthesized image.

Exploring neural sub-spaces in generative models has been studied
StyleGAN has been used to explore extended latent space
P+ is similar to W+ but is more editable
Exploiting deeper and more disentangled layers has been explored
Text-to-image diffusion models use cross-attention layers to condition by text prompts
Different layers are responsible for different abstraction levels

Text-driven editing

Text-to-Image models generate images based on textual inputs
Diffusion models are powerful architectures used in Text-to-Image models
Single-image editing has been attempted with Text-to-Image models
Text-only editing approaches can be used for global or local editing
Plug-and-Play, InstructPix2Pix, and Parmar et al. allow users to manipulate real images with instructions

Personalization

Synthesizing concepts not widespread in training data is challenging
Inversion process enables regenerating depicted object using text-guided diffusion model
Personalization of text-to-image models is powerful technique
Current methods face trade-off between learning tokens and avoiding overfitting

Extended conditioning space

Experiment conducted on Stable Diffusion model
Cross-attention layers of denoising U-net partitioned into two subsets
Two conditioning prompts used: “red cube” and “green lizard”
Prompts injected into different subsets of cross-attention layers
Results suggest different attributes exert greater influence at different levels
Extended Textual Conditioning space (P+) introduced
P+ allows for higher degree of control over various attributes
Potential for enhancing textual inversion

Extended textual inversion (xti)

Goal of Textual Inversion (TI) operation is to find a representation of an object in the conditioning space P.
Extended Textual Inversion (XTI) adds new textual tokens and token embeddings to the tokenizer model.
Reconstruction objective for the embeddings is defined to predict the noise of a noisy image.

Experiments and evaluation

Analysis of U-net cross-attention layers conducted
Motivation for effectiveness of proposed P+ space
Comprehensive evaluation of XTI approach for personalization task
Stable Diffusion 1.4 model used
Vector with 768 entries used for token embedding
U-net has 4 spatial resolution levels
16, 32, and 64 resolution levels have 2 and 3 cross-attention layers respectively
8 resolution level has 1 cross-attention layer
Distribution of cross-attention varies across layers
Coarse layers attend more to object token, fine layers attend more to appearance token
CLIP similarity metric used to quantify contribution of each layer
Coarse layers determine object shape and structure, fine layers determine color appearance
Style is a more ambiguous descriptor involving both shape and texture

Xti evaluation

Evaluated proposed XTI and compared to original Textual Inversion (TI)
Used combined dataset of 9 concepts from TI and 6 concepts from another dataset
Focused on TI as baseline because it does not fine tune model weights
Fine-tuning approaches have disadvantages
Used batch size of 8 and 5000 optimization steps for TI, reduced learning rate of 0.005 for XTI
Evaluated editability quality with average cosine similarity between CLIP embeddings
Measured distortion with average pairwise cosine similarity between ViT-S/16 DINO embeddings
Compared XTI to TI, DreamBooth, and single-image inversion
XTI outperforms TI in subject and text similarity
Conducted user study, results show preference for XTI for both subject and text fidelity

Single image inversion

Extended Textual Inversion is effective in data-hungry setups with a single image.
Learning rate was reduced to 0.001 to prevent overfitting.
DreamBooth performed poorly in the single-image setting and was prone to overfitting.

Embedding density

XTI has better editability properties than original TI
Evaluated density of optimized tokens with respect to original tokens look-up table embeddings
Kernel-based density estimation used to quantify intuition
Density of optimized tokens is significantly smaller compared to original embeddings
Figure 11 illustrates original tokens density distribution and textual inversion tokens densities

Style mixing application

Denoising U-net layers are responsible for different aspects of a synthesized image.
Style Mixing combines the inversions of two different concepts by passing tokens from different subjects to different layers.
An additional density regularization loss term enhances the ability to mix objects and styles.
Figure 12 demonstrates combining two concepts from [15].
Figure 13 shows a variety of examples generated with this method.
Figure 14 provides a qualitative comparison between XTI-based style mixing and baselines.

$P+$: Extended Textual Conditioning in Text-to-Image Generation

Link to paper

Abstract

Paper Content

Introduction

Text-driven editing

Personalization

Extended conditioning space

Extended textual inversion (xti)

Experiments and evaluation

Xti evaluation

Single image inversion

Embedding density

Style mixing application

Conclusions, limitations, and future work

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Text-driven editing#

Personalization#

Extended conditioning space#

Extended textual inversion (xti)#

Experiments and evaluation#

Xti evaluation#

Single image inversion#

Embedding density#

Style mixing application#

Conclusions, limitations, and future work#

Link to paper

Abstract

Paper Content

Introduction

Related works

Text-driven editing

Personalization

Extended conditioning space

Extended textual inversion (xti)

Experiments and evaluation

Xti evaluation

Single image inversion

Embedding density

Style mixing application

Conclusions, limitations, and future work