Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Text-to-image models allow users to create images through natural language.
A new approach is presented to allow creative freedom with 3-5 images of a user-provided concept.
The approach uses “words” in the embedding space of a frozen text-to-image model to represent the concept.
A single word embedding is sufficient for capturing unique and varied concepts.
Code, data and new words are available online.

Paper Content

Introduction

Request contains a wealth of information
Request should produce a drawing
Drawing should match style of Jack’s prior work
Drawing should portray Rose herself
Finding new concepts in large scale models is difficult

Text-guided synthesis is a widely studied topic in the context of GANs
Typically, a conditional model is trained to reproduce samples from paired image-caption datasets
Several approaches employ test-time optimization to explore the latent spaces of a pre-trained generator
Text-based interfaces are used for image editing, generator domain adaptation, video manipulation, motion synthesis, style transfer, and texture synthesis
Our approach builds on open-ended, conditional synthesis models
Inversion is done through optimization-based techniques or encoders
Diffusion-based inversion can be done by adding noise to an image and then de-noising it
Personalization efforts are found in recommendation systems, federated learning, vision and graphics
PALAVRA leverages a pre-trained CLIP model for retrieval and segmentation of personalized objects

Method

Goal is to enable language-guided generation of user-specified concepts
Representation of pre-trained text-to-image model is used
Leverage rich semantic and visual prior of model
Word-embedding stage of text encoders used
Prior work showed this embedding space can capture basic image semantics
Visual reconstruction objective used to find pseudo-words
Latent Diffusion Models (LDMs) used
LDMs consist of autoencoder and diffusion model
Autoencoder maps images to spatial latent code
Diffusion model produces codes within latent space
Text encoder used to map conditioning input to conditioning vector
Text embeddings initialized with single-word coarse descriptor
Experiments conducted using 2xV100 GPUs with batch size of 4

Qualitative comparisons and applications

Demonstrates applications enabled through Textual Inversions
Provides visual comparisons to state-of-the-art and human-captioning baselines

Image variations

Our method captures unique details of an object using a single pseudoword
We compared our method to two baselines: LDM guided by a human caption and DALLE-2 guided by either a human caption or an image prompt
Our method better captures the unique details of the concept than the baselines
We can compose novel scenes by incorporating the learned pseudo-words into new conditioning texts
We compared our method to several personalization baselines
Our method can optimize a single pseudo-word and re-use it for a multitude of new generations
Baseline models require expensive optimization for every new creation
Our method builds upon pre-trained, large-scale text-to-image synthesis models

Style transfer

Text-guided synthesis can be used to capture the unique style of an artist.
Model can find pseudowords representing a specific, unknown style.
Results demonstrate that the ability to capture concepts extends beyond simple object reconstructions.
Differs from traditional style transfer as content of input image is not necessarily maintained.

Concept compositions

Model can reason over multiple novel pseudo-words at the same time
Struggles with relations between concepts (e.g. placing two concepts side-by-side)

Bias reduction

Text-to-image models inherit biases from internet-scale data used to train them.
Examples of bias include whitepassing and male-passing images of CEOs and heterosexual couples for “wedding”.
A small, curated dataset can be used to learn a new “fairer” word for a biased concept.
Bias can be reduced by learning a new embedding from a small, more diverse set.

Downstream applications

Pseudo-words can be used in downstream models that build on the same initial LDM model.
Blended Latent Diffusion (Avrahami et al., 2022a) enables localized text-based editing of images.
Localized synthesis process can be conditioned on learned pseudo-words without requiring additional modifications.

Image curation

Generated 16 candidates for each prompt, manually selected best result
Similar curation processes with larger batches typically used in text-conditioned generation works
Automate selection process by using CLIP to rank images
Provide large-scale, uncurated galleries of generated results in supplementary materials

Quantitative analysis

Inversion into a latent space provides many design choices
Many core premises of GAN inversion also exist in the textual embedding space
Solutions typically used in GAN inversion do not generalize to this space and are often unhelpful or harmful

Evaluation metrics

Analyzed quality of latent space embeddings by considering two fronts: reconstruction and editability
Measured similarity of generated images to concept-specific training set by considering semantic CLIP-space distances
Evaluated ability to modify concepts using textual prompts by synthesizing 64 samples using 50 DDIM steps
Calculated average CLIP-space embedding of samples and computed cosine similarity with CLIP-space embedding of textual prompts
Method does not involve direct optimization of CLIP-based objective score and is not sensitive to adversarial scoring flaws

Evaluation setups

Evaluate embedding space using experimental setups inspired by GAN inversion
Consider extended, multi-vector latent space
Consider progressive multi-vector setup
Introduce regularization to keep learned embedding close to existing words
Introduce unique, per-image tokens
Compare to human-level performance using captions
Add two reference baselines
Evaluate own setup with increased/decreased learning rate
Consider additional setups in supplementary

Results

Semantic reconstruction quality of method and baselines is comparable to random images from training set
Single-word method achieves comparable reconstruction quality and improved editability over multi-word baselines
Distortion-editability trade-off curve exists, single-embedding model can be moved along it by changing learning rate
Use of human descriptions for concepts leads to diminished editability

Human evaluations

Conducted user study with two questionnaires
600 responses to each questionnaire, for a total of 1,200 responses
Results align with CLIP-based metrics
Demonstrates reconstruction-editability tradeoff
Outlines limitations of human-based captioning

Limitations

Method offers increased freedom but may struggle with precise shapes
Method is often enough for artistic creations
Aim to achieve better control over accuracy in the future
Optimization times are lengthy, roughly two hours for a single concept

Text-to-image models can be used to generate misleading content.
Models are susceptible to biases found in training data.
Ability to more precisely describe concepts can reduce biases.
Ability to learn artistic styles may be misused for copyright infringement.

Conclusions

Introduced task of personalized, language-guided generation
Leverage text-to-image model to create images of specific concepts in novel settings and scenes
Approach called “Textual Inversions” operates by inverting concepts into new pseudo-words
Pseudo-words can be injected into new scenes using natural language descriptions
Text-driven interface for ease of editing, but providing visual cues when approaching the limits of natural language
Implemented over LDM, but applicable to other text-to-image models
Investigated two recent approaches to inversion: Bipartite DDIM-inversion and pivotal tuning
Bipartite inversion allows for more accurate reconstructions without modifying the model, but their structure is lost for complex prompts in high guidance scales
Pivotal tuning improves shapes at the cost of visual artifacts, and fail to adhere to simple prompts at high guidance scales
Our approach shows the best results with ∼ 5 images
Additional results of personalized generation provided in Figures 13-16

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Qualitative comparisons and applications#

Image variations#

Style transfer#

Concept compositions#

Bias reduction#

Downstream applications#

Image curation#

Quantitative analysis#

Evaluation metrics#

Evaluation setups#

Results#

Human evaluations#

Limitations#

Social impact#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Qualitative comparisons and applications

Image variations

Style transfer

Concept compositions

Bias reduction

Downstream applications

Image curation

Quantitative analysis

Evaluation metrics

Evaluation setups

Results

Human evaluations

Limitations

Social impact

Conclusions