Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Text-to-image models allow users to create images through natural language.
  • A new approach is presented to allow creative freedom with 3-5 images of a user-provided concept.
  • The approach uses “words” in the embedding space of a frozen text-to-image model to represent the concept.
  • A single word embedding is sufficient for capturing unique and varied concepts.
  • Code, data and new words are available online.

Paper Content

Introduction

  • Request contains a wealth of information
  • Request should produce a drawing
  • Drawing should match style of Jack’s prior work
  • Drawing should portray Rose herself
  • Finding new concepts in large scale models is difficult
  • Text-guided synthesis is a widely studied topic in the context of GANs
  • Typically, a conditional model is trained to reproduce samples from paired image-caption datasets
  • Several approaches employ test-time optimization to explore the latent spaces of a pre-trained generator
  • Text-based interfaces are used for image editing, generator domain adaptation, video manipulation, motion synthesis, style transfer, and texture synthesis
  • Our approach builds on open-ended, conditional synthesis models
  • Inversion is done through optimization-based techniques or encoders
  • Diffusion-based inversion can be done by adding noise to an image and then de-noising it
  • Personalization efforts are found in recommendation systems, federated learning, vision and graphics
  • PALAVRA leverages a pre-trained CLIP model for retrieval and segmentation of personalized objects

Method

  • Goal is to enable language-guided generation of user-specified concepts
  • Representation of pre-trained text-to-image model is used
  • Leverage rich semantic and visual prior of model
  • Word-embedding stage of text encoders used
  • Prior work showed this embedding space can capture basic image semantics
  • Visual reconstruction objective used to find pseudo-words
  • Latent Diffusion Models (LDMs) used
  • LDMs consist of autoencoder and diffusion model
  • Autoencoder maps images to spatial latent code
  • Diffusion model produces codes within latent space
  • Text encoder used to map conditioning input to conditioning vector
  • Text embeddings initialized with single-word coarse descriptor
  • Experiments conducted using 2xV100 GPUs with batch size of 4

Qualitative comparisons and applications

  • Demonstrates applications enabled through Textual Inversions
  • Provides visual comparisons to state-of-the-art and human-captioning baselines

Image variations

  • Our method captures unique details of an object using a single pseudoword
  • We compared our method to two baselines: LDM guided by a human caption and DALLE-2 guided by either a human caption or an image prompt
  • Our method better captures the unique details of the concept than the baselines
  • We can compose novel scenes by incorporating the learned pseudo-words into new conditioning texts
  • We compared our method to several personalization baselines
  • Our method can optimize a single pseudo-word and re-use it for a multitude of new generations
  • Baseline models require expensive optimization for every new creation
  • Our method builds upon pre-trained, large-scale text-to-image synthesis models

Style transfer

  • Text-guided synthesis can be used to capture the unique style of an artist.
  • Model can find pseudowords representing a specific, unknown style.
  • Results demonstrate that the ability to capture concepts extends beyond simple object reconstructions.
  • Differs from traditional style transfer as content of input image is not necessarily maintained.

Concept compositions

  • Model can reason over multiple novel pseudo-words at the same time
  • Struggles with relations between concepts (e.g. placing two concepts side-by-side)

Bias reduction

  • Text-to-image models inherit biases from internet-scale data used to train them.
  • Examples of bias include whitepassing and male-passing images of CEOs and heterosexual couples for “wedding”.
  • A small, curated dataset can be used to learn a new “fairer” word for a biased concept.
  • Bias can be reduced by learning a new embedding from a small, more diverse set.

Downstream applications

  • Pseudo-words can be used in downstream models that build on the same initial LDM model.
  • Blended Latent Diffusion (Avrahami et al., 2022a) enables localized text-based editing of images.
  • Localized synthesis process can be conditioned on learned pseudo-words without requiring additional modifications.

Image curation

  • Generated 16 candidates for each prompt, manually selected best result
  • Similar curation processes with larger batches typically used in text-conditioned generation works
  • Automate selection process by using CLIP to rank images
  • Provide large-scale, uncurated galleries of generated results in supplementary materials

Quantitative analysis

  • Inversion into a latent space provides many design choices
  • Many core premises of GAN inversion also exist in the textual embedding space
  • Solutions typically used in GAN inversion do not generalize to this space and are often unhelpful or harmful

Evaluation metrics

  • Analyzed quality of latent space embeddings by considering two fronts: reconstruction and editability
  • Measured similarity of generated images to concept-specific training set by considering semantic CLIP-space distances
  • Evaluated ability to modify concepts using textual prompts by synthesizing 64 samples using 50 DDIM steps
  • Calculated average CLIP-space embedding of samples and computed cosine similarity with CLIP-space embedding of textual prompts
  • Method does not involve direct optimization of CLIP-based objective score and is not sensitive to adversarial scoring flaws

Evaluation setups

  • Evaluate embedding space using experimental setups inspired by GAN inversion
  • Consider extended, multi-vector latent space
  • Consider progressive multi-vector setup
  • Introduce regularization to keep learned embedding close to existing words
  • Introduce unique, per-image tokens
  • Compare to human-level performance using captions
  • Add two reference baselines
  • Evaluate own setup with increased/decreased learning rate
  • Consider additional setups in supplementary

Results

  • Semantic reconstruction quality of method and baselines is comparable to random images from training set
  • Single-word method achieves comparable reconstruction quality and improved editability over multi-word baselines
  • Distortion-editability trade-off curve exists, single-embedding model can be moved along it by changing learning rate
  • Use of human descriptions for concepts leads to diminished editability

Human evaluations

  • Conducted user study with two questionnaires
  • 600 responses to each questionnaire, for a total of 1,200 responses
  • Results align with CLIP-based metrics
  • Demonstrates reconstruction-editability tradeoff
  • Outlines limitations of human-based captioning

Limitations

  • Method offers increased freedom but may struggle with precise shapes
  • Method is often enough for artistic creations
  • Aim to achieve better control over accuracy in the future
  • Optimization times are lengthy, roughly two hours for a single concept

Social impact

  • Text-to-image models can be used to generate misleading content.
  • Models are susceptible to biases found in training data.
  • Ability to more precisely describe concepts can reduce biases.
  • Ability to learn artistic styles may be misused for copyright infringement.

Conclusions

  • Introduced task of personalized, language-guided generation
  • Leverage text-to-image model to create images of specific concepts in novel settings and scenes
  • Approach called “Textual Inversions” operates by inverting concepts into new pseudo-words
  • Pseudo-words can be injected into new scenes using natural language descriptions
  • Text-driven interface for ease of editing, but providing visual cues when approaching the limits of natural language
  • Implemented over LDM, but applicable to other text-to-image models
  • Investigated two recent approaches to inversion: Bipartite DDIM-inversion and pivotal tuning
  • Bipartite inversion allows for more accurate reconstructions without modifying the model, but their structure is lost for complex prompts in high guidance scales
  • Pivotal tuning improves shapes at the cost of visual artifacts, and fail to adhere to simple prompts at high guidance scales
  • Our approach shows the best results with ∼ 5 images
  • Additional results of personalized generation provided in Figures 13-16