Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text-to-image models allow users to create images through natural language.
- A new approach is presented to allow creative freedom with 3-5 images of a user-provided concept.
- The approach uses “words” in the embedding space of a frozen text-to-image model to represent the concept.
- A single word embedding is sufficient for capturing unique and varied concepts.
- Code, data and new words are available online.
Paper Content
Introduction
- Request contains a wealth of information
- Request should produce a drawing
- Drawing should match style of Jack’s prior work
- Drawing should portray Rose herself
- Finding new concepts in large scale models is difficult
Related work
- Text-guided synthesis is a widely studied topic in the context of GANs
- Typically, a conditional model is trained to reproduce samples from paired image-caption datasets
- Several approaches employ test-time optimization to explore the latent spaces of a pre-trained generator
- Text-based interfaces are used for image editing, generator domain adaptation, video manipulation, motion synthesis, style transfer, and texture synthesis
- Our approach builds on open-ended, conditional synthesis models
- Inversion is done through optimization-based techniques or encoders
- Diffusion-based inversion can be done by adding noise to an image and then de-noising it
- Personalization efforts are found in recommendation systems, federated learning, vision and graphics
- PALAVRA leverages a pre-trained CLIP model for retrieval and segmentation of personalized objects
Method
- Goal is to enable language-guided generation of user-specified concepts
- Representation of pre-trained text-to-image model is used
- Leverage rich semantic and visual prior of model
- Word-embedding stage of text encoders used
- Prior work showed this embedding space can capture basic image semantics
- Visual reconstruction objective used to find pseudo-words
- Latent Diffusion Models (LDMs) used
- LDMs consist of autoencoder and diffusion model
- Autoencoder maps images to spatial latent code
- Diffusion model produces codes within latent space
- Text encoder used to map conditioning input to conditioning vector
- Text embeddings initialized with single-word coarse descriptor
- Experiments conducted using 2xV100 GPUs with batch size of 4
Qualitative comparisons and applications
- Demonstrates applications enabled through Textual Inversions
- Provides visual comparisons to state-of-the-art and human-captioning baselines
Image variations
- Our method captures unique details of an object using a single pseudoword
- We compared our method to two baselines: LDM guided by a human caption and DALLE-2 guided by either a human caption or an image prompt
- Our method better captures the unique details of the concept than the baselines
- We can compose novel scenes by incorporating the learned pseudo-words into new conditioning texts
- We compared our method to several personalization baselines
- Our method can optimize a single pseudo-word and re-use it for a multitude of new generations
- Baseline models require expensive optimization for every new creation
- Our method builds upon pre-trained, large-scale text-to-image synthesis models
Style transfer
- Text-guided synthesis can be used to capture the unique style of an artist.
- Model can find pseudowords representing a specific, unknown style.
- Results demonstrate that the ability to capture concepts extends beyond simple object reconstructions.
- Differs from traditional style transfer as content of input image is not necessarily maintained.
Concept compositions
- Model can reason over multiple novel pseudo-words at the same time
- Struggles with relations between concepts (e.g. placing two concepts side-by-side)
Bias reduction
- Text-to-image models inherit biases from internet-scale data used to train them.
- Examples of bias include whitepassing and male-passing images of CEOs and heterosexual couples for “wedding”.
- A small, curated dataset can be used to learn a new “fairer” word for a biased concept.
- Bias can be reduced by learning a new embedding from a small, more diverse set.
Downstream applications
- Pseudo-words can be used in downstream models that build on the same initial LDM model.
- Blended Latent Diffusion (Avrahami et al., 2022a) enables localized text-based editing of images.
- Localized synthesis process can be conditioned on learned pseudo-words without requiring additional modifications.
Image curation
- Generated 16 candidates for each prompt, manually selected best result
- Similar curation processes with larger batches typically used in text-conditioned generation works
- Automate selection process by using CLIP to rank images
- Provide large-scale, uncurated galleries of generated results in supplementary materials
Quantitative analysis
- Inversion into a latent space provides many design choices
- Many core premises of GAN inversion also exist in the textual embedding space
- Solutions typically used in GAN inversion do not generalize to this space and are often unhelpful or harmful
Evaluation metrics
- Analyzed quality of latent space embeddings by considering two fronts: reconstruction and editability
- Measured similarity of generated images to concept-specific training set by considering semantic CLIP-space distances
- Evaluated ability to modify concepts using textual prompts by synthesizing 64 samples using 50 DDIM steps
- Calculated average CLIP-space embedding of samples and computed cosine similarity with CLIP-space embedding of textual prompts
- Method does not involve direct optimization of CLIP-based objective score and is not sensitive to adversarial scoring flaws
Evaluation setups
- Evaluate embedding space using experimental setups inspired by GAN inversion
- Consider extended, multi-vector latent space
- Consider progressive multi-vector setup
- Introduce regularization to keep learned embedding close to existing words
- Introduce unique, per-image tokens
- Compare to human-level performance using captions
- Add two reference baselines
- Evaluate own setup with increased/decreased learning rate
- Consider additional setups in supplementary
Results
- Semantic reconstruction quality of method and baselines is comparable to random images from training set
- Single-word method achieves comparable reconstruction quality and improved editability over multi-word baselines
- Distortion-editability trade-off curve exists, single-embedding model can be moved along it by changing learning rate
- Use of human descriptions for concepts leads to diminished editability
Human evaluations
- Conducted user study with two questionnaires
- 600 responses to each questionnaire, for a total of 1,200 responses
- Results align with CLIP-based metrics
- Demonstrates reconstruction-editability tradeoff
- Outlines limitations of human-based captioning
Limitations
- Method offers increased freedom but may struggle with precise shapes
- Method is often enough for artistic creations
- Aim to achieve better control over accuracy in the future
- Optimization times are lengthy, roughly two hours for a single concept
Social impact
- Text-to-image models can be used to generate misleading content.
- Models are susceptible to biases found in training data.
- Ability to more precisely describe concepts can reduce biases.
- Ability to learn artistic styles may be misused for copyright infringement.
Conclusions
- Introduced task of personalized, language-guided generation
- Leverage text-to-image model to create images of specific concepts in novel settings and scenes
- Approach called “Textual Inversions” operates by inverting concepts into new pseudo-words
- Pseudo-words can be injected into new scenes using natural language descriptions
- Text-driven interface for ease of editing, but providing visual cues when approaching the limits of natural language
- Implemented over LDM, but applicable to other text-to-image models
- Investigated two recent approaches to inversion: Bipartite DDIM-inversion and pivotal tuning
- Bipartite inversion allows for more accurate reconstructions without modifying the model, but their structure is lost for complex prompts in high guidance scales
- Pivotal tuning improves shapes at the cost of visual artifacts, and fail to adhere to simple prompts at high guidance scales
- Our approach shows the best results with ∼ 5 images
- Additional results of personalized generation provided in Figures 13-16