Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text-to-image models have enabled high-quality and diverse synthesis of images from a given text prompt.
- A new approach has been developed to “personalize” text-to-image diffusion models.
- The approach uses a few images of a subject to fine-tune a pretrained text-to-image model.
- The approach enables synthesizing the subject in diverse scenes, poses, views, and lighting conditions.
Paper Content
Introduction
- Problem: synthesizing instances of specific subjects in new contexts
- Recently developed large text-to-image models enable high-quality and diverse synthesis of images based on a text prompt
- Models lack ability to mimic appearance of subjects in a given reference set
- Goal: personalizing text-to-image diffusion models to user-specific image generation needs
- Expand language-vision dictionary of the model to bind new words with specific subjects
- Use pre-trained Imagen model as base model
- Contributions: subject-driven generation and technique for fine-tuning text-to-image diffusion models in a few-shot setting
Related work
- Compositing Objects into Scenes is a challenging task
- Image composition techniques are limited in their ability to adapt the correct lighting, cast proper shadows, or fit the background content in a semantically aware manner
- Text-Driven Editing is an intuitive approach to image manipulation
- GANs are used for high-quality generation
- Diffusion models achieve state-of-the-art generation quality
- Text-based editing approaches are limited to highly-curated datasets
- Inversion of diffusion models is a challenging task
- Personalization is a prominent factor in various fields of Machine Learning
- MyStyle requires around 100 images to learn an adequate prior and is constrained to the face domain
- Our approach can reconstruct the identity of different types of subjects with only 3-5 casually captured images
Preliminaries
- Diffusion models are probabilistic generative models that learn a data distribution by denoising a variable sampled from a Gaussian distribution.
- Cascaded text-to-image diffusion models are used to generate high-resolution images from text.
- Noise conditioning augmentation is used to improve sample quality.
- Text-conditioning is important for visual quality and semantic fidelity.
- Text is tokenized using a tokenizer and a learned vocabulary.
- Language model is conditioned on the token identifier vector to produce an embedding.
Method
- Objective is to generate new images of a subject with high detail fidelity and variations guided by text prompts
- Examples of output variations include changing place, property, pose, expression, material, etc.
- First task is to implant subject instance into output domain of model and bind with unique identifier
- Problem of overfitting and language drift addressed with autogenous class-specific prior preservation loss
- Super-resolution components of model fine-tuned to better conserve subject details
- Pre-trained Imagen model used as base model
Representing the subject with a rare-token identifier
- Goal is to “implant” a new (key, value) pair into a diffusion model’s “dictionary”
- Text prompts are usually written by humans and sourced from large online datasets
- Opt for simpler approach and label all input images of the subject with a unique identifier and a class descriptor
- Leverage the diffusion model’s prior of the specific class and entangle it with the embedding of the subject’s unique identifier
- Naive way of constructing an identifier is to use an existing word, but this has limitations
- Approach is to find relatively rare tokens in the vocabulary and invert them into text space
- Use tokens in the T5-XXL tokenizer range of {5000, …, 10000}
Class-specific prior preservation loss
- Few-shot personalization of a diffusion model is used to generate images from a small set of images depicting the target subject
- Classic denoising loss is used with the same hyperparameters as the original diffusion model
- Overfitting and language-drift are two key issues that arise with this naive fine-tuning strategy
- Autogenous class-specific prior-preserving loss is proposed to counter both the overfitting and language-drift issues
- Training process takes about 15 minutes on one TPUv4
Personalized instance-specific super-resolution
- Text-to-image diffusion model controls for most visual semantics
- Super-resolution models are essential for photorealistic content and preserving details
- Without fine-tuning, output can contain artifacts
- Fine-tuning 64x64 → 256x256 SR model is essential
- Fine-tuning 256x256 → 1024x1024 model can benefit some subject instances
- Reducing noise augmentation during fine-tuning helps recover fine-grained details
Experiments
- Bullet Points:
- Show applications and experimental evaluation of method
- Potential text-guided semantic modifications of subject instance
- Recontextualization, modification of subject properties, combinations of animal species, original art renditions, viewpoint and expression modification
- Preserve unique visual features that give subject its identity or essence
- Reference subject’s unique identifier using [V]
- All experiments conducted using images from Unsplash
Applications
- We can generate novel images of a specific subject instance by prompting a trained model with a sentence containing a unique identifier and a class noun
- We can recontextualize a subject in different environments, with high preservation of subject details and realistic interaction between the scene and the subject
- We can generate new images of the same subject with modified expressions that were not seen in the original set of subject images
- We can generate images with specified viewpoints for a subject
- We can accessorize subjects with aesthetically pleasing results
- We can modify subject instance properties, such as color
- We can generate new instances of a subject with different classes, while preserving the subject’s identity
- We can use a prior-preservation loss to capture a wider range of poses for a subject without sacrificing subject fidelity
- We can use lower noise to train super-resolution models to improve fidelity
- We can generate the same semantic variations of unique objects with a high emphasis on preserving the subject identity