Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Text-to-image models have enabled high-quality and diverse synthesis of images from a given text prompt.
  • A new approach has been developed to “personalize” text-to-image diffusion models.
  • The approach uses a few images of a subject to fine-tune a pretrained text-to-image model.
  • The approach enables synthesizing the subject in diverse scenes, poses, views, and lighting conditions.

Paper Content

Introduction

  • Problem: synthesizing instances of specific subjects in new contexts
  • Recently developed large text-to-image models enable high-quality and diverse synthesis of images based on a text prompt
  • Models lack ability to mimic appearance of subjects in a given reference set
  • Goal: personalizing text-to-image diffusion models to user-specific image generation needs
  • Expand language-vision dictionary of the model to bind new words with specific subjects
  • Use pre-trained Imagen model as base model
  • Contributions: subject-driven generation and technique for fine-tuning text-to-image diffusion models in a few-shot setting
  • Compositing Objects into Scenes is a challenging task
  • Image composition techniques are limited in their ability to adapt the correct lighting, cast proper shadows, or fit the background content in a semantically aware manner
  • Text-Driven Editing is an intuitive approach to image manipulation
  • GANs are used for high-quality generation
  • Diffusion models achieve state-of-the-art generation quality
  • Text-based editing approaches are limited to highly-curated datasets
  • Inversion of diffusion models is a challenging task
  • Personalization is a prominent factor in various fields of Machine Learning
  • MyStyle requires around 100 images to learn an adequate prior and is constrained to the face domain
  • Our approach can reconstruct the identity of different types of subjects with only 3-5 casually captured images

Preliminaries

  • Diffusion models are probabilistic generative models that learn a data distribution by denoising a variable sampled from a Gaussian distribution.
  • Cascaded text-to-image diffusion models are used to generate high-resolution images from text.
  • Noise conditioning augmentation is used to improve sample quality.
  • Text-conditioning is important for visual quality and semantic fidelity.
  • Text is tokenized using a tokenizer and a learned vocabulary.
  • Language model is conditioned on the token identifier vector to produce an embedding.

Method

  • Objective is to generate new images of a subject with high detail fidelity and variations guided by text prompts
  • Examples of output variations include changing place, property, pose, expression, material, etc.
  • First task is to implant subject instance into output domain of model and bind with unique identifier
  • Problem of overfitting and language drift addressed with autogenous class-specific prior preservation loss
  • Super-resolution components of model fine-tuned to better conserve subject details
  • Pre-trained Imagen model used as base model

Representing the subject with a rare-token identifier

  • Goal is to “implant” a new (key, value) pair into a diffusion model’s “dictionary”
  • Text prompts are usually written by humans and sourced from large online datasets
  • Opt for simpler approach and label all input images of the subject with a unique identifier and a class descriptor
  • Leverage the diffusion model’s prior of the specific class and entangle it with the embedding of the subject’s unique identifier
  • Naive way of constructing an identifier is to use an existing word, but this has limitations
  • Approach is to find relatively rare tokens in the vocabulary and invert them into text space
  • Use tokens in the T5-XXL tokenizer range of {5000, …, 10000}

Class-specific prior preservation loss

  • Few-shot personalization of a diffusion model is used to generate images from a small set of images depicting the target subject
  • Classic denoising loss is used with the same hyperparameters as the original diffusion model
  • Overfitting and language-drift are two key issues that arise with this naive fine-tuning strategy
  • Autogenous class-specific prior-preserving loss is proposed to counter both the overfitting and language-drift issues
  • Training process takes about 15 minutes on one TPUv4

Personalized instance-specific super-resolution

  • Text-to-image diffusion model controls for most visual semantics
  • Super-resolution models are essential for photorealistic content and preserving details
  • Without fine-tuning, output can contain artifacts
  • Fine-tuning 64x64 → 256x256 SR model is essential
  • Fine-tuning 256x256 → 1024x1024 model can benefit some subject instances
  • Reducing noise augmentation during fine-tuning helps recover fine-grained details

Experiments

  • Bullet Points:
  • Show applications and experimental evaluation of method
  • Potential text-guided semantic modifications of subject instance
  • Recontextualization, modification of subject properties, combinations of animal species, original art renditions, viewpoint and expression modification
  • Preserve unique visual features that give subject its identity or essence
  • Reference subject’s unique identifier using [V]
  • All experiments conducted using images from Unsplash

Applications

  • We can generate novel images of a specific subject instance by prompting a trained model with a sentence containing a unique identifier and a class noun
  • We can recontextualize a subject in different environments, with high preservation of subject details and realistic interaction between the scene and the subject
  • We can generate new images of the same subject with modified expressions that were not seen in the original set of subject images
  • We can generate images with specified viewpoints for a subject
  • We can accessorize subjects with aesthetically pleasing results
  • We can modify subject instance properties, such as color
  • We can generate new instances of a subject with different classes, while preserving the subject’s identity
  • We can use a prior-preservation loss to capture a wider range of poses for a subject without sacrificing subject fidelity
  • We can use lower noise to train super-resolution models to improve fidelity
  • We can generate the same semantic variations of unique objects with a high emphasis on preserving the subject identity