Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generative models produce high-quality images from a large-scale database
  • Propose Custom Diffusion to augment existing text-to-image models
  • Optimizing a few parameters in the text-to-image conditioning mechanism is sufficient to represent new concepts
  • Can jointly train for multiple concepts or combine multiple fine-tuned models
  • Generates variations of multiple, new concepts and composes them with existing concepts
  • Outperforms several baselines and concurrent works

Paper Content

Introduction

  • Text-to-image models can generate images of unprecedented quality
  • Users often want to generate images of personal concepts, such as family, friends, pets, or personal objects and places
  • Describing these concepts through text is difficult and unable to produce the personal concept with sufficient fidelity
  • Model customization is needed to generate these personal concepts
  • Challenges include model forgetting, overfitting, and concept mixing and omission
  • Custom Diffusion is a fine-tuning technique to overcome these challenges
  • It is computationally and memory efficient
  • It can compose multiple new concepts efficiently
  • Generative models aim to synthesize samples from a data distribution
  • Types of generative models include GANs, VAEs, autoregressive, flow-based, and diffusion models
  • Models can be conditioned on a class, image, or text prompt
  • Text-to-image models have demonstrated remarkable generalization ability
  • Aim to adapt models to become specialists in new concepts
  • Leverage generative models for image and model editing
  • Transfer learning used to produce a whole distribution of images
  • Aim to acquire multiple new concepts without catastrophic forgetting

Method

  • Updates a small subset of weights in the cross-attention layers of the model
  • Uses a regularization set of real images to prevent overfitting

Single-concept fine-tuning

  • Stable Diffusion model is built on Latent Diffusion Model
  • Images are encoded into a latent representation using VAE, Patch-GAN and LPIPS
  • Diffusion models aim to approximate the original data distribution
  • Model is trained to learn the reverse process of a fixed-length Markov chain
  • Model is conditioned on timestep and can be further conditioned on any other modality, e.g. text
  • Cross-attention layers have higher rate of change of weights compared to other layers
  • Model fine-tuning modifies latent features according to condition features
  • Regularization dataset is used to prevent language drift

Multiple-concept compositional fine-tuning

  • Joint training on multiple concepts is used
  • Different modifier tokens are used to denote target concepts
  • Weight updates are restricted to cross-attention key and value parameters
  • Constrained optimization is used to merge concepts
  • Training is done with 250-500 steps, batch size of 8 and learning rate of 8 x 10-5
  • Images are randomly resized and text prompts are adjusted accordingly

Experiments

  • Ten datasets used, consisting of two scene categories, two pets, and six objects
  • Evaluation metrics include image alignment, text alignment, KID, and a human preference study
  • Baselines include DreamBooth, Textual Inversion, and Custom Diffusion
  • 200 steps of DDPM sampler with a scale 6 used

Single-concept fine-tuning results

  • Tested fine-tuned models on challenging prompts
  • Custom Diffusion has higher text-image alignment and captures visual details of target object
  • Outperforms concurrent methods
  • Lower KID than most baselines
  • Training time of โˆผ6 minutes, 75MB storage, batch size of 8

Multiple-concept fine-tuning results

  • Generated two new concepts in the same scene
  • Compared method with DreamBooth and Textual Inversion
  • Generated mountains in background
  • Changed background scene
  • Added another object
  • Changed object property
  • Accessorized pet cat with sunglasses
  • Preserved visual similarity better than baselines

Human preference study

  • Human preference study conducted on Amazon Mechanical Turk
  • Compared to DreamBooth, Textual Inversion, and our w/ fine-tune all baseline, our method lies further along the upper right corner
  • Trade-off between high similarity to target images vs. text-alignment on new prompts
  • Our method is preferred over baselines in both single-concept and multi-concept

Ablation and applications

  • Our method uses generated images as regularization
  • We compare our method with using generated images as regularization
  • We evaluate our method without regularization dataset
  • We evaluate our method without data augmentation

Discussion and limitations

  • Proposed a method for fine-tuning large-scale text-to-image diffusion models on new concepts
  • Need only a few image examples
  • Generates novel variations of the fine-tuned concept in new contexts
  • Preserves visual similarity with target images
  • Only need to save a small subset of model weights
  • Can coherently compose multiple new concepts in the same scene
  • Difficult compositions remain challenging
  • Pretrained model also faces similar difficulty
  • Compressing model storage can reduce memory
  • Comparing with DreamBooth and Textual Inversion
  • Using generated images as regularization
  • Trade-off between text-alignment and image-alignment
  • Low FID on MS-COCO
  • Limitations of multi-concept fine-tuning
  • Initializing V* with rarely occurring token-id