Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generative models produce high-quality images from a large-scale database
- Propose Custom Diffusion to augment existing text-to-image models
- Optimizing a few parameters in the text-to-image conditioning mechanism is sufficient to represent new concepts
- Can jointly train for multiple concepts or combine multiple fine-tuned models
- Generates variations of multiple, new concepts and composes them with existing concepts
- Outperforms several baselines and concurrent works
Paper Content
Introduction
- Text-to-image models can generate images of unprecedented quality
- Users often want to generate images of personal concepts, such as family, friends, pets, or personal objects and places
- Describing these concepts through text is difficult and unable to produce the personal concept with sufficient fidelity
- Model customization is needed to generate these personal concepts
- Challenges include model forgetting, overfitting, and concept mixing and omission
- Custom Diffusion is a fine-tuning technique to overcome these challenges
- It is computationally and memory efficient
- It can compose multiple new concepts efficiently
Related work
- Generative models aim to synthesize samples from a data distribution
- Types of generative models include GANs, VAEs, autoregressive, flow-based, and diffusion models
- Models can be conditioned on a class, image, or text prompt
- Text-to-image models have demonstrated remarkable generalization ability
- Aim to adapt models to become specialists in new concepts
- Leverage generative models for image and model editing
- Transfer learning used to produce a whole distribution of images
- Aim to acquire multiple new concepts without catastrophic forgetting
Method
- Updates a small subset of weights in the cross-attention layers of the model
- Uses a regularization set of real images to prevent overfitting
Single-concept fine-tuning
- Stable Diffusion model is built on Latent Diffusion Model
- Images are encoded into a latent representation using VAE, Patch-GAN and LPIPS
- Diffusion models aim to approximate the original data distribution
- Model is trained to learn the reverse process of a fixed-length Markov chain
- Model is conditioned on timestep and can be further conditioned on any other modality, e.g. text
- Cross-attention layers have higher rate of change of weights compared to other layers
- Model fine-tuning modifies latent features according to condition features
- Regularization dataset is used to prevent language drift
Multiple-concept compositional fine-tuning
- Joint training on multiple concepts is used
- Different modifier tokens are used to denote target concepts
- Weight updates are restricted to cross-attention key and value parameters
- Constrained optimization is used to merge concepts
- Training is done with 250-500 steps, batch size of 8 and learning rate of 8 x 10-5
- Images are randomly resized and text prompts are adjusted accordingly
Experiments
- Ten datasets used, consisting of two scene categories, two pets, and six objects
- Evaluation metrics include image alignment, text alignment, KID, and a human preference study
- Baselines include DreamBooth, Textual Inversion, and Custom Diffusion
- 200 steps of DDPM sampler with a scale 6 used
Single-concept fine-tuning results
- Tested fine-tuned models on challenging prompts
- Custom Diffusion has higher text-image alignment and captures visual details of target object
- Outperforms concurrent methods
- Lower KID than most baselines
- Training time of โผ6 minutes, 75MB storage, batch size of 8
Multiple-concept fine-tuning results
- Generated two new concepts in the same scene
- Compared method with DreamBooth and Textual Inversion
- Generated mountains in background
- Changed background scene
- Added another object
- Changed object property
- Accessorized pet cat with sunglasses
- Preserved visual similarity better than baselines
Human preference study
- Human preference study conducted on Amazon Mechanical Turk
- Compared to DreamBooth, Textual Inversion, and our w/ fine-tune all baseline, our method lies further along the upper right corner
- Trade-off between high similarity to target images vs. text-alignment on new prompts
- Our method is preferred over baselines in both single-concept and multi-concept
Ablation and applications
- Our method uses generated images as regularization
- We compare our method with using generated images as regularization
- We evaluate our method without regularization dataset
- We evaluate our method without data augmentation
Discussion and limitations
- Proposed a method for fine-tuning large-scale text-to-image diffusion models on new concepts
- Need only a few image examples
- Generates novel variations of the fine-tuned concept in new contexts
- Preserves visual similarity with target images
- Only need to save a small subset of model weights
- Can coherently compose multiple new concepts in the same scene
- Difficult compositions remain challenging
- Pretrained model also faces similar difficulty
- Compressing model storage can reduce memory
- Comparing with DreamBooth and Textual Inversion
- Using generated images as regularization
- Trade-off between text-alignment and image-alignment
- Low FID on MS-COCO
- Limitations of multi-concept fine-tuning
- Initializing V* with rarely occurring token-id