Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion models have been successful in text-to-image generation.
  • Existing methods for customizing these models have limitations.
  • Proposed approach addresses these limitations.
  • Method involves fine-tuning singular values of weight matrices.
  • Cut-Mix-Unmix data-augmentation technique enhances quality of multi-subject image generation.
  • Proposed SVDiff method has significantly smaller model size.

Paper Content

Introduction

  • Recent years have seen rapid advancement of text-to-image generative models
  • These models can generate high-quality images from text prompts
  • Researchers have investigated ways to use these models for image editing
  • Some methods allow the diffusion models to be adapted to specific tasks or user preferences
  • Limitations include large parameter space and difficulty in learning multiple personalized concepts
  • Text-to-image diffusion models have been used for image synthesis and various applications.
  • Recent advancements have explored transformer-based architectures.
  • Text-to-image synthesis has seen significant growth with the introduction of diffusion models.
  • StableDiffusion is a variant of latent diffusion models (LDMs) used in experiments.
  • LDMs transform input images into a latent code and perform denoising in the latent space.

Method

  • FSGAN is a method for adapting GANs in few-shot settings.
  • It uses SVD to learn a compact update in the GAN’s parameter space.

Compact parameter space for diffusion finetuning

  • Introduce the concept of spectral shifts from FSGAN to the parameter space of diffusion models
  • Perform Singular Value Decomposition (SVD) on the weight matrices of the pre-trained diffusion model
  • Optimize the spectral shift, which is the difference between the singular values of the updated weight matrix and the original weight matrix
  • Fine-tune using a weighted prior-preservation loss
  • Combine individually trained spectral shifts into a new model to create novel renderings

Cut-mix-unmix for multi-subject generation

  • Model tends to mix styles when rendering difficult compositions
  • Proposed Cut-Mix-Unmix technique to guide model not to mix styles
  • Cut-Mix-Unmix data augmentation applied with pre-defined probability
  • Unmix regularization on cross-attention maps to enforce separation between subjects

Single-image editing

  • CoSINE is a framework for single image editing
  • CoSINE uses a diffusion model with an image-prompt pair
  • To mitigate overfitting, CoSINE uses the spectral shift parameter space
  • CoSINE allows more flexible edits rather than exact reconstructions
  • DDIM inversion can be used to improve editing quality for edits with no significant structural changes

Experiment

  • Evaluated SVDiff on various tasks
  • Used DDIM sampler with η = 0 for generated samples

Single-subject generation

  • Proposed SVDiff for customized single-subject generation
  • Fine-tuning pretrained text-to-image diffusion model on single object or concept
  • Visual comparisons of 5 examples in Fig. 5
  • SVDiff produces similar results to DreamBooth despite smaller parameter space
  • Custom Diffusion tends to underfit training images
  • Text and image alignment in Fig. 10 shows SVDiff performance similar to DreamBooth, Custom Diffusion underfits

Multi-subject generation

  • Cut-Mix-Unmix data augmentation technique is used with a probability of 0.6
  • A user study was conducted with 400 generated image pairs
  • Results showed that SVD was favored over full weights 60.9% of the time
  • Cut-Mix-Unmix is not necessary for semantically well-separated concepts, but is necessary for semantically similar concepts

Single image editing

  • Results presented for single image editing application
  • Aim of experiment is to demonstrate that regularizing parameter space with spectral shifts mitigates language drift issue
  • Without DDIM inversion, fine-tuning with spectral shifts can lead to over-creative results
  • DDIM inversion improves editing quality and alignment with input image for non-structural edits when using spectral shift parameter space

Analysis and ablation

  • Weight combination, interpolation and style mixing are analyzed
  • Weight combination is analyzed using equation 4
  • Fig. 8 shows comparison between combining only spectral shifts and combining full weights
  • Task arithmetic property of language models holds in StableDiffusion
  • Style-mixing is demonstrated using a single fine-tuned model
  • Interpolation is demonstrated for both spectral shifts and full weights
  • Interpolation is capable of generating intermediate concepts between two original classes

Comparison with lora

  • SVDiff provides a balanced trade-off between faithfulness and realism
  • SVDiff results in a significantly smaller delta checkpoint size
  • LoRA has the flexibility to adjust its capability by changing the rank
  • SVDiff requires min(M, N ) floats for W matrix, compared to (M + N ) floats for LoRA

Conclusion and limitation

  • Proposed a compact parameter space, spectral shift, for diffusion model fine-tuning
  • Experiments show similar or better results compared to full weight fine-tuning
  • Cut-Mix-Unmix data augmentation technique improves multisubject generation
  • Spectral shift serves as a regularization method, enabling single image editing
  • Limitations include decrease in performance with more subjects and inadequate background preservation

E. analysis on spectral shifts

  • We present the results of correlation analysis of individually learned spectral shifts for each subject
  • The diagonal entries show the average cosine similarities between two runs with different learning rates
  • Similarity between conceptually similar subjects is relatively high
  • Scaling both the spectral shift and full weight delta affects the presence of personalized attributes and features
  • Scaling the weight delta influences attribute strength
  • DDIM inversion improves editing quality and alignment with input images
  • Cut-Mix-Unmix data-augmentation helps to disentangle subjects of similar categories
  • SVDiff enables successful image edits despite slight misalignment with the original image
  • Combining spectral shifts and weight deltas retains individual subject features but may mix styles for similar subjects
  • Changing coarse class word and appending “in style of” affects style transfer and mixing
  • Correlation of spectral shifts and text-and image-alignment scores are compared
  • SVDiff performs similarly as DreamBooth and preserves subject identities better than Custom Diffusion
  • LoRA tends to underfit the input images and fails to remove objects
  • Instruct-Pix2Pix tends to alter the overall color scheme and struggles with significant or structural edits
  • Cross-attention maps of the fine-tuned model without using unmix regularization show that the dog’s special token attends largely to the panda
  • Limiting rank of spectral shifts leads to limited ability to capture details in the edited samples
  • Scaling spectral shifts and weight deltas affects attribute strength and can cause deviation from the text prompt