Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Diffusion models have been successful in text-to-image generation.
Existing methods for customizing these models have limitations.
Proposed approach addresses these limitations.
Method involves fine-tuning singular values of weight matrices.
Cut-Mix-Unmix data-augmentation technique enhances quality of multi-subject image generation.
Proposed SVDiff method has significantly smaller model size.

Paper Content

Introduction

Recent years have seen rapid advancement of text-to-image generative models
These models can generate high-quality images from text prompts
Researchers have investigated ways to use these models for image editing
Some methods allow the diffusion models to be adapted to specific tasks or user preferences
Limitations include large parameter space and difficulty in learning multiple personalized concepts

Text-to-image diffusion models have been used for image synthesis and various applications.
Recent advancements have explored transformer-based architectures.
Text-to-image synthesis has seen significant growth with the introduction of diffusion models.
StableDiffusion is a variant of latent diffusion models (LDMs) used in experiments.
LDMs transform input images into a latent code and perform denoising in the latent space.

Method

FSGAN is a method for adapting GANs in few-shot settings.
It uses SVD to learn a compact update in the GAN’s parameter space.

Compact parameter space for diffusion finetuning

Introduce the concept of spectral shifts from FSGAN to the parameter space of diffusion models
Perform Singular Value Decomposition (SVD) on the weight matrices of the pre-trained diffusion model
Optimize the spectral shift, which is the difference between the singular values of the updated weight matrix and the original weight matrix
Fine-tune using a weighted prior-preservation loss
Combine individually trained spectral shifts into a new model to create novel renderings

Cut-mix-unmix for multi-subject generation

Model tends to mix styles when rendering difficult compositions
Proposed Cut-Mix-Unmix technique to guide model not to mix styles
Cut-Mix-Unmix data augmentation applied with pre-defined probability
Unmix regularization on cross-attention maps to enforce separation between subjects

Single-image editing

CoSINE is a framework for single image editing
CoSINE uses a diffusion model with an image-prompt pair
To mitigate overfitting, CoSINE uses the spectral shift parameter space
CoSINE allows more flexible edits rather than exact reconstructions
DDIM inversion can be used to improve editing quality for edits with no significant structural changes

Experiment

Evaluated SVDiff on various tasks
Used DDIM sampler with η = 0 for generated samples

Single-subject generation

Proposed SVDiff for customized single-subject generation
Fine-tuning pretrained text-to-image diffusion model on single object or concept
Visual comparisons of 5 examples in Fig. 5
SVDiff produces similar results to DreamBooth despite smaller parameter space
Custom Diffusion tends to underfit training images
Text and image alignment in Fig. 10 shows SVDiff performance similar to DreamBooth, Custom Diffusion underfits

Multi-subject generation

Cut-Mix-Unmix data augmentation technique is used with a probability of 0.6
A user study was conducted with 400 generated image pairs
Results showed that SVD was favored over full weights 60.9% of the time
Cut-Mix-Unmix is not necessary for semantically well-separated concepts, but is necessary for semantically similar concepts

Single image editing

Results presented for single image editing application
Aim of experiment is to demonstrate that regularizing parameter space with spectral shifts mitigates language drift issue
Without DDIM inversion, fine-tuning with spectral shifts can lead to over-creative results
DDIM inversion improves editing quality and alignment with input image for non-structural edits when using spectral shift parameter space

Analysis and ablation

Weight combination, interpolation and style mixing are analyzed
Weight combination is analyzed using equation 4
Fig. 8 shows comparison between combining only spectral shifts and combining full weights
Task arithmetic property of language models holds in StableDiffusion
Style-mixing is demonstrated using a single fine-tuned model
Interpolation is demonstrated for both spectral shifts and full weights
Interpolation is capable of generating intermediate concepts between two original classes

Comparison with lora

SVDiff provides a balanced trade-off between faithfulness and realism
SVDiff results in a significantly smaller delta checkpoint size
LoRA has the flexibility to adjust its capability by changing the rank
SVDiff requires min(M, N ) floats for W matrix, compared to (M + N ) floats for LoRA

Conclusion and limitation

Proposed a compact parameter space, spectral shift, for diffusion model fine-tuning
Experiments show similar or better results compared to full weight fine-tuning
Cut-Mix-Unmix data augmentation technique improves multisubject generation
Spectral shift serves as a regularization method, enabling single image editing
Limitations include decrease in performance with more subjects and inadequate background preservation

E. analysis on spectral shifts

We present the results of correlation analysis of individually learned spectral shifts for each subject
The diagonal entries show the average cosine similarities between two runs with different learning rates
Similarity between conceptually similar subjects is relatively high
Scaling both the spectral shift and full weight delta affects the presence of personalized attributes and features
Scaling the weight delta influences attribute strength
DDIM inversion improves editing quality and alignment with input images
Cut-Mix-Unmix data-augmentation helps to disentangle subjects of similar categories
SVDiff enables successful image edits despite slight misalignment with the original image
Combining spectral shifts and weight deltas retains individual subject features but may mix styles for similar subjects
Changing coarse class word and appending “in style of” affects style transfer and mixing
Correlation of spectral shifts and text-and image-alignment scores are compared
SVDiff performs similarly as DreamBooth and preserves subject identities better than Custom Diffusion
LoRA tends to underfit the input images and fails to remove objects
Instruct-Pix2Pix tends to alter the overall color scheme and struggles with significant or structural edits
Cross-attention maps of the fine-tuned model without using unmix regularization show that the dog’s special token attends largely to the panda
Limiting rank of spectral shifts leads to limited ability to capture details in the edited samples
Scaling spectral shifts and weight deltas affects attribute strength and can cause deviation from the text prompt

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Compact parameter space for diffusion finetuning#

Cut-mix-unmix for multi-subject generation#

Single-image editing#

Experiment#

Single-subject generation#

Multi-subject generation#

Single image editing#

Analysis and ablation#

Comparison with lora#

Conclusion and limitation#

E. analysis on spectral shifts#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Compact parameter space for diffusion finetuning

Cut-mix-unmix for multi-subject generation

Single-image editing

Experiment

Single-subject generation

Multi-subject generation

Single image editing

Analysis and ablation

Comparison with lora

Conclusion and limitation

E. analysis on spectral shifts