Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion models have been successful in text-to-image generation.
- Existing methods for customizing these models have limitations.
- Proposed approach addresses these limitations.
- Method involves fine-tuning singular values of weight matrices.
- Cut-Mix-Unmix data-augmentation technique enhances quality of multi-subject image generation.
- Proposed SVDiff method has significantly smaller model size.
Paper Content
Introduction
- Recent years have seen rapid advancement of text-to-image generative models
- These models can generate high-quality images from text prompts
- Researchers have investigated ways to use these models for image editing
- Some methods allow the diffusion models to be adapted to specific tasks or user preferences
- Limitations include large parameter space and difficulty in learning multiple personalized concepts
Related work
- Text-to-image diffusion models have been used for image synthesis and various applications.
- Recent advancements have explored transformer-based architectures.
- Text-to-image synthesis has seen significant growth with the introduction of diffusion models.
- StableDiffusion is a variant of latent diffusion models (LDMs) used in experiments.
- LDMs transform input images into a latent code and perform denoising in the latent space.
Method
- FSGAN is a method for adapting GANs in few-shot settings.
- It uses SVD to learn a compact update in the GAN’s parameter space.
Compact parameter space for diffusion finetuning
- Introduce the concept of spectral shifts from FSGAN to the parameter space of diffusion models
- Perform Singular Value Decomposition (SVD) on the weight matrices of the pre-trained diffusion model
- Optimize the spectral shift, which is the difference between the singular values of the updated weight matrix and the original weight matrix
- Fine-tune using a weighted prior-preservation loss
- Combine individually trained spectral shifts into a new model to create novel renderings
Cut-mix-unmix for multi-subject generation
- Model tends to mix styles when rendering difficult compositions
- Proposed Cut-Mix-Unmix technique to guide model not to mix styles
- Cut-Mix-Unmix data augmentation applied with pre-defined probability
- Unmix regularization on cross-attention maps to enforce separation between subjects
Single-image editing
- CoSINE is a framework for single image editing
- CoSINE uses a diffusion model with an image-prompt pair
- To mitigate overfitting, CoSINE uses the spectral shift parameter space
- CoSINE allows more flexible edits rather than exact reconstructions
- DDIM inversion can be used to improve editing quality for edits with no significant structural changes
Experiment
- Evaluated SVDiff on various tasks
- Used DDIM sampler with η = 0 for generated samples
Single-subject generation
- Proposed SVDiff for customized single-subject generation
- Fine-tuning pretrained text-to-image diffusion model on single object or concept
- Visual comparisons of 5 examples in Fig. 5
- SVDiff produces similar results to DreamBooth despite smaller parameter space
- Custom Diffusion tends to underfit training images
- Text and image alignment in Fig. 10 shows SVDiff performance similar to DreamBooth, Custom Diffusion underfits
Multi-subject generation
- Cut-Mix-Unmix data augmentation technique is used with a probability of 0.6
- A user study was conducted with 400 generated image pairs
- Results showed that SVD was favored over full weights 60.9% of the time
- Cut-Mix-Unmix is not necessary for semantically well-separated concepts, but is necessary for semantically similar concepts
Single image editing
- Results presented for single image editing application
- Aim of experiment is to demonstrate that regularizing parameter space with spectral shifts mitigates language drift issue
- Without DDIM inversion, fine-tuning with spectral shifts can lead to over-creative results
- DDIM inversion improves editing quality and alignment with input image for non-structural edits when using spectral shift parameter space
Analysis and ablation
- Weight combination, interpolation and style mixing are analyzed
- Weight combination is analyzed using equation 4
- Fig. 8 shows comparison between combining only spectral shifts and combining full weights
- Task arithmetic property of language models holds in StableDiffusion
- Style-mixing is demonstrated using a single fine-tuned model
- Interpolation is demonstrated for both spectral shifts and full weights
- Interpolation is capable of generating intermediate concepts between two original classes
Comparison with lora
- SVDiff provides a balanced trade-off between faithfulness and realism
- SVDiff results in a significantly smaller delta checkpoint size
- LoRA has the flexibility to adjust its capability by changing the rank
- SVDiff requires min(M, N ) floats for W matrix, compared to (M + N ) floats for LoRA
Conclusion and limitation
- Proposed a compact parameter space, spectral shift, for diffusion model fine-tuning
- Experiments show similar or better results compared to full weight fine-tuning
- Cut-Mix-Unmix data augmentation technique improves multisubject generation
- Spectral shift serves as a regularization method, enabling single image editing
- Limitations include decrease in performance with more subjects and inadequate background preservation
E. analysis on spectral shifts
- We present the results of correlation analysis of individually learned spectral shifts for each subject
- The diagonal entries show the average cosine similarities between two runs with different learning rates
- Similarity between conceptually similar subjects is relatively high
- Scaling both the spectral shift and full weight delta affects the presence of personalized attributes and features
- Scaling the weight delta influences attribute strength
- DDIM inversion improves editing quality and alignment with input images
- Cut-Mix-Unmix data-augmentation helps to disentangle subjects of similar categories
- SVDiff enables successful image edits despite slight misalignment with the original image
- Combining spectral shifts and weight deltas retains individual subject features but may mix styles for similar subjects
- Changing coarse class word and appending “in style of” affects style transfer and mixing
- Correlation of spectral shifts and text-and image-alignment scores are compared
- SVDiff performs similarly as DreamBooth and preserves subject identities better than Custom Diffusion
- LoRA tends to underfit the input images and fails to remove objects
- Instruct-Pix2Pix tends to alter the overall color scheme and struggles with significant or structural edits
- Cross-attention maps of the fine-tuned model without using unmix regularization show that the dog’s special token attends largely to the panda
- Limiting rank of spectral shifts leads to limited ability to capture details in the edited samples
- Scaling spectral shifts and weight deltas affects attribute strength and can cause deviation from the text prompt