Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text-to-image synthesis has made great progress
- A generic approach using latent diffusion models as image priors is presented
- Feature matching and KL divergence loss are used to improve the approach
- The approach is tested on three applications: text-to-3D, StyleGAN adaptation, and layered image editing
- Results show the method is better than existing baselines
Paper Content
Introduction
- Diffusion models have shown impressive image generation capabilities.
- Text-to-image diffusion models have been applied to various visual editing and processing tasks.
- A unified approach is underexplored.
- Leverage a pretrained diffusion model as a generic image prior for various visual synthesis applications.
- Propose a feature matching loss to extract detailed information from the decoder to guide text-based visual synthesis tasks.
- Propose a KL loss to regularize the optimized latent, stabilizing the optimization process.
- Extensively evaluate method on three downstream tasks and show competitive results.
Related work
- Diffusion models generate images by denoising independent noises
- Impressive progress has been made in photorealistic and zero-shot text-to-image generation
- Diffusion model can be adapted to different conditional generation tasks
- Text-driven 3D generative models use CLIP model to guide 3D generation
- DreamFusion uses noise residual predicted by pre-trained diffusion model for backpropagation
- Image generator domain adaptation uses pre-trained generator with few-shot or text-guided zero-shot domain adaptation
- Text-driven image editing uses GANs to achieve editing of appearances while preserving the shape
- Text-to-image diffusion models have shown success in manipulation tasks
- Our method manipulates images using test-time optimization with diffusion guidance
Proposed method
- DreamFusion proposed using a diffusion model for text-driven 3D generation tasks
- Score distillation sampling involves perturbing the latent code with random noise
- Training is based on the gradient computed from the noise residual
- Jacobian NeRF and Latent NeRF adapted score distillation to the latent diffusion models
- Latent score distillation does not use the decoder, leading to inferior results
Feature matching loss
- We propose a feature matching loss to guide the differentiable renderer.
- We use the decoder of the stable diffusion autoencoder.
- We compare the features of real and synthetic images to compute the feature matching loss.
Kullback-leibler divergence regularizer.
- Optimization of latent code can cause poor quality decoded images.
- Feature matching loss is used to mitigate this issue.
- KL loss is proposed to further regularize the latent space.
Training procedure
- Latent score distillation, feature matching gradient, and KL loss are all parts of the diffusion prior.
- Latent code v is perturbed at a random time step.
- Predicted noise ε is used to derive the latent score distillation gradient.
- Feature matching loss is computed by inputting latent code v and updated latent code v + (ε − ε) into the decoder.
- Final loss is a combination of the three parts with balancing factors.
Experiments
- Used Stable-Diffusion v1.4 and v1.5 as pretrained diffusion model
- Evaluated on three applications
Applications
- Text-to-3D task aims to generate 3D model from text description
- Evaluate method on two text-guided 3D generative models
- Use feature matching loss and KL regularizer
- Compare results with two other baselines
- Use StyleGAN2 with feature matching loss and KL regularizer
- Layered image editing approach of Text2LIVE used
- Use CNN generator and additional trainable latent code
- Train parameters with combination of latent score distillation, feature matching loss and KL regularizer
- Use object mask supervision for blending alpha map
- Compare to CLIP-guided Text2LIVE and latent-score-distillation baseline
- Ablation study to evaluate effectiveness of feature matching loss and KL regularizer
- Janus problem and color over-saturation can occur