Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Text-to-image synthesis has made great progress
  • A generic approach using latent diffusion models as image priors is presented
  • Feature matching and KL divergence loss are used to improve the approach
  • The approach is tested on three applications: text-to-3D, StyleGAN adaptation, and layered image editing
  • Results show the method is better than existing baselines

Paper Content

Introduction

  • Diffusion models have shown impressive image generation capabilities.
  • Text-to-image diffusion models have been applied to various visual editing and processing tasks.
  • A unified approach is underexplored.
  • Leverage a pretrained diffusion model as a generic image prior for various visual synthesis applications.
  • Propose a feature matching loss to extract detailed information from the decoder to guide text-based visual synthesis tasks.
  • Propose a KL loss to regularize the optimized latent, stabilizing the optimization process.
  • Extensively evaluate method on three downstream tasks and show competitive results.
  • Diffusion models generate images by denoising independent noises
  • Impressive progress has been made in photorealistic and zero-shot text-to-image generation
  • Diffusion model can be adapted to different conditional generation tasks
  • Text-driven 3D generative models use CLIP model to guide 3D generation
  • DreamFusion uses noise residual predicted by pre-trained diffusion model for backpropagation
  • Image generator domain adaptation uses pre-trained generator with few-shot or text-guided zero-shot domain adaptation
  • Text-driven image editing uses GANs to achieve editing of appearances while preserving the shape
  • Text-to-image diffusion models have shown success in manipulation tasks
  • Our method manipulates images using test-time optimization with diffusion guidance

Proposed method

  • DreamFusion proposed using a diffusion model for text-driven 3D generation tasks
  • Score distillation sampling involves perturbing the latent code with random noise
  • Training is based on the gradient computed from the noise residual
  • Jacobian NeRF and Latent NeRF adapted score distillation to the latent diffusion models
  • Latent score distillation does not use the decoder, leading to inferior results

Feature matching loss

  • We propose a feature matching loss to guide the differentiable renderer.
  • We use the decoder of the stable diffusion autoencoder.
  • We compare the features of real and synthetic images to compute the feature matching loss.

Kullback-leibler divergence regularizer.

  • Optimization of latent code can cause poor quality decoded images.
  • Feature matching loss is used to mitigate this issue.
  • KL loss is proposed to further regularize the latent space.

Training procedure

  • Latent score distillation, feature matching gradient, and KL loss are all parts of the diffusion prior.
  • Latent code v is perturbed at a random time step.
  • Predicted noise ε is used to derive the latent score distillation gradient.
  • Feature matching loss is computed by inputting latent code v and updated latent code v + (ε − ε) into the decoder.
  • Final loss is a combination of the three parts with balancing factors.

Experiments

  • Used Stable-Diffusion v1.4 and v1.5 as pretrained diffusion model
  • Evaluated on three applications

Applications

  • Text-to-3D task aims to generate 3D model from text description
  • Evaluate method on two text-guided 3D generative models
  • Use feature matching loss and KL regularizer
  • Compare results with two other baselines
  • Use StyleGAN2 with feature matching loss and KL regularizer
  • Layered image editing approach of Text2LIVE used
  • Use CNN generator and additional trainable latent code
  • Train parameters with combination of latent score distillation, feature matching loss and KL regularizer
  • Use object mask supervision for blending alpha map
  • Compare to CLIP-guided Text2LIVE and latent-score-distillation baseline
  • Ablation study to evaluate effectiveness of feature matching loss and KL regularizer
  • Janus problem and color over-saturation can occur