Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion-based generative models have been successful in text-based image generation.
  • It is challenging to apply these models for real-world visual content editing, especially in videos.
  • FateZero is a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
  • FateZero captures intermediate attention maps during inversion, which retain both structural and motion information.
  • FateZero fuses self-attentions with a blending mask obtained by cross-attention features from the source prompt.
  • FateZero has implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention.
  • FateZero is the first to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.

Paper Content

Introduction

  • Diffusion-based models can generate high-quality images and videos from text prompts.
  • Previous diffusion-based editing methods mainly work on images.
  • Manipulating videos through generative priors is challenging.
  • There are no publicly available generic text-to-video models.
  • Current editing methods use DDIM for inversion and denoising.
  • Error accumulation can break the motion and structure of the original video.
  • FateZero is a simple yet effective method for zero-shot video editing.
  • FateZero stores self and cross-attention maps and uses attention blending to preserve original structures.
  • FateZero can be used for video style editing, local editing, and object replacement.
  • Video editing can be done by using example as style guide, but this can fail when track is lost
  • Image style transfer can be used to reduce temporal consistency, but style may still be imperfect
  • Layer-atlas based methods show promise for local editing, but lack 3D motion perception
  • Diffusion-based models can be used for object shape editing, but artifacts can still occur
  • Image generation can be done with VAE, GAN, VQVAE, and transformer
  • Text-to-image generation can be done with GPT, CLIP, and diffusion-based models
  • Video generation is more difficult and requires larger cascaded models and datasets
  • Image editing can be done with SDEdit, DiffEdit, Blended Diffusion, Plug-and-play, Pix2pix-Zero, and Prompt-to-Prompt
  • Optimization can be used to improve editing ability, but frame-wise application of image methods to video can cause flickering and inconsistency

Methods

  • Targets zero-shot text-driven video editing without optimization
  • Introduces method to enable video appearance editing
  • Discusses more challenging case to enable shape-aware editing of video
  • Proposed method is general editing method that can be used in various text-to-image or text-to-video models

Preliminary: latent diffusion and inversion

  • Latent Diffusion Models are used to reduce noise in an autoencoder.
  • U-Net is trained to remove artificial noise using an objective.
  • DDIM Inversion is used to convert random noise to a clean latent.

Fatezero video editing

  • Use pretrained text-to-image model, Stable Diffusion, as base model
  • Modifications made for video editing
  • Inversion Attention Fusion to reduce frame inconsistency
  • Attention Map Blending to prevent semantic leaks
  • Spatial-Temporal Self-Attention for better temporal consistency
  • Algorithm in supplementary materials

Shape-aware video editing

  • Shape reforming of a specific object in a video is difficult
  • No publicly-available generic video diffusion model exists
  • Editing method is compared to DDIM inversion
  • Editing method has better performance in terms of editing ability, motion consistency, and temporal consistency
  • Motion and structure are represented by high-quality spatial-temporal attention maps during inversion and editing

Experiments

Implementation details

  • We use a trained model as the base for zero-shot style and attribute editing.
  • We use a pretrained model for shape editing.
  • We use videos from DAVIS and other in-the-wild videos to evaluate our approach.
  • We generate the source prompt for the video using an image caption model.
  • We design the target prompt for each video by replacing or adding words.

Applications

Baseline comparisons

  • We build four state-of-the-art baselines for comparison
  • We use the trained CLIP model for quantitative evaluation
  • We measure temporal consistency with ‘Tem-Con’
  • We measure frame-wise editing accuracy with ‘Frame-Acc’
  • We measure editing quality, image fidelity, and temporal consistency with three user studies
  • Our proposed zero-shot method achieves the best temporal consistency
  • Our method preserves motion by fusion of attention during inversion
  • We use cross-attention and spatial-temporal self-attention during DDIM inversion
  • We propose Attention Blending Block to enhance shape editing performance
  • Our framework benefits video editing using existing image diffusion models
  • Limitations include difficulty in generating new motion or shape
  • Future work includes testing on generic pretrained video diffusion model