Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion-based generative models have been successful in text-based image generation.
- It is challenging to apply these models for real-world visual content editing, especially in videos.
- FateZero is a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
- FateZero captures intermediate attention maps during inversion, which retain both structural and motion information.
- FateZero fuses self-attentions with a blending mask obtained by cross-attention features from the source prompt.
- FateZero has implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention.
- FateZero is the first to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
Paper Content
Introduction
- Diffusion-based models can generate high-quality images and videos from text prompts.
- Previous diffusion-based editing methods mainly work on images.
- Manipulating videos through generative priors is challenging.
- There are no publicly available generic text-to-video models.
- Current editing methods use DDIM for inversion and denoising.
- Error accumulation can break the motion and structure of the original video.
- FateZero is a simple yet effective method for zero-shot video editing.
- FateZero stores self and cross-attention maps and uses attention blending to preserve original structures.
- FateZero can be used for video style editing, local editing, and object replacement.
Related work
- Video editing can be done by using example as style guide, but this can fail when track is lost
- Image style transfer can be used to reduce temporal consistency, but style may still be imperfect
- Layer-atlas based methods show promise for local editing, but lack 3D motion perception
- Diffusion-based models can be used for object shape editing, but artifacts can still occur
- Image generation can be done with VAE, GAN, VQVAE, and transformer
- Text-to-image generation can be done with GPT, CLIP, and diffusion-based models
- Video generation is more difficult and requires larger cascaded models and datasets
- Image editing can be done with SDEdit, DiffEdit, Blended Diffusion, Plug-and-play, Pix2pix-Zero, and Prompt-to-Prompt
- Optimization can be used to improve editing ability, but frame-wise application of image methods to video can cause flickering and inconsistency
Methods
- Targets zero-shot text-driven video editing without optimization
- Introduces method to enable video appearance editing
- Discusses more challenging case to enable shape-aware editing of video
- Proposed method is general editing method that can be used in various text-to-image or text-to-video models
Preliminary: latent diffusion and inversion
- Latent Diffusion Models are used to reduce noise in an autoencoder.
- U-Net is trained to remove artificial noise using an objective.
- DDIM Inversion is used to convert random noise to a clean latent.
Fatezero video editing
- Use pretrained text-to-image model, Stable Diffusion, as base model
- Modifications made for video editing
- Inversion Attention Fusion to reduce frame inconsistency
- Attention Map Blending to prevent semantic leaks
- Spatial-Temporal Self-Attention for better temporal consistency
- Algorithm in supplementary materials
Shape-aware video editing
- Shape reforming of a specific object in a video is difficult
- No publicly-available generic video diffusion model exists
- Editing method is compared to DDIM inversion
- Editing method has better performance in terms of editing ability, motion consistency, and temporal consistency
- Motion and structure are represented by high-quality spatial-temporal attention maps during inversion and editing
Experiments
Implementation details
- We use a trained model as the base for zero-shot style and attribute editing.
- We use a pretrained model for shape editing.
- We use videos from DAVIS and other in-the-wild videos to evaluate our approach.
- We generate the source prompt for the video using an image caption model.
- We design the target prompt for each video by replacing or adding words.
Applications
Baseline comparisons
- We build four state-of-the-art baselines for comparison
- We use the trained CLIP model for quantitative evaluation
- We measure temporal consistency with ‘Tem-Con’
- We measure frame-wise editing accuracy with ‘Frame-Acc’
- We measure editing quality, image fidelity, and temporal consistency with three user studies
- Our proposed zero-shot method achieves the best temporal consistency
- Our method preserves motion by fusion of attention during inversion
- We use cross-attention and spatial-temporal self-attention during DDIM inversion
- We propose Attention Blending Block to enhance shape editing performance
- Our framework benefits video editing using existing image diffusion models
- Limitations include difficulty in generating new motion or shape
- Future work includes testing on generic pretrained video diffusion model