Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Presents Video-P2P, a framework for real-world video editing with cross-attention control
  • Adapts an image generation diffusion model to complete various video editing tasks
  • Introduces a novel decoupled-guidance strategy for attention control
  • Enables various text-driven editing applications
  • Works well on real-world videos for generating new characters while preserving original poses and scenes

Paper Content

Introduction

  • Video creation and editing are key tasks
  • Text-driven editing is a promising pipeline
  • Editing local objects in a video is challenging
  • This paper proposes a pipeline for video editing
  • Text-driven image editing requires a model to generate target content
  • Attention control is the most effective pipeline for detailed image editing
  • Inverting images into latent features with a pre-trained diffusion model
  • Controlling attention maps in the denoising process to edit the image
  • Proposing a novel framework to show pre-trained image diffusion model can be adapted for video editing
  • Using a structure on inversion and attention control for all frames
  • Adopting a method to convert a T2I model into a T2S model
  • Optimizing a shared unconditional embedding for all frames to align the denoising latent features with the diffusion latent features
  • Proposing a decoupled-guidance strategy in attention control

Text driven editing

  • Generative models have been used for image editing
  • Video editing with generative models has seen advances recently
  • Generative models can be used for tasks such as stylization and customization
  • Proposed method allows for local editing with a diffusion model pre-trained on images

Method

  • V is a real video with n frames
  • Prompt-to-Prompt setting introduces source prompt P and edited prompt P* to generate edited video V*
  • Video-P2P framework proposed to achieve cross-attention control in video editing
  • Shared unconditional embedding optimized for video inversion
  • Different guidance used for source and edited prompts, with attention maps incorporated

Video inversion

  • Constructed a T2S model with 1x3x3 pattern convolution kernels and temporal attention
  • Replaced self-attentions with frame-attentions
  • Model processes video pair-by-pair and computes n times to obtain prediction for every frame
  • Fine-tuned query projection matrices and additional temporal attention to perform noise prediction
  • Used DDIM inversion to generate latent features and shared unconditional embedding for all frames

Decoupled-guidance attention control

  • Existing works require an inference pipeline with both reconstruction ability and editability to perform attention control on real images.
  • Video inversion allows for an inference pipeline to reconstruct the original video, but the T2S model is not as robust as T2I models.
  • An initialized unconditional embedding makes the model more editable, but it cannot reconstruct perfectly.
  • Algorithm 1 combines the abilities of two inference pipelines to obtain the edited video.

Experiments

Implementation details

  • Developed method based on CompVis Stable Diffusion
  • Sample 8 or 24 frames from video at 512x512 resolution
  • Initialize model by finetuning T2S model for 500 steps
  • Cross-attention replacing ratio set to 0.4, attention threshold set to 0.3
  • Refinement ratio set to 0.4
  • 8-frame experiments conducted on single V100 GPU, 5 minutes for initialization, 6 minutes for inversion, 1 minute for inference

Applications

  • Video-P2P enables editing applications such as word swapping, prompt refinement, and attention re-weighting
  • Video-P2P maintains semantic consistency and temporal coherence
  • Word swapping allows for the replacement of entities while preserving unrelated regions
  • Prompt refinement enables the modification of object properties
  • Attention re-weighting allows for the manipulation of the extent of the corresponding generation

Comparison

  • TAV+DDIM and Video-P2P both allow for video editing with text prompts
  • Video-P2P can edit a local area and minimize the influence
  • Video-P2P can generate temporal-consistent results where TAV+DDIM fails
  • Video-P2P outperforms Dreamix in preserving details and motion consistency
  • Video-P2P performs well on all metrics compared to other methods
  • Video-P2P has a high preference rate compared to other methods

Ablation study

  • Shared unconditional embedding improves PSNR compared to TAV+DDIM
  • Using multiple unconditional embeddings increases PSNR by 0.2 but uses more parameters
  • Decoupled-guidance attention control improves editing quality

Conclusion

  • Proposed approach Video-P2P enables video editing locally and globally
  • Leverages pre-trained image diffusion model
  • Optimizes shared unconditional embedding based on T2S model
  • Uses different unconditional embeddings for source and target prompts
  • Integrates attention maps from two branches for improved attention control
  • Applications include word swap, prompt refinement, and attention re-weighting