Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Presents Video-P2P, a framework for real-world video editing with cross-attention control
Adapts an image generation diffusion model to complete various video editing tasks
Introduces a novel decoupled-guidance strategy for attention control
Enables various text-driven editing applications
Works well on real-world videos for generating new characters while preserving original poses and scenes

Paper Content

Introduction

Video creation and editing are key tasks
Text-driven editing is a promising pipeline
Editing local objects in a video is challenging
This paper proposes a pipeline for video editing
Text-driven image editing requires a model to generate target content
Attention control is the most effective pipeline for detailed image editing
Inverting images into latent features with a pre-trained diffusion model
Controlling attention maps in the denoising process to edit the image
Proposing a novel framework to show pre-trained image diffusion model can be adapted for video editing
Using a structure on inversion and attention control for all frames
Adopting a method to convert a T2I model into a T2S model
Optimizing a shared unconditional embedding for all frames to align the denoising latent features with the diffusion latent features
Proposing a decoupled-guidance strategy in attention control

Text driven editing

Generative models have been used for image editing
Video editing with generative models has seen advances recently
Generative models can be used for tasks such as stylization and customization
Proposed method allows for local editing with a diffusion model pre-trained on images

Method

V is a real video with n frames
Prompt-to-Prompt setting introduces source prompt P and edited prompt P* to generate edited video V*
Video-P2P framework proposed to achieve cross-attention control in video editing
Shared unconditional embedding optimized for video inversion
Different guidance used for source and edited prompts, with attention maps incorporated

Video inversion

Constructed a T2S model with 1x3x3 pattern convolution kernels and temporal attention
Replaced self-attentions with frame-attentions
Model processes video pair-by-pair and computes n times to obtain prediction for every frame
Fine-tuned query projection matrices and additional temporal attention to perform noise prediction
Used DDIM inversion to generate latent features and shared unconditional embedding for all frames

Decoupled-guidance attention control

Existing works require an inference pipeline with both reconstruction ability and editability to perform attention control on real images.
Video inversion allows for an inference pipeline to reconstruct the original video, but the T2S model is not as robust as T2I models.
An initialized unconditional embedding makes the model more editable, but it cannot reconstruct perfectly.
Algorithm 1 combines the abilities of two inference pipelines to obtain the edited video.

Experiments

Implementation details

Developed method based on CompVis Stable Diffusion
Sample 8 or 24 frames from video at 512x512 resolution
Initialize model by finetuning T2S model for 500 steps
Cross-attention replacing ratio set to 0.4, attention threshold set to 0.3
Refinement ratio set to 0.4
8-frame experiments conducted on single V100 GPU, 5 minutes for initialization, 6 minutes for inversion, 1 minute for inference

Applications

Video-P2P enables editing applications such as word swapping, prompt refinement, and attention re-weighting
Video-P2P maintains semantic consistency and temporal coherence
Word swapping allows for the replacement of entities while preserving unrelated regions
Prompt refinement enables the modification of object properties
Attention re-weighting allows for the manipulation of the extent of the corresponding generation

Comparison

TAV+DDIM and Video-P2P both allow for video editing with text prompts
Video-P2P can edit a local area and minimize the influence
Video-P2P can generate temporal-consistent results where TAV+DDIM fails
Video-P2P outperforms Dreamix in preserving details and motion consistency
Video-P2P performs well on all metrics compared to other methods
Video-P2P has a high preference rate compared to other methods

Ablation study

Shared unconditional embedding improves PSNR compared to TAV+DDIM
Using multiple unconditional embeddings increases PSNR by 0.2 but uses more parameters
Decoupled-guidance attention control improves editing quality

Conclusion

Proposed approach Video-P2P enables video editing locally and globally
Leverages pre-trained image diffusion model
Optimizes shared unconditional embedding based on T2S model
Uses different unconditional embeddings for source and target prompts
Integrates attention maps from two branches for improved attention control
Applications include word swap, prompt refinement, and attention re-weighting

Link to paper#

Abstract#

Paper Content#

Introduction#

Text driven editing#

Method#

Video inversion#

Decoupled-guidance attention control#

Experiments#

Implementation details#

Applications#

Comparison#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Text driven editing

Method

Video inversion

Decoupled-guidance attention control

Experiments

Implementation details

Applications

Comparison

Ablation study

Conclusion