Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Presents Video-P2P, a framework for real-world video editing with cross-attention control
- Adapts an image generation diffusion model to complete various video editing tasks
- Introduces a novel decoupled-guidance strategy for attention control
- Enables various text-driven editing applications
- Works well on real-world videos for generating new characters while preserving original poses and scenes
Paper Content
Introduction
- Video creation and editing are key tasks
- Text-driven editing is a promising pipeline
- Editing local objects in a video is challenging
- This paper proposes a pipeline for video editing
- Text-driven image editing requires a model to generate target content
- Attention control is the most effective pipeline for detailed image editing
- Inverting images into latent features with a pre-trained diffusion model
- Controlling attention maps in the denoising process to edit the image
- Proposing a novel framework to show pre-trained image diffusion model can be adapted for video editing
- Using a structure on inversion and attention control for all frames
- Adopting a method to convert a T2I model into a T2S model
- Optimizing a shared unconditional embedding for all frames to align the denoising latent features with the diffusion latent features
- Proposing a decoupled-guidance strategy in attention control
Text driven editing
- Generative models have been used for image editing
- Video editing with generative models has seen advances recently
- Generative models can be used for tasks such as stylization and customization
- Proposed method allows for local editing with a diffusion model pre-trained on images
Method
- V is a real video with n frames
- Prompt-to-Prompt setting introduces source prompt P and edited prompt P* to generate edited video V*
- Video-P2P framework proposed to achieve cross-attention control in video editing
- Shared unconditional embedding optimized for video inversion
- Different guidance used for source and edited prompts, with attention maps incorporated
Video inversion
- Constructed a T2S model with 1x3x3 pattern convolution kernels and temporal attention
- Replaced self-attentions with frame-attentions
- Model processes video pair-by-pair and computes n times to obtain prediction for every frame
- Fine-tuned query projection matrices and additional temporal attention to perform noise prediction
- Used DDIM inversion to generate latent features and shared unconditional embedding for all frames
Decoupled-guidance attention control
- Existing works require an inference pipeline with both reconstruction ability and editability to perform attention control on real images.
- Video inversion allows for an inference pipeline to reconstruct the original video, but the T2S model is not as robust as T2I models.
- An initialized unconditional embedding makes the model more editable, but it cannot reconstruct perfectly.
- Algorithm 1 combines the abilities of two inference pipelines to obtain the edited video.
Experiments
Implementation details
- Developed method based on CompVis Stable Diffusion
- Sample 8 or 24 frames from video at 512x512 resolution
- Initialize model by finetuning T2S model for 500 steps
- Cross-attention replacing ratio set to 0.4, attention threshold set to 0.3
- Refinement ratio set to 0.4
- 8-frame experiments conducted on single V100 GPU, 5 minutes for initialization, 6 minutes for inversion, 1 minute for inference
Applications
- Video-P2P enables editing applications such as word swapping, prompt refinement, and attention re-weighting
- Video-P2P maintains semantic consistency and temporal coherence
- Word swapping allows for the replacement of entities while preserving unrelated regions
- Prompt refinement enables the modification of object properties
- Attention re-weighting allows for the manipulation of the extent of the corresponding generation
Comparison
- TAV+DDIM and Video-P2P both allow for video editing with text prompts
- Video-P2P can edit a local area and minimize the influence
- Video-P2P can generate temporal-consistent results where TAV+DDIM fails
- Video-P2P outperforms Dreamix in preserving details and motion consistency
- Video-P2P performs well on all metrics compared to other methods
- Video-P2P has a high preference rate compared to other methods
Ablation study
- Shared unconditional embedding improves PSNR compared to TAV+DDIM
- Using multiple unconditional embeddings increases PSNR by 0.2 but uses more parameters
- Decoupled-guidance attention control improves editing quality
Conclusion
- Proposed approach Video-P2P enables video editing locally and globally
- Leverages pre-trained image diffusion model
- Optimizes shared unconditional embedding based on T2S model
- Uses different unconditional embeddings for source and target prompts
- Integrates attention maps from two branches for improved attention control
- Applications include word swap, prompt refinement, and attention re-weighting