Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Text-driven image and video diffusion models have achieved high generation realism.
  • Few works have done text-based motion and appearance editing of general videos.
  • Our approach combines low-resolution information from the original video with new, high resolution information.
  • We propose a mixed objective to improve motion editability.
  • We introduce a new framework for image animation.
  • Our method has superior performance compared to baseline methods.

Paper Content

Introduction

  • Recent advancements in generative models and multimodal vision-language models have enabled large-scale text-to-image models with high realism and diversity.
  • Text-based image editing methods offer text-based editing of generated and real images.
  • Text-to-video models have been proposed, but few methods exist for video editing.
  • Text-guided video editing requires alignment, fidelity, and quality.
  • Dreamix is a new method for adapting a text-conditioned video diffusion model for video editing.
  • Dreamix uses a degraded version of the original video and mixed finetuning to maintain fidelity.
  • Dreamix can be used for image animation and subject-driven video generation.

Diffusion models for synthesis

  • Deep diffusion models are a new paradigm for image generation.
  • They outperform GANs.
  • EDM showed they are equivalent.
  • Text-to-image generation has made progress.
  • Video generation is a challenging task.

Diffusion models for editing

  • Image editing with generative models has been studied extensively
  • Many models are based on GANs
  • Editing methods have adopted diffusion models
  • Text-to-image diffusion models can be used for editing
  • Finetuning and optimization can be used to personalize the model
  • Text-to-video models can edit motion
  • Cascaded video diffusion models can reduce computational complexity

General editing by video diffusion models

  • Proposed new method for video editing
  • Extended to image animation

Text-guided video editing by inverting corruptions

  • We wish to edit an input video using a text prompt
  • We leverage the power of a cascade of VDMs
  • We corrupt the video by downsampling and adding noise
  • We use the text prompt to select feasible outputs that align with the edits desired by the user

Mixed video-image finetuning

  • Naive method relies on corrupted version of input video
  • Preliminary stage of finetuning model on input video
  • Model updates prior on both motion and appearance
  • Model trained on sequence of frames
  • Model trained on two objectives
  • Finetuning mitigates overfitting

Hyperparameters

  • Hyperparameters for inference time: noise scale s in range [0, 1]
  • Hyperparameters for finetuning: number of steps, learning rate, mixing weight between video and frames objectives
  • Qualitative and quantitative analysis of hyperparameter impact in Fig. 7 and Sec. 6.3
  • Additional implementation details in Appendix A

Applications of dreamix

  • Proposed method can be used to edit motion and appearance in real-world videos
  • Framework proposed for using Dreamix for single images
  • Transform image into a coarse, corrupted video and edit it using Dreamix
  • Simulate camera motion, such as panning and zoom
  • Use Dreamix for text-conditioned video generation given an image collection

Experiments

Qualitative results

  • Dreamix can edit videos and animate images
  • Dreamix can generate new motion and control camera movements
  • Dreamix can add effects, objects, and change backgrounds
  • Dreamix can take an image collection and generate new videos with the subject in motion

Baseline comparisons

  • Compared method to two baselines: Text-to-Video and Plug-and-Play
  • Performed human-rated evaluation on dataset of 29 videos and 127 text prompts
  • Results of evaluation seen in Table 2
  • Success rate of each method observed
  • Frame-by-frame methods like Plug-and-Play performed poorly in terms of visual quality
  • Text-to-Video baseline had low fidelity
  • Dreamix balanced between the three dimensions, resulting in high success rate

Ablation study

  • Motion changes require high-editability
  • Frame-based finetuning typically outperformed video-only finetuning
  • Denoising without finetuning worked well for style transfer, finetuning was often detrimental
  • Preserving fine-details in background, color or texture changes required finetuning

Discussion

  • Hyperparameter selection can be automated to make the method more user friendly
  • Automatic evaluation metrics are imperfectly correlated with human preference
  • Frequency of objects in dataset and editability can be used to determine successful pairs in advance
  • Computationally expensive, needs to be sped up for wider applications

Conclusion