Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text-driven image and video diffusion models have achieved high generation realism.
- Few works have done text-based motion and appearance editing of general videos.
- Our approach combines low-resolution information from the original video with new, high resolution information.
- We propose a mixed objective to improve motion editability.
- We introduce a new framework for image animation.
- Our method has superior performance compared to baseline methods.
Paper Content
Introduction
- Recent advancements in generative models and multimodal vision-language models have enabled large-scale text-to-image models with high realism and diversity.
- Text-based image editing methods offer text-based editing of generated and real images.
- Text-to-video models have been proposed, but few methods exist for video editing.
- Text-guided video editing requires alignment, fidelity, and quality.
- Dreamix is a new method for adapting a text-conditioned video diffusion model for video editing.
- Dreamix uses a degraded version of the original video and mixed finetuning to maintain fidelity.
- Dreamix can be used for image animation and subject-driven video generation.
Related work
Diffusion models for synthesis
- Deep diffusion models are a new paradigm for image generation.
- They outperform GANs.
- EDM showed they are equivalent.
- Text-to-image generation has made progress.
- Video generation is a challenging task.
Diffusion models for editing
- Image editing with generative models has been studied extensively
- Many models are based on GANs
- Editing methods have adopted diffusion models
- Text-to-image diffusion models can be used for editing
- Finetuning and optimization can be used to personalize the model
- Text-to-video models can edit motion
- Cascaded video diffusion models can reduce computational complexity
General editing by video diffusion models
- Proposed new method for video editing
- Extended to image animation
Text-guided video editing by inverting corruptions
- We wish to edit an input video using a text prompt
- We leverage the power of a cascade of VDMs
- We corrupt the video by downsampling and adding noise
- We use the text prompt to select feasible outputs that align with the edits desired by the user
Mixed video-image finetuning
- Naive method relies on corrupted version of input video
- Preliminary stage of finetuning model on input video
- Model updates prior on both motion and appearance
- Model trained on sequence of frames
- Model trained on two objectives
- Finetuning mitigates overfitting
Hyperparameters
- Hyperparameters for inference time: noise scale s in range [0, 1]
- Hyperparameters for finetuning: number of steps, learning rate, mixing weight between video and frames objectives
- Qualitative and quantitative analysis of hyperparameter impact in Fig. 7 and Sec. 6.3
- Additional implementation details in Appendix A
Applications of dreamix
- Proposed method can be used to edit motion and appearance in real-world videos
- Framework proposed for using Dreamix for single images
- Transform image into a coarse, corrupted video and edit it using Dreamix
- Simulate camera motion, such as panning and zoom
- Use Dreamix for text-conditioned video generation given an image collection
Experiments
Qualitative results
- Dreamix can edit videos and animate images
- Dreamix can generate new motion and control camera movements
- Dreamix can add effects, objects, and change backgrounds
- Dreamix can take an image collection and generate new videos with the subject in motion
Baseline comparisons
- Compared method to two baselines: Text-to-Video and Plug-and-Play
- Performed human-rated evaluation on dataset of 29 videos and 127 text prompts
- Results of evaluation seen in Table 2
- Success rate of each method observed
- Frame-by-frame methods like Plug-and-Play performed poorly in terms of visual quality
- Text-to-Video baseline had low fidelity
- Dreamix balanced between the three dimensions, resulting in high success rate
Ablation study
- Motion changes require high-editability
- Frame-based finetuning typically outperformed video-only finetuning
- Denoising without finetuning worked well for style transfer, finetuning was often detrimental
- Preserving fine-details in background, color or texture changes required finetuning
Discussion
- Hyperparameter selection can be automated to make the method more user friendly
- Automatic evaluation metrics are imperfectly correlated with human preference
- Frequency of objects in dataset and editability can be used to determine successful pairs in advance
- Computationally expensive, needs to be sped up for wider applications