Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- T2V generation requires large-scale text-video datasets for fine-tuning.
- Humans can learn new visual concepts from a single exemplar.
- One-Shot Video Generation uses a single text-video pair for training.
- T2I diffusion models are adapted for T2V generation.
- Tune-A-Video uses Sparse-Causal Attention to generate videos from text prompts.
Paper Content
Introduction
- A large-scale multimodal dataset has enabled breakthroughs in open-domain Text-to-Image (T2I) generation
- Recent works have extended the spatial-only T2I generation models to the spatiotemporal domain
- Human is capable of one-shot learning
- Can pre-trained T2I models infer other novel videos from a single video example?
- Intuitively, the key to video generation is to keep the continuous motion of consistent objects
- T2I models can properly attend to verbs via cross-modal attention for static motion generation
- Self-attention layers in T2I models are only driven by spatial similarities rather than pixel positions
- Introduce a novel problem of One-Shot Video Generation
- Generate videos from text prompts via an efficient one-shot tuning of pre-trained T2I diffusion models
- Generate temporally-coherent videos with customized attributes, subjects, places, etc.
Related work
Text-to-image generation
- DALL-E uses text-to-image generation as a sequence-to-sequence translation problem
- Parti uses a more advanced image tokenizer and an encoder-decoder architecture
- CogView2 uses hierarchical transformers and local parallel auto-regressive generation
- Make-A-Scene focuses on improving scene generation controllability
- DDPMs are widely used for T2I generation
- Recent works explore pre-trained T2I diffusion models for text-driven image editing
Text-to-video generation
- Text-to-video (T2V) generation is a relatively new research field.
- GODIVA is the first work to extend VQ-VAE to T2V generation.
- CogVideo extends CogView-2 to T2V generation.
- Phenaki is the first work to generate videos from time variable prompts.
Single video generative models
- Single-video GANs generate videos similar to the input video.
- These GANs are limited in computation time and impractical to use.
- Patch nearest-neighbour methods generate higher quality videos with less computation time, but are limited in generalization.
- SinFusion adapts diffusion models to single-video tasks, but cannot produce videos of different semantic contexts.
- Our work studies open-domain video generation of different appearance to the input video, guided by text prompts.
Method
- Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) are introduced
- Problem setting is formulated
- Tune-A-Video approach is presented for one-shot video generation
Preliminary: diffusion models
- Denoising Diffusion Probabilistic Models (DDPMs) are latent generative models trained to recreate a fixed forward Markov chain
- DDPMs use a prior distribution and Gaussian transitions to generate the Markov chain
- Latent Diffusion Models (LDMs) are variants of DDPMs that operate in the latent space of an autoencoder
- One-Shot Video Generation is a new problem for T2V generation that exploits pre-trained text-to-image (T2I) diffusion models
- Network Inflation uses a U-Net with 2D convolutional residual blocks and attention blocks
- Sparse-Causal Attention (SC-Attn) is proposed to achieve better temporal consistency
- SC-Attn is computationally efficient and supports autoregressive generation of long video sequences
- One-Shot Tuning fine-tunes the inflated T2V models for One-Shot Video Generation
- LDMs are used with a fixed image autoencoder to encode each video frame
- DDIM sampler and classifier-free guidance are used for T2V generation
Comparison with vdm baselines
- VDM baselines factorize space and time by adding temporal attention after spatial attention blocks
- Training pipeline used for fair comparison
- VDM baselines with factorized space-time attention fail to generate consistent content
- Tune-A-Video with spatio-temporal cross-frame attention maintains better temporal consistency
- Tune-A-Video produces higher CLIP score and is preferred in human evaluation for video quality and text-video faithfulness
Ablation study
- Sparse-Causal Attention (SC-Attn) and One-Shot Tuning are two key components of Tune-A-Video
- SC-Attn captures spato-temporal information when generating videos
- One-Shot Tuning is capable of performing semantic mixing and has more flexibility than other methods
- One-Shot Tuning is able to generate temporally-coherent videos with motion information
Conclusion
- Introduce a new task called One-Shot Video Generation
- Propose Tune-A-Video, a solution based on pretrained T2I diffusion models
- Exploit properties of pretrained T2I models with Sparse-Causal Attention
- Update projection matrices in attention block on one training sample
- Supports several T2V applications, including subject replacement, background change, attribute modification, style transfer
- Tune-A-Video generates diverse and high-definition videos that are well-aligned with the motion of source videos and semantics of text prompts
- Achieves better performance compared to CogVideo