Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • T2V generation requires large-scale text-video datasets for fine-tuning.
  • Humans can learn new visual concepts from a single exemplar.
  • One-Shot Video Generation uses a single text-video pair for training.
  • T2I diffusion models are adapted for T2V generation.
  • Tune-A-Video uses Sparse-Causal Attention to generate videos from text prompts.

Paper Content

Introduction

  • A large-scale multimodal dataset has enabled breakthroughs in open-domain Text-to-Image (T2I) generation
  • Recent works have extended the spatial-only T2I generation models to the spatiotemporal domain
  • Human is capable of one-shot learning
  • Can pre-trained T2I models infer other novel videos from a single video example?
  • Intuitively, the key to video generation is to keep the continuous motion of consistent objects
  • T2I models can properly attend to verbs via cross-modal attention for static motion generation
  • Self-attention layers in T2I models are only driven by spatial similarities rather than pixel positions
  • Introduce a novel problem of One-Shot Video Generation
  • Generate videos from text prompts via an efficient one-shot tuning of pre-trained T2I diffusion models
  • Generate temporally-coherent videos with customized attributes, subjects, places, etc.

Text-to-image generation

  • DALL-E uses text-to-image generation as a sequence-to-sequence translation problem
  • Parti uses a more advanced image tokenizer and an encoder-decoder architecture
  • CogView2 uses hierarchical transformers and local parallel auto-regressive generation
  • Make-A-Scene focuses on improving scene generation controllability
  • DDPMs are widely used for T2I generation
  • Recent works explore pre-trained T2I diffusion models for text-driven image editing

Text-to-video generation

  • Text-to-video (T2V) generation is a relatively new research field.
  • GODIVA is the first work to extend VQ-VAE to T2V generation.
  • CogVideo extends CogView-2 to T2V generation.
  • Phenaki is the first work to generate videos from time variable prompts.

Single video generative models

  • Single-video GANs generate videos similar to the input video.
  • These GANs are limited in computation time and impractical to use.
  • Patch nearest-neighbour methods generate higher quality videos with less computation time, but are limited in generalization.
  • SinFusion adapts diffusion models to single-video tasks, but cannot produce videos of different semantic contexts.
  • Our work studies open-domain video generation of different appearance to the input video, guided by text prompts.

Method

  • Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) are introduced
  • Problem setting is formulated
  • Tune-A-Video approach is presented for one-shot video generation

Preliminary: diffusion models

  • Denoising Diffusion Probabilistic Models (DDPMs) are latent generative models trained to recreate a fixed forward Markov chain
  • DDPMs use a prior distribution and Gaussian transitions to generate the Markov chain
  • Latent Diffusion Models (LDMs) are variants of DDPMs that operate in the latent space of an autoencoder
  • One-Shot Video Generation is a new problem for T2V generation that exploits pre-trained text-to-image (T2I) diffusion models
  • Network Inflation uses a U-Net with 2D convolutional residual blocks and attention blocks
  • Sparse-Causal Attention (SC-Attn) is proposed to achieve better temporal consistency
  • SC-Attn is computationally efficient and supports autoregressive generation of long video sequences
  • One-Shot Tuning fine-tunes the inflated T2V models for One-Shot Video Generation
  • LDMs are used with a fixed image autoencoder to encode each video frame
  • DDIM sampler and classifier-free guidance are used for T2V generation

Comparison with vdm baselines

  • VDM baselines factorize space and time by adding temporal attention after spatial attention blocks
  • Training pipeline used for fair comparison
  • VDM baselines with factorized space-time attention fail to generate consistent content
  • Tune-A-Video with spatio-temporal cross-frame attention maintains better temporal consistency
  • Tune-A-Video produces higher CLIP score and is preferred in human evaluation for video quality and text-video faithfulness

Ablation study

  • Sparse-Causal Attention (SC-Attn) and One-Shot Tuning are two key components of Tune-A-Video
  • SC-Attn captures spato-temporal information when generating videos
  • One-Shot Tuning is capable of performing semantic mixing and has more flexibility than other methods
  • One-Shot Tuning is able to generate temporally-coherent videos with motion information

Conclusion

  • Introduce a new task called One-Shot Video Generation
  • Propose Tune-A-Video, a solution based on pretrained T2I diffusion models
  • Exploit properties of pretrained T2I models with Sparse-Causal Attention
  • Update projection matrices in attention block on one training sample
  • Supports several T2V applications, including subject replacement, background change, attribute modification, style transfer
  • Tune-A-Video generates diverse and high-definition videos that are well-aligned with the motion of source videos and semantics of text prompts
  • Achieves better performance compared to CogVideo