Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

T2V generation requires large-scale text-video datasets for fine-tuning.
Humans can learn new visual concepts from a single exemplar.
One-Shot Video Generation uses a single text-video pair for training.
T2I diffusion models are adapted for T2V generation.
Tune-A-Video uses Sparse-Causal Attention to generate videos from text prompts.

Paper Content

Introduction

A large-scale multimodal dataset has enabled breakthroughs in open-domain Text-to-Image (T2I) generation
Recent works have extended the spatial-only T2I generation models to the spatiotemporal domain
Human is capable of one-shot learning
Can pre-trained T2I models infer other novel videos from a single video example?
Intuitively, the key to video generation is to keep the continuous motion of consistent objects
T2I models can properly attend to verbs via cross-modal attention for static motion generation
Self-attention layers in T2I models are only driven by spatial similarities rather than pixel positions
Introduce a novel problem of One-Shot Video Generation
Generate videos from text prompts via an efficient one-shot tuning of pre-trained T2I diffusion models
Generate temporally-coherent videos with customized attributes, subjects, places, etc.

Text-to-image generation

DALL-E uses text-to-image generation as a sequence-to-sequence translation problem
Parti uses a more advanced image tokenizer and an encoder-decoder architecture
CogView2 uses hierarchical transformers and local parallel auto-regressive generation
Make-A-Scene focuses on improving scene generation controllability
DDPMs are widely used for T2I generation
Recent works explore pre-trained T2I diffusion models for text-driven image editing

Text-to-video generation

Text-to-video (T2V) generation is a relatively new research field.
GODIVA is the first work to extend VQ-VAE to T2V generation.
CogVideo extends CogView-2 to T2V generation.
Phenaki is the first work to generate videos from time variable prompts.

Single video generative models

Single-video GANs generate videos similar to the input video.
These GANs are limited in computation time and impractical to use.
Patch nearest-neighbour methods generate higher quality videos with less computation time, but are limited in generalization.
SinFusion adapts diffusion models to single-video tasks, but cannot produce videos of different semantic contexts.
Our work studies open-domain video generation of different appearance to the input video, guided by text prompts.

Method

Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) are introduced
Problem setting is formulated
Tune-A-Video approach is presented for one-shot video generation

Preliminary: diffusion models

Denoising Diffusion Probabilistic Models (DDPMs) are latent generative models trained to recreate a fixed forward Markov chain
DDPMs use a prior distribution and Gaussian transitions to generate the Markov chain
Latent Diffusion Models (LDMs) are variants of DDPMs that operate in the latent space of an autoencoder
One-Shot Video Generation is a new problem for T2V generation that exploits pre-trained text-to-image (T2I) diffusion models
Network Inflation uses a U-Net with 2D convolutional residual blocks and attention blocks
Sparse-Causal Attention (SC-Attn) is proposed to achieve better temporal consistency
SC-Attn is computationally efficient and supports autoregressive generation of long video sequences
One-Shot Tuning fine-tunes the inflated T2V models for One-Shot Video Generation
LDMs are used with a fixed image autoencoder to encode each video frame
DDIM sampler and classifier-free guidance are used for T2V generation

Comparison with vdm baselines

VDM baselines factorize space and time by adding temporal attention after spatial attention blocks
Training pipeline used for fair comparison
VDM baselines with factorized space-time attention fail to generate consistent content
Tune-A-Video with spatio-temporal cross-frame attention maintains better temporal consistency
Tune-A-Video produces higher CLIP score and is preferred in human evaluation for video quality and text-video faithfulness

Ablation study

Sparse-Causal Attention (SC-Attn) and One-Shot Tuning are two key components of Tune-A-Video
SC-Attn captures spato-temporal information when generating videos
One-Shot Tuning is capable of performing semantic mixing and has more flexibility than other methods
One-Shot Tuning is able to generate temporally-coherent videos with motion information

Conclusion

Introduce a new task called One-Shot Video Generation
Propose Tune-A-Video, a solution based on pretrained T2I diffusion models
Exploit properties of pretrained T2I models with Sparse-Causal Attention
Update projection matrices in attention block on one training sample
Supports several T2V applications, including subject replacement, background change, attribute modification, style transfer
Tune-A-Video generates diverse and high-definition videos that are well-aligned with the motion of source videos and semantics of text prompts
Achieves better performance compared to CogVideo

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Text-to-image generation#

Text-to-video generation#

Single video generative models#

Method#

Preliminary: diffusion models#

Comparison with vdm baselines#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Text-to-image generation

Text-to-video generation

Single video generative models

Method

Preliminary: diffusion models

Comparison with vdm baselines

Ablation study

Conclusion