Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos.
- Augment language model with special time tokens to predict event boundaries and textual descriptions in same output sequence.
- Leverage unlabeled narrated videos for dense video captioning by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
- Vid2Seq model pretrained on YT-Temporal-1B dataset improves state of the art on dense video captioning benchmarks.
- Generalizes well to video paragraph captioning and standard video clip captioning tasks.
Paper Content
Introduction
- Dense video captioning requires temporal localization and captioning of all events in an untrimmed video
- Standard video captioning produces a single caption for a given short video clip
- Dense captioning is more difficult and requires long-range video information
- Existing methods mostly use two-stage approaches
- This work proposes a video language model, Vid2Seq, which jointly predicts all event captions and their corresponding temporal boundaries
- Vid2Seq is pretrained on unlabeled narrated videos
- Vid2Seq improves the state of the art on multiple dense video captioning datasets
- Code and trained models will be publicly released
Related work
- Dense video captioning lies at the intersection of event localization and event captioning
- Existing methods for dense video captioning consist of a temporal localization stage followed by an event captioning stage
- Recent works jointly train the captioning and localization modules
- Wang et al. propose to view dense video captioning as a set prediction task
- Deng et al. propose to first generate a paragraph and then ground each sentence in the video
- Zhang et al. propose to generate event boundaries sequentially
- Zhu et al. perform dense video captioning by generating a single output sequence
- Recent works have explored video-text pretraining
- We propose a pretraining method that does not rely on any manual annotation
- We formulate dense event captioning as a sequence-to-sequence problem
- We cast visual localization as a language modeling task
Method
- Goal of dense video captioning is to localize and describe events in an untrimmed video
- Key challenge is to model relationships between events
- Manual collection of annotations is expensive
- Develop a unified multi-modal model to predict event boundaries and captions
- Pretraining strategy leverages cross-modal supervision from unlabeled narrated videos
Model
- We wish to design a model for dense video captioning that can capture relationships between events using visual and transcribed speech cues.
- We cast dense video captioning as a sequence-to-sequence problem where the input and output sequences contain both semantic information and temporal localization.
- We develop a multi-modal encoder-decoder architecture that takes video frames and transcribed speech as input.
- The output of the model is an event sequence that contains both textual descriptions and timestamps.
- We construct the output event sequence by augmenting a text tokenizer with special time tokens.
- We construct the input transcript sequence similarly as the event sequence.
- We use a visual encoder and a text encoder to embed the video frames and transcribed speech.
- The text decoder generates the event sequence by using the encoder embeddings.
- The text encoder and decoder are initialized with T5-Base which has been pretrained on Web text corpora.
Training
- Leverage large amount of unlabeled narrated videos to train dense event captioning model
- Pretraining method uses cross-modal supervision in readily-available narrated videos
- Finetune architecture for various downstream tasks including dense event captioning
- Narrated videos do not contain dense event captioning annotations, so use transcribed speech sentences and timestamps as supervisory signal
- Speech transcripts are not always visually grounded and often temporally misaligned
- Vid2Seq model is suitable for using weak supervision
- Pretraining on entire minutes-long videos is beneficial
- Training objectives are based on maximum likelihood objective
- Finetuning uses maximum likelihood objective based on event sequence
- Inference autoregressively generates event sequence using beam search
Experiments
- Outlined experimental setup in Section 4.1
- Presented ablation studies in Section 4
Experimental setup
- Pretraining on YT-Temporal-1B dataset
- Evaluating Vid2Seq on three downstream dense video captioning datasets
- Using Adam optimizer
- Evaluating with CIDEr, METEOR, SODA c, average precision, average recall, and F1 Score
- Default Vid2Seq model predicts text and time tokens, uses visual frames and transcribed speech as input, builds on T5-Base language model, and is pretrained on untrimmed videos from YT-Temporal-1B with both generative and denoising losses
- Pretraining task formulation uses untrimmed videos and integrates sentence boundaries of transcribed speech via time tokens
- Adding denoising loss strongly benefits model with both modalities
- Model with T5-Base outperforms its variant with T5-Small
- Pretraining on 150K narrated videos yields important benefits
- Vid2Seq sets new state of the art on all three datasets
- Model that jointly predicts event boundaries and captions localizes better and benefits more from pretraining than localization-only baseline
- Vid2Seq outperforms prior methods on YouCook2, ViTT, ActivityNet Captions, MSR-VTT, and MSVD datasets
- Vid2Seq has potential to be extended to a wide range of other video tasks