Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos.
Augment language model with special time tokens to predict event boundaries and textual descriptions in same output sequence.
Leverage unlabeled narrated videos for dense video captioning by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
Vid2Seq model pretrained on YT-Temporal-1B dataset improves state of the art on dense video captioning benchmarks.
Generalizes well to video paragraph captioning and standard video clip captioning tasks.

Dense video captioning requires temporal localization and captioning of all events in an untrimmed video
Standard video captioning produces a single caption for a given short video clip
Dense captioning is more difficult and requires long-range video information
Existing methods mostly use two-stage approaches
This work proposes a video language model, Vid2Seq, which jointly predicts all event captions and their corresponding temporal boundaries
Vid2Seq is pretrained on unlabeled narrated videos
Vid2Seq improves the state of the art on multiple dense video captioning datasets
Code and trained models will be publicly released

Dense video captioning lies at the intersection of event localization and event captioning
Existing methods for dense video captioning consist of a temporal localization stage followed by an event captioning stage
Recent works jointly train the captioning and localization modules
Wang et al. propose to view dense video captioning as a set prediction task
Deng et al. propose to first generate a paragraph and then ground each sentence in the video
Zhang et al. propose to generate event boundaries sequentially
Zhu et al. perform dense video captioning by generating a single output sequence
Recent works have explored video-text pretraining
We propose a pretraining method that does not rely on any manual annotation
We formulate dense event captioning as a sequence-to-sequence problem
We cast visual localization as a language modeling task

Goal of dense video captioning is to localize and describe events in an untrimmed video
Key challenge is to model relationships between events
Manual collection of annotations is expensive
Develop a unified multi-modal model to predict event boundaries and captions
Pretraining strategy leverages cross-modal supervision from unlabeled narrated videos

We wish to design a model for dense video captioning that can capture relationships between events using visual and transcribed speech cues.
We cast dense video captioning as a sequence-to-sequence problem where the input and output sequences contain both semantic information and temporal localization.
We develop a multi-modal encoder-decoder architecture that takes video frames and transcribed speech as input.
The output of the model is an event sequence that contains both textual descriptions and timestamps.
We construct the output event sequence by augmenting a text tokenizer with special time tokens.
We construct the input transcript sequence similarly as the event sequence.
We use a visual encoder and a text encoder to embed the video frames and transcribed speech.
The text decoder generates the event sequence by using the encoder embeddings.
The text encoder and decoder are initialized with T5-Base which has been pretrained on Web text corpora.

Leverage large amount of unlabeled narrated videos to train dense event captioning model
Pretraining method uses cross-modal supervision in readily-available narrated videos
Finetune architecture for various downstream tasks including dense event captioning
Narrated videos do not contain dense event captioning annotations, so use transcribed speech sentences and timestamps as supervisory signal
Speech transcripts are not always visually grounded and often temporally misaligned
Vid2Seq model is suitable for using weak supervision
Pretraining on entire minutes-long videos is beneficial
Training objectives are based on maximum likelihood objective
Finetuning uses maximum likelihood objective based on event sequence
Inference autoregressively generates event sequence using beam search

Pretraining on YT-Temporal-1B dataset
Evaluating Vid2Seq on three downstream dense video captioning datasets
Using Adam optimizer
Evaluating with CIDEr, METEOR, SODA c, average precision, average recall, and F1 Score
Default Vid2Seq model predicts text and time tokens, uses visual frames and transcribed speech as input, builds on T5-Base language model, and is pretrained on untrimmed videos from YT-Temporal-1B with both generative and denoising losses
Pretraining task formulation uses untrimmed videos and integrates sentence boundaries of transcribed speech via time tokens
Adding denoising loss strongly benefits model with both modalities
Model with T5-Base outperforms its variant with T5-Small
Pretraining on 150K narrated videos yields important benefits
Vid2Seq sets new state of the art on all three datasets
Model that jointly predicts event boundaries and captions localizes better and benefits more from pretraining than localization-only baseline
Vid2Seq outperforms prior methods on YouCook2, ViTT, ActivityNet Captions, MSR-VTT, and MSVD datasets
Vid2Seq has potential to be extended to a wide range of other video tasks