Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos.
  • Augment language model with special time tokens to predict event boundaries and textual descriptions in same output sequence.
  • Leverage unlabeled narrated videos for dense video captioning by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
  • Vid2Seq model pretrained on YT-Temporal-1B dataset improves state of the art on dense video captioning benchmarks.
  • Generalizes well to video paragraph captioning and standard video clip captioning tasks.

Paper Content

Introduction

  • Dense video captioning requires temporal localization and captioning of all events in an untrimmed video
  • Standard video captioning produces a single caption for a given short video clip
  • Dense captioning is more difficult and requires long-range video information
  • Existing methods mostly use two-stage approaches
  • This work proposes a video language model, Vid2Seq, which jointly predicts all event captions and their corresponding temporal boundaries
  • Vid2Seq is pretrained on unlabeled narrated videos
  • Vid2Seq improves the state of the art on multiple dense video captioning datasets
  • Code and trained models will be publicly released
  • Dense video captioning lies at the intersection of event localization and event captioning
  • Existing methods for dense video captioning consist of a temporal localization stage followed by an event captioning stage
  • Recent works jointly train the captioning and localization modules
  • Wang et al. propose to view dense video captioning as a set prediction task
  • Deng et al. propose to first generate a paragraph and then ground each sentence in the video
  • Zhang et al. propose to generate event boundaries sequentially
  • Zhu et al. perform dense video captioning by generating a single output sequence
  • Recent works have explored video-text pretraining
  • We propose a pretraining method that does not rely on any manual annotation
  • We formulate dense event captioning as a sequence-to-sequence problem
  • We cast visual localization as a language modeling task

Method

  • Goal of dense video captioning is to localize and describe events in an untrimmed video
  • Key challenge is to model relationships between events
  • Manual collection of annotations is expensive
  • Develop a unified multi-modal model to predict event boundaries and captions
  • Pretraining strategy leverages cross-modal supervision from unlabeled narrated videos

Model

  • We wish to design a model for dense video captioning that can capture relationships between events using visual and transcribed speech cues.
  • We cast dense video captioning as a sequence-to-sequence problem where the input and output sequences contain both semantic information and temporal localization.
  • We develop a multi-modal encoder-decoder architecture that takes video frames and transcribed speech as input.
  • The output of the model is an event sequence that contains both textual descriptions and timestamps.
  • We construct the output event sequence by augmenting a text tokenizer with special time tokens.
  • We construct the input transcript sequence similarly as the event sequence.
  • We use a visual encoder and a text encoder to embed the video frames and transcribed speech.
  • The text decoder generates the event sequence by using the encoder embeddings.
  • The text encoder and decoder are initialized with T5-Base which has been pretrained on Web text corpora.

Training

  • Leverage large amount of unlabeled narrated videos to train dense event captioning model
  • Pretraining method uses cross-modal supervision in readily-available narrated videos
  • Finetune architecture for various downstream tasks including dense event captioning
  • Narrated videos do not contain dense event captioning annotations, so use transcribed speech sentences and timestamps as supervisory signal
  • Speech transcripts are not always visually grounded and often temporally misaligned
  • Vid2Seq model is suitable for using weak supervision
  • Pretraining on entire minutes-long videos is beneficial
  • Training objectives are based on maximum likelihood objective
  • Finetuning uses maximum likelihood objective based on event sequence
  • Inference autoregressively generates event sequence using beam search

Experiments

  • Outlined experimental setup in Section 4.1
  • Presented ablation studies in Section 4

Experimental setup

  • Pretraining on YT-Temporal-1B dataset
  • Evaluating Vid2Seq on three downstream dense video captioning datasets
  • Using Adam optimizer
  • Evaluating with CIDEr, METEOR, SODA c, average precision, average recall, and F1 Score
  • Default Vid2Seq model predicts text and time tokens, uses visual frames and transcribed speech as input, builds on T5-Base language model, and is pretrained on untrimmed videos from YT-Temporal-1B with both generative and denoising losses
  • Pretraining task formulation uses untrimmed videos and integrates sentence boundaries of transcribed speech via time tokens
  • Adding denoising loss strongly benefits model with both modalities
  • Model with T5-Base outperforms its variant with T5-Small
  • Pretraining on 150K narrated videos yields important benefits
  • Vid2Seq sets new state of the art on all three datasets
  • Model that jointly predicts event boundaries and captions localizes better and benefits more from pretraining than localization-only baseline
  • Vid2Seq outperforms prior methods on YouCook2, ViTT, ActivityNet Captions, MSR-VTT, and MSVD datasets
  • Vid2Seq has potential to be extended to a wide range of other video tasks