Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos
- Pre-train VideoTaskformer by predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video
- Learn step representations globally, leveraging video of the entire surrounding task as context
- Introduce two new benchmarks for detecting mistakes in instructional videos
- Introduce a long-term forecasting benchmark
- Outperforms previous baselines on these tasks
- Evaluate VideoTaskformer on 3 existing benchmarks and achieves new state-of-the-art performance
Paper Content
Introduction
- Trying to build a bookshelf using a YouTube video
- Need to repeatedly hit pause on the video
- Interactive assistant can guide user through task
- Composite task involves multiple fine-grained activities
- Ideal assistant has high-level and low-level understanding
- Prior work models step representations from single short video clips
- VideoTaskformer learns step representations for masked video steps
- Mistake detection task and dataset for verifying video representations
- VideoTaskformer learns step representations for whole video
- Network learns to predict labels for masked steps
- Representations improve performance on downstream tasks
- VideoTaskformer capable of detecting mistake types
Related works
- Large-scale narrated instructional video datasets enable learning joint video-language representations and task structure from videos
- Assembly-101 dataset and Ikea ASM provide videos of people assembling and disassembling toys and furniture
- Existing benchmarks for evaluating representations learned on instructional video datasets include step localization, step classification, procedural activity recognition, and step forecasting
- Recent works attempt to learn procedures from instructional videos
- Video action recognition models have improved over the last few years
- Works learn representations for longer video clips containing semantically more complex actions
Learning task structure through masked modeling of steps
- Goal is to learn task-aware step representations from instructional videos
- Developed VideoTaskformer, a video model pre-trained using a BERT style masked modeling loss
- Masking is done at the step level
- Framework consists of two steps: pre-training and fine-tuning
- Pre-training is done on weakly labeled data
- During fine-tuning, a subset of the parameters is adjusted using labeled data from the downstream tasks
- Pre-training approach uses masked step modeling loss
- Step modeling extends masked language modeling techniques used in BERT and VideoBERT
- Evaluated on 6 downstream tasks
- Step representations are learned from entire video with all steps as input
- Step classification and distribution matching are used as training objectives
Downstream tasks
- Mistake step detection: Identify which step in a video is incorrect
- Mistake ordering detection: Verify if the steps in a video are in the correct temporal order
- Short-term forecasting: Predict the step label given the previous n segments
- Long-term step forecasting: Predict the step labels for the next 5 steps given a single step
- Procedural activity recognition: Recognize the procedural activity (i.e., task label) from a long instructional video
- Step classification: Predict the step label given the video clip
- Evaluation: Zero-shot performance and fine-tuning results
Implementation details
- Step labels are needed to train VideoTaskformer
- Step labels are difficult to obtain manually
- WikiHow dataset is used as a weak form of supervision
- WikiHow steps are compared to transcribed speech using a language model and video model
Datasets and evaluation metrics
- Pre-training uses videos and transcripts from the HowTo100M and WikiHow datasets
- Evaluation uses videos and step annotations from the COIN dataset
- 3 new benchmark tasks introduced: mistake step detection, mistake ordering detection, and long-term step forecasting
- Mistake step detection dataset created by randomly replacing one step with a step from a different video
- Mistake ordering detection dataset created by randomly shuffling the ordering of the steps in a given video
- Long-term step forecasting predicts the step class label for the next 5 consecutive steps
- 3 existing benchmarks evaluated: step classification, procedural activity recognition, and short-term step forecasting
Experiments
- Evaluated VideoTaskformer (VideoTF)
- Compared with existing baselines
- 6 downstream tasks: step classification, procedural activity recognition, step forecasting, mistake step detection, mistake ordering detection, and long term forecasting
- Results on datasets described in Sec. 4
Baselines
- TimeSformer (LwDS) [13] is a baseline model pre-trained on HowTo100M
- TimeSformer w/ KB transfer (LwDS) [13] adds knowledge base transfer to the baseline model
- Steps from clustering ASR text is an unsupervised baseline using only transcribed speech
- Base models tested include S3D, SlowFast, TimeSformer trained on HT100M and Kinetics
- Loss functions tested include Step Classification and Distribution Matching
- Modalities tested include video features and ASR text
- Task label is included as input to the downstream model
- Linear-probe and fine-tuning are evaluated
- Results are compared to several baselines on 6 downstream tasks
- VideoTF with step classification loss outperforms LwDS by 2%
- Distribution matching loss works slightly better than step classification loss
- Linear-probe performance is competitive and outperforms baselines
- VideoTF achieves a 5% improvement over LwDS on long-term forecasting
- Adding task labels improves performance on all three tasks
- Qualitative results show VideoTF correctly predicts mistake steps and orders
- VideoTF correctly predicts next steps given past steps