Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos
Pre-train VideoTaskformer by predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video
Learn step representations globally, leveraging video of the entire surrounding task as context
Introduce two new benchmarks for detecting mistakes in instructional videos
Introduce a long-term forecasting benchmark
Outperforms previous baselines on these tasks
Evaluate VideoTaskformer on 3 existing benchmarks and achieves new state-of-the-art performance

Large-scale narrated instructional video datasets enable learning joint video-language representations and task structure from videos
Assembly-101 dataset and Ikea ASM provide videos of people assembling and disassembling toys and furniture
Existing benchmarks for evaluating representations learned on instructional video datasets include step localization, step classification, procedural activity recognition, and step forecasting
Recent works attempt to learn procedures from instructional videos
Video action recognition models have improved over the last few years
Works learn representations for longer video clips containing semantically more complex actions

Goal is to learn task-aware step representations from instructional videos
Developed VideoTaskformer, a video model pre-trained using a BERT style masked modeling loss
Masking is done at the step level
Framework consists of two steps: pre-training and fine-tuning
Pre-training is done on weakly labeled data
During fine-tuning, a subset of the parameters is adjusted using labeled data from the downstream tasks
Pre-training approach uses masked step modeling loss
Step modeling extends masked language modeling techniques used in BERT and VideoBERT
Evaluated on 6 downstream tasks
Step representations are learned from entire video with all steps as input
Step classification and distribution matching are used as training objectives

Mistake step detection: Identify which step in a video is incorrect
Mistake ordering detection: Verify if the steps in a video are in the correct temporal order
Short-term forecasting: Predict the step label given the previous n segments
Long-term step forecasting: Predict the step labels for the next 5 steps given a single step
Procedural activity recognition: Recognize the procedural activity (i.e., task label) from a long instructional video
Step classification: Predict the step label given the video clip
Evaluation: Zero-shot performance and fine-tuning results

Step labels are needed to train VideoTaskformer
Step labels are difficult to obtain manually
WikiHow dataset is used as a weak form of supervision
WikiHow steps are compared to transcribed speech using a language model and video model

Pre-training uses videos and transcripts from the HowTo100M and WikiHow datasets
Evaluation uses videos and step annotations from the COIN dataset
3 new benchmark tasks introduced: mistake step detection, mistake ordering detection, and long-term step forecasting
Mistake step detection dataset created by randomly replacing one step with a step from a different video
Mistake ordering detection dataset created by randomly shuffling the ordering of the steps in a given video
Long-term step forecasting predicts the step class label for the next 5 consecutive steps
3 existing benchmarks evaluated: step classification, procedural activity recognition, and short-term step forecasting

Evaluated VideoTaskformer (VideoTF)
Compared with existing baselines
6 downstream tasks: step classification, procedural activity recognition, step forecasting, mistake step detection, mistake ordering detection, and long term forecasting
Results on datasets described in Sec. 4

TimeSformer (LwDS) [13] is a baseline model pre-trained on HowTo100M
TimeSformer w/ KB transfer (LwDS) [13] adds knowledge base transfer to the baseline model
Steps from clustering ASR text is an unsupervised baseline using only transcribed speech
Base models tested include S3D, SlowFast, TimeSformer trained on HT100M and Kinetics
Loss functions tested include Step Classification and Distribution Matching
Modalities tested include video features and ASR text
Task label is included as input to the downstream model
Linear-probe and fine-tuning are evaluated
Results are compared to several baselines on 6 downstream tasks
VideoTF with step classification loss outperforms LwDS by 2%
Distribution matching loss works slightly better than step classification loss
Linear-probe performance is competitive and outperforms baselines
VideoTF achieves a 5% improvement over LwDS on long-term forecasting
Adding task labels improves performance on all three tasks
Qualitative results show VideoTF correctly predicts mistake steps and orders
VideoTF correctly predicts next steps given past steps