Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos
  • Pre-train VideoTaskformer by predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video
  • Learn step representations globally, leveraging video of the entire surrounding task as context
  • Introduce two new benchmarks for detecting mistakes in instructional videos
  • Introduce a long-term forecasting benchmark
  • Outperforms previous baselines on these tasks
  • Evaluate VideoTaskformer on 3 existing benchmarks and achieves new state-of-the-art performance

Paper Content


  • Trying to build a bookshelf using a YouTube video
  • Need to repeatedly hit pause on the video
  • Interactive assistant can guide user through task
  • Composite task involves multiple fine-grained activities
  • Ideal assistant has high-level and low-level understanding
  • Prior work models step representations from single short video clips
  • VideoTaskformer learns step representations for masked video steps
  • Mistake detection task and dataset for verifying video representations
  • VideoTaskformer learns step representations for whole video
  • Network learns to predict labels for masked steps
  • Representations improve performance on downstream tasks
  • VideoTaskformer capable of detecting mistake types
  • Large-scale narrated instructional video datasets enable learning joint video-language representations and task structure from videos
  • Assembly-101 dataset and Ikea ASM provide videos of people assembling and disassembling toys and furniture
  • Existing benchmarks for evaluating representations learned on instructional video datasets include step localization, step classification, procedural activity recognition, and step forecasting
  • Recent works attempt to learn procedures from instructional videos
  • Video action recognition models have improved over the last few years
  • Works learn representations for longer video clips containing semantically more complex actions

Learning task structure through masked modeling of steps

  • Goal is to learn task-aware step representations from instructional videos
  • Developed VideoTaskformer, a video model pre-trained using a BERT style masked modeling loss
  • Masking is done at the step level
  • Framework consists of two steps: pre-training and fine-tuning
  • Pre-training is done on weakly labeled data
  • During fine-tuning, a subset of the parameters is adjusted using labeled data from the downstream tasks
  • Pre-training approach uses masked step modeling loss
  • Step modeling extends masked language modeling techniques used in BERT and VideoBERT
  • Evaluated on 6 downstream tasks
  • Step representations are learned from entire video with all steps as input
  • Step classification and distribution matching are used as training objectives

Downstream tasks

  • Mistake step detection: Identify which step in a video is incorrect
  • Mistake ordering detection: Verify if the steps in a video are in the correct temporal order
  • Short-term forecasting: Predict the step label given the previous n segments
  • Long-term step forecasting: Predict the step labels for the next 5 steps given a single step
  • Procedural activity recognition: Recognize the procedural activity (i.e., task label) from a long instructional video
  • Step classification: Predict the step label given the video clip
  • Evaluation: Zero-shot performance and fine-tuning results

Implementation details

  • Step labels are needed to train VideoTaskformer
  • Step labels are difficult to obtain manually
  • WikiHow dataset is used as a weak form of supervision
  • WikiHow steps are compared to transcribed speech using a language model and video model

Datasets and evaluation metrics

  • Pre-training uses videos and transcripts from the HowTo100M and WikiHow datasets
  • Evaluation uses videos and step annotations from the COIN dataset
  • 3 new benchmark tasks introduced: mistake step detection, mistake ordering detection, and long-term step forecasting
  • Mistake step detection dataset created by randomly replacing one step with a step from a different video
  • Mistake ordering detection dataset created by randomly shuffling the ordering of the steps in a given video
  • Long-term step forecasting predicts the step class label for the next 5 consecutive steps
  • 3 existing benchmarks evaluated: step classification, procedural activity recognition, and short-term step forecasting


  • Evaluated VideoTaskformer (VideoTF)
  • Compared with existing baselines
  • 6 downstream tasks: step classification, procedural activity recognition, step forecasting, mistake step detection, mistake ordering detection, and long term forecasting
  • Results on datasets described in Sec. 4


  • TimeSformer (LwDS) [13] is a baseline model pre-trained on HowTo100M
  • TimeSformer w/ KB transfer (LwDS) [13] adds knowledge base transfer to the baseline model
  • Steps from clustering ASR text is an unsupervised baseline using only transcribed speech
  • Base models tested include S3D, SlowFast, TimeSformer trained on HT100M and Kinetics
  • Loss functions tested include Step Classification and Distribution Matching
  • Modalities tested include video features and ASR text
  • Task label is included as input to the downstream model
  • Linear-probe and fine-tuning are evaluated
  • Results are compared to several baselines on 6 downstream tasks
  • VideoTF with step classification loss outperforms LwDS by 2%
  • Distribution matching loss works slightly better than step classification loss
  • Linear-probe performance is competitive and outperforms baselines
  • VideoTF achieves a 5% improvement over LwDS on long-term forecasting
  • Adding task labels improves performance on all three tasks
  • Qualitative results show VideoTF correctly predicts mistake steps and orders
  • VideoTF correctly predicts next steps given past steps