Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss.
  • Contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs.
  • ViC-MAE combines MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames.
  • ViC-MAE generalizes well to both video classification and image classification tasks.
  • ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives.

Paper Content


  • Self-supervised visual representation learning has been successful in image benchmarks
  • Two paradigms have driven this success: joint-embedding methods and masked image modeling
  • Joint-embedding methods learn representations that are invariant to specific transformations
  • Masked image modeling works by randomly masking out parts of the input and forcing a model to predict the masked parts
  • Self-supervised methods from the image domain have been replicated for video representation learning with success
  • There is still a gap in performance in the video-to-image transfer learning setting
  • Videos contain complex changes in pose, viewpoint, deformations, etc.
  • Gordeon and Feichtenhofer have obtained good results for video and image benchmarks
  • Parthasarathy has obtained results that rival ImageNet results
  • ViC-MAE proposed to leverage contrastive learning and masked image modeling for videos
  • Training with negative pairs surpasses methods that only train with positive samples
  • Training with strong image transformations as augmentations is not necessary
  • ViC-MAE obtains best video-to-image transfer learning results in Imagenet-1k benchmark
  • ViC-MAE achieves superior accuracy than strong alternatives based on existing methods
  • ViC-MAE shows superior transfer learning accuracy on a wide array of downstream image classification tasks
  • Self-supervised learning from videos uses prior knowledge about videos to create pretext training tasks
  • Self-supervised learning from videos also uses contrastive learning
  • Masked auto encoders (MAE) can be used to pre-train models for transfer learning
  • Learning image representations from video can improve performance over artificially produced augmentations
  • Video datasets can provide natural image augmentations and correspondences to learn robust image representations


  • Proposes ViC-MAE for space-time feature learning
  • Uses contrastive learning at the time level
  • Uses masked image modelling at the space level


  • Masked image modeling is a way to learn visual representations in a self-supervised manner.
  • Negative-free representation learning is a way to learn global visual representations without negative examples.
  • Representation collapse can be avoided by using methods such as SiamSiam and VicReg.

Combining mae with contrastive methods.

  • Combining MAE with contrastive learning methods can be done by using the [CLS] token of the transformer as a global video feature representation
  • Sample two frames from a video and perform patch-level masking
  • Pass the x CLS i token to a projector network P to obtain p i P(x CLS i )/ P(x CLS i ) 2
  • Experiment with SiamSiam and VicReg losses
  • Pass x CLS i to a projector network P to obtain p i P(x CLS i )/ P(x CLS i ) 2
  • Calculate variance and covariance losses
  • Use masking image modeling at the frame level and image level similarity at the time level
  • Pull each video frame towards a global video representation in the latent space
  • Use InfoNCE contrastive learning loss
  • Use an schedule to gradually introduce the contrastive loss

Experiment settings

  • Perform experiments to demonstrate performance of method on ImageNet and other image recognition datasets
  • Evaluate method on Kinetics dataset for action recognition
  • Use Vision Transformer (ViT) architectures
  • Use small decoder proposed by He
  • Experiment with two alternatives for contrastive learning
  • Pre-train with Moments in Time and Kinetics-400 datasets
  • Use AdamW optimizer with batch size of 512
  • Evaluate pre-training quality by end-to-end finetune
  • Use K Top-1 Top-5 for multi-view testing on video datasets

Results and ablations

  • Performed experiments to analyze ViC-MAE framework
  • Experiments done under learning with negative pairs setting
  • Mean pooling used over ViT-B/16 features
  • Linear evaluation and end-to-end finetuning done over 100 epochs

Main result

  • Evaluated video-to-image transfer learning using ImageNet-1K benchmark
  • Results in Table 1 show our method surpasses previous state of the art by 1.58% points of accuracy
  • TubeViT is still the best model under ViT/B-16 architecture on Kinetics-400 benchmark
  • Our method performs best than other methods in video-to-image transfer
  • MAE + contrastive learning model outperforms competing methods by > 3%

Transfer learning performance.

  • Evaluated transfer learning performance of model across 12 image classification tasks
  • Trained two models using two video datasets
  • Model significantly outperformed other baselines on 9 out of 12 datasets
  • Investigated effect of various frame-level image transformations, frame separation, and pooling operator
  • Using only color augmentations reduced performance by >2%
  • Using combination of strong color and spatial augmentations not superior to using only strong spatial augmentations
  • Increasing frame separation when using continuous sampling increases performance
  • Using only strong spatial augmentations and discarding color augmentations
  • Model underperformed compared to TubeViT by 7.1% points in absolute accuracy
  • Model underperformed compared to MaskFeat by 0.7% points in absolute accuracy
  • Model surpassed MViTv1-B, TimeSformer, and ViVit-B by 0.3%, 0.8%, and 1.5% points in absolute accuracy respectively
  • Model underperformed compared to DINO by 1% points in absolute accuracy
  • Model operates over video frames using masked image modeling and contrastive learning