Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss.
Contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs.
ViC-MAE combines MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames.
ViC-MAE generalizes well to both video classification and image classification tasks.
ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives.

Self-supervised visual representation learning has been successful in image benchmarks
Two paradigms have driven this success: joint-embedding methods and masked image modeling
Joint-embedding methods learn representations that are invariant to specific transformations
Masked image modeling works by randomly masking out parts of the input and forcing a model to predict the masked parts
Self-supervised methods from the image domain have been replicated for video representation learning with success
There is still a gap in performance in the video-to-image transfer learning setting
Videos contain complex changes in pose, viewpoint, deformations, etc.
Gordeon et.al. and Feichtenhofer et.al. have obtained good results for video and image benchmarks
Parthasarathy et.al. has obtained results that rival ImageNet results
ViC-MAE proposed to leverage contrastive learning and masked image modeling for videos
Training with negative pairs surpasses methods that only train with positive samples
Training with strong image transformations as augmentations is not necessary
ViC-MAE obtains best video-to-image transfer learning results in Imagenet-1k benchmark
ViC-MAE achieves superior accuracy than strong alternatives based on existing methods
ViC-MAE shows superior transfer learning accuracy on a wide array of downstream image classification tasks

Self-supervised learning from videos uses prior knowledge about videos to create pretext training tasks
Self-supervised learning from videos also uses contrastive learning
Masked auto encoders (MAE) can be used to pre-train models for transfer learning
Learning image representations from video can improve performance over artificially produced augmentations
Video datasets can provide natural image augmentations and correspondences to learn robust image representations

Masked image modeling is a way to learn visual representations in a self-supervised manner.
Negative-free representation learning is a way to learn global visual representations without negative examples.
Representation collapse can be avoided by using methods such as SiamSiam and VicReg.

Combining MAE with contrastive learning methods can be done by using the [CLS] token of the transformer as a global video feature representation
Sample two frames from a video and perform patch-level masking
Pass the x CLS i token to a projector network P to obtain p i P(x CLS i )/ P(x CLS i ) 2
Experiment with SiamSiam and VicReg losses
Pass x CLS i to a projector network P to obtain p i P(x CLS i )/ P(x CLS i ) 2
Calculate variance and covariance losses
Use masking image modeling at the frame level and image level similarity at the time level
Pull each video frame towards a global video representation in the latent space
Use InfoNCE contrastive learning loss
Use an schedule to gradually introduce the contrastive loss

Perform experiments to demonstrate performance of method on ImageNet and other image recognition datasets
Evaluate method on Kinetics dataset for action recognition
Use Vision Transformer (ViT) architectures
Use small decoder proposed by He et.al
Experiment with two alternatives for contrastive learning
Pre-train with Moments in Time and Kinetics-400 datasets
Use AdamW optimizer with batch size of 512
Evaluate pre-training quality by end-to-end finetune
Use K Top-1 Top-5 for multi-view testing on video datasets

Evaluated video-to-image transfer learning using ImageNet-1K benchmark
Results in Table 1 show our method surpasses previous state of the art by 1.58% points of accuracy
TubeViT is still the best model under ViT/B-16 architecture on Kinetics-400 benchmark
Our method performs best than other methods in video-to-image transfer
MAE + contrastive learning model outperforms competing methods by > 3%

Evaluated transfer learning performance of model across 12 image classification tasks
Trained two models using two video datasets
Model significantly outperformed other baselines on 9 out of 12 datasets
Investigated effect of various frame-level image transformations, frame separation, and pooling operator
Using only color augmentations reduced performance by >2%
Using combination of strong color and spatial augmentations not superior to using only strong spatial augmentations
Increasing frame separation when using continuous sampling increases performance
Using only strong spatial augmentations and discarding color augmentations
Model underperformed compared to TubeViT by 7.1% points in absolute accuracy
Model underperformed compared to MaskFeat by 0.7% points in absolute accuracy
Model surpassed MViTv1-B, TimeSformer, and ViVit-B by 0.3%, 0.8%, and 1.5% points in absolute accuracy respectively
Model underperformed compared to DINO by 1% points in absolute accuracy
Model operates over video frames using masked image modeling and contrastive learning