Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss.
- Contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs.
- ViC-MAE combines MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames.
- ViC-MAE generalizes well to both video classification and image classification tasks.
- ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives.
Paper Content
Introduction
- Self-supervised visual representation learning has been successful in image benchmarks
- Two paradigms have driven this success: joint-embedding methods and masked image modeling
- Joint-embedding methods learn representations that are invariant to specific transformations
- Masked image modeling works by randomly masking out parts of the input and forcing a model to predict the masked parts
- Self-supervised methods from the image domain have been replicated for video representation learning with success
- There is still a gap in performance in the video-to-image transfer learning setting
- Videos contain complex changes in pose, viewpoint, deformations, etc.
- Gordeon et.al. and Feichtenhofer et.al. have obtained good results for video and image benchmarks
- Parthasarathy et.al. has obtained results that rival ImageNet results
- ViC-MAE proposed to leverage contrastive learning and masked image modeling for videos
- Training with negative pairs surpasses methods that only train with positive samples
- Training with strong image transformations as augmentations is not necessary
- ViC-MAE obtains best video-to-image transfer learning results in Imagenet-1k benchmark
- ViC-MAE achieves superior accuracy than strong alternatives based on existing methods
- ViC-MAE shows superior transfer learning accuracy on a wide array of downstream image classification tasks
Related work
- Self-supervised learning from videos uses prior knowledge about videos to create pretext training tasks
- Self-supervised learning from videos also uses contrastive learning
- Masked auto encoders (MAE) can be used to pre-train models for transfer learning
- Learning image representations from video can improve performance over artificially produced augmentations
- Video datasets can provide natural image augmentations and correspondences to learn robust image representations
Method
- Proposes ViC-MAE for space-time feature learning
- Uses contrastive learning at the time level
- Uses masked image modelling at the space level
Background
- Masked image modeling is a way to learn visual representations in a self-supervised manner.
- Negative-free representation learning is a way to learn global visual representations without negative examples.
- Representation collapse can be avoided by using methods such as SiamSiam and VicReg.
Combining mae with contrastive methods.
- Combining MAE with contrastive learning methods can be done by using the [CLS] token of the transformer as a global video feature representation
- Sample two frames from a video and perform patch-level masking
- Pass the x CLS i token to a projector network P to obtain p i P(x CLS i )/ P(x CLS i ) 2
- Experiment with SiamSiam and VicReg losses
- Pass x CLS i to a projector network P to obtain p i P(x CLS i )/ P(x CLS i ) 2
- Calculate variance and covariance losses
- Use masking image modeling at the frame level and image level similarity at the time level
- Pull each video frame towards a global video representation in the latent space
- Use InfoNCE contrastive learning loss
- Use an schedule to gradually introduce the contrastive loss
Experiment settings
- Perform experiments to demonstrate performance of method on ImageNet and other image recognition datasets
- Evaluate method on Kinetics dataset for action recognition
- Use Vision Transformer (ViT) architectures
- Use small decoder proposed by He et.al
- Experiment with two alternatives for contrastive learning
- Pre-train with Moments in Time and Kinetics-400 datasets
- Use AdamW optimizer with batch size of 512
- Evaluate pre-training quality by end-to-end finetune
- Use K Top-1 Top-5 for multi-view testing on video datasets
Results and ablations
- Performed experiments to analyze ViC-MAE framework
- Experiments done under learning with negative pairs setting
- Mean pooling used over ViT-B/16 features
- Linear evaluation and end-to-end finetuning done over 100 epochs
Main result
- Evaluated video-to-image transfer learning using ImageNet-1K benchmark
- Results in Table 1 show our method surpasses previous state of the art by 1.58% points of accuracy
- TubeViT is still the best model under ViT/B-16 architecture on Kinetics-400 benchmark
- Our method performs best than other methods in video-to-image transfer
- MAE + contrastive learning model outperforms competing methods by > 3%
Transfer learning performance.
- Evaluated transfer learning performance of model across 12 image classification tasks
- Trained two models using two video datasets
- Model significantly outperformed other baselines on 9 out of 12 datasets
- Investigated effect of various frame-level image transformations, frame separation, and pooling operator
- Using only color augmentations reduced performance by >2%
- Using combination of strong color and spatial augmentations not superior to using only strong spatial augmentations
- Increasing frame separation when using continuous sampling increases performance
- Using only strong spatial augmentations and discarding color augmentations
- Model underperformed compared to TubeViT by 7.1% points in absolute accuracy
- Model underperformed compared to MaskFeat by 0.7% points in absolute accuracy
- Model surpassed MViTv1-B, TimeSformer, and ViVit-B by 0.3%, 0.8%, and 1.5% points in absolute accuracy respectively
- Model underperformed compared to DINO by 1% points in absolute accuracy
- Model operates over video frames using masked image modeling and contrastive learning