Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Foundation models have shown good performance on computer vision tasks
  • Existing models focus on image-level pretraining and adaption, which are limited for complex video-level understanding tasks
  • InternVideo explores masked video modeling and video-language contrastive learning as pretraining objectives
  • InternVideo achieves state-of-the-art performance on 39 video datasets from various tasks
  • InternVideo obtains 91.1% and 77.2% top-1 accuracy on Kinetics-400 and Something-Something V2 benchmarks

Paper Content

Introduction

  • Foundation models gaining increasing attention in research community
  • Simple adaption or zero/few-shot learning to reduce design and training costs
  • Developing foundation models to cultivate cognition from perception
  • Video understanding and tasks less explored compared to image ones
  • High computing burden and current video benchmarks can be handled by exploiting image backbones
  • Transferability of current vision foundation models narrow
  • Propose a cost-effective and versatile model InternVideo
  • Explore popular video masked modeling and multimodal contrastive learning
  • Propose a unified representation learning with both two self-supervised training manners
  • Propose a systematic video understanding benchmark
  • Design a unified video representation (UVR) learning paradigm
  • Make VideoMAE scalable and explore its scalability
  • Efficient and effective multimodal architecture design and training receipt
  • Features from VideoMAE and multimodal models are complementary
  • Make practices and guidelines for training large-scale video foundation models
  • Validate model with state-of-the-art performance in 10 tasks with 39 datasets
  • Propose a lightweight model interaction learning in supervision
  • Masked video encoder scalable in model and data size
  • Pluggable local temporal and global spatiotemporal interaction modules to reuse pretrained ViT
  • Supervised action recognition to further enhance video representation
  • Cross-model attention to conduct feature alignments between encoders
  • Learned video representations outperform rivals in vision-language tasks
  • Current vision models require manually labeled datasets for training
  • Recent works have proposed vision foundation models which use web-scale noisy image-text pairs and manually annotated images
  • Video foundation models have shown promising performance for video recognition, but struggle for video-only tasks
  • Self-supervised learning focuses on designing different pretext tasks for pretraining, such as contrastive learning and masked modeling
  • Multimodal pretraining uses pretrained visual and language encoders to extract offline video and text features, and often includes two or three pretraining tasks

Internvideo

  • InternVideo is a general video foundation model.
  • It uses the vision transformer (ViT) and its variant UniformerV2.
  • It uses self-supervised and supervised training.
  • It integrates the merits of generative and contrastive pertaining.
  • It sets new performance records on 34 benchmarks from 10 mainstream video tasks.

Self-supervised video pretraining

  • InternVideo conducts both masked and contrastive training without supervision for representation learning.
  • Video masked modeling produces features that excel at action discrimination.
  • Video-language contrastive learning is able to understand videos with semantics from text without annotations.
  • Two transformers with different structures are employed for better leveraging optimization targets.

Supervised video post-pretraining

  • Action recognition is a meta task in video downstream applications.
  • Masked video encoder and multimodal one are trained separately as a post-pretraining step.
  • Kinetics-710 is proposed as a unified video benchmark for finetuning the video encoders.

Cross-model interaction

  • Cross-representation learning is conducted with added cross-model attention modules
  • Backbones are frozen except for classification layers and query tokens in the multimodal video encoder
  • Cross-model attention is formed by Multi-Head Cross Attention and Feed-Forward Network
  • Tokens from the masked video encoder are used as keys and values, while queries are from the multimodal video encoder
  • Class token is updated based on tokens from the masked encoder
  • Features in all stages of the masked video encoder and the ones in the final stage of the multimodal video encoder are enhanced
  • Prediction scores are fused with a learnable linear combination

Experiments

  • Details of experimental configurations in Section 4.1
  • Performance of Intern-Video on proposed video understanding benchmark in Section 4.3

Implementations

  • Post-pretrained multi-modal model with WebVid2M, WebVid10M, and HowTo100M
  • Co-trained video model with image-text dataset subset of LAION-400M
  • Alternated images and videos for each iteration
  • Trained for 400k steps on 128 NVIDIA A100 GPUs in 2 weeks
  • Trained VideoMAE-Huge for 1200 epochs on UnlabeledHybrid dataset with 64 80G-A100 GPUs
  • Used cosine annealing learning rate schedule and warmup 10% total epochs
  • Added tanh gating layers in extra MHCA and FFN
  • Trained coordinated models with batch size of 64, learning rate of 5 ร— 10^5, weight decay of 0.001, dropout rate of 0.9, and EMA rate of 0.9999

Downstream tasks

  • Evaluated InternVideo on a spectrum of downstream tasks
  • Improved action understanding and video-language alignment tasks
  • Impressive zero-shot and open-set capabilities
  • Evaluated on 8 datasets
  • Improved performance on temporal action localization, spatiotemporal action localization, video retrieval, video question answering, visual language navigation, zero-shot action recognition, zero-shot video retrieval, and zero-shot multiple choice

Concluding remarks

  • Proposed a versatile and training-efficient video foundation model InternVideo
  • First work to perform best among existing researches on all action understanding, video-language alignment, and video open understanding tasks
  • Achieved state-of-the-art performance on nearly 40 datasets covering 10 different tasks
  • Exploits a unified video representation based on the cross-model learning between masked video learning (VideoMAE) and video-language contrastive modeling
  • Efficient in training, 64.5K GPU hours (A100-80G)
  • Record-breaking results in all used datasets
  • Even for zero-shot and open-set settings, consistent and non-trivial performance increases
  • Exploring model coordination and cognition is necessary for its studies
  • Combining foundation models with decision-making to form intelligent agents
  • InternVideo significantly surpasses previous SOTA methods on almost all benchmarks
  • Exploits 12 million video clips from 5 different domains
  • Results show that InternVideo is robust and effective