Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Foundation models have shown good performance on computer vision tasks
Existing models focus on image-level pretraining and adaption, which are limited for complex video-level understanding tasks
InternVideo explores masked video modeling and video-language contrastive learning as pretraining objectives
InternVideo achieves state-of-the-art performance on 39 video datasets from various tasks
InternVideo obtains 91.1% and 77.2% top-1 accuracy on Kinetics-400 and Something-Something V2 benchmarks

Paper Content

Introduction

Foundation models gaining increasing attention in research community
Simple adaption or zero/few-shot learning to reduce design and training costs
Developing foundation models to cultivate cognition from perception
Video understanding and tasks less explored compared to image ones
High computing burden and current video benchmarks can be handled by exploiting image backbones
Transferability of current vision foundation models narrow
Propose a cost-effective and versatile model InternVideo
Explore popular video masked modeling and multimodal contrastive learning
Propose a unified representation learning with both two self-supervised training manners
Propose a systematic video understanding benchmark
Design a unified video representation (UVR) learning paradigm
Make VideoMAE scalable and explore its scalability
Efficient and effective multimodal architecture design and training receipt
Features from VideoMAE and multimodal models are complementary
Make practices and guidelines for training large-scale video foundation models
Validate model with state-of-the-art performance in 10 tasks with 39 datasets
Propose a lightweight model interaction learning in supervision
Masked video encoder scalable in model and data size
Pluggable local temporal and global spatiotemporal interaction modules to reuse pretrained ViT
Supervised action recognition to further enhance video representation
Cross-model attention to conduct feature alignments between encoders
Learned video representations outperform rivals in vision-language tasks

Current vision models require manually labeled datasets for training
Recent works have proposed vision foundation models which use web-scale noisy image-text pairs and manually annotated images
Video foundation models have shown promising performance for video recognition, but struggle for video-only tasks
Self-supervised learning focuses on designing different pretext tasks for pretraining, such as contrastive learning and masked modeling
Multimodal pretraining uses pretrained visual and language encoders to extract offline video and text features, and often includes two or three pretraining tasks

Internvideo

InternVideo is a general video foundation model.
It uses the vision transformer (ViT) and its variant UniformerV2.
It uses self-supervised and supervised training.
It integrates the merits of generative and contrastive pertaining.
It sets new performance records on 34 benchmarks from 10 mainstream video tasks.

Self-supervised video pretraining

InternVideo conducts both masked and contrastive training without supervision for representation learning.
Video masked modeling produces features that excel at action discrimination.
Video-language contrastive learning is able to understand videos with semantics from text without annotations.
Two transformers with different structures are employed for better leveraging optimization targets.

Supervised video post-pretraining

Action recognition is a meta task in video downstream applications.
Masked video encoder and multimodal one are trained separately as a post-pretraining step.
Kinetics-710 is proposed as a unified video benchmark for finetuning the video encoders.

Cross-model interaction

Cross-representation learning is conducted with added cross-model attention modules
Backbones are frozen except for classification layers and query tokens in the multimodal video encoder
Cross-model attention is formed by Multi-Head Cross Attention and Feed-Forward Network
Tokens from the masked video encoder are used as keys and values, while queries are from the multimodal video encoder
Class token is updated based on tokens from the masked encoder
Features in all stages of the masked video encoder and the ones in the final stage of the multimodal video encoder are enhanced
Prediction scores are fused with a learnable linear combination

Experiments

Details of experimental configurations in Section 4.1
Performance of Intern-Video on proposed video understanding benchmark in Section 4.3

Implementations

Post-pretrained multi-modal model with WebVid2M, WebVid10M, and HowTo100M
Co-trained video model with image-text dataset subset of LAION-400M
Alternated images and videos for each iteration
Trained for 400k steps on 128 NVIDIA A100 GPUs in 2 weeks
Trained VideoMAE-Huge for 1200 epochs on UnlabeledHybrid dataset with 64 80G-A100 GPUs
Used cosine annealing learning rate schedule and warmup 10% total epochs
Added tanh gating layers in extra MHCA and FFN
Trained coordinated models with batch size of 64, learning rate of 5 × 10^5, weight decay of 0.001, dropout rate of 0.9, and EMA rate of 0.9999

Downstream tasks

Evaluated InternVideo on a spectrum of downstream tasks
Improved action understanding and video-language alignment tasks
Impressive zero-shot and open-set capabilities
Evaluated on 8 datasets
Improved performance on temporal action localization, spatiotemporal action localization, video retrieval, video question answering, visual language navigation, zero-shot action recognition, zero-shot video retrieval, and zero-shot multiple choice

Concluding remarks

Proposed a versatile and training-efficient video foundation model InternVideo
First work to perform best among existing researches on all action understanding, video-language alignment, and video open understanding tasks
Achieved state-of-the-art performance on nearly 40 datasets covering 10 different tasks
Exploits a unified video representation based on the cross-model learning between masked video learning (VideoMAE) and video-language contrastive modeling
Efficient in training, 64.5K GPU hours (A100-80G)
Record-breaking results in all used datasets
Even for zero-shot and open-set settings, consistent and non-trivial performance increases
Exploring model coordination and cognition is necessary for its studies
Combining foundation models with decision-making to form intelligent agents
InternVideo significantly surpasses previous SOTA methods on almost all benchmarks
Exploits 12 million video clips from 5 different domains
Results show that InternVideo is robust and effective

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Internvideo#

Self-supervised video pretraining#

Supervised video post-pretraining#

Cross-model interaction#

Experiments#

Implementations#

Downstream tasks#

Concluding remarks#

Link to paper

Abstract

Paper Content

Introduction

Related work

Internvideo

Self-supervised video pretraining

Supervised video post-pretraining

Cross-model interaction

Experiments

Implementations

Downstream tasks

Concluding remarks