Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Foundation models have shown good performance on computer vision tasks
- Existing models focus on image-level pretraining and adaption, which are limited for complex video-level understanding tasks
- InternVideo explores masked video modeling and video-language contrastive learning as pretraining objectives
- InternVideo achieves state-of-the-art performance on 39 video datasets from various tasks
- InternVideo obtains 91.1% and 77.2% top-1 accuracy on Kinetics-400 and Something-Something V2 benchmarks
Paper Content
Introduction
- Foundation models gaining increasing attention in research community
- Simple adaption or zero/few-shot learning to reduce design and training costs
- Developing foundation models to cultivate cognition from perception
- Video understanding and tasks less explored compared to image ones
- High computing burden and current video benchmarks can be handled by exploiting image backbones
- Transferability of current vision foundation models narrow
- Propose a cost-effective and versatile model InternVideo
- Explore popular video masked modeling and multimodal contrastive learning
- Propose a unified representation learning with both two self-supervised training manners
- Propose a systematic video understanding benchmark
- Design a unified video representation (UVR) learning paradigm
- Make VideoMAE scalable and explore its scalability
- Efficient and effective multimodal architecture design and training receipt
- Features from VideoMAE and multimodal models are complementary
- Make practices and guidelines for training large-scale video foundation models
- Validate model with state-of-the-art performance in 10 tasks with 39 datasets
- Propose a lightweight model interaction learning in supervision
- Masked video encoder scalable in model and data size
- Pluggable local temporal and global spatiotemporal interaction modules to reuse pretrained ViT
- Supervised action recognition to further enhance video representation
- Cross-model attention to conduct feature alignments between encoders
- Learned video representations outperform rivals in vision-language tasks
Related work
- Current vision models require manually labeled datasets for training
- Recent works have proposed vision foundation models which use web-scale noisy image-text pairs and manually annotated images
- Video foundation models have shown promising performance for video recognition, but struggle for video-only tasks
- Self-supervised learning focuses on designing different pretext tasks for pretraining, such as contrastive learning and masked modeling
- Multimodal pretraining uses pretrained visual and language encoders to extract offline video and text features, and often includes two or three pretraining tasks
Internvideo
- InternVideo is a general video foundation model.
- It uses the vision transformer (ViT) and its variant UniformerV2.
- It uses self-supervised and supervised training.
- It integrates the merits of generative and contrastive pertaining.
- It sets new performance records on 34 benchmarks from 10 mainstream video tasks.
Self-supervised video pretraining
- InternVideo conducts both masked and contrastive training without supervision for representation learning.
- Video masked modeling produces features that excel at action discrimination.
- Video-language contrastive learning is able to understand videos with semantics from text without annotations.
- Two transformers with different structures are employed for better leveraging optimization targets.
Supervised video post-pretraining
- Action recognition is a meta task in video downstream applications.
- Masked video encoder and multimodal one are trained separately as a post-pretraining step.
- Kinetics-710 is proposed as a unified video benchmark for finetuning the video encoders.
Cross-model interaction
- Cross-representation learning is conducted with added cross-model attention modules
- Backbones are frozen except for classification layers and query tokens in the multimodal video encoder
- Cross-model attention is formed by Multi-Head Cross Attention and Feed-Forward Network
- Tokens from the masked video encoder are used as keys and values, while queries are from the multimodal video encoder
- Class token is updated based on tokens from the masked encoder
- Features in all stages of the masked video encoder and the ones in the final stage of the multimodal video encoder are enhanced
- Prediction scores are fused with a learnable linear combination
Experiments
- Details of experimental configurations in Section 4.1
- Performance of Intern-Video on proposed video understanding benchmark in Section 4.3
Implementations
- Post-pretrained multi-modal model with WebVid2M, WebVid10M, and HowTo100M
- Co-trained video model with image-text dataset subset of LAION-400M
- Alternated images and videos for each iteration
- Trained for 400k steps on 128 NVIDIA A100 GPUs in 2 weeks
- Trained VideoMAE-Huge for 1200 epochs on UnlabeledHybrid dataset with 64 80G-A100 GPUs
- Used cosine annealing learning rate schedule and warmup 10% total epochs
- Added tanh gating layers in extra MHCA and FFN
- Trained coordinated models with batch size of 64, learning rate of 5 ร 10^5, weight decay of 0.001, dropout rate of 0.9, and EMA rate of 0.9999
Downstream tasks
- Evaluated InternVideo on a spectrum of downstream tasks
- Improved action understanding and video-language alignment tasks
- Impressive zero-shot and open-set capabilities
- Evaluated on 8 datasets
- Improved performance on temporal action localization, spatiotemporal action localization, video retrieval, video question answering, visual language navigation, zero-shot action recognition, zero-shot video retrieval, and zero-shot multiple choice
Concluding remarks
- Proposed a versatile and training-efficient video foundation model InternVideo
- First work to perform best among existing researches on all action understanding, video-language alignment, and video open understanding tasks
- Achieved state-of-the-art performance on nearly 40 datasets covering 10 different tasks
- Exploits a unified video representation based on the cross-model learning between masked video learning (VideoMAE) and video-language contrastive modeling
- Efficient in training, 64.5K GPU hours (A100-80G)
- Record-breaking results in all used datasets
- Even for zero-shot and open-set settings, consistent and non-trivial performance increases
- Exploring model coordination and cognition is necessary for its studies
- Combining foundation models with decision-making to form intelligent agents
- InternVideo significantly surpasses previous SOTA methods on almost all benchmarks
- Exploits 12 million video clips from 5 different domains
- Results show that InternVideo is robust and effective