Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- EVA is a vision-centric foundation model that uses publicly accessible data.
- EVA is pre-trained to reconstruct masked out image-text aligned vision features.
- EVA can be scaled up to one billion parameters and sets new records on a range of vision tasks.
- EVA can serve as a vision-centric, multi-modal pivot to connect images and text.
- EVA can be used to initialize the vision tower of a giant CLIP.
Paper Content
Introduction
- Scaling up pre-trained language models (PLMs) has revolutionized natural language processing (NLP).
- The key to this success lies in the self-supervised learning task of masked signal prediction.
- Transformer models can be scaled up to billions of parameters using nearly unlimited unlabelled data.
- Masked image modeling (MIM) has boomed as a viable approach for vision model pretraining and scaling.
- EVA is a vanilla ViT encoder scaled up to one billion parameters with strong visual representations.
- EVA uses 29.6 million public accessible unlabeled images for pretraining.
- EVA sets new records on several representative vision benchmarks.
- EVA does not need a costly supervised training stage.
- EVA makes a significant breakthrough in the challenging large vocabulary object-level recognition task.
- EVA can serve as a vision-centric, multi-modal pivot that builds a bridge between vision and language.
- EVA can greatly stabilize the giant CLIP’s training & optimization process.
- EVA bridges the gap between vision and language with masked signal modeling.
The feature instrumentality project
- MIM vision pretext task with compelling transfer performance is studied
- Two promising candidates are recovering masked out tokenized semantic vision features and feature distillation from strong pre-trained representation
- CLIP vision features are used
- Pilot experiments show that tokenization process is unnecessary and feature distillation fails to provide consistent performance gain
- Reconstructing masked out CLIP vision features is highly performant
- Pretext task can scale up to billion-scale parameters and tens of millions of unlabeled images
Pre-training
- EVA is a vanilla ViT with 1.0B parameters
- Pre-training objective is to reconstruct masked out image-text aligned vision features
- 40% of input patches are corrupted with [MASK] tokens
- Pre-training data includes CC12M, CC3M, COCO, ADE20K, ImageNet-21K, and Object365
- Pre-training is optimized via Adam with decoupled weight decay of 0.05
- Pre-training infrastructure is NVIDIA A100-SXM4-40GB and PyTorch
- Pre-training uses fp16 format with dynamic loss scaling
Evaluation on downstream tasks
- Evaluated pre-trained EVA on image classification, video action recognition, object detection & instance segmentation, semantic segmentation, and contrastive image-text pre-training
- Achieved state-of-the-art performance on downstream tasks
Image classification
- Evaluated EVA on ImageNet-1K validation set
- Training settings: intermediate fine-tuning on ImageNet-21K for 60 epochs, then EVA is further fine-tuned on ImageNet-1K training set for 10 epochs
- EVA achieves 89.6% top-1 accuracy with 336x336 inputs
- Robustness & generalization ability evaluated on 6 different ImageNet-1K validation set variants
- EVA achieves highest averaged accuracy and smallest performance gap
- Evaluated EVA on Kinetics-400, Kinetics-600 and Kinetics-700 benchmarks
- Training & evaluation settings: EVA processes video data via spatial-temporal attention, trained using K-722 training set for 40 epochs with 8 frames and 224x224 resolution
- EVA achieves better performance compared with some recent video-specific or large foundation models in video recognition
Object detection & instance segmentation
- Evaluated object detection and instance segmentation performance on COCO and LVISv1.0 benchmark
- COCO has 80 object categories, LVISv1.0 has 1,200+
- Reported box AP, mask AP, and mask rare AP on LVISv1.0
- Used same model architecture and hyperparameters for COCO and LVISv1.0
- Achieved state-of-the-art results on both COCO and LVISv1.0
- Closed the performance gap between LVISv1.0 and COCO
- Analyzed the performance gap between LVISv1.0 and COCO
- Reported semantic segmentation performance on ADE20K and COCO-Stuff-164K
Semantic segmentation
- Evaluated EVA on ADE20K and COCO-Stuff-164K datasets
- Weakened model adaptation processes due to GPU memory limitation
- EVA achieves strong results in both datasets
- CLIP (Contrastive Language-Image Pre-training) connects vision and language
- EVA CLIP-g has 78.5 top-1 accuracy
- EVA CLIP-g has highest zero-shot classification accuracy averaged on 12 benchmarks
- EVA CLIP-g has smallest performance drop when facing natural distribution shifts
- EVA CLIP-g is largest performant CLIP model trained via publicly accessible data and resources
- Interleaved MIM & image-text contrastive pre-training is an efficient and scalable CLIP training approach
Related work
- Masked image modeling (MIM) learns visual representations by predicting masked visual contents.
- ViT and iGPT report the first meaningful MIM pre-training results.
- BEiT family improves MIM performance via masked visual token prediction.
- Recent work explores pixel/feature regression in MIM.
- ConvNets have been the standard visual architecture.
- ViTs with hierarchical architectures and multi-modal representations demonstrate various vision benchmarks.
- Vanilla ViT can be scaled up to billion-scale parameters.
Conclusion
- Launched EVA, a one billion parameters vanilla ViT encoder
- Explored the limits of masked visual representation learning
- Showed simple masked feature modeling as a visual learning pretext task scales well
- Attained excellent results in a representative & diverse set of downstream tasks
- Bridging the gap between vision and language study via masked modeling
- Merged dataset Kinetics-722 (K-722) with all valid training samples from Kinetics-400 (K-400), Kinetics-600 (K-600) and Kinetics-700 (K-700)
- Input video resolution is 224x224 with 8 frames
- Removed leaked videos in all validation sets and duplicated videos in all training sets
- Further fine-tuned on each dataset using more input video frames of 16 with a resolution of 224x224
- Multi-view inference with 4 temporal clips and 3 spatial crops
- Hyper-parameters for fine-tuning on K-400, K-600 and K-700
- Intermediate fine-tuning on Objects365 with batch size of 128 for 380k iterations
- Fine-tuning COCO and LVIS with learning rate initialized as 2.5e-5
- Semantic segmentation settings following ViT-Adapter with Mask2Former as the segmentation head
- Pre-trained weights for ADE20K initialized from COCO-Stuff
- Regressing the masked out image-text aligned vision features (i.e., CLIP features) scales up well
- New state-of-the-art ImageNet-1K image classification result with a canonical linear classifier
- New state-of-the-art results in object detection and instance segmentation tasks on both COCO val and test-dev splits