Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

EVA is a vision-centric foundation model that uses publicly accessible data.
EVA is pre-trained to reconstruct masked out image-text aligned vision features.
EVA can be scaled up to one billion parameters and sets new records on a range of vision tasks.
EVA can serve as a vision-centric, multi-modal pivot to connect images and text.
EVA can be used to initialize the vision tower of a giant CLIP.

Scaling up pre-trained language models (PLMs) has revolutionized natural language processing (NLP).
The key to this success lies in the self-supervised learning task of masked signal prediction.
Transformer models can be scaled up to billions of parameters using nearly unlimited unlabelled data.
Masked image modeling (MIM) has boomed as a viable approach for vision model pretraining and scaling.
EVA is a vanilla ViT encoder scaled up to one billion parameters with strong visual representations.
EVA uses 29.6 million public accessible unlabeled images for pretraining.
EVA sets new records on several representative vision benchmarks.
EVA does not need a costly supervised training stage.
EVA makes a significant breakthrough in the challenging large vocabulary object-level recognition task.
EVA can serve as a vision-centric, multi-modal pivot that builds a bridge between vision and language.
EVA can greatly stabilize the giant CLIP’s training & optimization process.
EVA bridges the gap between vision and language with masked signal modeling.

MIM vision pretext task with compelling transfer performance is studied
Two promising candidates are recovering masked out tokenized semantic vision features and feature distillation from strong pre-trained representation
CLIP vision features are used
Pilot experiments show that tokenization process is unnecessary and feature distillation fails to provide consistent performance gain
Reconstructing masked out CLIP vision features is highly performant
Pretext task can scale up to billion-scale parameters and tens of millions of unlabeled images

EVA is a vanilla ViT with 1.0B parameters
Pre-training objective is to reconstruct masked out image-text aligned vision features
40% of input patches are corrupted with [MASK] tokens
Pre-training data includes CC12M, CC3M, COCO, ADE20K, ImageNet-21K, and Object365
Pre-training is optimized via Adam with decoupled weight decay of 0.05
Pre-training infrastructure is NVIDIA A100-SXM4-40GB and PyTorch
Pre-training uses fp16 format with dynamic loss scaling

Evaluated pre-trained EVA on image classification, video action recognition, object detection & instance segmentation, semantic segmentation, and contrastive image-text pre-training
Achieved state-of-the-art performance on downstream tasks

Evaluated EVA on ImageNet-1K validation set
Training settings: intermediate fine-tuning on ImageNet-21K for 60 epochs, then EVA is further fine-tuned on ImageNet-1K training set for 10 epochs
EVA achieves 89.6% top-1 accuracy with 336x336 inputs
Robustness & generalization ability evaluated on 6 different ImageNet-1K validation set variants
EVA achieves highest averaged accuracy and smallest performance gap
Evaluated EVA on Kinetics-400, Kinetics-600 and Kinetics-700 benchmarks
Training & evaluation settings: EVA processes video data via spatial-temporal attention, trained using K-722 training set for 40 epochs with 8 frames and 224x224 resolution
EVA achieves better performance compared with some recent video-specific or large foundation models in video recognition

Evaluated object detection and instance segmentation performance on COCO and LVISv1.0 benchmark
COCO has 80 object categories, LVISv1.0 has 1,200+
Reported box AP, mask AP, and mask rare AP on LVISv1.0
Used same model architecture and hyperparameters for COCO and LVISv1.0
Achieved state-of-the-art results on both COCO and LVISv1.0
Closed the performance gap between LVISv1.0 and COCO
Analyzed the performance gap between LVISv1.0 and COCO
Reported semantic segmentation performance on ADE20K and COCO-Stuff-164K

Evaluated EVA on ADE20K and COCO-Stuff-164K datasets
Weakened model adaptation processes due to GPU memory limitation
EVA achieves strong results in both datasets
CLIP (Contrastive Language-Image Pre-training) connects vision and language
EVA CLIP-g has 78.5 top-1 accuracy
EVA CLIP-g has highest zero-shot classification accuracy averaged on 12 benchmarks
EVA CLIP-g has smallest performance drop when facing natural distribution shifts
EVA CLIP-g is largest performant CLIP model trained via publicly accessible data and resources
Interleaved MIM & image-text contrastive pre-training is an efficient and scalable CLIP training approach

Masked image modeling (MIM) learns visual representations by predicting masked visual contents.
ViT and iGPT report the first meaningful MIM pre-training results.
BEiT family improves MIM performance via masked visual token prediction.
Recent work explores pixel/feature regression in MIM.
ConvNets have been the standard visual architecture.
ViTs with hierarchical architectures and multi-modal representations demonstrate various vision benchmarks.
Vanilla ViT can be scaled up to billion-scale parameters.

Launched EVA, a one billion parameters vanilla ViT encoder
Explored the limits of masked visual representation learning
Showed simple masked feature modeling as a visual learning pretext task scales well
Attained excellent results in a representative & diverse set of downstream tasks
Bridging the gap between vision and language study via masked modeling
Merged dataset Kinetics-722 (K-722) with all valid training samples from Kinetics-400 (K-400), Kinetics-600 (K-600) and Kinetics-700 (K-700)
Input video resolution is 224x224 with 8 frames
Removed leaked videos in all validation sets and duplicated videos in all training sets
Further fine-tuned on each dataset using more input video frames of 16 with a resolution of 224x224
Multi-view inference with 4 temporal clips and 3 spatial crops
Hyper-parameters for fine-tuning on K-400, K-600 and K-700
Intermediate fine-tuning on Objects365 with batch size of 128 for 380k iterations
Fine-tuning COCO and LVIS with learning rate initialized as 2.5e-5
Semantic segmentation settings following ViT-Adapter with Mask2Former as the segmentation head
Pre-trained weights for ADE20K initialized from COCO-Stuff
Regressing the masked out image-text aligned vision features (i.e., CLIP features) scales up well
New state-of-the-art ImageNet-1K image classification result with a canonical linear classifier
New state-of-the-art results in object detection and instance segmentation tasks on both COCO val and test-dev splits