Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • EVA is a vision-centric foundation model that uses publicly accessible data.
  • EVA is pre-trained to reconstruct masked out image-text aligned vision features.
  • EVA can be scaled up to one billion parameters and sets new records on a range of vision tasks.
  • EVA can serve as a vision-centric, multi-modal pivot to connect images and text.
  • EVA can be used to initialize the vision tower of a giant CLIP.

Paper Content

Introduction

  • Scaling up pre-trained language models (PLMs) has revolutionized natural language processing (NLP).
  • The key to this success lies in the self-supervised learning task of masked signal prediction.
  • Transformer models can be scaled up to billions of parameters using nearly unlimited unlabelled data.
  • Masked image modeling (MIM) has boomed as a viable approach for vision model pretraining and scaling.
  • EVA is a vanilla ViT encoder scaled up to one billion parameters with strong visual representations.
  • EVA uses 29.6 million public accessible unlabeled images for pretraining.
  • EVA sets new records on several representative vision benchmarks.
  • EVA does not need a costly supervised training stage.
  • EVA makes a significant breakthrough in the challenging large vocabulary object-level recognition task.
  • EVA can serve as a vision-centric, multi-modal pivot that builds a bridge between vision and language.
  • EVA can greatly stabilize the giant CLIP’s training & optimization process.
  • EVA bridges the gap between vision and language with masked signal modeling.

The feature instrumentality project

  • MIM vision pretext task with compelling transfer performance is studied
  • Two promising candidates are recovering masked out tokenized semantic vision features and feature distillation from strong pre-trained representation
  • CLIP vision features are used
  • Pilot experiments show that tokenization process is unnecessary and feature distillation fails to provide consistent performance gain
  • Reconstructing masked out CLIP vision features is highly performant
  • Pretext task can scale up to billion-scale parameters and tens of millions of unlabeled images

Pre-training

  • EVA is a vanilla ViT with 1.0B parameters
  • Pre-training objective is to reconstruct masked out image-text aligned vision features
  • 40% of input patches are corrupted with [MASK] tokens
  • Pre-training data includes CC12M, CC3M, COCO, ADE20K, ImageNet-21K, and Object365
  • Pre-training is optimized via Adam with decoupled weight decay of 0.05
  • Pre-training infrastructure is NVIDIA A100-SXM4-40GB and PyTorch
  • Pre-training uses fp16 format with dynamic loss scaling

Evaluation on downstream tasks

  • Evaluated pre-trained EVA on image classification, video action recognition, object detection & instance segmentation, semantic segmentation, and contrastive image-text pre-training
  • Achieved state-of-the-art performance on downstream tasks

Image classification

  • Evaluated EVA on ImageNet-1K validation set
  • Training settings: intermediate fine-tuning on ImageNet-21K for 60 epochs, then EVA is further fine-tuned on ImageNet-1K training set for 10 epochs
  • EVA achieves 89.6% top-1 accuracy with 336x336 inputs
  • Robustness & generalization ability evaluated on 6 different ImageNet-1K validation set variants
  • EVA achieves highest averaged accuracy and smallest performance gap
  • Evaluated EVA on Kinetics-400, Kinetics-600 and Kinetics-700 benchmarks
  • Training & evaluation settings: EVA processes video data via spatial-temporal attention, trained using K-722 training set for 40 epochs with 8 frames and 224x224 resolution
  • EVA achieves better performance compared with some recent video-specific or large foundation models in video recognition

Object detection & instance segmentation

  • Evaluated object detection and instance segmentation performance on COCO and LVISv1.0 benchmark
  • COCO has 80 object categories, LVISv1.0 has 1,200+
  • Reported box AP, mask AP, and mask rare AP on LVISv1.0
  • Used same model architecture and hyperparameters for COCO and LVISv1.0
  • Achieved state-of-the-art results on both COCO and LVISv1.0
  • Closed the performance gap between LVISv1.0 and COCO
  • Analyzed the performance gap between LVISv1.0 and COCO
  • Reported semantic segmentation performance on ADE20K and COCO-Stuff-164K

Semantic segmentation

  • Evaluated EVA on ADE20K and COCO-Stuff-164K datasets
  • Weakened model adaptation processes due to GPU memory limitation
  • EVA achieves strong results in both datasets
  • CLIP (Contrastive Language-Image Pre-training) connects vision and language
  • EVA CLIP-g has 78.5 top-1 accuracy
  • EVA CLIP-g has highest zero-shot classification accuracy averaged on 12 benchmarks
  • EVA CLIP-g has smallest performance drop when facing natural distribution shifts
  • EVA CLIP-g is largest performant CLIP model trained via publicly accessible data and resources
  • Interleaved MIM & image-text contrastive pre-training is an efficient and scalable CLIP training approach
  • Masked image modeling (MIM) learns visual representations by predicting masked visual contents.
  • ViT and iGPT report the first meaningful MIM pre-training results.
  • BEiT family improves MIM performance via masked visual token prediction.
  • Recent work explores pixel/feature regression in MIM.
  • ConvNets have been the standard visual architecture.
  • ViTs with hierarchical architectures and multi-modal representations demonstrate various vision benchmarks.
  • Vanilla ViT can be scaled up to billion-scale parameters.

Conclusion

  • Launched EVA, a one billion parameters vanilla ViT encoder
  • Explored the limits of masked visual representation learning
  • Showed simple masked feature modeling as a visual learning pretext task scales well
  • Attained excellent results in a representative & diverse set of downstream tasks
  • Bridging the gap between vision and language study via masked modeling
  • Merged dataset Kinetics-722 (K-722) with all valid training samples from Kinetics-400 (K-400), Kinetics-600 (K-600) and Kinetics-700 (K-700)
  • Input video resolution is 224x224 with 8 frames
  • Removed leaked videos in all validation sets and duplicated videos in all training sets
  • Further fine-tuned on each dataset using more input video frames of 16 with a resolution of 224x224
  • Multi-view inference with 4 temporal clips and 3 spatial crops
  • Hyper-parameters for fine-tuning on K-400, K-600 and K-700
  • Intermediate fine-tuning on Objects365 with batch size of 128 for 380k iterations
  • Fine-tuning COCO and LVIS with learning rate initialized as 2.5e-5
  • Semantic segmentation settings following ViT-Adapter with Mask2Former as the segmentation head
  • Pre-trained weights for ADE20K initialized from COCO-Stuff
  • Regressing the masked out image-text aligned vision features (i.e., CLIP features) scales up well
  • New state-of-the-art ImageNet-1K image classification result with a canonical linear classifier
  • New state-of-the-art results in object detection and instance segmentation tasks on both COCO val and test-dev splits