Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Ability to understand and generate natural language from visual information is important for applications
Deep learning has led to development of large-scale vision-language models
MoE can improve efficiency and effectiveness of VLMs
Contributions: proposed VL-MoE, explored scaling strategies, presented ablations

Vision-language pretraining involves developing model architecture and pretraining objectives to learn effective multimodal representations from image-text pairs
Two main approaches for model architecture: separate encoders and complex fusion module
MOME Transformer unifies dual-encoder and fusion-encoder models
Increasing interest to grow VL model capacity with affordable compute budget
Pretraining objectives can be categorized into discriminative and generative modeling
Sparse Mixture of Experts models studied for conditional computation, multitask learning, and multimodal learning

Utilized a unified masked data modeling objective to pretrain VL-MoE on monomodal and multimodal data
Used masked language modeling to learn language representations from text-only data
Used masked image modeling to learn vision representations from image data
Used masked vision-language modeling to recover masked image patches and text tokens
Input text is tokenized and projected onto word embeddings
Input image is split and reshaped into patches and flattened into vectors
Image and text input vectors are concatenated
Used mixture-of-modality-experts Transformer to encode different modalities
Used mixture-of-experts model to selectively activate different parts of a neural network
Replaced a subset of V-FFN and T-FFN with V-MoE and T-MoE layers
Used Batch Priority Routing for stable training of VL-MoE
Pretrained on 4 million images and 10 million image-text pairs
Used Adam optimizer with linear warmup and cosine learning rate decay
Used 32 expert parallelism and TUTEL for fast routing and computation
Results show cost-performance tradeoff of VL-MoE dominates dense models
Maximum wall-clock overhead of VL-MoE compared to dense counterparts is 13%

Explored performance of VL-MoE on vision-and-language downstream tasks
Used 480x480 image resolution for VQA and 384x384 for other tasks
Used VQA 2.0 dataset and formulated it as a classification problem
Used NLVR2 dataset for visual reasoning task
Used COCO and Flickr30K datasets for image-text retrieval
Fine-tuned VL-MoE with imagetext contrastive and image-text matching objectives
VL-MoE outperformed previous models on VQA, NLVR2, COCO, and Flickr30K
VL-MoE is the first to demonstrate that a mixture-of-experts architecture can scale with modest architecture size and training counts
VL-MoE outperformed VLMO LARGE and ALBEF on COCO and Flickr30K

Image Classification task evaluates model on vision-only downstream task
ILSVRC-2012 ImageNet dataset used with 1.3M images and 1k classes
Average pooling over final vectors and linear classifier layer used to predict label
Natural Language Inference task evaluates model on language-only downstream task
Model given premise and hypothesis sentence to determine if hypothesis is true, false, or undetermined using Multi-Genre Natural Language Inference (MNLI) dataset

Conducted ablation studies to analyze contributions of Mixture-of-Experts module
Evaluated models on visual reasoning, image-text retrieval, image classification and natural language inference
Scaling both T-FFN and V-FFN improved downstream performance on corresponding modality and overall vision-language tasks
Optimal number of experts in Mixture-of-Experts is still a topic of debate
Auxiliary losses used to prevent instability caused by imbalance of multimodal data
Z-loss hurt vision-language pretraining of VL-MoE
Loading balance loss only introduced unstable training and underperforming models
“vloss” led to most stable training

Mixture-of-Experts (MoE) can improve the efficiency and effectiveness of vision-language models
Dividing a large vision-language model into smaller, specialized sub-models can achieve state-of-the-art performance
Larger expert pools yield consistent performance improvements
MoE can improve the interpretability of vision-language models
MoE is a valuable technique for scaling vision-language models
New research directions for exploring the effectiveness of MoEs in other vision-language tasks
Dropped tokens issue inherited from MoE training
Distribution of dropped tokens in VL-MoE BASE/32E across different pretraining tasks
Pretrain losses for different scaling strategies
Optimizing VL-MoE or LIMOE with the contrastive loss
Fine-tuning base-size models for 10 epochs with 128 batch size
Input image resolution is 480 × 480
Input image resolution is 384×384
Input image resolution is 224 × 224
Learning rates from {5e-5, 1e-4}
Averaged over 3 runs