Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent vision-language models have shown impressive multi-modal generation capabilities.
- Prismer is a data- and parameter-efficient vision-language model that leverages an ensemble of domain experts.
- Prismer requires training of a small number of components, with the majority of network weights inherited from pre-trained domain experts.
- Prismer can efficiently pool expert knowledge and adapt it to various vision-language reasoning tasks.
- Prismer achieves competitive fine-tuned and few-shot learning performance with up to two orders of magnitude less training data.
Paper Content
Introduction
- Large pre-trained models have good generalisation capabilities but require a lot of resources.
- Language domain models need yottaFLOP scale compute budget.
- Vision-language learning is more challenging and requires extra skills.
- Prismer model uses pre-trained experts and lightweight components.
- Prismer is trained on 13M image/alt-text data and performs well.
- Prismer is robust against noisy experts and performance scales with quantity/quality of experts.
Related work
- VLMs use transformers to model the vision-language relationship
- VLMs usually leverage a pre-trained object detector to encode images
- Dual-stream VLMs encode vision and language features in separate networks
- Prismer uses pre-trained models to provide output predictions as auxiliary signals
- Prismer focuses on language generation and only requires a single autoregressive training objective
- Prismer leverages powerful pre-trained domain expert models for data-efficient training
- Prismer unifies pre-trained experts in a single architecture design
Prismer: open-ended reasoning with multi-modal knowledge
- Prismer model is a type of vision-language generative model
- Takes multi-modal signals as input and outputs free-form text
Model overview
- Prismer is an encoder-decoder transformer model
- Leverages a library of existing pre-trained experts
- Vision encoder takes RGB image and multi-modal labels as input
- Language decoder is conditioned on multi-modal features
- Data efficiency achieved by leveraging combined power of strong domain-specific experts
- Built on top of existing pre-trained vision-only and language-only backbone models
- Vision encoder extended to accept multi-modal signals
- Majority of network weights of pre-trained experts frozen
- Experts Resampler and Adaptor used to link vision and language parts
- Re-formulated as language modelling or prefix language modelling problem
Pre-trained experts
- Prismer includes two types of pre-trained experts: Backbone Experts and Modality Experts
- Backbone Experts are vision-only and language-only pre-trained models that encode images and texts into a meaningful sequence of tokens
- Modality Experts are up to 6 experts from the vision domain, encoding three low-level and three high-level vision signals
- Modality Experts are treated as black-box predictors and their predicted labels are used as input for the Prismer model
- Modality-specific post-processing is applied to the predicted labels, transforming them to a R H×W×C tensor
Training objective
- Prismer is trained with a single objective to predict the next text token autoregressively.
- Prismer uses an encoder-decoder architecture to predict multi-modal features.
- Pre-processing step is computationally cheap and fast.
- Generative objective only requires one forward pass to compute gradients.
- Model is less suitable for multi-modal discriminative tasks.
Experiments
Prismer model variants
- PrismerZ is a model variant of Prismer that only requires RGB images and is more efficient and applicable to a wider range of applications.
- Prismer is less efficient in data inference but has better predictive performance.
- Prismer and PrismerZ use ViT and RoBERTa as the frozen vision encoder and language decoder respectively.
Training and evaluation details
- Pre-training data is constructed from 5 datasets
- Training is done with AdamW optimiser and ZeRO Stage 2 technique
- Hyper-parameters and data processing techniques are in Appendix A
- Training costs are in Appendix B
- Performance is evaluated through language modelling
- Beam search with a beam size of 3 is used for text generation
- Prefix prompt of “A picture of” is added for image captioning tasks
Results on vision-language benchmarks
- Fine-tuned models on COCO Caption, NoCaps and VQAv2
- Prismer and PrismerZ achieve superior performance compared to other VLMs with similar model sizes
- Prismer can achieve competitive performance on par with VLMs trained with more data
- Prismer outperforms original vision backbones ViT-B and ViT-L
- Few-shot classification via lightweight fine-tuning
- Prismer underperforms GIT and Flamingo
- Prismer’s performance can likely be improved with stronger vision backbone
Additional analysis
Intriguing properties of prismer
- Performance of Prismer improves with addition of more modality experts
- Performance plateaus after a certain number of experts
- Performance of Prismer improves with better quality experts
- Prismer maintains performance even when including experts that predict noise
Architecture design and training details
- Adaptor design consisting of residual connection and encoder-decoder structure performs best
- Lightweight designs for resampler layers and latent variables are important for stable training
- Freezing pre-trained parameters is essential for strong performance and avoiding over-fitting
Conclusions, limitations and discussion
- Introduced Prismer, a vision-language model designed for reasoning tasks
- Prismer is parameter-efficient and uses a small number of trainable components to connect an ensemble of diverse, pre-trained experts
- Prismer achieves competitive performance in image captioning, VQA, and image classification benchmarks
- Prismer does not have the ability to perform few-shot in-context prompting
- Prismer shows limited adaptability to a different expert with a different set of semantic information
- Prismer entangles its multi-modal features from all experts included during pre-training
- Prismer converts all expert labels into an image-like 3-dimensional tensor
- Prismer has two main trainable components: the Experts Resampler and the Adaptor
- Prismer BASE and Prismer LARGE are capable of generating captions that are semantically coherent and aligned with the visual content of the images
- Prismer LARGE produces more detailed and semantically coherent captions than Prismer BASE
- Ablation studies for architecture components and training strategies