Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent vision-language models have shown impressive multi-modal generation capabilities.
Prismer is a data- and parameter-efficient vision-language model that leverages an ensemble of domain experts.
Prismer requires training of a small number of components, with the majority of network weights inherited from pre-trained domain experts.
Prismer can efficiently pool expert knowledge and adapt it to various vision-language reasoning tasks.
Prismer achieves competitive fine-tuned and few-shot learning performance with up to two orders of magnitude less training data.

Paper Content

Introduction

Large pre-trained models have good generalisation capabilities but require a lot of resources.
Language domain models need yottaFLOP scale compute budget.
Vision-language learning is more challenging and requires extra skills.
Prismer model uses pre-trained experts and lightweight components.
Prismer is trained on 13M image/alt-text data and performs well.
Prismer is robust against noisy experts and performance scales with quantity/quality of experts.

VLMs use transformers to model the vision-language relationship
VLMs usually leverage a pre-trained object detector to encode images
Dual-stream VLMs encode vision and language features in separate networks
Prismer uses pre-trained models to provide output predictions as auxiliary signals
Prismer focuses on language generation and only requires a single autoregressive training objective
Prismer leverages powerful pre-trained domain expert models for data-efficient training
Prismer unifies pre-trained experts in a single architecture design

Prismer model is a type of vision-language generative model
Takes multi-modal signals as input and outputs free-form text

Model overview

Prismer is an encoder-decoder transformer model
Leverages a library of existing pre-trained experts
Vision encoder takes RGB image and multi-modal labels as input
Language decoder is conditioned on multi-modal features
Data efficiency achieved by leveraging combined power of strong domain-specific experts
Built on top of existing pre-trained vision-only and language-only backbone models
Vision encoder extended to accept multi-modal signals
Majority of network weights of pre-trained experts frozen
Experts Resampler and Adaptor used to link vision and language parts
Re-formulated as language modelling or prefix language modelling problem

Pre-trained experts

Prismer includes two types of pre-trained experts: Backbone Experts and Modality Experts
Backbone Experts are vision-only and language-only pre-trained models that encode images and texts into a meaningful sequence of tokens
Modality Experts are up to 6 experts from the vision domain, encoding three low-level and three high-level vision signals
Modality Experts are treated as black-box predictors and their predicted labels are used as input for the Prismer model
Modality-specific post-processing is applied to the predicted labels, transforming them to a R H×W×C tensor

Training objective

Prismer is trained with a single objective to predict the next text token autoregressively.
Prismer uses an encoder-decoder architecture to predict multi-modal features.
Pre-processing step is computationally cheap and fast.
Generative objective only requires one forward pass to compute gradients.
Model is less suitable for multi-modal discriminative tasks.

Experiments

Prismer model variants

PrismerZ is a model variant of Prismer that only requires RGB images and is more efficient and applicable to a wider range of applications.
Prismer is less efficient in data inference but has better predictive performance.
Prismer and PrismerZ use ViT and RoBERTa as the frozen vision encoder and language decoder respectively.

Training and evaluation details

Pre-training data is constructed from 5 datasets
Training is done with AdamW optimiser and ZeRO Stage 2 technique
Hyper-parameters and data processing techniques are in Appendix A
Training costs are in Appendix B
Performance is evaluated through language modelling
Beam search with a beam size of 3 is used for text generation
Prefix prompt of “A picture of” is added for image captioning tasks

Results on vision-language benchmarks

Fine-tuned models on COCO Caption, NoCaps and VQAv2
Prismer and PrismerZ achieve superior performance compared to other VLMs with similar model sizes
Prismer can achieve competitive performance on par with VLMs trained with more data
Prismer outperforms original vision backbones ViT-B and ViT-L
Few-shot classification via lightweight fine-tuning
Prismer underperforms GIT and Flamingo
Prismer’s performance can likely be improved with stronger vision backbone

Additional analysis

Intriguing properties of prismer

Performance of Prismer improves with addition of more modality experts
Performance plateaus after a certain number of experts
Performance of Prismer improves with better quality experts
Prismer maintains performance even when including experts that predict noise

Architecture design and training details

Adaptor design consisting of residual connection and encoder-decoder structure performs best
Lightweight designs for resampler layers and latent variables are important for stable training
Freezing pre-trained parameters is essential for strong performance and avoiding over-fitting

Conclusions, limitations and discussion

Introduced Prismer, a vision-language model designed for reasoning tasks
Prismer is parameter-efficient and uses a small number of trainable components to connect an ensemble of diverse, pre-trained experts
Prismer achieves competitive performance in image captioning, VQA, and image classification benchmarks
Prismer does not have the ability to perform few-shot in-context prompting
Prismer shows limited adaptability to a different expert with a different set of semantic information
Prismer entangles its multi-modal features from all experts included during pre-training
Prismer converts all expert labels into an image-like 3-dimensional tensor
Prismer has two main trainable components: the Experts Resampler and the Adaptor
Prismer BASE and Prismer LARGE are capable of generating captions that are semantically coherent and aligned with the visual content of the images
Prismer LARGE produces more detailed and semantically coherent captions than Prismer BASE
Ablation studies for architecture components and training strategies

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Prismer: open-ended reasoning with multi-modal knowledge#

Model overview#

Pre-trained experts#

Training objective#

Experiments#

Prismer model variants#

Training and evaluation details#

Results on vision-language benchmarks#

Additional analysis#

Intriguing properties of prismer#

Architecture design and training details#

Conclusions, limitations and discussion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Prismer: open-ended reasoning with multi-modal knowledge

Model overview

Pre-trained experts

Training objective

Experiments

Prismer model variants

Training and evaluation details

Results on vision-language benchmarks

Additional analysis

Intriguing properties of prismer

Architecture design and training details

Conclusions, limitations and discussion