Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent vision-language models have shown impressive multi-modal generation capabilities.
  • Prismer is a data- and parameter-efficient vision-language model that leverages an ensemble of domain experts.
  • Prismer requires training of a small number of components, with the majority of network weights inherited from pre-trained domain experts.
  • Prismer can efficiently pool expert knowledge and adapt it to various vision-language reasoning tasks.
  • Prismer achieves competitive fine-tuned and few-shot learning performance with up to two orders of magnitude less training data.

Paper Content

Introduction

  • Large pre-trained models have good generalisation capabilities but require a lot of resources.
  • Language domain models need yottaFLOP scale compute budget.
  • Vision-language learning is more challenging and requires extra skills.
  • Prismer model uses pre-trained experts and lightweight components.
  • Prismer is trained on 13M image/alt-text data and performs well.
  • Prismer is robust against noisy experts and performance scales with quantity/quality of experts.
  • VLMs use transformers to model the vision-language relationship
  • VLMs usually leverage a pre-trained object detector to encode images
  • Dual-stream VLMs encode vision and language features in separate networks
  • Prismer uses pre-trained models to provide output predictions as auxiliary signals
  • Prismer focuses on language generation and only requires a single autoregressive training objective
  • Prismer leverages powerful pre-trained domain expert models for data-efficient training
  • Prismer unifies pre-trained experts in a single architecture design

Prismer: open-ended reasoning with multi-modal knowledge

  • Prismer model is a type of vision-language generative model
  • Takes multi-modal signals as input and outputs free-form text

Model overview

  • Prismer is an encoder-decoder transformer model
  • Leverages a library of existing pre-trained experts
  • Vision encoder takes RGB image and multi-modal labels as input
  • Language decoder is conditioned on multi-modal features
  • Data efficiency achieved by leveraging combined power of strong domain-specific experts
  • Built on top of existing pre-trained vision-only and language-only backbone models
  • Vision encoder extended to accept multi-modal signals
  • Majority of network weights of pre-trained experts frozen
  • Experts Resampler and Adaptor used to link vision and language parts
  • Re-formulated as language modelling or prefix language modelling problem

Pre-trained experts

  • Prismer includes two types of pre-trained experts: Backbone Experts and Modality Experts
  • Backbone Experts are vision-only and language-only pre-trained models that encode images and texts into a meaningful sequence of tokens
  • Modality Experts are up to 6 experts from the vision domain, encoding three low-level and three high-level vision signals
  • Modality Experts are treated as black-box predictors and their predicted labels are used as input for the Prismer model
  • Modality-specific post-processing is applied to the predicted labels, transforming them to a R H×W×C tensor

Training objective

  • Prismer is trained with a single objective to predict the next text token autoregressively.
  • Prismer uses an encoder-decoder architecture to predict multi-modal features.
  • Pre-processing step is computationally cheap and fast.
  • Generative objective only requires one forward pass to compute gradients.
  • Model is less suitable for multi-modal discriminative tasks.

Experiments

Prismer model variants

  • PrismerZ is a model variant of Prismer that only requires RGB images and is more efficient and applicable to a wider range of applications.
  • Prismer is less efficient in data inference but has better predictive performance.
  • Prismer and PrismerZ use ViT and RoBERTa as the frozen vision encoder and language decoder respectively.

Training and evaluation details

  • Pre-training data is constructed from 5 datasets
  • Training is done with AdamW optimiser and ZeRO Stage 2 technique
  • Hyper-parameters and data processing techniques are in Appendix A
  • Training costs are in Appendix B
  • Performance is evaluated through language modelling
  • Beam search with a beam size of 3 is used for text generation
  • Prefix prompt of “A picture of” is added for image captioning tasks

Results on vision-language benchmarks

  • Fine-tuned models on COCO Caption, NoCaps and VQAv2
  • Prismer and PrismerZ achieve superior performance compared to other VLMs with similar model sizes
  • Prismer can achieve competitive performance on par with VLMs trained with more data
  • Prismer outperforms original vision backbones ViT-B and ViT-L
  • Few-shot classification via lightweight fine-tuning
  • Prismer underperforms GIT and Flamingo
  • Prismer’s performance can likely be improved with stronger vision backbone

Additional analysis

Intriguing properties of prismer

  • Performance of Prismer improves with addition of more modality experts
  • Performance plateaus after a certain number of experts
  • Performance of Prismer improves with better quality experts
  • Prismer maintains performance even when including experts that predict noise

Architecture design and training details

  • Adaptor design consisting of residual connection and encoder-decoder structure performs best
  • Lightweight designs for resampler layers and latent variables are important for stable training
  • Freezing pre-trained parameters is essential for strong performance and avoiding over-fitting

Conclusions, limitations and discussion

  • Introduced Prismer, a vision-language model designed for reasoning tasks
  • Prismer is parameter-efficient and uses a small number of trainable components to connect an ensemble of diverse, pre-trained experts
  • Prismer achieves competitive performance in image captioning, VQA, and image classification benchmarks
  • Prismer does not have the ability to perform few-shot in-context prompting
  • Prismer shows limited adaptability to a different expert with a different set of semantic information
  • Prismer entangles its multi-modal features from all experts included during pre-training
  • Prismer converts all expert labels into an image-like 3-dimensional tensor
  • Prismer has two main trainable components: the Experts Resampler and the Adaptor
  • Prismer BASE and Prismer LARGE are capable of generating captions that are semantically coherent and aligned with the visual content of the images
  • Prismer LARGE produces more detailed and semantically coherent captions than Prismer BASE
  • Ablation studies for architecture components and training strategies