Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Cost of vision-language pre-training has become expensive
  • BLIP-2 is a pre-training strategy that uses frozen pre-trained image encoders and language models
  • BLIP-2 bridges the modality gap with a lightweight Querying Transformer
  • BLIP-2 achieves state-of-the-art performance on various vision-language tasks
  • BLIP-2 outperforms Flamingo80B with 54x fewer trainable parameters
  • BLIP-2 has emerging capabilities of zero-shot image-to-text generation

Paper Content

Introduction

  • Vision-language pre-training (VLP) research has seen rapid advancement in the past few years
  • Pre-trained models with larger scale have been developed to push the state-of-the-art on various downstream tasks
  • Most state-of-the-art models incur high computation cost during pre-training
  • Proposed a generic and compute-efficient VLP method by bootstrapping from pre-trained vision and language models
  • Pre-trained models offer high-quality visual representation and strong language generation
  • Unimodal pre-trained models remain frozen during pre-training
  • Querying Transformer (Q-Former) pre-trained with two-stage pre-training strategy to facilitate cross-modal alignment
  • BLIP-2 achieves state-of-the-art performance on various vision-language tasks
  • BLIP-2 can perform zero-shot image-to-text generation
  • BLIP-2 is more compute-efficient than existing state-of-the-arts

End-to-end vision-language pre-training

  • Vision-language pre-training is used to learn multimodal foundation models with improved performance on various vision-language tasks.
  • Different model architectures have been proposed, including dual-encoder, fusion-encoder, encoder-decoder, and unified transformer.
  • Pre-training objectives have been proposed, including image-text contrastive learning, image-text matching, and (masked) language modeling.
  • End-to-end pre-training using large-scale image-text pair datasets is used, but this can incur a high computation cost and is inflexible for leveraging readily-available unimodal pre-trained models.

Modular vision-language pre-training

  • Methods leverage off-the-shelf pre-trained models and keep them frozen during VLP
  • Image encoder is frozen, including early work and recent LiT
  • Language model is frozen to use knowledge from LLMs
  • Challenge is to align visual features to text space
  • Existing methods use language modeling loss to generate texts conditioned on the image
  • BLIP-2 leverages both frozen image encoders and frozen LLMs for vision-language tasks

Method

  • Proposed BLIP-2, a vision-language pre-training method
  • Uses a Querying Transformer (Q-Former) pre-trained in two stages

Model architecture

  • Q-Former is a trainable module that bridges the gap between a frozen image encoder and a frozen LLM.
  • Q-Former consists of two transformer submodules that share the same self-attention layers.
  • Q-Former contains 188M parameters and 32 learnable query embeddings.

Bootstrap vision-language representation learning from a frozen image encoder

  • Q-Former is connected to a frozen image encoder and pre-trained using image-text pairs
  • Three pre-training objectives are jointly optimized, each with a different attention masking strategy
  • Image-Text Contrastive Learning (ITC) maximizes mutual information between image and text representation
  • Image-grounded Text Generation (ITG) trains Q-Former to generate texts given input images
  • Image-Text Matching (ITM) learns fine-grained alignment between image and text representation
  • Q-Former is connected to a frozen LLM to harvest its generative language capability
  • LLM is pre-trained with language modeling loss or prefix language modeling loss

Model pre-training

  • Pre-training data includes COCO, Visual Genome, CC3M, CC12M, SBU, and LAION400M
  • Pre-training uses CapFilt method to create synthetic captions for web images
  • Two state-of-the-art pre-trained vision transformer models used: ViT-L/14 and ViT-G/14
  • Frozen language model explored: OPT model family and FlanT5 model family
  • Pre-training for 250k steps in first stage and 80k steps in second stage
  • Batch size of 2320/1680 for ViT-L/ViT-G in first stage and 1920/1520 for OPT/FlanT5 in second stage
  • Parameters converted to FP16 or BFloat16
  • AdamW optimizer used with cosine learning rate decay
  • Images augmented with random resized cropping and horizontal flipping

Experiment

  • BLIP-2 performs better than previous state-of-the-art models on zero-shot vision-language tasks.
  • BLIP-2 requires fewer trainable parameters than previous models.

Instructed zero-shot image-to-text generation

  • BLIP-2 enables a LLM to understand images while following text prompts
  • Allows control of image-to-text generation with instructions
  • Examples demonstrate a range of zero-shot image-to-text capabilities
  • BLIP-2 achieves state-of-the-art result on VQAv2 and GQA datasets
  • Stronger image encoder or LLM leads to better performance
  • Representation learning pre-trains Q-Former to learn visual features relevant to text

Image captioning

  • Finetune BLIP-2 models for image captioning task
  • Use prompt “a photo of” as initial input to LLM
  • Keep LLM frozen during finetuning, update parameters of Q-Former and image encoder
  • Experiment with ViT-G and various LLMs
  • Perform finetuning on COCO, evaluate on COCO test set and zero-shot transfer to NoCaps
  • BLIP-2 achieves state-of-the-art performance with significant improvement on NoCaps

Visual question answering

  • Finetune parameters of Q-Former and image encoder while keeping LLM frozen
  • LLM receives Q-Former’s output and question as input and is asked to generate answer
  • Q-Former is conditioned on question tokens to extract image features more relevant to question

Image-text retrieval

  • Finetuned first-stage-pretrained model w/o LLM for image-text retrieval
  • Evaluated model on COCO and Flickr30K datasets
  • Used ViT-L and ViT-G as image encoder
  • Achieved state-of-the-art performance with ITC and ITM losses
  • ITG loss beneficial for image-text retrieval

Limitation

  • LLMs can perform in-context learning with fewshot examples
  • BLIP-2 does not observe improved VQA performance when providing LLM with in-context VQA examples
  • Lack of in-context learning capability attributed to pretraining dataset containing only single image-text pair per sample
  • BLIP-2’s image-to-text generation could have unsatisfactory results due to various reasons
  • BLIP-2 inherits risks of LLMs such as outputting offensive language, propagating social bias, or leaking private information

Conclusion

  • Proposed BLIP-2, a generic and compute-efficient method for vision-language pre-training
  • Leverages frozen pretrained image encoders and LLMs
  • Achieves state-of-the-art performance on various vision-language tasks
  • Small amount of trainable parameters during pre-training
  • Enables zero-shot instructed image-to-text generation
  • Compares with state-of-the-art methods on zero-shot vision-language tasks
  • Improves image-text retrieval performance