Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Cost of vision-language pre-training has become expensive
- BLIP-2 is a pre-training strategy that uses frozen pre-trained image encoders and language models
- BLIP-2 bridges the modality gap with a lightweight Querying Transformer
- BLIP-2 achieves state-of-the-art performance on various vision-language tasks
- BLIP-2 outperforms Flamingo80B with 54x fewer trainable parameters
- BLIP-2 has emerging capabilities of zero-shot image-to-text generation
Paper Content
Introduction
- Vision-language pre-training (VLP) research has seen rapid advancement in the past few years
- Pre-trained models with larger scale have been developed to push the state-of-the-art on various downstream tasks
- Most state-of-the-art models incur high computation cost during pre-training
- Proposed a generic and compute-efficient VLP method by bootstrapping from pre-trained vision and language models
- Pre-trained models offer high-quality visual representation and strong language generation
- Unimodal pre-trained models remain frozen during pre-training
- Querying Transformer (Q-Former) pre-trained with two-stage pre-training strategy to facilitate cross-modal alignment
- BLIP-2 achieves state-of-the-art performance on various vision-language tasks
- BLIP-2 can perform zero-shot image-to-text generation
- BLIP-2 is more compute-efficient than existing state-of-the-arts
Related work
End-to-end vision-language pre-training
- Vision-language pre-training is used to learn multimodal foundation models with improved performance on various vision-language tasks.
- Different model architectures have been proposed, including dual-encoder, fusion-encoder, encoder-decoder, and unified transformer.
- Pre-training objectives have been proposed, including image-text contrastive learning, image-text matching, and (masked) language modeling.
- End-to-end pre-training using large-scale image-text pair datasets is used, but this can incur a high computation cost and is inflexible for leveraging readily-available unimodal pre-trained models.
Modular vision-language pre-training
- Methods leverage off-the-shelf pre-trained models and keep them frozen during VLP
- Image encoder is frozen, including early work and recent LiT
- Language model is frozen to use knowledge from LLMs
- Challenge is to align visual features to text space
- Existing methods use language modeling loss to generate texts conditioned on the image
- BLIP-2 leverages both frozen image encoders and frozen LLMs for vision-language tasks
Method
- Proposed BLIP-2, a vision-language pre-training method
- Uses a Querying Transformer (Q-Former) pre-trained in two stages
Model architecture
- Q-Former is a trainable module that bridges the gap between a frozen image encoder and a frozen LLM.
- Q-Former consists of two transformer submodules that share the same self-attention layers.
- Q-Former contains 188M parameters and 32 learnable query embeddings.
Bootstrap vision-language representation learning from a frozen image encoder
- Q-Former is connected to a frozen image encoder and pre-trained using image-text pairs
- Three pre-training objectives are jointly optimized, each with a different attention masking strategy
- Image-Text Contrastive Learning (ITC) maximizes mutual information between image and text representation
- Image-grounded Text Generation (ITG) trains Q-Former to generate texts given input images
- Image-Text Matching (ITM) learns fine-grained alignment between image and text representation
- Q-Former is connected to a frozen LLM to harvest its generative language capability
- LLM is pre-trained with language modeling loss or prefix language modeling loss
Model pre-training
- Pre-training data includes COCO, Visual Genome, CC3M, CC12M, SBU, and LAION400M
- Pre-training uses CapFilt method to create synthetic captions for web images
- Two state-of-the-art pre-trained vision transformer models used: ViT-L/14 and ViT-G/14
- Frozen language model explored: OPT model family and FlanT5 model family
- Pre-training for 250k steps in first stage and 80k steps in second stage
- Batch size of 2320/1680 for ViT-L/ViT-G in first stage and 1920/1520 for OPT/FlanT5 in second stage
- Parameters converted to FP16 or BFloat16
- AdamW optimizer used with cosine learning rate decay
- Images augmented with random resized cropping and horizontal flipping
Experiment
- BLIP-2 performs better than previous state-of-the-art models on zero-shot vision-language tasks.
- BLIP-2 requires fewer trainable parameters than previous models.
Instructed zero-shot image-to-text generation
- BLIP-2 enables a LLM to understand images while following text prompts
- Allows control of image-to-text generation with instructions
- Examples demonstrate a range of zero-shot image-to-text capabilities
- BLIP-2 achieves state-of-the-art result on VQAv2 and GQA datasets
- Stronger image encoder or LLM leads to better performance
- Representation learning pre-trains Q-Former to learn visual features relevant to text
Image captioning
- Finetune BLIP-2 models for image captioning task
- Use prompt “a photo of” as initial input to LLM
- Keep LLM frozen during finetuning, update parameters of Q-Former and image encoder
- Experiment with ViT-G and various LLMs
- Perform finetuning on COCO, evaluate on COCO test set and zero-shot transfer to NoCaps
- BLIP-2 achieves state-of-the-art performance with significant improvement on NoCaps
Visual question answering
- Finetune parameters of Q-Former and image encoder while keeping LLM frozen
- LLM receives Q-Former’s output and question as input and is asked to generate answer
- Q-Former is conditioned on question tokens to extract image features more relevant to question
Image-text retrieval
- Finetuned first-stage-pretrained model w/o LLM for image-text retrieval
- Evaluated model on COCO and Flickr30K datasets
- Used ViT-L and ViT-G as image encoder
- Achieved state-of-the-art performance with ITC and ITM losses
- ITG loss beneficial for image-text retrieval
Limitation
- LLMs can perform in-context learning with fewshot examples
- BLIP-2 does not observe improved VQA performance when providing LLM with in-context VQA examples
- Lack of in-context learning capability attributed to pretraining dataset containing only single image-text pair per sample
- BLIP-2’s image-to-text generation could have unsatisfactory results due to various reasons
- BLIP-2 inherits risks of LLMs such as outputting offensive language, propagating social bias, or leaking private information
Conclusion
- Proposed BLIP-2, a generic and compute-efficient method for vision-language pre-training
- Leverages frozen pretrained image encoders and LLMs
- Achieves state-of-the-art performance on various vision-language tasks
- Small amount of trainable parameters during pre-training
- Enables zero-shot instructed image-to-text generation
- Compares with state-of-the-art methods on zero-shot vision-language tasks
- Improves image-text retrieval performance