Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- VLP has improved performance for vision-language tasks, but only excels in understanding or generation tasks.
- Performance improvement is achieved by scaling up dataset with noisy image-text pairs from the web.
- BLIP is a new VLP framework which transfers flexibly to both understanding and generation tasks.
- BLIP utilizes noisy web data by bootstrapping captions and achieves state-of-the-art results on a range of vision-language tasks.
- BLIP also demonstrates strong generalization ability when transferred to video-language tasks in a zero-shot manner.
Paper Content
Introduction
- Existing vision-language pre-training methods have two major limitations
- BLIP is a new VLP framework which enables a wider range of downstream tasks
- BLIP introduces a new model architecture (MED) and a new dataset boostrapping method (CapFilt)
- BLIP achieves state-of-the-art performance on a wide range of vision-language tasks
Related work
Vision-language pre-training
- VLP uses image-text pairs to improve performance of vision and language tasks
- Most methods use image and alt-text pairs from the web
- Noise is prevalent in web texts, but its negative impact is overlooked
- CapFilt uses web datasets in a more effective way
- Unified encoder-decoder models offer flexibility and better performance on downstream tasks
Knowledge distillation
- Knowledge distillation (KD) is a technique to improve the performance of a student model by distilling knowledge from a teacher model.
- Self-distillation is a special case of KD where the teacher and student have equal sizes.
- KD has been used for image classification and VLP (Visual Language Processing).
- CapFilt is a KD technique for VLP, where the captioner and filter distill knowledge through synthetic captions and removing noisy captions.
Data augmentation
- Data augmentation is widely used in computer vision but less so in language tasks.
- Generative language models have been used to synthesize examples for NLP tasks.
- Synthetic captions can be used in large-scale vision-language pre-training.
Method
- Proposed BLIP, a unified VLP framework to learn from noisy image-text pairs
- Introduced new model architecture MED and its pre-training objectives
- Described CapFilt for dataset bootstrapping
Model architecture
- Uses visual transformer to encode images
- Text encoder is same as BERT
- MED model can operate in 3 functionalities: unimodal encoder, image-grounded text encoder, image-grounded text decoder
Pre-training objectives
- Jointly optimize 3 objectives during pre-training: 2 understanding-based and 1 generation-based
- Image-Text Contrastive Loss (ITC) aligns feature space of visual and text transformer
- Image-Text Matching Loss (ITM) learns image-text multimodal representation
- Language Modeling Loss (LM) maximizes likelihood of text in autoregressive manner
Capfilt
- Limited number of high-quality human-annotated image-text pairs
- Recent work uses larger number of image and alt-text pairs automatically collected from web
- Alt-texts often do not accurately describe visual content of images, making them a noisy signal
Experiments and discussions
- Pre-training details are introduced
- Detailed experimental analysis is provided on the method
Pre-training details
- Models implemented in PyTorch
- Pre-trained on two 16-GPU nodes
- Image transformer initialized from ViT pre-trained on ImageNet
- Text transformer initialized from BERT base
- Two variants of ViTs explored: ViT-B/16 and ViT-L/16
- Pre-trained for 20 epochs with batch size of
- Datasets used: SBU captions and LAION
Effect of capfilt
- Performance improvement observed when only captioner or filter applied to dataset with 14M images
- Effects of captioner and filter compliment each other, leading to substantial improvements
- CapFilt boosts performance with larger dataset and larger vision backbone
- Performance of base model improved when using large captioner and filter with ViT-L
- Examples of captions and images demonstrate effect of captioner and filter
Diversity is key for synthetic captions
- Nucleus sampling is a stochastic decoding method used to generate synthetic captions
- Beam search is a deterministic decoding method used to generate captions with the highest probability
- Nucleus sampling leads to better performance than beam search
- Nucleus sampling generates more diverse and surprising captions, which contain more new information
- Beam search tends to generate safe captions that are common in the dataset
Parameter sharing and decoupling
- Pre-training: parameters are shared except for self-attention layers, leads to better performance and reduces model size
- CapFilt: captioner and filter are end-to-end finetuned individually, sharing parameters decreases performance due to confirmation bias
Comparison with state-of-the-arts
- Compared BLIP to existing VLP methods on a range of vision-language tasks
- Omitted SNLI-VE from benchmark due to noisy test data
Image-text retrieval
- Evaluated BLIP for image-to-text and text-to-image retrieval on COCO and Flickr30K datasets
- Finetuned pre-trained model using ITC and ITM losses
- Selected k candidates based on image-text feature similarity and reranked them based on pairwise ITM scores
- BLIP achieved substantial performance improvement compared to existing methods
Image captioning
- Two datasets were used for image captioning: No-Caps and COCO
- The model was finetuned on COCO with the LM loss
- A prompt “a picture of” was added to each caption, leading to better results
- 200M images were used
- LEMON requires a pre-trained object detector and higher resolution input images, leading to slower inference time than BLIP
Visual question answering (vqa)
- VQA requires a model to predict an answer given an image and a question.
- BLIP outperforms ALBEF by +1.64% on the test set using 14M images.
- BLIP achieves better performance than SimVLM using less pre-training data and a smaller vision backbone.
Natural language visual reasoning (nlvr 2 )
- NLVR 2 asks a model to predict whether a sentence describes a pair of images.
- ViLBERT is pre-trained with additional VQA data.
- A modification is made to the pre-trained model to enable reasoning over two images.
- BLIP outperforms existing methods except for ALBEF.
- Performance on NLVR 2 does not benefit much from additional web images.
Visual dialog (visdial)
- VisDial extends VQA in a conversational setting
- Model needs to predict answer based on image-question pair, dialog history, and image caption
- Model ranks a pool of answer candidates
- Image and caption embeddings are concatenated and passed to dialog encoder
- Dialog encoder is trained with ITM loss to discriminate true/false answers
- Achieves state-of-the-art performance on VisDial v1.0 validation set
Zero-shot transfer to video-language tasks
- Our image-language model has strong generalization ability to video-language tasks.
- We sample frames from videos and concatenate them into a single sequence.
- Our models achieve state-of-the-art performance on both video-language tasks.
Additional ablation study
- CapFilt was pre-trained using a bootstrapped dataset
- Continuing training does not help improve the model
Conclusion
- Proposed BLIP, a new VLP framework with state-of-the-art performance
- Pre-train a multimodal mixture of encoder-decoder model using a dataset bootstrapped from large-scale noisy image-text pairs
- Dataset bootstrapped by injecting diverse synthetic captions and removing noisy captions
- Released dataset to facilitate future vision-language research
- Potential directions to further enhance performance: multiple rounds of dataset bootstrapping, generate multiple synthetic captions per image, model ensemble
- Finetune on downstream vision-language tasks with AdamW optimizer, cosine learning rate schedule, image resolution of 384x384
- Experiment with VQA2.0 dataset, use decoder to rank 3,128 candidate answers
- Experiment with NLVR 2, VisDial v1.0
- Examples of web text and synthetic text
- Model architecture for downstream tasks
- Pre-training model architecture and objectives of BLIP
- Unimodal encoder trained with ITC loss, image-grounded text encoder trained with ITM loss, image-grounded text decoder trained with LM loss
- Beam search and nucleus sampling for synthetic caption generation
- Parameter sharing strategies for text encoder and decoder during pre-training
- Comparison with state-of-the-art image-text retrieval methods
- Zero-shot image-text retrieval results on Flickr30K
- Comparison with state-of-the-art methods on VisDial v1.0 validation set
- Text-to-video retrieval on MSRVTT dataset
- Video question answering on two datasets
- Improvement with CapFilt not due to longer training
- Statistics of pre-training datasets