Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

VLP has improved performance for vision-language tasks, but only excels in understanding or generation tasks.
Performance improvement is achieved by scaling up dataset with noisy image-text pairs from the web.
BLIP is a new VLP framework which transfers flexibly to both understanding and generation tasks.
BLIP utilizes noisy web data by bootstrapping captions and achieves state-of-the-art results on a range of vision-language tasks.
BLIP also demonstrates strong generalization ability when transferred to video-language tasks in a zero-shot manner.

Paper Content

Introduction

Existing vision-language pre-training methods have two major limitations
BLIP is a new VLP framework which enables a wider range of downstream tasks
BLIP introduces a new model architecture (MED) and a new dataset boostrapping method (CapFilt)
BLIP achieves state-of-the-art performance on a wide range of vision-language tasks

Vision-language pre-training

VLP uses image-text pairs to improve performance of vision and language tasks
Most methods use image and alt-text pairs from the web
Noise is prevalent in web texts, but its negative impact is overlooked
CapFilt uses web datasets in a more effective way
Unified encoder-decoder models offer flexibility and better performance on downstream tasks

Knowledge distillation

Knowledge distillation (KD) is a technique to improve the performance of a student model by distilling knowledge from a teacher model.
Self-distillation is a special case of KD where the teacher and student have equal sizes.
KD has been used for image classification and VLP (Visual Language Processing).
CapFilt is a KD technique for VLP, where the captioner and filter distill knowledge through synthetic captions and removing noisy captions.

Data augmentation

Data augmentation is widely used in computer vision but less so in language tasks.
Generative language models have been used to synthesize examples for NLP tasks.
Synthetic captions can be used in large-scale vision-language pre-training.

Method

Proposed BLIP, a unified VLP framework to learn from noisy image-text pairs
Introduced new model architecture MED and its pre-training objectives
Described CapFilt for dataset bootstrapping

Model architecture

Uses visual transformer to encode images
Text encoder is same as BERT
MED model can operate in 3 functionalities: unimodal encoder, image-grounded text encoder, image-grounded text decoder

Pre-training objectives

Jointly optimize 3 objectives during pre-training: 2 understanding-based and 1 generation-based
Image-Text Contrastive Loss (ITC) aligns feature space of visual and text transformer
Image-Text Matching Loss (ITM) learns image-text multimodal representation
Language Modeling Loss (LM) maximizes likelihood of text in autoregressive manner

Capfilt

Limited number of high-quality human-annotated image-text pairs
Recent work uses larger number of image and alt-text pairs automatically collected from web
Alt-texts often do not accurately describe visual content of images, making them a noisy signal

Experiments and discussions

Pre-training details are introduced
Detailed experimental analysis is provided on the method

Pre-training details

Models implemented in PyTorch
Pre-trained on two 16-GPU nodes
Image transformer initialized from ViT pre-trained on ImageNet
Text transformer initialized from BERT base
Two variants of ViTs explored: ViT-B/16 and ViT-L/16
Pre-trained for 20 epochs with batch size of
Datasets used: SBU captions and LAION

Effect of capfilt

Performance improvement observed when only captioner or filter applied to dataset with 14M images
Effects of captioner and filter compliment each other, leading to substantial improvements
CapFilt boosts performance with larger dataset and larger vision backbone
Performance of base model improved when using large captioner and filter with ViT-L
Examples of captions and images demonstrate effect of captioner and filter

Diversity is key for synthetic captions

Nucleus sampling is a stochastic decoding method used to generate synthetic captions
Beam search is a deterministic decoding method used to generate captions with the highest probability
Nucleus sampling leads to better performance than beam search
Nucleus sampling generates more diverse and surprising captions, which contain more new information
Beam search tends to generate safe captions that are common in the dataset

Pre-training: parameters are shared except for self-attention layers, leads to better performance and reduces model size
CapFilt: captioner and filter are end-to-end finetuned individually, sharing parameters decreases performance due to confirmation bias

Comparison with state-of-the-arts

Compared BLIP to existing VLP methods on a range of vision-language tasks
Omitted SNLI-VE from benchmark due to noisy test data

Image-text retrieval

Evaluated BLIP for image-to-text and text-to-image retrieval on COCO and Flickr30K datasets
Finetuned pre-trained model using ITC and ITM losses
Selected k candidates based on image-text feature similarity and reranked them based on pairwise ITM scores
BLIP achieved substantial performance improvement compared to existing methods

Image captioning

Two datasets were used for image captioning: No-Caps and COCO
The model was finetuned on COCO with the LM loss
A prompt “a picture of” was added to each caption, leading to better results
200M images were used
LEMON requires a pre-trained object detector and higher resolution input images, leading to slower inference time than BLIP

Visual question answering (vqa)

VQA requires a model to predict an answer given an image and a question.
BLIP outperforms ALBEF by +1.64% on the test set using 14M images.
BLIP achieves better performance than SimVLM using less pre-training data and a smaller vision backbone.

Natural language visual reasoning (nlvr 2 )

NLVR 2 asks a model to predict whether a sentence describes a pair of images.
ViLBERT is pre-trained with additional VQA data.
A modification is made to the pre-trained model to enable reasoning over two images.
BLIP outperforms existing methods except for ALBEF.
Performance on NLVR 2 does not benefit much from additional web images.

Visual dialog (visdial)

VisDial extends VQA in a conversational setting
Model needs to predict answer based on image-question pair, dialog history, and image caption
Model ranks a pool of answer candidates
Image and caption embeddings are concatenated and passed to dialog encoder
Dialog encoder is trained with ITM loss to discriminate true/false answers
Achieves state-of-the-art performance on VisDial v1.0 validation set

Zero-shot transfer to video-language tasks

Our image-language model has strong generalization ability to video-language tasks.
We sample frames from videos and concatenate them into a single sequence.
Our models achieve state-of-the-art performance on both video-language tasks.

Additional ablation study

CapFilt was pre-trained using a bootstrapped dataset
Continuing training does not help improve the model

Conclusion

Proposed BLIP, a new VLP framework with state-of-the-art performance
Pre-train a multimodal mixture of encoder-decoder model using a dataset bootstrapped from large-scale noisy image-text pairs
Dataset bootstrapped by injecting diverse synthetic captions and removing noisy captions
Released dataset to facilitate future vision-language research
Potential directions to further enhance performance: multiple rounds of dataset bootstrapping, generate multiple synthetic captions per image, model ensemble
Finetune on downstream vision-language tasks with AdamW optimizer, cosine learning rate schedule, image resolution of 384x384
Experiment with VQA2.0 dataset, use decoder to rank 3,128 candidate answers
Experiment with NLVR 2, VisDial v1.0
Examples of web text and synthetic text
Model architecture for downstream tasks
Pre-training model architecture and objectives of BLIP
Unimodal encoder trained with ITC loss, image-grounded text encoder trained with ITM loss, image-grounded text decoder trained with LM loss
Beam search and nucleus sampling for synthetic caption generation
Parameter sharing strategies for text encoder and decoder during pre-training
Comparison with state-of-the-art image-text retrieval methods
Zero-shot image-text retrieval results on Flickr30K
Comparison with state-of-the-art methods on VisDial v1.0 validation set
Text-to-video retrieval on MSRVTT dataset
Video question answering on two datasets
Improvement with CapFilt not due to longer training
Statistics of pre-training datasets

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Vision-language pre-training#

Knowledge distillation#

Data augmentation#

Method#

Model architecture#

Pre-training objectives#

Capfilt#

Experiments and discussions#

Pre-training details#

Effect of capfilt#

Diversity is key for synthetic captions#

Parameter sharing and decoupling#

Comparison with state-of-the-arts#

Image-text retrieval#

Image captioning#

Visual question answering (vqa)#

Natural language visual reasoning (nlvr 2 )#

Visual dialog (visdial)#

Zero-shot transfer to video-language tasks#

Additional ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Vision-language pre-training

Knowledge distillation

Data augmentation

Method

Model architecture

Pre-training objectives

Capfilt

Experiments and discussions

Pre-training details

Effect of capfilt

Diversity is key for synthetic captions

Parameter sharing and decoupling

Comparison with state-of-the-arts

Image-text retrieval

Image captioning

Visual question answering (vqa)

Natural language visual reasoning (nlvr 2 )

Visual dialog (visdial)

Zero-shot transfer to video-language tasks

Additional ablation study

Conclusion