Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have good zero-shot generalization to new language tasks.
LLMs are not effective for zero-shot VQA due to modality and task disconnection.
Img2Prompt is a plug-and-play module that bridges the disconnections, allowing LLMs to perform zero-shot VQA without end-to-end training.
Img2Prompt is flexible, reduces cost, and achieves comparable or better performance than end-to-end training.

Paper Content

Introduction

VQA is a vision-language task with real-world applications
Human annotations are expensive and can introduce biases
Zero-shot VQA methods do not require ground-truth annotations
LLMs can perform tasks with zero in-domain data
LLMs have been used in zero-shot VQA
Finetuning a vision encoder and LLM is costly and introduces interdependence

Recent advances in vqa methods

VQA is a multi-modal evaluation benchmark that requires models to answer natural language questions according to an image.
Vision-language models have been pretrained on large-scale image-text datasets and fine-tuned for VQA tasks.
Recent works have incorporated external knowledge into networks to solve knowledge-based VQA, but results show difficulty in answering questions requiring reasoning ability.

Llm for zero/few-shot vqa tasks

LLMs are powerful for natural language understanding and reasoning
LLMs generate target tokens autoregressively
Prior VQA methods using LLMs fall into two categories: multi-modal pretraining and language-mediated VQA
Multi-modal pretraining is highly compute-inefficient and can lead to catastrophic forgetting
Language-mediated VQA uses language as the intermediate representation of the image
Difficulties in utilizing LLMs effectively in zero-shot VQA stem from modality disconnection and task disconnection
Img2Prompt is a new zero-shot technique to address the task disconnection on generic LLMs

Answer extraction

Generate captions from VQA image
Extract noun phrases, verb phrases, adjective phrases, numbers, and boolean-typed words as potential answers

Question generation

Extract answer candidates from image captions
Use question generation networks to generate questions for each answer candidate
Template-based question generation uses parser to obtain part-of-speech for each answer and design specific question templates
Neural question generation uses pretrained T5-large model to generate questions from answers
Exemplar QA pairs bridge task disconnect between language modeling and VQA

Question-relevant caption prompt

Generate captions about the question-relevant portion of the image
Use Imagegrounded Text Encoder (ITE) to determine relevant image regions
Use GradCAM to generate a coarse localisation map
Sample a subset of image patches with probability proportional to patch relevance
Generate captions from the sampled image patches using top-k sampling
Filter captions with less than 0.5 matching scores

Prompt design

Synthetic question-relevant captions and question-answer pairs are used to construct complete prompts for LLM.
The instruction text is “Please reason the answers of question according to the contexts.”
Greedy decoding is used to get the answer and meaningless tokens are removed.
30 answer candidates with highest frequencies and one caption containing each answer are selected.

Experiment

Compare Img2Prompt with other zero-shot and few-shot VQA methods
Perform ablation studies on prompt patterns and caption selection strategies
Show qualitative examples and discuss failure cases

Environment setup

Validated method on 3 datasets: VQAv2, OK-VQA, A-OKVQA
VQAv2 has 214,354 questions in validation set and 107,394 in test-dev dataset
OK-VQA has 5,046 test questions and A-OKVQA has 1,100 validation questions and 6,700 test questions
Used BLIP to generate captions and GradCam to localize image regions
Compared with prior VQA methods, which fall into 3 categories: zero-shot, zero-shot with pre-training, and few-shot

Main results

Img2Prompt surpasses prior zero-shot model with frozen LLMs
LLMs help better comprehend questions and give more accurate answers
Scaling LLMs improves VQA scores
Img2Prompt outperforms end-to-end pretraining and few-shot models
Img2Prompt enables various LLMs to perform zero-shot VQA tasks

Analysis on question generation methods

Three question generation techniques are compared: image-agnostic, template-based, and neural-based
Two synthetic QA selection strategies are compared: random and max freq.
Neural-based performs the best, Agnostic performs the worst
Answer hit rate and answer noise rate are evaluated to measure visual information quality

Ablation on caption selection

Max Frequency strategy selects captions with highest frequencies
Min Frequency strategy selects captions with lowest frequencies
Max Frequency does not provide more info than exemplar prompts
Min Frequency provides info not in QA pairs, boosting performance

Ablation study on prompt design

Option 1: Append synthetic QA pair after caption
Option 2: Present all captions at once, followed by all QA pairs
Option 2 performs significantly better than Option 1

Examples and failure case analysis

LLM correctly infers that a man making drinks at a bar is a bartender
LLM is unable to make inferences based on qualitative physics in some cases

Limitation

Generating image captions and question-answer pairs incurs extra inference overhead.
Reducing the prompt can reduce the overhead, but accuracy may be affected.
Method avoids expensive end-to-end multimodal representation alignment.

Conclusion

Proposed Img2Prompt module to exploit knowledge and reasoning power of large language models for zero-shot VQA tasks
Provides visual information and task guidance to LLMs in the format of easily-digestible prompts
Eliminates requirement for expensive end-to-end vision-language alignment
Increases model deployment flexibility and decreases model deployment cost
Achieves comparable or superior zero-shot VQA performance to other methods
Inherent bias of these systems still exists
Integrated with few-shot models
Experimental comparisons between Img2Prompt and supervised model on A-OKVQA dataset
Outperform almost all supervised model with smaller size language model
Experimental comparisons with different prompts in OK-VQA val set
Experimental comparisons with models trained in A-OKVQA training dataset
Experimental results of using different number of captions and QA pairs as prompts
Experimental results of using different number of patches to generate question-relevant captions

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Recent advances in vqa methods#

Llm for zero/few-shot vqa tasks#

Answer extraction#

Question generation#

Question-relevant caption prompt#

Prompt design#

Experiment#

Environment setup#

Main results#

Analysis on question generation methods#

Ablation on caption selection#

Ablation study on prompt design#

Examples and failure case analysis#

Limitation#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Recent advances in vqa methods

Llm for zero/few-shot vqa tasks

Answer extraction

Question generation

Question-relevant caption prompt

Prompt design

Experiment

Environment setup

Main results

Analysis on question generation methods

Ablation on caption selection

Ablation study on prompt design

Examples and failure case analysis

Limitation

Conclusion