Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs have good zero-shot generalization to new language tasks.
- LLMs are not effective for zero-shot VQA due to modality and task disconnection.
- Img2Prompt is a plug-and-play module that bridges the disconnections, allowing LLMs to perform zero-shot VQA without end-to-end training.
- Img2Prompt is flexible, reduces cost, and achieves comparable or better performance than end-to-end training.
Paper Content
Introduction
- VQA is a vision-language task with real-world applications
- Human annotations are expensive and can introduce biases
- Zero-shot VQA methods do not require ground-truth annotations
- LLMs can perform tasks with zero in-domain data
- LLMs have been used in zero-shot VQA
- Finetuning a vision encoder and LLM is costly and introduces interdependence
Related work
Recent advances in vqa methods
- VQA is a multi-modal evaluation benchmark that requires models to answer natural language questions according to an image.
- Vision-language models have been pretrained on large-scale image-text datasets and fine-tuned for VQA tasks.
- Recent works have incorporated external knowledge into networks to solve knowledge-based VQA, but results show difficulty in answering questions requiring reasoning ability.
Llm for zero/few-shot vqa tasks
- LLMs are powerful for natural language understanding and reasoning
- LLMs generate target tokens autoregressively
- Prior VQA methods using LLMs fall into two categories: multi-modal pretraining and language-mediated VQA
- Multi-modal pretraining is highly compute-inefficient and can lead to catastrophic forgetting
- Language-mediated VQA uses language as the intermediate representation of the image
- Difficulties in utilizing LLMs effectively in zero-shot VQA stem from modality disconnection and task disconnection
- Img2Prompt is a new zero-shot technique to address the task disconnection on generic LLMs
Answer extraction
- Generate captions from VQA image
- Extract noun phrases, verb phrases, adjective phrases, numbers, and boolean-typed words as potential answers
Question generation
- Extract answer candidates from image captions
- Use question generation networks to generate questions for each answer candidate
- Template-based question generation uses parser to obtain part-of-speech for each answer and design specific question templates
- Neural question generation uses pretrained T5-large model to generate questions from answers
- Exemplar QA pairs bridge task disconnect between language modeling and VQA
Question-relevant caption prompt
- Generate captions about the question-relevant portion of the image
- Use Imagegrounded Text Encoder (ITE) to determine relevant image regions
- Use GradCAM to generate a coarse localisation map
- Sample a subset of image patches with probability proportional to patch relevance
- Generate captions from the sampled image patches using top-k sampling
- Filter captions with less than 0.5 matching scores
Prompt design
- Synthetic question-relevant captions and question-answer pairs are used to construct complete prompts for LLM.
- The instruction text is “Please reason the answers of question according to the contexts.”
- Greedy decoding is used to get the answer and meaningless tokens are removed.
- 30 answer candidates with highest frequencies and one caption containing each answer are selected.
Experiment
- Compare Img2Prompt with other zero-shot and few-shot VQA methods
- Perform ablation studies on prompt patterns and caption selection strategies
- Show qualitative examples and discuss failure cases
Environment setup
- Validated method on 3 datasets: VQAv2, OK-VQA, A-OKVQA
- VQAv2 has 214,354 questions in validation set and 107,394 in test-dev dataset
- OK-VQA has 5,046 test questions and A-OKVQA has 1,100 validation questions and 6,700 test questions
- Used BLIP to generate captions and GradCam to localize image regions
- Compared with prior VQA methods, which fall into 3 categories: zero-shot, zero-shot with pre-training, and few-shot
Main results
- Img2Prompt surpasses prior zero-shot model with frozen LLMs
- LLMs help better comprehend questions and give more accurate answers
- Scaling LLMs improves VQA scores
- Img2Prompt outperforms end-to-end pretraining and few-shot models
- Img2Prompt enables various LLMs to perform zero-shot VQA tasks
Analysis on question generation methods
- Three question generation techniques are compared: image-agnostic, template-based, and neural-based
- Two synthetic QA selection strategies are compared: random and max freq.
- Neural-based performs the best, Agnostic performs the worst
- Answer hit rate and answer noise rate are evaluated to measure visual information quality
Ablation on caption selection
- Max Frequency strategy selects captions with highest frequencies
- Min Frequency strategy selects captions with lowest frequencies
- Max Frequency does not provide more info than exemplar prompts
- Min Frequency provides info not in QA pairs, boosting performance
Ablation study on prompt design
- Option 1: Append synthetic QA pair after caption
- Option 2: Present all captions at once, followed by all QA pairs
- Option 2 performs significantly better than Option 1
Examples and failure case analysis
- LLM correctly infers that a man making drinks at a bar is a bartender
- LLM is unable to make inferences based on qualitative physics in some cases
Limitation
- Generating image captions and question-answer pairs incurs extra inference overhead.
- Reducing the prompt can reduce the overhead, but accuracy may be affected.
- Method avoids expensive end-to-end multimodal representation alignment.
Conclusion
- Proposed Img2Prompt module to exploit knowledge and reasoning power of large language models for zero-shot VQA tasks
- Provides visual information and task guidance to LLMs in the format of easily-digestible prompts
- Eliminates requirement for expensive end-to-end vision-language alignment
- Increases model deployment flexibility and decreases model deployment cost
- Achieves comparable or superior zero-shot VQA performance to other methods
- Inherent bias of these systems still exists
- Integrated with few-shot models
- Experimental comparisons between Img2Prompt and supervised model on A-OKVQA dataset
- Outperform almost all supervised model with smaller size language model
- Experimental comparisons with different prompts in OK-VQA val set
- Experimental comparisons with models trained in A-OKVQA training dataset
- Experimental results of using different number of captions and QA pairs as prompts
- Experimental results of using different number of patches to generate question-relevant captions