Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs have good zero-shot generalization to new language tasks.
  • LLMs are not effective for zero-shot VQA due to modality and task disconnection.
  • Img2Prompt is a plug-and-play module that bridges the disconnections, allowing LLMs to perform zero-shot VQA without end-to-end training.
  • Img2Prompt is flexible, reduces cost, and achieves comparable or better performance than end-to-end training.

Paper Content

Introduction

  • VQA is a vision-language task with real-world applications
  • Human annotations are expensive and can introduce biases
  • Zero-shot VQA methods do not require ground-truth annotations
  • LLMs can perform tasks with zero in-domain data
  • LLMs have been used in zero-shot VQA
  • Finetuning a vision encoder and LLM is costly and introduces interdependence

Recent advances in vqa methods

  • VQA is a multi-modal evaluation benchmark that requires models to answer natural language questions according to an image.
  • Vision-language models have been pretrained on large-scale image-text datasets and fine-tuned for VQA tasks.
  • Recent works have incorporated external knowledge into networks to solve knowledge-based VQA, but results show difficulty in answering questions requiring reasoning ability.

Llm for zero/few-shot vqa tasks

  • LLMs are powerful for natural language understanding and reasoning
  • LLMs generate target tokens autoregressively
  • Prior VQA methods using LLMs fall into two categories: multi-modal pretraining and language-mediated VQA
  • Multi-modal pretraining is highly compute-inefficient and can lead to catastrophic forgetting
  • Language-mediated VQA uses language as the intermediate representation of the image
  • Difficulties in utilizing LLMs effectively in zero-shot VQA stem from modality disconnection and task disconnection
  • Img2Prompt is a new zero-shot technique to address the task disconnection on generic LLMs

Answer extraction

  • Generate captions from VQA image
  • Extract noun phrases, verb phrases, adjective phrases, numbers, and boolean-typed words as potential answers

Question generation

  • Extract answer candidates from image captions
  • Use question generation networks to generate questions for each answer candidate
  • Template-based question generation uses parser to obtain part-of-speech for each answer and design specific question templates
  • Neural question generation uses pretrained T5-large model to generate questions from answers
  • Exemplar QA pairs bridge task disconnect between language modeling and VQA

Question-relevant caption prompt

  • Generate captions about the question-relevant portion of the image
  • Use Imagegrounded Text Encoder (ITE) to determine relevant image regions
  • Use GradCAM to generate a coarse localisation map
  • Sample a subset of image patches with probability proportional to patch relevance
  • Generate captions from the sampled image patches using top-k sampling
  • Filter captions with less than 0.5 matching scores

Prompt design

  • Synthetic question-relevant captions and question-answer pairs are used to construct complete prompts for LLM.
  • The instruction text is “Please reason the answers of question according to the contexts.”
  • Greedy decoding is used to get the answer and meaningless tokens are removed.
  • 30 answer candidates with highest frequencies and one caption containing each answer are selected.

Experiment

  • Compare Img2Prompt with other zero-shot and few-shot VQA methods
  • Perform ablation studies on prompt patterns and caption selection strategies
  • Show qualitative examples and discuss failure cases

Environment setup

  • Validated method on 3 datasets: VQAv2, OK-VQA, A-OKVQA
  • VQAv2 has 214,354 questions in validation set and 107,394 in test-dev dataset
  • OK-VQA has 5,046 test questions and A-OKVQA has 1,100 validation questions and 6,700 test questions
  • Used BLIP to generate captions and GradCam to localize image regions
  • Compared with prior VQA methods, which fall into 3 categories: zero-shot, zero-shot with pre-training, and few-shot

Main results

  • Img2Prompt surpasses prior zero-shot model with frozen LLMs
  • LLMs help better comprehend questions and give more accurate answers
  • Scaling LLMs improves VQA scores
  • Img2Prompt outperforms end-to-end pretraining and few-shot models
  • Img2Prompt enables various LLMs to perform zero-shot VQA tasks

Analysis on question generation methods

  • Three question generation techniques are compared: image-agnostic, template-based, and neural-based
  • Two synthetic QA selection strategies are compared: random and max freq.
  • Neural-based performs the best, Agnostic performs the worst
  • Answer hit rate and answer noise rate are evaluated to measure visual information quality

Ablation on caption selection

  • Max Frequency strategy selects captions with highest frequencies
  • Min Frequency strategy selects captions with lowest frequencies
  • Max Frequency does not provide more info than exemplar prompts
  • Min Frequency provides info not in QA pairs, boosting performance

Ablation study on prompt design

  • Option 1: Append synthetic QA pair after caption
  • Option 2: Present all captions at once, followed by all QA pairs
  • Option 2 performs significantly better than Option 1

Examples and failure case analysis

  • LLM correctly infers that a man making drinks at a bar is a bartender
  • LLM is unable to make inferences based on qualitative physics in some cases

Limitation

  • Generating image captions and question-answer pairs incurs extra inference overhead.
  • Reducing the prompt can reduce the overhead, but accuracy may be affected.
  • Method avoids expensive end-to-end multimodal representation alignment.

Conclusion

  • Proposed Img2Prompt module to exploit knowledge and reasoning power of large language models for zero-shot VQA tasks
  • Provides visual information and task guidance to LLMs in the format of easily-digestible prompts
  • Eliminates requirement for expensive end-to-end vision-language alignment
  • Increases model deployment flexibility and decreases model deployment cost
  • Achieves comparable or superior zero-shot VQA performance to other methods
  • Inherent bias of these systems still exists
  • Integrated with few-shot models
  • Experimental comparisons between Img2Prompt and supervised model on A-OKVQA dataset
  • Outperform almost all supervised model with smaller size language model
  • Experimental comparisons with different prompts in OK-VQA val set
  • Experimental comparisons with models trained in A-OKVQA training dataset
  • Experimental results of using different number of captions and QA pairs as prompts
  • Experimental results of using different number of patches to generate question-relevant captions