Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed system paradigm integrates ChatGPT with a pool of vision experts
  • Defined and explored a comprehensive list of advanced vision tasks
  • Textual prompt design allows language models to accept, associate, and process multimodal information
  • Zero-shot experiments demonstrate effectiveness in addressing specified capabilities
  • Discussed and compared system paradigm with alternative approach

Paper Content

Introduction

  • Recent years have seen significant advancement for computer vision
  • Different vision problems require different models
  • One research direction is to combine vision and language modules
  • Large language models have shown impressive dialogue capability
  • NLP research has demonstrated the effectiveness of integrating external NLP tools with LLMs
  • MM-REACT combines vision experts with ChatGPT for multimodal reasoning and action
  • MM-REACT provides extra flexibility in module upgrades
  • LLMs have strong chain-of-thought capabilities
  • LLMs can use external NLP tools to solve problems
  • LLMs can reason and take action independently, but not together
  • Recent studies have attempted to merge reasoning and action for LLMs
  • MM-REACT uses vision tools as executable actions
  • MM-REACT uses ChatGPT to determine which vision expert to invoke

User input

  • ChatGPT only accepts texts as input
  • File paths are used to indicate non-text inputs
  • Vision experts are used to understand image content from different perspectives

Chatgpt response

  • ChatGPT is expected to provide two kinds of responses
  • Key challenge is to set up a protocol to know when to invoke vision expert
  • Use keyword “Assistant” to distinguish if vision expert is required
  • Encourage Chat-GPT to show thought process to highlight why external tool is required

Vision experts

  • Use regular expression matching to parse expert name and file path
  • Standardize output into text format
  • Represent output of detection model as <object name, x1, y1, x2, y2>
  • Add text description to explain numerical values
  • Inject knowledge of vision experts’ usages into prefix

Extensibility

  • Motivated by REACT, which uses NLP tools
  • Extended to vision domain by replacing non-text modality with path string
  • Can be extended to other modalities, such as speech and audio
  • Can incorporate more tools by formatting their outputs in text format
  • Performance can be enhanced by upgrading to more powerful LLM

Experiments

Experiment setup

  • Implemented MM-REACT based on LangChain codebase and ReAct
  • Accessed ChatGPT via Azure API with token length limit of 4096
  • Utilized vision experts from Azure Cognitive Services APIs
  • Expanded toolset with customized tools for spatial understanding and image editing
  • Examples of capabilities and application scenarios in Figures 4-14
  • Unfolded steps in Figure 18
  • Enhanced LLM from ChatGPT to GPT-4 in Figures 23 and 24
  • Plugged image editing tool from X-decoder in Figure 25

Limitations

  • Recognition capability in the wild is hard to evaluate with accuracy numbers due to lack of annotated benchmarks
  • Vision capability is limited by integrated vision experts
  • Knowledge is injected in the prefix, limited by context window
  • Visual signals are converted to text words for ChatGPT understanding
  • Manual prompt engineering required for MM-REACT

Conclusion

  • MM-REACT is a system paradigm that combines multimodal reasoning and action to solve visual understanding problems.