Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed system paradigm integrates ChatGPT with a pool of vision experts
- Defined and explored a comprehensive list of advanced vision tasks
- Textual prompt design allows language models to accept, associate, and process multimodal information
- Zero-shot experiments demonstrate effectiveness in addressing specified capabilities
- Discussed and compared system paradigm with alternative approach
Paper Content
Introduction
- Recent years have seen significant advancement for computer vision
- Different vision problems require different models
- One research direction is to combine vision and language modules
- Large language models have shown impressive dialogue capability
- NLP research has demonstrated the effectiveness of integrating external NLP tools with LLMs
- MM-REACT combines vision experts with ChatGPT for multimodal reasoning and action
- MM-REACT provides extra flexibility in module upgrades
Related work
- LLMs have strong chain-of-thought capabilities
- LLMs can use external NLP tools to solve problems
- LLMs can reason and take action independently, but not together
- Recent studies have attempted to merge reasoning and action for LLMs
- MM-REACT uses vision tools as executable actions
- MM-REACT uses ChatGPT to determine which vision expert to invoke
User input
- ChatGPT only accepts texts as input
- File paths are used to indicate non-text inputs
- Vision experts are used to understand image content from different perspectives
Chatgpt response
- ChatGPT is expected to provide two kinds of responses
- Key challenge is to set up a protocol to know when to invoke vision expert
- Use keyword “Assistant” to distinguish if vision expert is required
- Encourage Chat-GPT to show thought process to highlight why external tool is required
Vision experts
- Use regular expression matching to parse expert name and file path
- Standardize output into text format
- Represent output of detection model as <object name, x1, y1, x2, y2>
- Add text description to explain numerical values
- Inject knowledge of vision experts’ usages into prefix
Extensibility
- Motivated by REACT, which uses NLP tools
- Extended to vision domain by replacing non-text modality with path string
- Can be extended to other modalities, such as speech and audio
- Can incorporate more tools by formatting their outputs in text format
- Performance can be enhanced by upgrading to more powerful LLM
Experiments
Experiment setup
- Implemented MM-REACT based on LangChain codebase and ReAct
- Accessed ChatGPT via Azure API with token length limit of 4096
- Utilized vision experts from Azure Cognitive Services APIs
- Expanded toolset with customized tools for spatial understanding and image editing
- Examples of capabilities and application scenarios in Figures 4-14
- Unfolded steps in Figure 18
- Enhanced LLM from ChatGPT to GPT-4 in Figures 23 and 24
- Plugged image editing tool from X-decoder in Figure 25
Limitations
- Recognition capability in the wild is hard to evaluate with accuracy numbers due to lack of annotated benchmarks
- Vision capability is limited by integrated vision experts
- Knowledge is injected in the prefix, limited by context window
- Visual signals are converted to text words for ChatGPT understanding
- Manual prompt engineering required for MM-REACT
Conclusion
- MM-REACT is a system paradigm that combines multimodal reasoning and action to solve visual understanding problems.