Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs have shown good performance on complex reasoning by using CoT prompting
- Existing CoT studies are mostly limited to language modality
- A possible solution is to fine-tune small language models by fusing vision and language features
- The challenge is that language models tend to generate wrong reasoning chains
- Multimodal-CoT incorporates vision features in a decoupled training framework
- Multimodal-CoT outperforms previous state-of-the-art LLM by 16% on ScienceQA benchmark
Paper Content
Introduction
- Knowledge acquisition is strengthened by modeling diverse data modalities
- LLMs generate intermediate reasoning steps before inferring the answer (CoT reasoning)
- Existing studies related to CoT reasoning are largely isolated in the language modality
- Multimodal-CoT decomposes multi-step problems into intermediate reasoning steps and then infers the answer
- Two ways to perform Multimodal-CoT: prompting LLMs and fine-tuning small models
- Multimodal-CoT proposed to incorporate vision features in a decoupled training framework
- New state-of-the-art performance on the ScienceQA benchmark, outperforming accuracy of GPT-3.5 by 16%
Background
- Language models can be used to elicit CoT reasoning
- Recent progress has been made in prompting and fine-tuning language models to do this
Cot reasoning with llms
- CoT techniques are used to elicit multi-step reasoning abilities of LLMs
- Two major paradigms of techniques: Zero-Shot-CoT and Few-Shot-CoT
- Manual-CoT and Auto-CoT are used to generate demonstrations
- Recent studies focus on optimizing demonstrations and reasoning chains
- Problem decomposition and voting over multiple reasoning paths are used to optimize reasoning chains
- Program-of-thoughts (PoT) models the reasoning process as a program
Eliciting cot reasoning by fine-tuning models
- Recent interest in eliciting CoT reasoning by fine-tuning language models
- Lu et al. (2022a) fine-tuned T5 model on large-scale dataset with CoT annotations
- Performance decline when using CoT to infer answer
- Magister et al. (2022) and Ho et al. (2022) employed knowledge distillation to fine-tune student model on CoT outputs from teacher model
- New challenges in training 1B-models to be CoT reasoners
- Models under 100 billion parameters tend to produce illogical CoT that leads to wrong answers
Challenge of multimodal-cot
- Existing studies suggest CoT reasoning ability may emerge in language models with over 100 billion parameters.
- It is a challenge to elicit CoT reasoning in 1B-models, especially in the multimodal scenario.
Towards the role of cot
- Fine-tuning a text-only baseline for CoT reasoning on the ScienceQA benchmark
- Model takes the concatenation of tokens of the question text, context text, and multiple options as the input
- Comparing performance with three variants: No CoT, Reasoning, and Explanation
- Accuracy decreases by 12.54% if the model predicts rationales before answers
- Maximum length of generated outputs is always less than 400 tokens
Misleading by hallucinated rationales
- CoT problem is split into two stages: rationale generation and answer inference
- RougeL score and accuracy reported for rationale generation and answer inference
- 50 error cases randomly sampled, model tends to generate hallucinated rationales that mislead answer inference
- 64% of error cases involve model hallucinating due to lack of reference to vision content
Multimodality contributes to effective rationales
- Multimodal-CoT framework consists of two stages: rationale generation and answer inference
- To inject vision information, a caption is appended to the input of both stages
- Vision features are fused with language representations before feeding to the decoder
- Vision features boost the RougeL score of rationale generation and contribute to better answer accuracy
- Vision features mitigate the phenomenon of hallucination
Multimodal-cot
- Proposed Multimodal-CoT framework to incorporate vision features
- Overview of procedure and technical design of model architecture
Framework overview
- Multimodal-CoT consists of two training stages: rationale generation and answer inference.
- Both stages share the same model architecture but differ in the input and output.
- Vision-language is used as an example to show how Multimodal-CoT works.
- In the rationale generation stage, the model is fed with language and vision inputs.
- In the answer inference stage, the rationale is appended to the original language input.
Model architecture
Experiments
- Benchmark dataset presented
- Implementation of technique described
- Baselines for comparison established
- Results and findings reported
Dataset
- Method is evaluated on ScienceQA benchmark
- ScienceQA is a large-scale multimodal science question dataset
- ScienceQA contains 21k questions with rich domain diversity
- Dataset is split into training, validation, and test splits
Implementation
- Use T5 encoder-decoder architecture (Raffel et al., 2020)
- Initialize models with UnifiedQA (Khashabi et al., 2020)
- Also use FLAN-T5 (Chung et al., 2022)
- Do not use image captions
- Fine-tune models up to 20 epochs, with a learning rate of 5e-5
- Maximum input sequence length is 512
Main results
- Mutimodal-CoT Large outperforms GPT-3.5 by 16.51%.
- Mutimodal-CoT Large surpasses human performance.
- Mutimodal-CoT Large achieves a 21.37% performance gain for questions with paired images.
- Using image features is more effective than existing methods.
- Decoupled training strategy contributes to superior results.
Analysis
- Investigates how Multimodal-CoT works
- Discusses contribution factors and limitations
- Uses models under the base size for analysis
Multimodality boosts convergence
- Multimodal-CoT converges more quickly than the baseline
- Decoupled methods achieve higher accuracy at the beginning than single-turn baselines
- Vision features help generate more effective rationales which contribute to better answer accuracy
Using different vision features
- Three types of vision features are compared: CLIP, DETR, and ResNet.
- Vision features generally improve model performance compared to language only baseline.
- DETR performs best and is used by default in Multimodal-CoT.
General effectiveness across backbone models
- Our approach is generally effective for widely-used backbone models.
- Model is robust to some extent and can predict correct answer by ignoring incorrect rationales.
- Most mistakes are due to failures of understanding maps and counting numbers.
- Other mistakes are due to lack of commonsense knowledge and logical mistakes.
Conclusion
- Incorporating vision features helps generate more effective rationales and improves answer accuracy
- Scaling the language model size can mitigate the issue of incorrect rationales, but not as effective as using vision features
- VQA baselines take the question, context, and choices as the textual input, and the image as the vision input
- UnifiedQA and GPT-3.5 models are used, with CoT applied after the answer
- A stronger baseline is developed to directly predict the choice
- Four types of vision features are compared: CLIP, DETR, and ResNet
- Manual investigation of randomly selected examples shows that the model is robust to some extent and can predict the correct answer by ignoring incorrect rationales
- Most mistakes are due to failure to understand maps and counting numbers in the images, utilizing the alphabet, and logical mistakes
- Prospective directions for future studies include improving the quality of CoT and applying a filtering mechanism