Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have shown good performance on complex reasoning by using CoT prompting
Existing CoT studies are mostly limited to language modality
A possible solution is to fine-tune small language models by fusing vision and language features
The challenge is that language models tend to generate wrong reasoning chains
Multimodal-CoT incorporates vision features in a decoupled training framework
Multimodal-CoT outperforms previous state-of-the-art LLM by 16% on ScienceQA benchmark

Paper Content

Introduction

Knowledge acquisition is strengthened by modeling diverse data modalities
LLMs generate intermediate reasoning steps before inferring the answer (CoT reasoning)
Existing studies related to CoT reasoning are largely isolated in the language modality
Multimodal-CoT decomposes multi-step problems into intermediate reasoning steps and then infers the answer
Two ways to perform Multimodal-CoT: prompting LLMs and fine-tuning small models
Multimodal-CoT proposed to incorporate vision features in a decoupled training framework
New state-of-the-art performance on the ScienceQA benchmark, outperforming accuracy of GPT-3.5 by 16%

Background

Language models can be used to elicit CoT reasoning
Recent progress has been made in prompting and fine-tuning language models to do this

Cot reasoning with llms

CoT techniques are used to elicit multi-step reasoning abilities of LLMs
Two major paradigms of techniques: Zero-Shot-CoT and Few-Shot-CoT
Manual-CoT and Auto-CoT are used to generate demonstrations
Recent studies focus on optimizing demonstrations and reasoning chains
Problem decomposition and voting over multiple reasoning paths are used to optimize reasoning chains
Program-of-thoughts (PoT) models the reasoning process as a program

Eliciting cot reasoning by fine-tuning models

Recent interest in eliciting CoT reasoning by fine-tuning language models
Lu et al. (2022a) fine-tuned T5 model on large-scale dataset with CoT annotations
Performance decline when using CoT to infer answer
Magister et al. (2022) and Ho et al. (2022) employed knowledge distillation to fine-tune student model on CoT outputs from teacher model
New challenges in training 1B-models to be CoT reasoners
Models under 100 billion parameters tend to produce illogical CoT that leads to wrong answers

Challenge of multimodal-cot

Existing studies suggest CoT reasoning ability may emerge in language models with over 100 billion parameters.
It is a challenge to elicit CoT reasoning in 1B-models, especially in the multimodal scenario.

Towards the role of cot

Fine-tuning a text-only baseline for CoT reasoning on the ScienceQA benchmark
Model takes the concatenation of tokens of the question text, context text, and multiple options as the input
Comparing performance with three variants: No CoT, Reasoning, and Explanation
Accuracy decreases by 12.54% if the model predicts rationales before answers
Maximum length of generated outputs is always less than 400 tokens

Misleading by hallucinated rationales

CoT problem is split into two stages: rationale generation and answer inference
RougeL score and accuracy reported for rationale generation and answer inference
50 error cases randomly sampled, model tends to generate hallucinated rationales that mislead answer inference
64% of error cases involve model hallucinating due to lack of reference to vision content

Multimodality contributes to effective rationales

Multimodal-CoT framework consists of two stages: rationale generation and answer inference
To inject vision information, a caption is appended to the input of both stages
Vision features are fused with language representations before feeding to the decoder
Vision features boost the RougeL score of rationale generation and contribute to better answer accuracy
Vision features mitigate the phenomenon of hallucination

Multimodal-cot

Proposed Multimodal-CoT framework to incorporate vision features
Overview of procedure and technical design of model architecture

Framework overview

Multimodal-CoT consists of two training stages: rationale generation and answer inference.
Both stages share the same model architecture but differ in the input and output.
Vision-language is used as an example to show how Multimodal-CoT works.
In the rationale generation stage, the model is fed with language and vision inputs.
In the answer inference stage, the rationale is appended to the original language input.

Model architecture

Experiments

Benchmark dataset presented
Implementation of technique described
Baselines for comparison established
Results and findings reported

Dataset

Method is evaluated on ScienceQA benchmark
ScienceQA is a large-scale multimodal science question dataset
ScienceQA contains 21k questions with rich domain diversity
Dataset is split into training, validation, and test splits

Implementation

Use T5 encoder-decoder architecture (Raffel et al., 2020)
Initialize models with UnifiedQA (Khashabi et al., 2020)
Also use FLAN-T5 (Chung et al., 2022)
Do not use image captions
Fine-tune models up to 20 epochs, with a learning rate of 5e-5
Maximum input sequence length is 512

Main results

Mutimodal-CoT Large outperforms GPT-3.5 by 16.51%.
Mutimodal-CoT Large surpasses human performance.
Mutimodal-CoT Large achieves a 21.37% performance gain for questions with paired images.
Using image features is more effective than existing methods.
Decoupled training strategy contributes to superior results.

Analysis

Investigates how Multimodal-CoT works
Discusses contribution factors and limitations
Uses models under the base size for analysis

Multimodality boosts convergence

Multimodal-CoT converges more quickly than the baseline
Decoupled methods achieve higher accuracy at the beginning than single-turn baselines
Vision features help generate more effective rationales which contribute to better answer accuracy

Using different vision features

Three types of vision features are compared: CLIP, DETR, and ResNet.
Vision features generally improve model performance compared to language only baseline.
DETR performs best and is used by default in Multimodal-CoT.

General effectiveness across backbone models

Our approach is generally effective for widely-used backbone models.
Model is robust to some extent and can predict correct answer by ignoring incorrect rationales.
Most mistakes are due to failures of understanding maps and counting numbers.
Other mistakes are due to lack of commonsense knowledge and logical mistakes.

Conclusion

Incorporating vision features helps generate more effective rationales and improves answer accuracy
Scaling the language model size can mitigate the issue of incorrect rationales, but not as effective as using vision features
VQA baselines take the question, context, and choices as the textual input, and the image as the vision input
UnifiedQA and GPT-3.5 models are used, with CoT applied after the answer
A stronger baseline is developed to directly predict the choice
Four types of vision features are compared: CLIP, DETR, and ResNet
Manual investigation of randomly selected examples shows that the model is robust to some extent and can predict the correct answer by ignoring incorrect rationales
Most mistakes are due to failure to understand maps and counting numbers in the images, utilizing the alphabet, and logical mistakes
Prospective directions for future studies include improving the quality of CoT and applying a filtering mechanism

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Cot reasoning with llms#

Eliciting cot reasoning by fine-tuning models#

Challenge of multimodal-cot#

Towards the role of cot#

Misleading by hallucinated rationales#

Multimodality contributes to effective rationales#

Multimodal-cot#

Framework overview#

Model architecture#

Experiments#

Dataset#

Implementation#

Main results#

Analysis#

Multimodality boosts convergence#

Using different vision features#

General effectiveness across backbone models#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Cot reasoning with llms

Eliciting cot reasoning by fine-tuning models

Challenge of multimodal-cot

Towards the role of cot

Misleading by hallucinated rationales

Multimodality contributes to effective rationales

Multimodal-cot

Framework overview

Model architecture

Experiments

Dataset

Implementation

Main results

Analysis

Multimodality boosts convergence

Using different vision features

General effectiveness across backbone models

Conclusion