Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Chain-of-Thought (CoT) prompting can improve the multi-step reasoning abilities of large language models (LLMs).
CoT prompting can achieve over 80-90% of the performance obtained using CoT with invalid demonstrations.
Relevance to the query and correctly ordering the reasoning steps are important for effective CoT reasoning.

Paper Content

Introduction

LLMs can perform new tasks when prompted with a few demonstrations
Chain-of-Thought (CoT) prompting can improve LLMs ability to do complex and multi-step reasoning
CoT prompting includes a rationale for each example, which encourages the LLM to generate its intermediate reasoning process
Recent findings show that in-context learning is different from fine-tuning/training
We study how and why CoT prompting works
We find that the validity of reasoning matters only a small portion to the performance
We identify and formulate other aspects of a CoT rationale
LLMs have already gained a lot of “reasoning abilities” from pretraining
Demonstrations specify an output space/format that regularizes the model generation
Evaluation scores should be interpreted in view of the prior knowledge LLMs possess

Study formulation

Chain-of-Thought rationale consists of a series of reasoning steps
Two components of a CoT rationale: Bridging objects and Language templates
Bridging objects are key and necessary objects for the model to make a successful prediction
Language templates are textual hints and relations/predicates that guide the model
Questions: Are ground truth bridging objects/language templates important? What are the key aspects needed for the LLM to reason properly?

Datasets & in-context exemplars

Experimented on two tasks involving multi-step reasoning: arithmetic reasoning & multi-hop factual question answering
Selected benchmarks where CoT prompting brings significant improvements
Goal is to understand how different aspects of the Chain-of-Thought rationales contribute to the performance of CoT prompting
Experimented on GSM8K and Bamboogle datasets
800 out of 1319 test examples for GSM8K and all 125 test samples for Bamboogle were used for evaluation
Prompt exemplars were edited to make the structure more consistent and reduce redundancy

Backbone language model

Used InstructGPT-175B2 as backbone LLM
Tested on improved version text-davinci-003

Evaluation

Evaluation of predicted rationales is usually done by assessing the correctness of the final answer (extrinsic evaluation).
This can be too strict in some cases, so intrinsic evaluation is also done to measure the Recall/F1 of bridging objects.
For GSM8K, ground truth reasoning steps are used as a proxy for bridging objects.
For Bamboogle, bridging objects are manually annotated.

Constructing invalid chain of reasoning

We manually write rationales with invalid reasoning for all the in-context demonstration examples.
We only ablate the parts in a CoT rationale which are involved with derivations that are logically sound and helpful for answering the query.
We keep the premise steps which are copies/paraphrases of facts from the query, and change the subsequent steps such that they do not logically derive the final answer.
We apply drastic changes to both the bridging objects and language templates, so that little valid reasoning exists to help solve the query.

Results & analysis

Relevance and coherence are key for the performance of CoT prompting
Keeping relevance is crucial
Relevance matters more than coherence for bridging objects
Coherence of language templates is important

Ablation settings

Ablation settings based on CoT prompts
Four ablation settings to examine one aspect of a certain component
Two other settings: no relevance and no coherence
Relevance ablated by randomly substituting alternatives
Coherence ablated by randomly shuffling components and permuting orderings

Discussion

LLMs learn limited reasoning from CoT demonstrations
LLMs have already gained complex reasoning ability from pretraining
LLMs face difficulties in capturing task semantics
Learning to reason in-context is possible and could be powerful
LLMs are not good few-shot learners of reasoning
Quantifying the level of prior knowledge LLMs have is important
Ablation settings in this paper are different from Madaan and Yazdanbakhsh (2022)
Understanding in-context learning in sequence generation problems is an attempt to empirically understand in-context learning

Conclusion

Validity of reasoning matters only a small portion to performance
Relevance to input query and following order of reasoning steps are key to effectiveness of CoT prompting
Editing to make structure more consistent and reduce redundancy
Use Recall/F1 of bridging objects as metrics for intrinsic evaluation of generated rationales
LLMs suffer less from ablations when they have more prior knowledge about the task
LLMs can effectively utilize prior knowledge to solve new problems
LLMs may overrely on prior knowledge and ignore important information in context
Performance comparison of text-davinci-002 and text-davinci-003
Correctness of bridging objects is a good indicator of quality of reasoning steps

Link to paper#

Abstract#

Paper Content#

Introduction#

Study formulation#

Datasets & in-context exemplars#

Backbone language model#

Evaluation#

Constructing invalid chain of reasoning#

Results & analysis#

Ablation settings#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Study formulation

Datasets & in-context exemplars

Backbone language model

Evaluation

Constructing invalid chain of reasoning

Results & analysis

Ablation settings

Discussion

Conclusion