Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Chain-of-Thought (CoT) prompting can improve the multi-step reasoning abilities of large language models (LLMs).
  • CoT prompting can achieve over 80-90% of the performance obtained using CoT with invalid demonstrations.
  • Relevance to the query and correctly ordering the reasoning steps are important for effective CoT reasoning.

Paper Content

Introduction

  • LLMs can perform new tasks when prompted with a few demonstrations
  • Chain-of-Thought (CoT) prompting can improve LLMs ability to do complex and multi-step reasoning
  • CoT prompting includes a rationale for each example, which encourages the LLM to generate its intermediate reasoning process
  • Recent findings show that in-context learning is different from fine-tuning/training
  • We study how and why CoT prompting works
  • We find that the validity of reasoning matters only a small portion to the performance
  • We identify and formulate other aspects of a CoT rationale
  • LLMs have already gained a lot of “reasoning abilities” from pretraining
  • Demonstrations specify an output space/format that regularizes the model generation
  • Evaluation scores should be interpreted in view of the prior knowledge LLMs possess

Study formulation

  • Chain-of-Thought rationale consists of a series of reasoning steps
  • Two components of a CoT rationale: Bridging objects and Language templates
  • Bridging objects are key and necessary objects for the model to make a successful prediction
  • Language templates are textual hints and relations/predicates that guide the model
  • Questions: Are ground truth bridging objects/language templates important? What are the key aspects needed for the LLM to reason properly?

Datasets & in-context exemplars

  • Experimented on two tasks involving multi-step reasoning: arithmetic reasoning & multi-hop factual question answering
  • Selected benchmarks where CoT prompting brings significant improvements
  • Goal is to understand how different aspects of the Chain-of-Thought rationales contribute to the performance of CoT prompting
  • Experimented on GSM8K and Bamboogle datasets
  • 800 out of 1319 test examples for GSM8K and all 125 test samples for Bamboogle were used for evaluation
  • Prompt exemplars were edited to make the structure more consistent and reduce redundancy

Backbone language model

  • Used InstructGPT-175B2 as backbone LLM
  • Tested on improved version text-davinci-003

Evaluation

  • Evaluation of predicted rationales is usually done by assessing the correctness of the final answer (extrinsic evaluation).
  • This can be too strict in some cases, so intrinsic evaluation is also done to measure the Recall/F1 of bridging objects.
  • For GSM8K, ground truth reasoning steps are used as a proxy for bridging objects.
  • For Bamboogle, bridging objects are manually annotated.

Constructing invalid chain of reasoning

  • We manually write rationales with invalid reasoning for all the in-context demonstration examples.
  • We only ablate the parts in a CoT rationale which are involved with derivations that are logically sound and helpful for answering the query.
  • We keep the premise steps which are copies/paraphrases of facts from the query, and change the subsequent steps such that they do not logically derive the final answer.
  • We apply drastic changes to both the bridging objects and language templates, so that little valid reasoning exists to help solve the query.

Results & analysis

  • Relevance and coherence are key for the performance of CoT prompting
  • Keeping relevance is crucial
  • Relevance matters more than coherence for bridging objects
  • Coherence of language templates is important

Ablation settings

  • Ablation settings based on CoT prompts
  • Four ablation settings to examine one aspect of a certain component
  • Two other settings: no relevance and no coherence
  • Relevance ablated by randomly substituting alternatives
  • Coherence ablated by randomly shuffling components and permuting orderings

Discussion

  • LLMs learn limited reasoning from CoT demonstrations
  • LLMs have already gained complex reasoning ability from pretraining
  • LLMs face difficulties in capturing task semantics
  • Learning to reason in-context is possible and could be powerful
  • LLMs are not good few-shot learners of reasoning
  • Quantifying the level of prior knowledge LLMs have is important
  • Ablation settings in this paper are different from Madaan and Yazdanbakhsh (2022)
  • Understanding in-context learning in sequence generation problems is an attempt to empirically understand in-context learning

Conclusion

  • Validity of reasoning matters only a small portion to performance
  • Relevance to input query and following order of reasoning steps are key to effectiveness of CoT prompting
  • Editing to make structure more consistent and reduce redundancy
  • Use Recall/F1 of bridging objects as metrics for intrinsic evaluation of generated rationales
  • LLMs suffer less from ablations when they have more prior knowledge about the task
  • LLMs can effectively utilize prior knowledge to solve new problems
  • LLMs may overrely on prior knowledge and ignore important information in context
  • Performance comparison of text-davinci-002 and text-davinci-003
  • Correctness of bridging objects is a good indicator of quality of reasoning steps