Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Chain-of-Thought (CoT) prompting can improve the multi-step reasoning abilities of large language models (LLMs).
- CoT prompting can achieve over 80-90% of the performance obtained using CoT with invalid demonstrations.
- Relevance to the query and correctly ordering the reasoning steps are important for effective CoT reasoning.
Paper Content
Introduction
- LLMs can perform new tasks when prompted with a few demonstrations
- Chain-of-Thought (CoT) prompting can improve LLMs ability to do complex and multi-step reasoning
- CoT prompting includes a rationale for each example, which encourages the LLM to generate its intermediate reasoning process
- Recent findings show that in-context learning is different from fine-tuning/training
- We study how and why CoT prompting works
- We find that the validity of reasoning matters only a small portion to the performance
- We identify and formulate other aspects of a CoT rationale
- LLMs have already gained a lot of “reasoning abilities” from pretraining
- Demonstrations specify an output space/format that regularizes the model generation
- Evaluation scores should be interpreted in view of the prior knowledge LLMs possess
Study formulation
- Chain-of-Thought rationale consists of a series of reasoning steps
- Two components of a CoT rationale: Bridging objects and Language templates
- Bridging objects are key and necessary objects for the model to make a successful prediction
- Language templates are textual hints and relations/predicates that guide the model
- Questions: Are ground truth bridging objects/language templates important? What are the key aspects needed for the LLM to reason properly?
Datasets & in-context exemplars
- Experimented on two tasks involving multi-step reasoning: arithmetic reasoning & multi-hop factual question answering
- Selected benchmarks where CoT prompting brings significant improvements
- Goal is to understand how different aspects of the Chain-of-Thought rationales contribute to the performance of CoT prompting
- Experimented on GSM8K and Bamboogle datasets
- 800 out of 1319 test examples for GSM8K and all 125 test samples for Bamboogle were used for evaluation
- Prompt exemplars were edited to make the structure more consistent and reduce redundancy
Backbone language model
- Used InstructGPT-175B2 as backbone LLM
- Tested on improved version text-davinci-003
Evaluation
- Evaluation of predicted rationales is usually done by assessing the correctness of the final answer (extrinsic evaluation).
- This can be too strict in some cases, so intrinsic evaluation is also done to measure the Recall/F1 of bridging objects.
- For GSM8K, ground truth reasoning steps are used as a proxy for bridging objects.
- For Bamboogle, bridging objects are manually annotated.
Constructing invalid chain of reasoning
- We manually write rationales with invalid reasoning for all the in-context demonstration examples.
- We only ablate the parts in a CoT rationale which are involved with derivations that are logically sound and helpful for answering the query.
- We keep the premise steps which are copies/paraphrases of facts from the query, and change the subsequent steps such that they do not logically derive the final answer.
- We apply drastic changes to both the bridging objects and language templates, so that little valid reasoning exists to help solve the query.
Results & analysis
- Relevance and coherence are key for the performance of CoT prompting
- Keeping relevance is crucial
- Relevance matters more than coherence for bridging objects
- Coherence of language templates is important
Ablation settings
- Ablation settings based on CoT prompts
- Four ablation settings to examine one aspect of a certain component
- Two other settings: no relevance and no coherence
- Relevance ablated by randomly substituting alternatives
- Coherence ablated by randomly shuffling components and permuting orderings
Discussion
- LLMs learn limited reasoning from CoT demonstrations
- LLMs have already gained complex reasoning ability from pretraining
- LLMs face difficulties in capturing task semantics
- Learning to reason in-context is possible and could be powerful
- LLMs are not good few-shot learners of reasoning
- Quantifying the level of prior knowledge LLMs have is important
- Ablation settings in this paper are different from Madaan and Yazdanbakhsh (2022)
- Understanding in-context learning in sequence generation problems is an attempt to empirically understand in-context learning
Conclusion
- Validity of reasoning matters only a small portion to performance
- Relevance to input query and following order of reasoning steps are key to effectiveness of CoT prompting
- Editing to make structure more consistent and reduce redundancy
- Use Recall/F1 of bridging objects as metrics for intrinsic evaluation of generated rationales
- LLMs suffer less from ablations when they have more prior knowledge about the task
- LLMs can effectively utilize prior knowledge to solve new problems
- LLMs may overrely on prior knowledge and ignore important information in context
- Performance comparison of text-davinci-002 and text-davinci-003
- Correctness of bridging objects is a good indicator of quality of reasoning steps