Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs have made progress in natural language processing and may exhibit reasoning abilities.
- This paper provides an overview of the current state of knowledge on reasoning in LLMs.
- Techniques, methods, benchmarks, findings, implications, and future directions are discussed.
Paper Content
Introduction
- Reasoning is a cognitive process
- It involves using evidence, arguments, and logic
- It is important in fields like psychology, philosophy, and computer science
- Large language models have made advancements in natural language processing
- These models exhibit emergent behaviors, including the ability to “reason”
- LLMs can answer questions with explicit reasoning steps
- Reasoning ability is a hallmark of human intelligence
- It is unclear whether LLMs are actually reasoning
- Different forms of reasoning may be used depending on the task
- Focus on “informal deductive reasoning” in large language models
Towards reasoning in large language models
- Reasoning is seen as a weakness in language models and other NLP models.
- Research suggests that reasoning ability may emerge in language models with over 100 billion parameters.
- This paper focuses on techniques applicable to improving or eliciting reasoning in large-scale models.
Fully supervised finetuning
- Research is being done to improve reasoning in small language models through supervised finetuning
- Fully supervised finetuning has two major limitations: difficult and time-consuming to create datasets and models are limited to a specific domain
Prompting & in-context learning
- Large language models (LLMs) can be prompted with a question and a few input, output exemplars to potentially solve a problem through “reasoning”
- LLMs still fall short when it comes to tasks that require multiple steps of reasoning to solve
- Chain of thought prompting involves providing a few examples of “chain of though” (CoT) in the prompt to LLMs
- Rationale engineering aims to more effectively elicit or utilize reasoning in LLMs
- Rationale creation & refinement involves creating more effective examples of reasoning steps
- Rationale exploration allows LLMs to explore various ways of reasoning
- Rationale verification ensures that the rationales produced by LLMs are valid
- Least-to-most prompting decomposes complex problems into subproblems and solves them in a specific order
- Dynamic least-to-most prompting is designed to solve more realistic semantic parsing problems
- Decomposed prompting breaks down a complex problem into subproblems that can be handled by a shared library of prompting-based LLMs
- Successive prompting iteratively decomposes a complex problem into a simple problem
- Selection-inference framework uses LLMs as modules to select and infer reasoning steps
- Abductive and recursive prompting is used to solve binary questions
- Numerical reasoning on complex numbers is performed by replacing the complex numbers with simple numbers
- Language model cascade is a unifying framework for understanding this line of work
Hybrid method
- Prompting techniques can help utilize reasoning in large language models, but do not improve the reasoning capabilities of the models.
- Hybrid approach aims to improve reasoning capabilities and use techniques such as prompting.
- Pretraining and finetuning LLMs on datasets with reasoning can lead to better generalization.
- Self-improvement of reasoning abilities can be achieved through bootstrapping.
Measuring reasoning in large language models
- Focus on using downstream performance on reasoning tasks as primary measurement for model’s “reasoning” ability
- Little work on directly analyzing rationales generated by models
- Summarize methods and benchmarks for evaluating reasoning abilities of LLMs
Downstream task performance
- Measure reasoning abilities of LLMs by evaluating performance on tasks
- Arithmetic Reasoning: GSM8K, Math, MathQA, SVAMP, AS-Div, AQuA, MAWPS
- Commonsense Reasoning: CSQA, StrategyQA, ARC
- Symbolic Reasoning: Last Letter Concatenation, Coin Flip
- BIG-bench, SCAN, WikiTableQA, FetaQA, CommonGen, Open Relation Modeling
Formal analysis on reasoning
- LLMs have demonstrated impressive performance on various reasoning tasks
- Most existing evaluations focus on accuracy on downstream tasks, not directly assessing reasoning steps
- Error analysis of generated rationales has been limited in depth
- Efforts to develop metrics and benchmarks to enable formal analysis of reasoning in LLMs
- Question of whether models are able to reason like humans or just achieve good performance through other means
Findings and implications
- Reasoning appears to emerge only in large language models
- Chain of thought prompts improve performance on reasoning tasks
- LLMs show human-like content effects on reasoning
- LLMs struggle with complex reasoning tasks
- Current benchmarks may not accurately gauge LLMs’ reasoning abilities
- LLMs may not be capable of robust reasoning
- Techniques like chain of thought prompting can help elicit reasoning abilities
- Finetuning with CoT data can improve reasoning
- Models can self-improve through bootstrapping their reasoning
Conclusion
- LLMs have made significant progress in natural language processing and related fields
- It is unclear to what extent LLMs are capable of true reasoning
- Further research is needed to understand LLMs’ reasoning capabilities and potential for use in applications
- This paper provides an overview of the current state of the field and encourages further discussion and research