Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs have made progress in natural language processing and may exhibit reasoning abilities.
  • This paper provides an overview of the current state of knowledge on reasoning in LLMs.
  • Techniques, methods, benchmarks, findings, implications, and future directions are discussed.

Paper Content

Introduction

  • Reasoning is a cognitive process
  • It involves using evidence, arguments, and logic
  • It is important in fields like psychology, philosophy, and computer science
  • Large language models have made advancements in natural language processing
  • These models exhibit emergent behaviors, including the ability to “reason”
  • LLMs can answer questions with explicit reasoning steps
  • Reasoning ability is a hallmark of human intelligence
  • It is unclear whether LLMs are actually reasoning
  • Different forms of reasoning may be used depending on the task
  • Focus on “informal deductive reasoning” in large language models

Towards reasoning in large language models

  • Reasoning is seen as a weakness in language models and other NLP models.
  • Research suggests that reasoning ability may emerge in language models with over 100 billion parameters.
  • This paper focuses on techniques applicable to improving or eliciting reasoning in large-scale models.

Fully supervised finetuning

  • Research is being done to improve reasoning in small language models through supervised finetuning
  • Fully supervised finetuning has two major limitations: difficult and time-consuming to create datasets and models are limited to a specific domain

Prompting & in-context learning

  • Large language models (LLMs) can be prompted with a question and a few input, output exemplars to potentially solve a problem through “reasoning”
  • LLMs still fall short when it comes to tasks that require multiple steps of reasoning to solve
  • Chain of thought prompting involves providing a few examples of “chain of though” (CoT) in the prompt to LLMs
  • Rationale engineering aims to more effectively elicit or utilize reasoning in LLMs
  • Rationale creation & refinement involves creating more effective examples of reasoning steps
  • Rationale exploration allows LLMs to explore various ways of reasoning
  • Rationale verification ensures that the rationales produced by LLMs are valid
  • Least-to-most prompting decomposes complex problems into subproblems and solves them in a specific order
  • Dynamic least-to-most prompting is designed to solve more realistic semantic parsing problems
  • Decomposed prompting breaks down a complex problem into subproblems that can be handled by a shared library of prompting-based LLMs
  • Successive prompting iteratively decomposes a complex problem into a simple problem
  • Selection-inference framework uses LLMs as modules to select and infer reasoning steps
  • Abductive and recursive prompting is used to solve binary questions
  • Numerical reasoning on complex numbers is performed by replacing the complex numbers with simple numbers
  • Language model cascade is a unifying framework for understanding this line of work

Hybrid method

  • Prompting techniques can help utilize reasoning in large language models, but do not improve the reasoning capabilities of the models.
  • Hybrid approach aims to improve reasoning capabilities and use techniques such as prompting.
  • Pretraining and finetuning LLMs on datasets with reasoning can lead to better generalization.
  • Self-improvement of reasoning abilities can be achieved through bootstrapping.

Measuring reasoning in large language models

  • Focus on using downstream performance on reasoning tasks as primary measurement for model’s “reasoning” ability
  • Little work on directly analyzing rationales generated by models
  • Summarize methods and benchmarks for evaluating reasoning abilities of LLMs

Downstream task performance

  • Measure reasoning abilities of LLMs by evaluating performance on tasks
  • Arithmetic Reasoning: GSM8K, Math, MathQA, SVAMP, AS-Div, AQuA, MAWPS
  • Commonsense Reasoning: CSQA, StrategyQA, ARC
  • Symbolic Reasoning: Last Letter Concatenation, Coin Flip
  • BIG-bench, SCAN, WikiTableQA, FetaQA, CommonGen, Open Relation Modeling

Formal analysis on reasoning

  • LLMs have demonstrated impressive performance on various reasoning tasks
  • Most existing evaluations focus on accuracy on downstream tasks, not directly assessing reasoning steps
  • Error analysis of generated rationales has been limited in depth
  • Efforts to develop metrics and benchmarks to enable formal analysis of reasoning in LLMs
  • Question of whether models are able to reason like humans or just achieve good performance through other means

Findings and implications

  • Reasoning appears to emerge only in large language models
  • Chain of thought prompts improve performance on reasoning tasks
  • LLMs show human-like content effects on reasoning
  • LLMs struggle with complex reasoning tasks
  • Current benchmarks may not accurately gauge LLMs’ reasoning abilities
  • LLMs may not be capable of robust reasoning
  • Techniques like chain of thought prompting can help elicit reasoning abilities
  • Finetuning with CoT data can improve reasoning
  • Models can self-improve through bootstrapping their reasoning

Conclusion

  • LLMs have made significant progress in natural language processing and related fields
  • It is unclear to what extent LLMs are capable of true reasoning
  • Further research is needed to understand LLMs’ reasoning capabilities and potential for use in applications
  • This paper provides an overview of the current state of the field and encourages further discussion and research