Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have made progress in natural language processing and may exhibit reasoning abilities.
This paper provides an overview of the current state of knowledge on reasoning in LLMs.
Techniques, methods, benchmarks, findings, implications, and future directions are discussed.

Paper Content

Introduction

Reasoning is a cognitive process
It involves using evidence, arguments, and logic
It is important in fields like psychology, philosophy, and computer science
Large language models have made advancements in natural language processing
These models exhibit emergent behaviors, including the ability to “reason”
LLMs can answer questions with explicit reasoning steps
Reasoning ability is a hallmark of human intelligence
It is unclear whether LLMs are actually reasoning
Different forms of reasoning may be used depending on the task
Focus on “informal deductive reasoning” in large language models

Towards reasoning in large language models

Reasoning is seen as a weakness in language models and other NLP models.
Research suggests that reasoning ability may emerge in language models with over 100 billion parameters.
This paper focuses on techniques applicable to improving or eliciting reasoning in large-scale models.

Fully supervised finetuning

Research is being done to improve reasoning in small language models through supervised finetuning
Fully supervised finetuning has two major limitations: difficult and time-consuming to create datasets and models are limited to a specific domain

Prompting & in-context learning

Large language models (LLMs) can be prompted with a question and a few input, output exemplars to potentially solve a problem through “reasoning”
LLMs still fall short when it comes to tasks that require multiple steps of reasoning to solve
Chain of thought prompting involves providing a few examples of “chain of though” (CoT) in the prompt to LLMs
Rationale engineering aims to more effectively elicit or utilize reasoning in LLMs
Rationale creation & refinement involves creating more effective examples of reasoning steps
Rationale exploration allows LLMs to explore various ways of reasoning
Rationale verification ensures that the rationales produced by LLMs are valid
Least-to-most prompting decomposes complex problems into subproblems and solves them in a specific order
Dynamic least-to-most prompting is designed to solve more realistic semantic parsing problems
Decomposed prompting breaks down a complex problem into subproblems that can be handled by a shared library of prompting-based LLMs
Successive prompting iteratively decomposes a complex problem into a simple problem
Selection-inference framework uses LLMs as modules to select and infer reasoning steps
Abductive and recursive prompting is used to solve binary questions
Numerical reasoning on complex numbers is performed by replacing the complex numbers with simple numbers
Language model cascade is a unifying framework for understanding this line of work

Hybrid method

Prompting techniques can help utilize reasoning in large language models, but do not improve the reasoning capabilities of the models.
Hybrid approach aims to improve reasoning capabilities and use techniques such as prompting.
Pretraining and finetuning LLMs on datasets with reasoning can lead to better generalization.
Self-improvement of reasoning abilities can be achieved through bootstrapping.

Measuring reasoning in large language models

Focus on using downstream performance on reasoning tasks as primary measurement for model’s “reasoning” ability
Little work on directly analyzing rationales generated by models
Summarize methods and benchmarks for evaluating reasoning abilities of LLMs

Downstream task performance

Measure reasoning abilities of LLMs by evaluating performance on tasks
Arithmetic Reasoning: GSM8K, Math, MathQA, SVAMP, AS-Div, AQuA, MAWPS
Commonsense Reasoning: CSQA, StrategyQA, ARC
Symbolic Reasoning: Last Letter Concatenation, Coin Flip
BIG-bench, SCAN, WikiTableQA, FetaQA, CommonGen, Open Relation Modeling

Formal analysis on reasoning

LLMs have demonstrated impressive performance on various reasoning tasks
Most existing evaluations focus on accuracy on downstream tasks, not directly assessing reasoning steps
Error analysis of generated rationales has been limited in depth
Efforts to develop metrics and benchmarks to enable formal analysis of reasoning in LLMs
Question of whether models are able to reason like humans or just achieve good performance through other means

Findings and implications

Reasoning appears to emerge only in large language models
Chain of thought prompts improve performance on reasoning tasks
LLMs show human-like content effects on reasoning
LLMs struggle with complex reasoning tasks
Current benchmarks may not accurately gauge LLMs’ reasoning abilities
LLMs may not be capable of robust reasoning
Techniques like chain of thought prompting can help elicit reasoning abilities
Finetuning with CoT data can improve reasoning
Models can self-improve through bootstrapping their reasoning

Conclusion

LLMs have made significant progress in natural language processing and related fields
It is unclear to what extent LLMs are capable of true reasoning
Further research is needed to understand LLMs’ reasoning capabilities and potential for use in applications
This paper provides an overview of the current state of the field and encourages further discussion and research

Link to paper#

Abstract#

Paper Content#

Introduction#

Towards reasoning in large language models#

Fully supervised finetuning#

Prompting & in-context learning#

Hybrid method#

Measuring reasoning in large language models#

Downstream task performance#

Formal analysis on reasoning#

Findings and implications#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Towards reasoning in large language models

Fully supervised finetuning

Prompting & in-context learning

Hybrid method

Measuring reasoning in large language models

Downstream task performance

Formal analysis on reasoning

Findings and implications

Conclusion