Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Language models (LMs) can be augmented with reasoning skills and the ability to use tools.
  • Augmentations can be used separately or in combination.
  • Augmented LMs (ALMs) can use external modules to expand their context processing ability.
  • ALMs can learn to reason, use tools, and act while still performing standard natural language tasks.
  • ALMs have the potential to address limitations of traditional LMs.

Paper Content


  • Reasoning is the ability to make inferences using evidence and logic
  • Reasoning can be divided into multiple types of skills
  • Reasoning often involves deductions from inference chains
  • Previous work has shown that LLMs can solve simple reasoning problems but fail at complex reasoning
  • Challenge with complex reasoning problems for LMs is to correctly obtain the solution by composing the correct answers
  • Works related to three popular paradigms for eliciting reasoning in LMs discussed

Recursive prompting

  • Problem decomposition can be used to solve complex tasks
  • Problem decomposition can be used to solve sub-problems independently or sequentially
  • Least-to-most prompting decomposes complex problems into sub-problems
  • Recent works employ in-context learning to decompose problems
  • Different works use different methods to decompose problems

Explicitly teaching language models to reason

  • Prompting approaches require model scale and need to discover prompts that elicit multi-step computation tasks.
  • Scratchpads are fine-tuned on example tasks with associated computation steps.
  • Nye et al. (2021) and Taylor et al. (2022) use similar approaches for LM pre-training.
  • Taylor et al. (2022) use a special token to mimic an internal working memory.
  • Zelikman et al. (2022), Yu et al. (2022), Ouyang et al. (2022), Chung et al. (2022), Iyer et al. (2022), and Ho et al. (2022) use instruction fine-tuning to improve reasoning skills.

Comparison and limitations of abstract reasoning

  • Reasoning can be seen as breaking down a problem into smaller parts.
  • Exploring all possible reasoning paths is difficult and there is no guarantee that the steps are valid.
  • A reasoning language model seeks to improve its context to increase the chance of a correct answer.
  • Mistakes on mathematical operations or known facts can lead to wrong output.
  • External tools such as search engines and calculators can be used to validate intermediate steps.

Using tools and act

  • LM research allows access to knowledge not stored in weights
  • LM can query external modules such as python interpreter or search engine

Calling another model

  • Tool can be another neural network or the LM itself
  • Iteratively refine output by repeatedly calling the model
  • Re3 generates stories of over two thousand words
  • Re3 first generates plan, setting, and characters
  • Injects information from plan and story state into new GPT3 prompt
  • Learned detailed outliner expands brief initial outline
  • PEER model initialized from LM-Adapted T5 and trained on Wikipedia edits
  • Models can be chained together to refine complex tasks
  • Leveraging other modalities such as vision and language
  • Flamingo models trained on large-scale multimodal web corpora
  • Socratic Models allow models to exchange information and acquire new multimodal capabilities
  • Images can be incorporated to improve reasoning capabilities of moderate size LMs

Information retrieval

  • LMs can be augmented with memory units to improve reasoning abilities
  • Knowledge can be offloaded from the LM by retrieving from an external source
  • Memory augmentation strategies help the LM avoid producing non-factual and out-of-date information
  • Two types of retrievers exist: dense and sparse
  • Various works augment LMs with a dense retriever by appending the retrieved documents to the current context
  • He et al. (2022) and Trivedi et al. (2022) combine a retriever with reasoning via chain-of-thoughts prompting
  • LaMDA and BlenderBot are agent-like LMs designed for dialogue applications
  • ReAct interleaves reasoning and acting
  • WebGPT and WebShop are LM-based agents that can interact with a custom text-based web-browsing environment
  • Most works on web navigation and computer-control assume the typical human interface

Computing via symbolic modules and code interpreters

  • LMs are prone to errors when dealing with large numbers or complex arithmetics
  • GPT3 cannot perform out-of-distribution addition
  • Reinforcement learning action space is equipped with symbolic modules
  • Mind’s Eye uses a physics engine to ground LMs physical reasoning
  • PAL uses CoT prompting and python code to decompose tasks

Acting on the virtual and physical world

  • LM’s can be used to control virtual and physical agents in simulated and real-world environments
  • LM’s can be used to represent goals and plans, and improve learning and generalization on tasks beyond language processing
  • LM’s can be used to break down high-level tasks into a series of simple commands
  • LM’s can be used to write robot policy code given natural language commands
  • LM’s can encode common sense knowledge about the world
  • LM’s lack contextual grounding, making it difficult to use them for decision making in the real-world
  • NLMap-SayCan is a framework to gather and integrate contextual information into LM planners
  • RT-1 leverages large-scale, diverse, task-agnostic robotic datasets to learn a model that can follow natural language instructions

Learning to reason, use tools, and act

  • LMs can be augmented with reasoning and tools
  • Approaches to teach LMs reasoning and tools


  • Teaching LMs to reason and act can be done by providing them with human-written demonstrations
  • Common ways of doing this are few-shot prompting and regular gradient-based learning
  • Supervised learning is usually done after pre-training with a language modeling objective
  • Taylor et al. (2022) propose to mix pre-training texts with human-annotated examples containing explicit reasoning
  • Some authors use supervised fine-tuning followed by reinforcement learning from human feedback
  • Few-shot prompting is common for teaching LMs to reason and act
  • Performance depends on the format of examples, the choice of few-shot examples, and the order in which they are presented
  • Bootstrapping combines data efficiency of few-shot prompting with advantages of fine-tuning
  • Bootstrapping can be applied to teach models to reason and use tools

Reinforcement learning

  • Supervised learning from human-created prompts is effective to teach models to reason and act, but is difficult and costly to obtain.
  • Human preference data (rankings or likes/dislikes) is easier, faster, and cheaper to obtain than full demonstrations.
  • Reinforcement Learning (RL) has proven successful for learning complex behaviors through feedback-based interaction with an environment.
  • RL is a natural framework for training LMs to act and use tools since many of these tools are nondifferentiable.
  • Most existing work on RL and ALMs has focused on teaching LMs how to act rather than reason.
  • Hard-coded reward functions are used to update the weights of the model using a scalar reward generated by a task-dependent function.
  • RL has been used to teach a LM to search and fetch additional factual information, navigate a virtual shop, and interface with a graph-based knowledge base.
  • Human feedback can be used to improve the quality of machine-generated text.
  • RLHF (Reinforcement Learning from Human Feedback) uses human preferences as an evaluation metric and as an objective function to optimize the language model.
  • RLHF has been applied directly on top of a general-purpose LM pre-trained via self-supervised learning, and after an initial supervised fine-tuning phase.
  • RLHF has been used to teach a LM to use an external tool (e.g. search engine, web-browser, information retrieval module).
  • RLHF has also proven useful for a wide range of language generation tasks.

Limitations and future directions

  • Recent algorithmic progress and performance improvements in RL methods
  • Instability issues can make training difficult and slow
  • Supervised learning is an efficient and robust way to fine-tune language models
  • Assumes existence of a large number of expert demonstrations
  • Bootstrapping methods and offline RL combine “the best of both worlds”
  • ILQL combines online RL and supervised learning
  • Toolformer teaches itself to use tools in a self-supervised way
  • Text used for supervision can lack context
  • ALMs can access recent information from the external world
  • Tradeoff between memorizing and querying tools
  • Generalizing the non-parametric framework
  • ALMs instantiate autonomous intelligent agent concept
  • ALMs offer truthfulness, estimating and reducing uncertainty, and interpretability