Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent advancements in LLM agents have shown impressive performance
  • Implementing these methods can be challenging due to lack of data or well-defined state space
  • Reflexion is an approach that endows an agent with dynamic memory and self-reflection capabilities
  • Heuristic introduced to enable agent to pinpoint hallucination instances, avoid repetition and construct memory map
  • Evaluated in AlfWorld and HotPotQA environments with success rates of 97% and 51% respectively

Paper Content

Introduction

  • Mastering decision-making and knowledge-intensive search tasks is important for natural language agents
  • LLMs have achieved impressive results on various benchmarks
  • Grounding complex tasks in natural language helps agents avoid false-negative errors
  • Learning optimal policies for natural language RL agents is challenging due to vast and mostly unbound state spaces
  • Several decision-making approaches have been proposed to enable natural language agents to select their next action
  • Chain-of-thought reasoning leverages emergent properties to solve tasks in a single action
  • ReAct utilizes emergent properties to solve problems
  • Several recent works have aimed to allow natural language agents to exhibit reflective-like qualities
  • DEPS uses multi-step reasoning and sub-task error correction to solve long-range tasks
  • Huang et al. use inner monologue and success detection to allow agents to reason and act
  • Haluptzok et al. use self-generated solutions to fine-tune an LLM
  • Human-in-the-loop approaches have been used to improve performance
  • LLMs possess an emergent property of self-reflection
  • Reflexion agent uses self-reflection to learn from its own mistakes
  • Reflexion agent achieves improved performance on AlfWorld and HotPotQA benchmarks

Architecture

  • Reflexion agent uses ReAct decision-making approach
  • Output is constrained to binary success status
  • Agent computes heuristic to suggest self-reflection
  • Agent queries LLM for next action
  • Maximum of three reflections stored in agent’s memory

Setup

  • Agent is tasked to solve a problem by executing actions to learn from observations within an environment
  • At each time step, agent receives an observation and executes an action based on its current policy
  • Context is given to the agent based on its current state and trajectory history
  • Policy is not learned over a state space
  • Aim to constrain specific reward information given to the agent
  • Agent is equipped with a heuristic function to detect common modes of failure

Heuristics

  • Heuristic h is defined to tell the agent when to reflect
  • Heuristic h takes into account the current state, hyperparameters, and trajectory history
  • Hyperparameter Ω is set to detect hallucination of repeated consecutive actions
  • Hyperparameter ε restricts the maximum number of actions allowed in an environment per trial
  • Heuristic h replaces the role of a human-in-the-loop

Reflexion

  • Agent initiates self-reflection process on current state, reward, actions/observations, and existing memory
  • Model used for self-reflection is an LLM prompted with two-shot learning examples of failed/ideal reflection pairs
  • Agent not granted access to domain-specific solutions to encourage creative/novel techniques
  • Self-reflection modeled in equation
  • Reflection added to agent’s memory, environment reset, next trial starts

Reward model

  • Binary reward model assigns 0 or 1 to an action taken by the agent
  • Agent’s knowledge is limited to binary success status in environment
  • Binary reward models applicable to language problems such as code generation and debugging
  • Agent learns faster than random action sampling in arbitrary environments

Action space

  • Defining a natural language action space for large language models is difficult due to the vast number of possible actions.
  • There is a trade-off between minimizing false positives and false negatives.
  • The action space chosen is a combination of actions and language.
  • False negatives are handled by responding with “Invalid Action” observations.
  • Few-shot prompting is used to demonstrate permissible action syntax.

Evaluation

Alfworld

  • AlfWorld is a suite of text-based environments
  • Agent can choose from a list of admissible actions
  • Agent receives an observation and reward from the environment
  • Reward is filtered to a binary reward
  • Six different tasks and over 3000 environments
  • ReAct problem solving strategy is implemented
  • ReAct allows agent to reason and act
  • Action space in each state is not explicitly defined
  • Hand-selected few-shot trajectories used
  • 134 AlfWorld environments tested
  • Reflexion agent outperformed base ReAct agent by 20%

Hotpotqa

  • HotPotQA is a dataset with 113k question-and-answer pairs.
  • Agents must search Wikipedia to answer questions.
  • EM is used to grade the agent’s answers.
  • The agent is given six few-shot examples of successful trajectories.
  • The agent can perform three actions: search, lookup, and finish.
  • The agent can self-reflect and reset the environment if it provides an incorrect answer.

Discussion

  • Reflexion promotes learning through discovery
  • Agent demonstrates unfamiliarity with environment in first trial
  • Agent recognizes flaw in initial plan after reflecting
  • Agent remembers desk lamp was on desk 1
  • Agent devises more accurate plan to find mug
  • Agent succeeds after 1 reflection, ReAct-only approach takes 6 trials

Hallucination vs. inefficient planning

  • Hallucination is the most common reason for failure in AlfWorld runs.
  • Hallucination is defined as two or more consecutive identical actions with the same environment response.
  • Inefficient planning is defined as executing more than 30 actions without reaching a successful state.
  • With reflection, the agent can correct all mistakes related to inefficient planning.

Reflexion enables more intuitive search queries

  • A key component to an agent’s success is its ability to form intuitive search queries
  • An agent can learn from its past mistakes
  • Reflexion is an approach designed to promote discovery for problem-solving LLM-agents
  • A Reflexion agent must have access to a heuristic for termination and a binary reward model
  • The heuristic can be designed to match actions that are associated with high probabilities of failure
  • In the AlfWorld experiment, a simple heuristic was used to demonstrate the applicability of the approach
  • In other experiments, a more specific heuristic could be applied
  • The binary reward model can be a compilation or type-check attempt
  • In a capture-the-flag task, the possession status of the flag could be the binary reward model
  • Reflexion served as a redirection mechanism in some cases and as a summarization tool in others
  • In subsequent trials, the agent did not need to search every location to find an object
  • Reflexion is needed to ensure that the query length does not exceed the maximum number of tokens while still discovering important information

Limitations of reflexion

  • Reflexion relies on self-reflection present in language models
  • Used GPT-3.0 and GPT-3.5 to power a ReAct agent
  • ReAct agent used in AlfWorld and HotPotQA tasks
  • Shortcoming in ability to improve on baseline performance in WebShop task
  • Agent only achieved 1 additional task relative to baseline
  • Successful item purchase not necessarily dependent on agent’s ability to plan and execute, but on quality of search engine results

Conclusion

  • Proposed an approach to allow natural language agents to learn from past mistakes
  • Demonstrated learning curves on AlfWorld and HotPotQA benchmarks that outperform base ReAct agents
  • Attempted to improve performance on WebShop benchmark
  • Reflexion is applicable to improve performance on decision-making and knowledge-intensive tasks
  • Reward model constrained to imitate environments with difficult to design or compute reward models
  • Encouraged others to apply Reflexion to more complex tasks
  • Figures 2, 3 and 6 show performance on AlfWorld, HotPotQA and WebShop
  • Example of HotPotQA question and thought/action/observation process
  • Example of failed task and new plan of action