Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent advancements in LLM agents have shown impressive performance
Implementing these methods can be challenging due to lack of data or well-defined state space
Reflexion is an approach that endows an agent with dynamic memory and self-reflection capabilities
Heuristic introduced to enable agent to pinpoint hallucination instances, avoid repetition and construct memory map
Evaluated in AlfWorld and HotPotQA environments with success rates of 97% and 51% respectively

Paper Content

Introduction

Mastering decision-making and knowledge-intensive search tasks is important for natural language agents
LLMs have achieved impressive results on various benchmarks
Grounding complex tasks in natural language helps agents avoid false-negative errors
Learning optimal policies for natural language RL agents is challenging due to vast and mostly unbound state spaces
Several decision-making approaches have been proposed to enable natural language agents to select their next action
Chain-of-thought reasoning leverages emergent properties to solve tasks in a single action
ReAct utilizes emergent properties to solve problems
Several recent works have aimed to allow natural language agents to exhibit reflective-like qualities
DEPS uses multi-step reasoning and sub-task error correction to solve long-range tasks
Huang et al. use inner monologue and success detection to allow agents to reason and act
Haluptzok et al. use self-generated solutions to fine-tune an LLM
Human-in-the-loop approaches have been used to improve performance
LLMs possess an emergent property of self-reflection
Reflexion agent uses self-reflection to learn from its own mistakes
Reflexion agent achieves improved performance on AlfWorld and HotPotQA benchmarks

Architecture

Reflexion agent uses ReAct decision-making approach
Output is constrained to binary success status
Agent computes heuristic to suggest self-reflection
Agent queries LLM for next action
Maximum of three reflections stored in agent’s memory

Setup

Agent is tasked to solve a problem by executing actions to learn from observations within an environment
At each time step, agent receives an observation and executes an action based on its current policy
Context is given to the agent based on its current state and trajectory history
Policy is not learned over a state space
Aim to constrain specific reward information given to the agent
Agent is equipped with a heuristic function to detect common modes of failure

Heuristics

Heuristic h is defined to tell the agent when to reflect
Heuristic h takes into account the current state, hyperparameters, and trajectory history
Hyperparameter Ω is set to detect hallucination of repeated consecutive actions
Hyperparameter ε restricts the maximum number of actions allowed in an environment per trial
Heuristic h replaces the role of a human-in-the-loop

Reflexion

Agent initiates self-reflection process on current state, reward, actions/observations, and existing memory
Model used for self-reflection is an LLM prompted with two-shot learning examples of failed/ideal reflection pairs
Agent not granted access to domain-specific solutions to encourage creative/novel techniques
Self-reflection modeled in equation
Reflection added to agent’s memory, environment reset, next trial starts

Reward model

Binary reward model assigns 0 or 1 to an action taken by the agent
Agent’s knowledge is limited to binary success status in environment
Binary reward models applicable to language problems such as code generation and debugging
Agent learns faster than random action sampling in arbitrary environments

Action space

Defining a natural language action space for large language models is difficult due to the vast number of possible actions.
There is a trade-off between minimizing false positives and false negatives.
The action space chosen is a combination of actions and language.
False negatives are handled by responding with “Invalid Action” observations.
Few-shot prompting is used to demonstrate permissible action syntax.

Evaluation

Alfworld

AlfWorld is a suite of text-based environments
Agent can choose from a list of admissible actions
Agent receives an observation and reward from the environment
Reward is filtered to a binary reward
Six different tasks and over 3000 environments
ReAct problem solving strategy is implemented
ReAct allows agent to reason and act
Action space in each state is not explicitly defined
Hand-selected few-shot trajectories used
134 AlfWorld environments tested
Reflexion agent outperformed base ReAct agent by 20%

Hotpotqa

HotPotQA is a dataset with 113k question-and-answer pairs.
Agents must search Wikipedia to answer questions.
EM is used to grade the agent’s answers.
The agent is given six few-shot examples of successful trajectories.
The agent can perform three actions: search, lookup, and finish.
The agent can self-reflect and reset the environment if it provides an incorrect answer.

Discussion

Reflexion promotes learning through discovery
Agent demonstrates unfamiliarity with environment in first trial
Agent recognizes flaw in initial plan after reflecting
Agent remembers desk lamp was on desk 1
Agent devises more accurate plan to find mug
Agent succeeds after 1 reflection, ReAct-only approach takes 6 trials

Hallucination vs. inefficient planning

Hallucination is the most common reason for failure in AlfWorld runs.
Hallucination is defined as two or more consecutive identical actions with the same environment response.
Inefficient planning is defined as executing more than 30 actions without reaching a successful state.
With reflection, the agent can correct all mistakes related to inefficient planning.

Reflexion enables more intuitive search queries

A key component to an agent’s success is its ability to form intuitive search queries
An agent can learn from its past mistakes
Reflexion is an approach designed to promote discovery for problem-solving LLM-agents
A Reflexion agent must have access to a heuristic for termination and a binary reward model
The heuristic can be designed to match actions that are associated with high probabilities of failure
In the AlfWorld experiment, a simple heuristic was used to demonstrate the applicability of the approach
In other experiments, a more specific heuristic could be applied
The binary reward model can be a compilation or type-check attempt
In a capture-the-flag task, the possession status of the flag could be the binary reward model
Reflexion served as a redirection mechanism in some cases and as a summarization tool in others
In subsequent trials, the agent did not need to search every location to find an object
Reflexion is needed to ensure that the query length does not exceed the maximum number of tokens while still discovering important information

Limitations of reflexion

Reflexion relies on self-reflection present in language models
Used GPT-3.0 and GPT-3.5 to power a ReAct agent
ReAct agent used in AlfWorld and HotPotQA tasks
Shortcoming in ability to improve on baseline performance in WebShop task
Agent only achieved 1 additional task relative to baseline
Successful item purchase not necessarily dependent on agent’s ability to plan and execute, but on quality of search engine results

Conclusion

Proposed an approach to allow natural language agents to learn from past mistakes
Demonstrated learning curves on AlfWorld and HotPotQA benchmarks that outperform base ReAct agents
Attempted to improve performance on WebShop benchmark
Reflexion is applicable to improve performance on decision-making and knowledge-intensive tasks
Reward model constrained to imitate environments with difficult to design or compute reward models
Encouraged others to apply Reflexion to more complex tasks
Figures 2, 3 and 6 show performance on AlfWorld, HotPotQA and WebShop
Example of HotPotQA question and thought/action/observation process
Example of failed task and new plan of action

Link to paper#

Abstract#

Paper Content#

Introduction#

Architecture#

Setup#

Heuristics#

Reflexion#

Reward model#

Action space#

Evaluation#

Alfworld#

Hotpotqa#

Discussion#

Hallucination vs. inefficient planning#

Reflexion enables more intuitive search queries#

Limitations of reflexion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Architecture

Setup

Heuristics

Reflexion

Reward model

Action space

Evaluation

Alfworld

Hotpotqa

Discussion

Hallucination vs. inefficient planning

Reflexion enables more intuitive search queries

Limitations of reflexion

Conclusion