Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

RL agents can solve difficult tasks but require a lot of training data and struggle to generalize.
LSLMs have strong reasoning ability and can adapt to new tasks, but don’t have the ability to interact with the environment.
This work combines the complementary abilities of RL and LSLMs into a single system with three parts: Planner, Actor, and Reporter.

Environment is a 2D partially observable grid-world with unique objects
Actor can move and perform two special actions: examine and pickup
Planner is a pre-trained large language model
Reporter translates Actor’s action and observation to Planner’s language
Tasks focus on reasoning, generalization, and exploration in embodied environments

Examines interaction between Planner, Actor and Reporter in tasks that require all three components for success
Previous work shows LSLMs can break down complex tasks into step-by-step instructions
Planner needs to issue information gathering instructions and incorporate reported information
Tasks involve objects with abstract properties not grounded in LM’s previous experiences
Analyzing performance of different Planners and their robustness
All components pre-trained
Objects have ‘secret property’ (good/bad/unknown)
When Actor ’examines’ object, Reporter relays text string to Planner

Task requires gathering information
Goal is to pick up correct object based on another object’s secret property
Successful episode consists of 5 steps
LSLM Planner and Actor can complete task with good accuracy
Pure RL baseline performs poorly
Two main failure cases: Planner not inferring next instruction and Actor not following instruction
Smaller language models can infer correct object 58% of the time, larger models 96% of the time
Actor might encounter distribution shift which makes it unable to follow Planner’s instruction

Task requires agent to examine multiple objects and pick up the one with the good secret property
RL baseline performs worse, but agent framework with Planner-Actor-Reporter is still able to complete the task zero-shot
Agents perform better in this task than in the previous task
Planner can recover from errors
Larger language models (70B) perform significantly better than smaller models (7B)

Studied behavior of Planner in agent framework with Reporter that always reports accurate information
Investigated how to train Reporter from scratch with RL
Task specified as ‘If {decider object} is close to the wall, pick up {object 1}, otherwise pick up {object 2}’
Reporter’s input is same visual observations as Actor, output is binary classifier head
Recent work used pretrained models with visual grounding to act as Reporter module

Logical Reasoning: Ability to take complex instructions and do logical operations
Generalization: Ability to generalize to new inputs
Exploration: Ability to explore the world to learn new information
Perception: Ability to use raw observation to process the world and make decisions
Option Elimination: Logical reasoning and generalization task
Step Tasks: Pick up two objects in order
Conditional Secret Property: Pick up target object based on properties of decider object
Search Secret Property: Collect information to determine target object
Visual Location Conditional: Navigate to object to examine surroundings to determine target object

Trained Actor on ‘pick up’ and ’examine’ tasks
Actor receives event report as separate observation
After 5000 learner updates, performance on Search task is 25%
Performance on Conditional task is 33%
Pure RL baseline finds tasks difficult
Planner-Actor-Reporter agent performs well with 5 examples of optimal performance