Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • RL agents can solve difficult tasks but require a lot of training data and struggle to generalize.
  • LSLMs have strong reasoning ability and can adapt to new tasks, but don’t have the ability to interact with the environment.
  • This work combines the complementary abilities of RL and LSLMs into a single system with three parts: Planner, Actor, and Reporter.

Paper Content

Methods

  • Environment is a 2D partially observable grid-world with unique objects
  • Actor can move and perform two special actions: examine and pickup
  • Planner is a pre-trained large language model
  • Reporter translates Actor’s action and observation to Planner’s language
  • Tasks focus on reasoning, generalization, and exploration in embodied environments

Language models as interactive planners

  • Examines interaction between Planner, Actor and Reporter in tasks that require all three components for success
  • Previous work shows LSLMs can break down complex tasks into step-by-step instructions
  • Planner needs to issue information gathering instructions and incorporate reported information
  • Tasks involve objects with abstract properties not grounded in LM’s previous experiences
  • Analyzing performance of different Planners and their robustness
  • All components pre-trained
  • Objects have ‘secret property’ (good/bad/unknown)
  • When Actor ’examines’ object, Reporter relays text string to Planner

Secret property conditional task

  • Task requires gathering information
  • Goal is to pick up correct object based on another object’s secret property
  • Successful episode consists of 5 steps
  • LSLM Planner and Actor can complete task with good accuracy
  • Pure RL baseline performs poorly
  • Two main failure cases: Planner not inferring next instruction and Actor not following instruction
  • Smaller language models can infer correct object 58% of the time, larger models 96% of the time
  • Actor might encounter distribution shift which makes it unable to follow Planner’s instruction

Secret property search task

  • Task requires agent to examine multiple objects and pick up the one with the good secret property
  • RL baseline performs worse, but agent framework with Planner-Actor-Reporter is still able to complete the task zero-shot
  • Agents perform better in this task than in the previous task
  • Planner can recover from errors
  • Larger language models (70B) perform significantly better than smaller models (7B)

Robustness to irrelevant reports

  • 70B Planner is robust to mistakes from Actor
  • Examined if it can be robust to noisy Reporter
  • Irrelevant actions reported 20% of the time
  • 70B Planner uses strategies of repetition and cycling
  • 70B Planner performance not dramatically reduced

Training a truthful reporter

  • Studied behavior of Planner in agent framework with Reporter that always reports accurate information
  • Investigated how to train Reporter from scratch with RL
  • Task specified as ‘If {decider object} is close to the wall, pick up {object 1}, otherwise pick up {object 2}’
  • Reporter’s input is same visual observations as Actor, output is binary classifier head
  • Recent work used pretrained models with visual grounding to act as Reporter module

Discussion and future work

  • System uses Planner, Actor, and Reporter modules
  • Planner is a pre-trained language model
  • Actor is a pre-trained policy
  • Reporter translates information back to Planner
  • Tasks leverage language model’s abstract reasoning
  • Agent view is 11x11 crop of scene from top-down perspective
  • Actor and Reporter modules trained with VTrace loss

C all task descriptions

  • Logical Reasoning: Ability to take complex instructions and do logical operations
  • Generalization: Ability to generalize to new inputs
  • Exploration: Ability to explore the world to learn new information
  • Perception: Ability to use raw observation to process the world and make decisions
  • Option Elimination: Logical reasoning and generalization task
  • Step Tasks: Pick up two objects in order
  • Conditional Secret Property: Pick up target object based on properties of decider object
  • Search Secret Property: Collect information to determine target object
  • Visual Location Conditional: Navigate to object to examine surroundings to determine target object

D rl baselines

  • Trained Actor on ‘pick up’ and ’examine’ tasks
  • Actor receives event report as separate observation
  • After 5000 learner updates, performance on Search task is 25%
  • Performance on Conditional task is 33%
  • Pure RL baseline finds tasks difficult
  • Planner-Actor-Reporter agent performs well with 5 examples of optimal performance