Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- RL agents can solve difficult tasks but require a lot of training data and struggle to generalize.
- LSLMs have strong reasoning ability and can adapt to new tasks, but don’t have the ability to interact with the environment.
- This work combines the complementary abilities of RL and LSLMs into a single system with three parts: Planner, Actor, and Reporter.
Paper Content
Methods
- Environment is a 2D partially observable grid-world with unique objects
- Actor can move and perform two special actions: examine and pickup
- Planner is a pre-trained large language model
- Reporter translates Actor’s action and observation to Planner’s language
- Tasks focus on reasoning, generalization, and exploration in embodied environments
Language models as interactive planners
- Examines interaction between Planner, Actor and Reporter in tasks that require all three components for success
- Previous work shows LSLMs can break down complex tasks into step-by-step instructions
- Planner needs to issue information gathering instructions and incorporate reported information
- Tasks involve objects with abstract properties not grounded in LM’s previous experiences
- Analyzing performance of different Planners and their robustness
- All components pre-trained
- Objects have ‘secret property’ (good/bad/unknown)
- When Actor ’examines’ object, Reporter relays text string to Planner
Secret property conditional task
- Task requires gathering information
- Goal is to pick up correct object based on another object’s secret property
- Successful episode consists of 5 steps
- LSLM Planner and Actor can complete task with good accuracy
- Pure RL baseline performs poorly
- Two main failure cases: Planner not inferring next instruction and Actor not following instruction
- Smaller language models can infer correct object 58% of the time, larger models 96% of the time
- Actor might encounter distribution shift which makes it unable to follow Planner’s instruction
Secret property search task
- Task requires agent to examine multiple objects and pick up the one with the good secret property
- RL baseline performs worse, but agent framework with Planner-Actor-Reporter is still able to complete the task zero-shot
- Agents perform better in this task than in the previous task
- Planner can recover from errors
- Larger language models (70B) perform significantly better than smaller models (7B)
Robustness to irrelevant reports
- 70B Planner is robust to mistakes from Actor
- Examined if it can be robust to noisy Reporter
- Irrelevant actions reported 20% of the time
- 70B Planner uses strategies of repetition and cycling
- 70B Planner performance not dramatically reduced
Training a truthful reporter
- Studied behavior of Planner in agent framework with Reporter that always reports accurate information
- Investigated how to train Reporter from scratch with RL
- Task specified as ‘If {decider object} is close to the wall, pick up {object 1}, otherwise pick up {object 2}’
- Reporter’s input is same visual observations as Actor, output is binary classifier head
- Recent work used pretrained models with visual grounding to act as Reporter module
Discussion and future work
- System uses Planner, Actor, and Reporter modules
- Planner is a pre-trained language model
- Actor is a pre-trained policy
- Reporter translates information back to Planner
- Tasks leverage language model’s abstract reasoning
- Agent view is 11x11 crop of scene from top-down perspective
- Actor and Reporter modules trained with VTrace loss
C all task descriptions
- Logical Reasoning: Ability to take complex instructions and do logical operations
- Generalization: Ability to generalize to new inputs
- Exploration: Ability to explore the world to learn new information
- Perception: Ability to use raw observation to process the world and make decisions
- Option Elimination: Logical reasoning and generalization task
- Step Tasks: Pick up two objects in order
- Conditional Secret Property: Pick up target object based on properties of decider object
- Search Secret Property: Collect information to determine target object
- Visual Location Conditional: Navigate to object to examine surroundings to determine target object
D rl baselines
- Trained Actor on ‘pick up’ and ’examine’ tasks
- Actor receives event report as separate observation
- After 5000 learner updates, performance on Search task is 25%
- Performance on Conditional task is 33%
- Pure RL baseline finds tasks difficult
- Planner-Actor-Reporter agent performs well with 5 examples of optimal performance