Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Reinforcement learning algorithms have difficulty without a dense, well-shaped reward function.
- Intrinsically motivated exploration methods reward agents for visiting novel states or transitions, but are limited in large environments.
- ELLM uses background knowledge from text corpora to shape exploration.
- ELLM rewards agents for achieving goals suggested by a language model.
- ELLM guides agents toward human-meaningful and useful behaviors without requiring a human in the loop.
- ELLM is evaluated in the Crafter game environment and the Housekeep robotic simulator.
- ELLM-trained agents have better coverage of common-sense behaviors and usually match or improve performance on downstream tasks.
Paper Content
Introduction
- Reinforcement learning algorithms require rewards to incentivize progress.
- Intrinsically motivated RL methods use novelty, surprise, uncertainty, or prediction errors as rewards.
- Not all novelty is useful.
- Exploring with LLMs (ELLM) uses large, pretrained language models to suggest useful goals.
- ELLM yields meaningful exploratory rewards in two challenging domains.
Background and related work
- Intrinsically motivated RL algorithms explore outcomes rather than actions
- Knowledge-based IMs focus on maximizing the diversity of states
- Competence-based IMs maximize the diversity of skills mastered by the agent
- Representing goals in language unlocks the possibility of using text representations and generative models of text
- ELLM uses pretrained LLMs to constrain exploration towards plausibly useful goals
- LLM reward scheme rewards the agent for the similarity between the captioned transition and the goals
- CB-IM algorithms train a goal-conditioned policy to maximize R int
- Measure final performance on the original task defined by R either during training or after a fine-tuning phase
Implementation details
- ELLM algorithm is summarized in Algorithm 1 and Figure 1
- Impose novelty bias by filtering out LM suggestions already achieved in same episode
- Two forms of agent training: goal-conditioned and goal-free
- Pixel observations combined with embedded language-state captions performs better
- DQN algorithm used for training
- Test robustness of method with variant of ELLM using learned captioner trained on human descriptions
Experiments
- Tests hypotheses H1 and H2
- Evaluates ELLM in two complex environments
Crafter
- Tested ELLM in the Crafter environment, a 2D version of Minecraft
- Crafter has an achievement tree with prerequisites
- Modified the game by augmenting the action space and reducing the amount of wood required to craft a table
- Used Codex as the LLM with the open-ended suggestion generation variant of ELLM
- Measured exploration quality as the average number of unique achievements per episode
- ELLM learns to unlock about 6 achievements every episode
- Rewards the agent for achieving any goal suggested by the LLM using a similarity-based reward function
- Rewards the agent for maximizing a form of novelty estimated by the prediction error of a model
- Pretrained agents outperform the unconditioned ones
- Goal-conditioned ELLM and RND stand out as the best-performing methods
- Tested the robustness of ELLM to diverse and imperfect captions
- ELLM performance is overall robust to this imperfect captioner
Housekeep
- Housekeep is an embodied robotics environment where the agent is tasked with cleaning up a house by rearranging misplaced objects
- The agent must match the environment’s ground truth correct mapping of objects to receptacles without direct instructions
- The task consists of 4 different scenes with one room each, each with 5 different misplaced objects and a suite of different possible receptacles
- The game reward’s rearrangement success rate is used as a measure of exploration quality
- The agent operates with low-level actions: moving forward, turning, looking up or down, and picking or placing an object
- LLM accuracy at identifying mismatches is above 87%, but accuracy of identifying matches varies greatly
- Pretraining and finetuning leads to higher success rates during pretraining
- Directly finetuning the pretrained model on the ground truth correct rearrangement matches or outperforms the baselines
- Directly training a new agent on the downstream task using the frozen pretrained model as an exploratory actor matches or outperforms all baselines
Conclusions and discussion
- ELLM is an intrinsic motivation method that biases exploration towards common-sense and useful behaviors.
- ELLM focuses exploration on common-sensical goals.
- ELLM requires states and transition captions.
- Textual observations increase performance in all conditions.
- ELLM can be used to suggest plausible visual goals.
B. crafter downstream training
- Place Crafting Table
- Attack Cow
- Make Wood Sword
- Mine Stone
- Deforestation
- Gardening
- Plant Row
C. crafter env modifications
- Default action space contains “do” action which takes different actions depending on object
- Modified action space to increase exploration problem by turning “do” into more precise combinations
- Action space now 260 possible actions
- Human priors used to disallow invalid combinations
- 6/10 good actions, 6/21 rewarded actions, 7/15 good actions, 7/51 all actions suggested in prompt
E. housekeep tasks
- Housekeep benchmark features a variety of household scenes and episodes
- Ground truth correct object-receptacle placements determined by humans
- RL pretraining focuses on first 4 tasks with 5 misplaced objects per task
G. algorithmic details
- Use DQN, double Q-learning, dueling networks, and multi-step learning
- Take in 84x84 images encoded with Nature Atari CNN
- Pass image through linear layer to output 512 dimensional vector
- Compute language embedding of state caption/goals using SBERT model
J. code and compute
- Code to be released under MIT license and other licenses
- Use OpenAI’s APIs for LLM access
- Experiments with GPT-3 models led to degraded performance
- Codex is free and Davinci is priced at $0.02/1000 tokens
- Caching helps reduce API queries
- Each API query takes .02 seconds
- 100 GPUs used for pretraining
K. societal impact
- LLMs have been shown to have impressive capabilities, but can also be prone to harmful biases and stereotypes
- If used in RL, it is necessary to understand and mitigate any negative behaviors that can be learned
- More careful study is necessary if deployed to real world
- Mitigations for ELLM include filtering LLM generations, prompting the LM with guidelines, and using closed-form ELLM with constrained goal spaces
- ELLM uses a pretrained large language model to suggest goals in a task-agnostic way
- ELLM uses GPT-3 to suggest goals and SentenceBert embeddings to compute similarity between suggested goals and demonstrated behaviors
- ELLM is used to partially observed Markov decision processes
- Textual observations can increase performance
- Reward confusion matrix shows probability of column achievement when row achievement is unlocked
- Policy parametrization for ELLM includes conditioning on embeddings of goals and state
- Classification accuracy of LLM for Housekeep tasks