Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Reinforcement learning algorithms have difficulty without a dense, well-shaped reward function.
Intrinsically motivated exploration methods reward agents for visiting novel states or transitions, but are limited in large environments.
ELLM uses background knowledge from text corpora to shape exploration.
ELLM rewards agents for achieving goals suggested by a language model.
ELLM guides agents toward human-meaningful and useful behaviors without requiring a human in the loop.
ELLM is evaluated in the Crafter game environment and the Housekeep robotic simulator.
ELLM-trained agents have better coverage of common-sense behaviors and usually match or improve performance on downstream tasks.

Paper Content

Introduction

Reinforcement learning algorithms require rewards to incentivize progress.
Intrinsically motivated RL methods use novelty, surprise, uncertainty, or prediction errors as rewards.
Not all novelty is useful.
Exploring with LLMs (ELLM) uses large, pretrained language models to suggest useful goals.
ELLM yields meaningful exploratory rewards in two challenging domains.

Intrinsically motivated RL algorithms explore outcomes rather than actions
Knowledge-based IMs focus on maximizing the diversity of states
Competence-based IMs maximize the diversity of skills mastered by the agent
Representing goals in language unlocks the possibility of using text representations and generative models of text
ELLM uses pretrained LLMs to constrain exploration towards plausibly useful goals
LLM reward scheme rewards the agent for the similarity between the captioned transition and the goals
CB-IM algorithms train a goal-conditioned policy to maximize R int
Measure final performance on the original task defined by R either during training or after a fine-tuning phase

Implementation details

ELLM algorithm is summarized in Algorithm 1 and Figure 1
Impose novelty bias by filtering out LM suggestions already achieved in same episode
Two forms of agent training: goal-conditioned and goal-free
Pixel observations combined with embedded language-state captions performs better
DQN algorithm used for training
Test robustness of method with variant of ELLM using learned captioner trained on human descriptions

Experiments

Tests hypotheses H1 and H2
Evaluates ELLM in two complex environments

Crafter

Tested ELLM in the Crafter environment, a 2D version of Minecraft
Crafter has an achievement tree with prerequisites
Modified the game by augmenting the action space and reducing the amount of wood required to craft a table
Used Codex as the LLM with the open-ended suggestion generation variant of ELLM
Measured exploration quality as the average number of unique achievements per episode
ELLM learns to unlock about 6 achievements every episode
Rewards the agent for achieving any goal suggested by the LLM using a similarity-based reward function
Rewards the agent for maximizing a form of novelty estimated by the prediction error of a model
Pretrained agents outperform the unconditioned ones
Goal-conditioned ELLM and RND stand out as the best-performing methods
Tested the robustness of ELLM to diverse and imperfect captions
ELLM performance is overall robust to this imperfect captioner

Housekeep

Housekeep is an embodied robotics environment where the agent is tasked with cleaning up a house by rearranging misplaced objects
The agent must match the environment’s ground truth correct mapping of objects to receptacles without direct instructions
The task consists of 4 different scenes with one room each, each with 5 different misplaced objects and a suite of different possible receptacles
The game reward’s rearrangement success rate is used as a measure of exploration quality
The agent operates with low-level actions: moving forward, turning, looking up or down, and picking or placing an object
LLM accuracy at identifying mismatches is above 87%, but accuracy of identifying matches varies greatly
Pretraining and finetuning leads to higher success rates during pretraining
Directly finetuning the pretrained model on the ground truth correct rearrangement matches or outperforms the baselines
Directly training a new agent on the downstream task using the frozen pretrained model as an exploratory actor matches or outperforms all baselines

Conclusions and discussion

ELLM is an intrinsic motivation method that biases exploration towards common-sense and useful behaviors.
ELLM focuses exploration on common-sensical goals.
ELLM requires states and transition captions.
Textual observations increase performance in all conditions.
ELLM can be used to suggest plausible visual goals.

B. crafter downstream training

Place Crafting Table
Attack Cow
Make Wood Sword
Mine Stone
Deforestation
Gardening
Plant Row

C. crafter env modifications

Default action space contains “do” action which takes different actions depending on object
Modified action space to increase exploration problem by turning “do” into more precise combinations
Action space now 260 possible actions
Human priors used to disallow invalid combinations
6/10 good actions, 6/21 rewarded actions, 7/15 good actions, 7/51 all actions suggested in prompt

E. housekeep tasks

Housekeep benchmark features a variety of household scenes and episodes
Ground truth correct object-receptacle placements determined by humans
RL pretraining focuses on first 4 tasks with 5 misplaced objects per task

G. algorithmic details

Use DQN, double Q-learning, dueling networks, and multi-step learning
Take in 84x84 images encoded with Nature Atari CNN
Pass image through linear layer to output 512 dimensional vector
Compute language embedding of state caption/goals using SBERT model

J. code and compute

Code to be released under MIT license and other licenses
Use OpenAI’s APIs for LLM access
Experiments with GPT-3 models led to degraded performance
Codex is free and Davinci is priced at $0.02/1000 tokens
Caching helps reduce API queries
Each API query takes .02 seconds
100 GPUs used for pretraining

K. societal impact

LLMs have been shown to have impressive capabilities, but can also be prone to harmful biases and stereotypes
If used in RL, it is necessary to understand and mitigate any negative behaviors that can be learned
More careful study is necessary if deployed to real world
Mitigations for ELLM include filtering LLM generations, prompting the LM with guidelines, and using closed-form ELLM with constrained goal spaces
ELLM uses a pretrained large language model to suggest goals in a task-agnostic way
ELLM uses GPT-3 to suggest goals and SentenceBert embeddings to compute similarity between suggested goals and demonstrated behaviors
ELLM is used to partially observed Markov decision processes
Textual observations can increase performance
Reward confusion matrix shows probability of column achievement when row achievement is unlocked
Policy parametrization for ELLM includes conditioning on embeddings of goals and state
Classification accuracy of LLM for Housekeep tasks

Link to paper#

Abstract#

Paper Content#

Introduction#

Background and related work#

Implementation details#

Experiments#

Crafter#

Housekeep#

Conclusions and discussion#

B. crafter downstream training#

C. crafter env modifications#

E. housekeep tasks#

G. algorithmic details#

J. code and compute#

K. societal impact#

Link to paper

Abstract

Paper Content

Introduction

Background and related work

Implementation details

Experiments

Crafter

Housekeep

Conclusions and discussion

B. crafter downstream training

C. crafter env modifications

E. housekeep tasks

G. algorithmic details

J. code and compute

K. societal impact