Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Reinforcement learning algorithms have difficulty without a dense, well-shaped reward function.
  • Intrinsically motivated exploration methods reward agents for visiting novel states or transitions, but are limited in large environments.
  • ELLM uses background knowledge from text corpora to shape exploration.
  • ELLM rewards agents for achieving goals suggested by a language model.
  • ELLM guides agents toward human-meaningful and useful behaviors without requiring a human in the loop.
  • ELLM is evaluated in the Crafter game environment and the Housekeep robotic simulator.
  • ELLM-trained agents have better coverage of common-sense behaviors and usually match or improve performance on downstream tasks.

Paper Content


  • Reinforcement learning algorithms require rewards to incentivize progress.
  • Intrinsically motivated RL methods use novelty, surprise, uncertainty, or prediction errors as rewards.
  • Not all novelty is useful.
  • Exploring with LLMs (ELLM) uses large, pretrained language models to suggest useful goals.
  • ELLM yields meaningful exploratory rewards in two challenging domains.
  • Intrinsically motivated RL algorithms explore outcomes rather than actions
  • Knowledge-based IMs focus on maximizing the diversity of states
  • Competence-based IMs maximize the diversity of skills mastered by the agent
  • Representing goals in language unlocks the possibility of using text representations and generative models of text
  • ELLM uses pretrained LLMs to constrain exploration towards plausibly useful goals
  • LLM reward scheme rewards the agent for the similarity between the captioned transition and the goals
  • CB-IM algorithms train a goal-conditioned policy to maximize R int
  • Measure final performance on the original task defined by R either during training or after a fine-tuning phase

Implementation details

  • ELLM algorithm is summarized in Algorithm 1 and Figure 1
  • Impose novelty bias by filtering out LM suggestions already achieved in same episode
  • Two forms of agent training: goal-conditioned and goal-free
  • Pixel observations combined with embedded language-state captions performs better
  • DQN algorithm used for training
  • Test robustness of method with variant of ELLM using learned captioner trained on human descriptions


  • Tests hypotheses H1 and H2
  • Evaluates ELLM in two complex environments


  • Tested ELLM in the Crafter environment, a 2D version of Minecraft
  • Crafter has an achievement tree with prerequisites
  • Modified the game by augmenting the action space and reducing the amount of wood required to craft a table
  • Used Codex as the LLM with the open-ended suggestion generation variant of ELLM
  • Measured exploration quality as the average number of unique achievements per episode
  • ELLM learns to unlock about 6 achievements every episode
  • Rewards the agent for achieving any goal suggested by the LLM using a similarity-based reward function
  • Rewards the agent for maximizing a form of novelty estimated by the prediction error of a model
  • Pretrained agents outperform the unconditioned ones
  • Goal-conditioned ELLM and RND stand out as the best-performing methods
  • Tested the robustness of ELLM to diverse and imperfect captions
  • ELLM performance is overall robust to this imperfect captioner


  • Housekeep is an embodied robotics environment where the agent is tasked with cleaning up a house by rearranging misplaced objects
  • The agent must match the environment’s ground truth correct mapping of objects to receptacles without direct instructions
  • The task consists of 4 different scenes with one room each, each with 5 different misplaced objects and a suite of different possible receptacles
  • The game reward’s rearrangement success rate is used as a measure of exploration quality
  • The agent operates with low-level actions: moving forward, turning, looking up or down, and picking or placing an object
  • LLM accuracy at identifying mismatches is above 87%, but accuracy of identifying matches varies greatly
  • Pretraining and finetuning leads to higher success rates during pretraining
  • Directly finetuning the pretrained model on the ground truth correct rearrangement matches or outperforms the baselines
  • Directly training a new agent on the downstream task using the frozen pretrained model as an exploratory actor matches or outperforms all baselines

Conclusions and discussion

  • ELLM is an intrinsic motivation method that biases exploration towards common-sense and useful behaviors.
  • ELLM focuses exploration on common-sensical goals.
  • ELLM requires states and transition captions.
  • Textual observations increase performance in all conditions.
  • ELLM can be used to suggest plausible visual goals.

B. crafter downstream training

  • Place Crafting Table
  • Attack Cow
  • Make Wood Sword
  • Mine Stone
  • Deforestation
  • Gardening
  • Plant Row

C. crafter env modifications

  • Default action space contains “do” action which takes different actions depending on object
  • Modified action space to increase exploration problem by turning “do” into more precise combinations
  • Action space now 260 possible actions
  • Human priors used to disallow invalid combinations
  • 6/10 good actions, 6/21 rewarded actions, 7/15 good actions, 7/51 all actions suggested in prompt

E. housekeep tasks

  • Housekeep benchmark features a variety of household scenes and episodes
  • Ground truth correct object-receptacle placements determined by humans
  • RL pretraining focuses on first 4 tasks with 5 misplaced objects per task

G. algorithmic details

  • Use DQN, double Q-learning, dueling networks, and multi-step learning
  • Take in 84x84 images encoded with Nature Atari CNN
  • Pass image through linear layer to output 512 dimensional vector
  • Compute language embedding of state caption/goals using SBERT model

J. code and compute

  • Code to be released under MIT license and other licenses
  • Use OpenAI’s APIs for LLM access
  • Experiments with GPT-3 models led to degraded performance
  • Codex is free and Davinci is priced at $0.02/1000 tokens
  • Caching helps reduce API queries
  • Each API query takes .02 seconds
  • 100 GPUs used for pretraining

K. societal impact

  • LLMs have been shown to have impressive capabilities, but can also be prone to harmful biases and stereotypes
  • If used in RL, it is necessary to understand and mitigate any negative behaviors that can be learned
  • More careful study is necessary if deployed to real world
  • Mitigations for ELLM include filtering LLM generations, prompting the LM with guidelines, and using closed-form ELLM with constrained goal spaces
  • ELLM uses a pretrained large language model to suggest goals in a task-agnostic way
  • ELLM uses GPT-3 to suggest goals and SentenceBert embeddings to compute similarity between suggested goals and demonstrated behaviors
  • ELLM is used to partially observed Markov decision processes
  • Textual observations can increase performance
  • Reward confusion matrix shows probability of column achievement when row achievement is unlocked
  • Policy parametrization for ELLM includes conditioning on embeddings of goals and state
  • Classification accuracy of LLM for Housekeep tasks