Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Autonomous agents have made progress in specialist domains
  • Humans learn and adapt in the open world
  • Three ingredients for building generalist agents: environment, knowledge base, and agent architecture
  • MineDojo is a new framework built on Minecraft with thousands of tasks and an internet-scale knowledge base
  • Novel agent learning algorithm uses pre-trained video-language models as reward function
  • Agent is able to solve open-ended tasks without manually designed reward

Paper Content

Introduction

  • Developing autonomous embodied agents to attain human-level performance is a long-standing goal for AI research
  • Progress has been made in games and robotics
  • Agents are typically trained tabula rasa in isolated worlds with limited complexity and diversity
  • Humans inhabit an infinitely rich reality and can leverage large amounts of prior knowledge
  • MINEDOJO is a framework to develop open-ended, generally capable agents
  • MINEDOJO features a benchmarking suite with thousands of diverse open-ended tasks
  • MINEDOJO also provides an internet-scale, multimodal knowledge base
  • MINEDOJO’s simulator provides unified observation and action spaces
  • MINEDOJO’s tasks are divided into programmatic and creative tasks
  • Programmatic tasks can be automatically assessed
  • Creative tasks do not have well-defined success criteria
  • MINEDOJO uses a learned model to evaluate creative tasks

Task suite i: programmatic tasks

  • Formalize each programmatic task as a 5-tuple
  • Leverage OpenAI’s GPT-3-davinci API to generate detailed guidance
  • Initial conditions of the agent and the world
  • Success criterion is a deterministic function
  • Optional dense reward function
  • 4 categories of programmatic tasks with 1,581 template-generated natural language goals

Task suite ii: creative tasks

  • Creative tasks are defined as a 3-tuple
  • A novel task evaluation metric is designed based on a pre-trained contrastive video-language model
  • Human evaluations show high agreement with the learned metric
  • 216 Creative tasks are manually authored
  • 1,560 Creative tasks are generated through two systematic approaches
  • Approach 1 mines tasks from YouTube tutorial videos
  • Approach 2 uses GPT-3 to generate new task ideas

Internet-scale knowledge base

  • Two approaches to train embodied agents include RL with reward functions or human-demonstrations
  • Crafting reward functions is challenging for the task suite
  • Turn to the open web as an ever-growing source of learning material
  • Harvest domain knowledge by web scraping and filtering
  • Collect 33 years of YouTube videos, 6K+ Wiki pages, and millions of Reddit comment threads
  • Language is a key component of the database
  • Take special measures to filter out low-quality and toxic contents
  • 730K+ narrated Minecraft videos, 2.2B words in English transcripts
  • 6,735 Wiki pages with text, images, tables, and diagrams
  • 340K+ posts and 6.6M comments under the “r/Minecraft” subreddit

Agent learning with large-scale pre-training

  • Grand challenge of embodied AI is to build a single agent that can complete a wide range of open-world tasks
  • MINEDOJO framework aims to facilitate new techniques towards this goal
  • Initial step towards this goal is to develop a proof of concept that demonstrates how a single language-prompted agent can be trained in MINEDOJO to complete several complex Minecraft tasks
  • Novel agent learning algorithm takes advantage of massive YouTube data offered by MINEDOJO
  • Multi-task reinforcement learning setting, where agent is tasked with completing a collection of MINEDOJO tasks specified by language instructions
  • Agents developed in popular RL benchmarks rely on dense and task-specific reward functions to guide random explorations
  • MINECLIP proposed to learn a dense, language-conditioned reward function from in-the-wild YouTube videos and their transcripts
  • MINECLIP serves as an automatic evaluation metric for Creative tasks that lack a simple success criterion
  • MINECLIP eliminates the need to manually engineer reward functions for each and every MINEDOJO task
  • MINECLIP also provides a high-quality reward signal without any domain adaptation techniques
  • Several techniques introduced to significantly improve RL training efficiency

Experiments

  • Evaluated agent-learning approach on 8 Programmatic tasks and 4 Creative tasks from MINEDOJO benchmarking suite
  • Split tasks into 3 groups and trained one multi-task agent for each group
  • Compared MINECLIP to manually written reward functions
  • MINECLIP is competitive with manual reward
  • MINECLIP outperforms hand-engineered reward functions in 3 tasks
  • MINECLIP dominates Sparse-only baseline
  • Original OpenAI CLIP model fails to achieve any success
  • MINECLIP is an effective approach to solving open-ended tasks
  • MINECLIP shows good zero-shot generalization to significant visual distribution shift
  • Learned policy is more robust to visual changes than baseline
  • Trained single agent for all 12 tasks
  • Performance boost in 6 tasks, degradation in 4 tasks, and roughly the same success rates in 2 tasks
  • Generalize to novel tasks with finetuning
  • Many environments have been developed for open-ended agent learning
  • Minecraft offers an exciting alternative for open-ended agent learning
  • Malmo platform is the first comprehensive release of a Gym-style agent API for Minecraft
  • MineRL provides a codebase and human play trajectories for the annual Diamond Challenge
  • MINEDOJO’s simulator builds upon MineRL
  • Internet-scale databases are used to learn open-vocabulary reward models
  • Pre-training is used to develop generally capable embodied agents

Conclusion

  • MINEDOJO is a framework for developing generally capable embodied agents
  • MINEDOJO has a benchmarking suite of thousands of Programmatic and Creative tasks
  • MINEDOJO has an internet-scale multimodal knowledge base of videos, wiki, and forum discussions
  • MINECLIP is an effective language-conditioned reward function trained with in-the-wild YouTube videos
  • MINECLIP has strong performance and agrees with human evaluation results
  • MINEDOJO is designed to facilitate the development of multi-tasking and continually learning agents
  • MINEDOJO’s observation space contains multiple modalities
  • MINEDOJO’s action space is compound and can be modelled in an autoregressive manner
  • Environments in MINECLIP simulator can be easily and flexibly customized
  • Programmatic tasks are constructed by filling manually written templates
  • Creative tasks are collected from YouTube tutorial videos, mined from GPT-3 API, and manually brainstormed
  • Playthrough task is a special task that tests the agent’s ability to defeat the Ender dragon
  • MINEDOJO’s databases are uploaded to zenodo.org
  • Videos are collected from YouTube Data API
  • Wiki pages provide unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials