Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Autonomous agents have made progress in specialist domains
Humans learn and adapt in the open world
Three ingredients for building generalist agents: environment, knowledge base, and agent architecture
MineDojo is a new framework built on Minecraft with thousands of tasks and an internet-scale knowledge base
Novel agent learning algorithm uses pre-trained video-language models as reward function
Agent is able to solve open-ended tasks without manually designed reward

Developing autonomous embodied agents to attain human-level performance is a long-standing goal for AI research
Progress has been made in games and robotics
Agents are typically trained tabula rasa in isolated worlds with limited complexity and diversity
Humans inhabit an infinitely rich reality and can leverage large amounts of prior knowledge
MINEDOJO is a framework to develop open-ended, generally capable agents
MINEDOJO features a benchmarking suite with thousands of diverse open-ended tasks
MINEDOJO also provides an internet-scale, multimodal knowledge base
MINEDOJO’s simulator provides unified observation and action spaces
MINEDOJO’s tasks are divided into programmatic and creative tasks
Programmatic tasks can be automatically assessed
Creative tasks do not have well-defined success criteria
MINEDOJO uses a learned model to evaluate creative tasks

Formalize each programmatic task as a 5-tuple
Leverage OpenAI’s GPT-3-davinci API to generate detailed guidance
Initial conditions of the agent and the world
Success criterion is a deterministic function
Optional dense reward function
4 categories of programmatic tasks with 1,581 template-generated natural language goals

Creative tasks are defined as a 3-tuple
A novel task evaluation metric is designed based on a pre-trained contrastive video-language model
Human evaluations show high agreement with the learned metric
216 Creative tasks are manually authored
1,560 Creative tasks are generated through two systematic approaches
Approach 1 mines tasks from YouTube tutorial videos
Approach 2 uses GPT-3 to generate new task ideas

Two approaches to train embodied agents include RL with reward functions or human-demonstrations
Crafting reward functions is challenging for the task suite
Turn to the open web as an ever-growing source of learning material
Harvest domain knowledge by web scraping and filtering
Collect 33 years of YouTube videos, 6K+ Wiki pages, and millions of Reddit comment threads
Language is a key component of the database
Take special measures to filter out low-quality and toxic contents
730K+ narrated Minecraft videos, 2.2B words in English transcripts
6,735 Wiki pages with text, images, tables, and diagrams
340K+ posts and 6.6M comments under the “r/Minecraft” subreddit

Grand challenge of embodied AI is to build a single agent that can complete a wide range of open-world tasks
MINEDOJO framework aims to facilitate new techniques towards this goal
Initial step towards this goal is to develop a proof of concept that demonstrates how a single language-prompted agent can be trained in MINEDOJO to complete several complex Minecraft tasks
Novel agent learning algorithm takes advantage of massive YouTube data offered by MINEDOJO
Multi-task reinforcement learning setting, where agent is tasked with completing a collection of MINEDOJO tasks specified by language instructions
Agents developed in popular RL benchmarks rely on dense and task-specific reward functions to guide random explorations
MINECLIP proposed to learn a dense, language-conditioned reward function from in-the-wild YouTube videos and their transcripts
MINECLIP serves as an automatic evaluation metric for Creative tasks that lack a simple success criterion
MINECLIP eliminates the need to manually engineer reward functions for each and every MINEDOJO task
MINECLIP also provides a high-quality reward signal without any domain adaptation techniques
Several techniques introduced to significantly improve RL training efficiency

Evaluated agent-learning approach on 8 Programmatic tasks and 4 Creative tasks from MINEDOJO benchmarking suite
Split tasks into 3 groups and trained one multi-task agent for each group
Compared MINECLIP to manually written reward functions
MINECLIP is competitive with manual reward
MINECLIP outperforms hand-engineered reward functions in 3 tasks
MINECLIP dominates Sparse-only baseline
Original OpenAI CLIP model fails to achieve any success
MINECLIP is an effective approach to solving open-ended tasks
MINECLIP shows good zero-shot generalization to significant visual distribution shift
Learned policy is more robust to visual changes than baseline
Trained single agent for all 12 tasks
Performance boost in 6 tasks, degradation in 4 tasks, and roughly the same success rates in 2 tasks
Generalize to novel tasks with finetuning

Many environments have been developed for open-ended agent learning
Minecraft offers an exciting alternative for open-ended agent learning
Malmo platform is the first comprehensive release of a Gym-style agent API for Minecraft
MineRL provides a codebase and human play trajectories for the annual Diamond Challenge
MINEDOJO’s simulator builds upon MineRL
Internet-scale databases are used to learn open-vocabulary reward models
Pre-training is used to develop generally capable embodied agents

MINEDOJO is a framework for developing generally capable embodied agents
MINEDOJO has a benchmarking suite of thousands of Programmatic and Creative tasks
MINEDOJO has an internet-scale multimodal knowledge base of videos, wiki, and forum discussions
MINECLIP is an effective language-conditioned reward function trained with in-the-wild YouTube videos
MINECLIP has strong performance and agrees with human evaluation results
MINEDOJO is designed to facilitate the development of multi-tasking and continually learning agents
MINEDOJO’s observation space contains multiple modalities
MINEDOJO’s action space is compound and can be modelled in an autoregressive manner
Environments in MINECLIP simulator can be easily and flexibly customized
Programmatic tasks are constructed by filling manually written templates
Creative tasks are collected from YouTube tutorial videos, mined from GPT-3 API, and manually brainstormed
Playthrough task is a special task that tests the agent’s ability to defeat the Ender dragon
MINEDOJO’s databases are uploaded to zenodo.org
Videos are collected from YouTube Data API
Wiki pages provide unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials