Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Autonomous agents have made progress in specialist domains
- Humans learn and adapt in the open world
- Three ingredients for building generalist agents: environment, knowledge base, and agent architecture
- MineDojo is a new framework built on Minecraft with thousands of tasks and an internet-scale knowledge base
- Novel agent learning algorithm uses pre-trained video-language models as reward function
- Agent is able to solve open-ended tasks without manually designed reward
Paper Content
Introduction
- Developing autonomous embodied agents to attain human-level performance is a long-standing goal for AI research
- Progress has been made in games and robotics
- Agents are typically trained tabula rasa in isolated worlds with limited complexity and diversity
- Humans inhabit an infinitely rich reality and can leverage large amounts of prior knowledge
- MINEDOJO is a framework to develop open-ended, generally capable agents
- MINEDOJO features a benchmarking suite with thousands of diverse open-ended tasks
- MINEDOJO also provides an internet-scale, multimodal knowledge base
- MINEDOJO’s simulator provides unified observation and action spaces
- MINEDOJO’s tasks are divided into programmatic and creative tasks
- Programmatic tasks can be automatically assessed
- Creative tasks do not have well-defined success criteria
- MINEDOJO uses a learned model to evaluate creative tasks
Task suite i: programmatic tasks
- Formalize each programmatic task as a 5-tuple
- Leverage OpenAI’s GPT-3-davinci API to generate detailed guidance
- Initial conditions of the agent and the world
- Success criterion is a deterministic function
- Optional dense reward function
- 4 categories of programmatic tasks with 1,581 template-generated natural language goals
Task suite ii: creative tasks
- Creative tasks are defined as a 3-tuple
- A novel task evaluation metric is designed based on a pre-trained contrastive video-language model
- Human evaluations show high agreement with the learned metric
- 216 Creative tasks are manually authored
- 1,560 Creative tasks are generated through two systematic approaches
- Approach 1 mines tasks from YouTube tutorial videos
- Approach 2 uses GPT-3 to generate new task ideas
Internet-scale knowledge base
- Two approaches to train embodied agents include RL with reward functions or human-demonstrations
- Crafting reward functions is challenging for the task suite
- Turn to the open web as an ever-growing source of learning material
- Harvest domain knowledge by web scraping and filtering
- Collect 33 years of YouTube videos, 6K+ Wiki pages, and millions of Reddit comment threads
- Language is a key component of the database
- Take special measures to filter out low-quality and toxic contents
- 730K+ narrated Minecraft videos, 2.2B words in English transcripts
- 6,735 Wiki pages with text, images, tables, and diagrams
- 340K+ posts and 6.6M comments under the “r/Minecraft” subreddit
Agent learning with large-scale pre-training
- Grand challenge of embodied AI is to build a single agent that can complete a wide range of open-world tasks
- MINEDOJO framework aims to facilitate new techniques towards this goal
- Initial step towards this goal is to develop a proof of concept that demonstrates how a single language-prompted agent can be trained in MINEDOJO to complete several complex Minecraft tasks
- Novel agent learning algorithm takes advantage of massive YouTube data offered by MINEDOJO
- Multi-task reinforcement learning setting, where agent is tasked with completing a collection of MINEDOJO tasks specified by language instructions
- Agents developed in popular RL benchmarks rely on dense and task-specific reward functions to guide random explorations
- MINECLIP proposed to learn a dense, language-conditioned reward function from in-the-wild YouTube videos and their transcripts
- MINECLIP serves as an automatic evaluation metric for Creative tasks that lack a simple success criterion
- MINECLIP eliminates the need to manually engineer reward functions for each and every MINEDOJO task
- MINECLIP also provides a high-quality reward signal without any domain adaptation techniques
- Several techniques introduced to significantly improve RL training efficiency
Experiments
- Evaluated agent-learning approach on 8 Programmatic tasks and 4 Creative tasks from MINEDOJO benchmarking suite
- Split tasks into 3 groups and trained one multi-task agent for each group
- Compared MINECLIP to manually written reward functions
- MINECLIP is competitive with manual reward
- MINECLIP outperforms hand-engineered reward functions in 3 tasks
- MINECLIP dominates Sparse-only baseline
- Original OpenAI CLIP model fails to achieve any success
- MINECLIP is an effective approach to solving open-ended tasks
- MINECLIP shows good zero-shot generalization to significant visual distribution shift
- Learned policy is more robust to visual changes than baseline
- Trained single agent for all 12 tasks
- Performance boost in 6 tasks, degradation in 4 tasks, and roughly the same success rates in 2 tasks
- Generalize to novel tasks with finetuning
Related work
- Many environments have been developed for open-ended agent learning
- Minecraft offers an exciting alternative for open-ended agent learning
- Malmo platform is the first comprehensive release of a Gym-style agent API for Minecraft
- MineRL provides a codebase and human play trajectories for the annual Diamond Challenge
- MINEDOJO’s simulator builds upon MineRL
- Internet-scale databases are used to learn open-vocabulary reward models
- Pre-training is used to develop generally capable embodied agents
Conclusion
- MINEDOJO is a framework for developing generally capable embodied agents
- MINEDOJO has a benchmarking suite of thousands of Programmatic and Creative tasks
- MINEDOJO has an internet-scale multimodal knowledge base of videos, wiki, and forum discussions
- MINECLIP is an effective language-conditioned reward function trained with in-the-wild YouTube videos
- MINECLIP has strong performance and agrees with human evaluation results
- MINEDOJO is designed to facilitate the development of multi-tasking and continually learning agents
- MINEDOJO’s observation space contains multiple modalities
- MINEDOJO’s action space is compound and can be modelled in an autoregressive manner
- Environments in MINECLIP simulator can be easily and flexibly customized
- Programmatic tasks are constructed by filling manually written templates
- Creative tasks are collected from YouTube tutorial videos, mined from GPT-3 API, and manually brainstormed
- Playthrough task is a special task that tests the agent’s ability to defeat the Ender dragon
- MINEDOJO’s databases are uploaded to zenodo.org
- Videos are collected from YouTube Data API
- Wiki pages provide unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials