Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- General intelligence requires solving tasks across many domains.
- DreamerV3 is a general and scalable algorithm based on world models that outperforms previous approaches.
- DreamerV3 is able to work across a wide range of domains with fixed hyperparameters.
- DreamerV3 has favorable scaling properties, with larger models leading to higher data-efficiency and performance.
- DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula.
Paper Content
Introduction
- Reinforcement learning enables computers to solve individual tasks through interaction.
- Applying algorithms to new application domains requires expert knowledge and computational resources.
- Different domains pose unique learning challenges, prompting specialized algorithms.
- DreamerV3 is a general and scalable algorithm that masters a wide range of domains with fixed hyperparameters.
Dreamerv3
- DreamerV3 algorithm consists of 3 neural networks
- Algorithm is trained concurrently from replayed experience without sharing gradients
- Transformation for predicting quantities of unknown orders of magnitude
- World model, critic, and actor introduced with robust learning objectives
- KL balancing and free bits enable world model to learn without tuning
- Scaling down large returns without amplifying small returns allows fixed policy entropy regularizer
Symlog predictions
- Reconstructing inputs and predicting rewards and values can be challenging.
- Squared loss can lead to divergence, absolute and Huber losses can stagnate learning.
- Symlog predictions is a simple solution to this dilemma.
World model learning
- World model learns compact representations of sensory inputs
- World model predicts future representations and rewards for potential actions
- Implemented as Recurrent State-Space Model
- Encoder maps sensory inputs to stochastic representations
- Sequence model predicts sequence of representations given past actions
- Predict rewards, episode continuation flags, and reconstruct inputs
- Use convolutional neural networks and multi-layer perceptrons
- Long-term video predictions visualized in Figure 5
- Optimize parameters to minimize prediction, dynamics, and representation losses
- Use free bits and symlog predictions to stabilize trade-off with representation loss
Actor critic learning
- Actor and critic neural networks learn behaviors from abstract sequences predicted by the world model
- Actions are selected by sampling from the actor network without lookahead planning
- Actor and critic operate on model states and benefit from Markovian representations
- Actor aims to maximize expected return with a discount factor
- Critic learns to predict return of each state under current actor behavior
- Critic loss function is discrete regression approach with twohot encoded targets
- Returns are normalized by exponentially decaying standard deviation or 5th to 95th percentile
- Entropy scale is fixed at 3 x 10-4
Results
- Proprio Control Suite: DreamerV3 sets a new state-of-the-art on this benchmark
- Visual Control Suite: DreamerV3 establishes a new state-of-the-art on this benchmark
- Atari 100k: DreamerV3 outperforms previous methods without complexity
- Atari 200M: DreamerV3 outperforms Rainbow and IQN
- BSuite: DreamerV3 establishes a new state-of-the-art on this benchmark
- Crafter: DreamerV3 sets a new state-of-the-art on this benchmark
- Scaling properties: DreamerV3 has favorable scaling properties
- Minecraft: DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch
Previous work
- Developing general-purpose algorithms is a goal of reinforcement learning research
- PPO 19 is widely used and requires little tuning but uses a lot of experience
- SAC 38 is popular for continuous control but requires tuning and struggles with high-dimensional inputs
- MuZero 34 plans using a value prediction model and has achieved high performance
- Gato 62 fits one large model to expert demonstrations but only applicable to tasks with expert data
- DreamerV3 masters a range of environments with fixed hyperparameters and from scratch
- Microsoft released MALMO 63 for research purposes
- MineRL 15 offers competition environments
- MineDojo 64 provides tasks with sparse rewards and language descriptions
Conclusion
- Presents DreamerV3, a general and scalable reinforcement learning algorithm
- Masters a wide range of domains with fixed hyperparameters
- Systematically addresses varying signal magnitudes and instabilities
- Establishes new state-of-the-art on continuous control from states and images, on BSuite, and on Crafter
- Learns successfully in 3D environments that require spatial and temporal reasoning
- Outperforms IMPALA in DMLab tasks using 130 times fewer interactions
- Obtains diamonds in Minecraft end-to-end from sparse rewards
- Final performance and data-efficiency of DreamerV3 improve monotonically as a function of model size
- Limitations include not collecting diamonds in all scenarios
- Human experts can typically collect diamonds in all scenarios
- Speed at which blocks break is increased to allow learning with a stochastic policy
- Training larger models to solve multiple tasks across overlapping domains is a promising direction
- Symlog predictions used for world model and reward predictor/critic
- World model regularizer combines KL balancing and free bits
- Policy regularizer uses fixed entropy regularizer and scales large return ranges
- Unimix categoricals used for world model representations and dynamics, and actor network
- Layer normalization and SiLU used as activation function
- Critic EMA regularizer used
- Replay buffer uniformly samples from all inserted subsequences
- Hyperparameters tuned to perform well across visual control suite and Atari 200M
- Ablation removes symlog encoding of inputs to world model
- Target KL value of 3.5 nats on average over replay buffer
- Reward normalized by running standard deviation and clipped beyond magnitude of 10
- Target policy randomness of 40% on average across imagined states
- Minecraft Diamond environment built on top of MineRL
- Sparse reward structure of MineRL competition environment used
- Inputs include RGB first-person camera image, inventory counts, equipped item, and scalar inputs for health, hunger, and breath levels
- Flat categorical action space with 25 actions used
- Break speed multiplier set to 100