Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

General intelligence requires solving tasks across many domains.
DreamerV3 is a general and scalable algorithm based on world models that outperforms previous approaches.
DreamerV3 is able to work across a wide range of domains with fixed hyperparameters.
DreamerV3 has favorable scaling properties, with larger models leading to higher data-efficiency and performance.
DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula.

Paper Content

Introduction

Reinforcement learning enables computers to solve individual tasks through interaction.
Applying algorithms to new application domains requires expert knowledge and computational resources.
Different domains pose unique learning challenges, prompting specialized algorithms.
DreamerV3 is a general and scalable algorithm that masters a wide range of domains with fixed hyperparameters.

Dreamerv3

DreamerV3 algorithm consists of 3 neural networks
Algorithm is trained concurrently from replayed experience without sharing gradients
Transformation for predicting quantities of unknown orders of magnitude
World model, critic, and actor introduced with robust learning objectives
KL balancing and free bits enable world model to learn without tuning
Scaling down large returns without amplifying small returns allows fixed policy entropy regularizer

Symlog predictions

Reconstructing inputs and predicting rewards and values can be challenging.
Squared loss can lead to divergence, absolute and Huber losses can stagnate learning.
Symlog predictions is a simple solution to this dilemma.

World model learning

World model learns compact representations of sensory inputs
World model predicts future representations and rewards for potential actions
Implemented as Recurrent State-Space Model
Encoder maps sensory inputs to stochastic representations
Sequence model predicts sequence of representations given past actions
Predict rewards, episode continuation flags, and reconstruct inputs
Use convolutional neural networks and multi-layer perceptrons
Long-term video predictions visualized in Figure 5
Optimize parameters to minimize prediction, dynamics, and representation losses
Use free bits and symlog predictions to stabilize trade-off with representation loss

Actor critic learning

Actor and critic neural networks learn behaviors from abstract sequences predicted by the world model
Actions are selected by sampling from the actor network without lookahead planning
Actor and critic operate on model states and benefit from Markovian representations
Actor aims to maximize expected return with a discount factor
Critic learns to predict return of each state under current actor behavior
Critic loss function is discrete regression approach with twohot encoded targets
Returns are normalized by exponentially decaying standard deviation or 5th to 95th percentile
Entropy scale is fixed at 3 x 10-4

Results

Proprio Control Suite: DreamerV3 sets a new state-of-the-art on this benchmark
Visual Control Suite: DreamerV3 establishes a new state-of-the-art on this benchmark
Atari 100k: DreamerV3 outperforms previous methods without complexity
Atari 200M: DreamerV3 outperforms Rainbow and IQN
BSuite: DreamerV3 establishes a new state-of-the-art on this benchmark
Crafter: DreamerV3 sets a new state-of-the-art on this benchmark
Scaling properties: DreamerV3 has favorable scaling properties
Minecraft: DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch

Previous work

Developing general-purpose algorithms is a goal of reinforcement learning research
PPO 19 is widely used and requires little tuning but uses a lot of experience
SAC 38 is popular for continuous control but requires tuning and struggles with high-dimensional inputs
MuZero 34 plans using a value prediction model and has achieved high performance
Gato 62 fits one large model to expert demonstrations but only applicable to tasks with expert data
DreamerV3 masters a range of environments with fixed hyperparameters and from scratch
Microsoft released MALMO 63 for research purposes
MineRL 15 offers competition environments
MineDojo 64 provides tasks with sparse rewards and language descriptions

Conclusion

Presents DreamerV3, a general and scalable reinforcement learning algorithm
Masters a wide range of domains with fixed hyperparameters
Systematically addresses varying signal magnitudes and instabilities
Establishes new state-of-the-art on continuous control from states and images, on BSuite, and on Crafter
Learns successfully in 3D environments that require spatial and temporal reasoning
Outperforms IMPALA in DMLab tasks using 130 times fewer interactions
Obtains diamonds in Minecraft end-to-end from sparse rewards
Final performance and data-efficiency of DreamerV3 improve monotonically as a function of model size
Limitations include not collecting diamonds in all scenarios
Human experts can typically collect diamonds in all scenarios
Speed at which blocks break is increased to allow learning with a stochastic policy
Training larger models to solve multiple tasks across overlapping domains is a promising direction
Symlog predictions used for world model and reward predictor/critic
World model regularizer combines KL balancing and free bits
Policy regularizer uses fixed entropy regularizer and scales large return ranges
Unimix categoricals used for world model representations and dynamics, and actor network
Layer normalization and SiLU used as activation function
Critic EMA regularizer used
Replay buffer uniformly samples from all inserted subsequences
Hyperparameters tuned to perform well across visual control suite and Atari 200M
Ablation removes symlog encoding of inputs to world model
Target KL value of 3.5 nats on average over replay buffer
Reward normalized by running standard deviation and clipped beyond magnitude of 10
Target policy randomness of 40% on average across imagined states
Minecraft Diamond environment built on top of MineRL
Sparse reward structure of MineRL competition environment used
Inputs include RGB first-person camera image, inventory counts, equipped item, and scalar inputs for health, hunger, and breath levels
Flat categorical action space with 25 actions used
Break speed multiplier set to 100

Link to paper#

Abstract#

Paper Content#

Introduction#

Dreamerv3#

Symlog predictions#

World model learning#

Actor critic learning#

Results#

Previous work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Dreamerv3

Symlog predictions

World model learning

Actor critic learning

Results

Previous work

Conclusion