Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • General intelligence requires solving tasks across many domains.
  • DreamerV3 is a general and scalable algorithm based on world models that outperforms previous approaches.
  • DreamerV3 is able to work across a wide range of domains with fixed hyperparameters.
  • DreamerV3 has favorable scaling properties, with larger models leading to higher data-efficiency and performance.
  • DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula.

Paper Content

Introduction

  • Reinforcement learning enables computers to solve individual tasks through interaction.
  • Applying algorithms to new application domains requires expert knowledge and computational resources.
  • Different domains pose unique learning challenges, prompting specialized algorithms.
  • DreamerV3 is a general and scalable algorithm that masters a wide range of domains with fixed hyperparameters.

Dreamerv3

  • DreamerV3 algorithm consists of 3 neural networks
  • Algorithm is trained concurrently from replayed experience without sharing gradients
  • Transformation for predicting quantities of unknown orders of magnitude
  • World model, critic, and actor introduced with robust learning objectives
  • KL balancing and free bits enable world model to learn without tuning
  • Scaling down large returns without amplifying small returns allows fixed policy entropy regularizer

Symlog predictions

  • Reconstructing inputs and predicting rewards and values can be challenging.
  • Squared loss can lead to divergence, absolute and Huber losses can stagnate learning.
  • Symlog predictions is a simple solution to this dilemma.

World model learning

  • World model learns compact representations of sensory inputs
  • World model predicts future representations and rewards for potential actions
  • Implemented as Recurrent State-Space Model
  • Encoder maps sensory inputs to stochastic representations
  • Sequence model predicts sequence of representations given past actions
  • Predict rewards, episode continuation flags, and reconstruct inputs
  • Use convolutional neural networks and multi-layer perceptrons
  • Long-term video predictions visualized in Figure 5
  • Optimize parameters to minimize prediction, dynamics, and representation losses
  • Use free bits and symlog predictions to stabilize trade-off with representation loss

Actor critic learning

  • Actor and critic neural networks learn behaviors from abstract sequences predicted by the world model
  • Actions are selected by sampling from the actor network without lookahead planning
  • Actor and critic operate on model states and benefit from Markovian representations
  • Actor aims to maximize expected return with a discount factor
  • Critic learns to predict return of each state under current actor behavior
  • Critic loss function is discrete regression approach with twohot encoded targets
  • Returns are normalized by exponentially decaying standard deviation or 5th to 95th percentile
  • Entropy scale is fixed at 3 x 10-4

Results

  • Proprio Control Suite: DreamerV3 sets a new state-of-the-art on this benchmark
  • Visual Control Suite: DreamerV3 establishes a new state-of-the-art on this benchmark
  • Atari 100k: DreamerV3 outperforms previous methods without complexity
  • Atari 200M: DreamerV3 outperforms Rainbow and IQN
  • BSuite: DreamerV3 establishes a new state-of-the-art on this benchmark
  • Crafter: DreamerV3 sets a new state-of-the-art on this benchmark
  • Scaling properties: DreamerV3 has favorable scaling properties
  • Minecraft: DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch

Previous work

  • Developing general-purpose algorithms is a goal of reinforcement learning research
  • PPO 19 is widely used and requires little tuning but uses a lot of experience
  • SAC 38 is popular for continuous control but requires tuning and struggles with high-dimensional inputs
  • MuZero 34 plans using a value prediction model and has achieved high performance
  • Gato 62 fits one large model to expert demonstrations but only applicable to tasks with expert data
  • DreamerV3 masters a range of environments with fixed hyperparameters and from scratch
  • Microsoft released MALMO 63 for research purposes
  • MineRL 15 offers competition environments
  • MineDojo 64 provides tasks with sparse rewards and language descriptions

Conclusion

  • Presents DreamerV3, a general and scalable reinforcement learning algorithm
  • Masters a wide range of domains with fixed hyperparameters
  • Systematically addresses varying signal magnitudes and instabilities
  • Establishes new state-of-the-art on continuous control from states and images, on BSuite, and on Crafter
  • Learns successfully in 3D environments that require spatial and temporal reasoning
  • Outperforms IMPALA in DMLab tasks using 130 times fewer interactions
  • Obtains diamonds in Minecraft end-to-end from sparse rewards
  • Final performance and data-efficiency of DreamerV3 improve monotonically as a function of model size
  • Limitations include not collecting diamonds in all scenarios
  • Human experts can typically collect diamonds in all scenarios
  • Speed at which blocks break is increased to allow learning with a stochastic policy
  • Training larger models to solve multiple tasks across overlapping domains is a promising direction
  • Symlog predictions used for world model and reward predictor/critic
  • World model regularizer combines KL balancing and free bits
  • Policy regularizer uses fixed entropy regularizer and scales large return ranges
  • Unimix categoricals used for world model representations and dynamics, and actor network
  • Layer normalization and SiLU used as activation function
  • Critic EMA regularizer used
  • Replay buffer uniformly samples from all inserted subsequences
  • Hyperparameters tuned to perform well across visual control suite and Atari 200M
  • Ablation removes symlog encoding of inputs to world model
  • Target KL value of 3.5 nats on average over replay buffer
  • Reward normalized by running standard deviation and clipped beyond magnitude of 10
  • Target policy randomness of 40% on average across imagined states
  • Minecraft Diamond environment built on top of MineRL
  • Sparse reward structure of MineRL competition environment used
  • Inputs include RGB first-person camera image, inventory counts, equipped item, and scalar inputs for health, hunger, and breath levels
  • Flat categorical action space with 25 actions used
  • Break speed multiplier set to 100