Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Cross-entropy loss improves with model size and training compute following a power law
Intrinsic performance is a monotonic function of the return defined as the minimum compute required to achieve the given return
Intrinsic performance scales as a power law in model size and environment interactions
Optimal model size scales as a power law in the training compute budget
Varying the “horizon length” of the task mostly changes the coefficient but not the exponent of this relationship

Paper Content

Introduction

Recent studies have found relationships between neural network performance and model size/training compute to be governed by smooth power laws
Studies have focused on generative modeling with cross-entropy loss
This paper seeks to extend these results to reinforcement learning, which generally has no cross-entropy loss
Introduces intrinsic performance, which is defined to be equal to training compute on the compute-efficient frontier
Studies relationships between performance, model size and environment interactions across a range of environments

Intrinsic performance

Cross-entropy test loss scales smoothly with training compute in generative modeling
Mean episode return in reinforcement learning does not necessarily scale smoothly
Intrinsic performance is a metric that behaves like test loss and scales as a power law with compute
Intrinsic performance is the minimum compute required to train a model of any size to reach the same return

The power law for intrinsic performance

Intrinsic performance I scales as a power law with model parameters N and environment interactions E.
Intrinsic performance I is similar to the scaling law for language models.
Intrinsic definition of I determines the exponent β.
Intrinsic performance I scales as a power law in N when interactions are not bottlenecked.
Intrinsic performance I scales as a power law in E when model size is not bottlenecked.

Optimal model size vs compute

Power law for intrinsic performance implies optimal model size scales as a power law with exponent 4
Exponent varied between 0.40 and 0.80 for different environments
Exponent for language modeling is around 0.50
Exponent for 32x32 images is around 0.65
Intriguing conjecture that exponent would be 0.5 in every domain
Optimal model size for RL is smaller than for generative modeling

Experimental setup

Ran experiments using RL environments: CoinRun, StarPilot, FruitBot, Dota 2, MNIST
Used PPO and PPG algorithms with Adam optimization algorithm
Hyperparameters in Appendix B

Procgen benchmark

Used 3 Procgen environments: CoinRun, StarPilot and FruitBot
Used PPG-EWMA with a fixed KL penalty objective
Trained for 200 million environment interactions
Used CNN architecture from IMPALA
Conducted widthscaling and depth-scaling experiments
Varied total number of parameters from 1/64 to 8 times the default
Varied number of residual blocks per stack from 1 to 64

Dota 2

Used 1v1 version of Dota 2 to save computational expense
Used PPO with asynchronous setup to ensure only on-policy data used, no data reuse
8 parallel GPU workers, trained for 13.6-82.6 billion environment interactions
LSTM architecture, embedding and hidden state sizes varied from 8 to 4096

Mnist

MNIST environment samples a handwritten digit from the MNIST training set uniformly and independently random at each timestep
Immediate reward of 1 for a correct label and 0 for an incorrect label
Mean training accuracy is measured instead of mean episode return
Horizon length of the task is artificially controlled by varying hyperparameters of method advantage estimation
PPO-EWMA used with rollouts of length 512
Simple CNN architecture with ReLU activations and 5x5 convolutional layer, 2x2 max pooling, 3x3 convolutional layer, 2x2 max pooling, and dense layer with 1,000 channels
Width of network scaled by varying total number of parameters from 1/64 to 8 times the default
Separate policy and value function networks used

Learning rates

Tuning hyperparameters is important for quantitative results
Adam learning rate should be proportional to initialization scale
For width-scaling experiments, Adam learning rate should be proportional to 1/√width
For depth-scaling experiments, Adam learning rate should be proportional to 1/depth^L
Learning rate should be tuned separately for each model size
Learning rate schedule should be carefully tuned for RL
Values of scaling exponents should be considered uncertain

Results

Power law holds across environments and model sizes
Closeness of power law fit to learning curves shown in Figure 2 and Appendix C
Primary determinant of exponents is domain
Within MNIST, increasing horizon lowers exponent
Within Procgen, easy and hard modes have similar exponents
Difficulty mode does not affect scaling exponents

Effect of task horizon length

MNIST experiments were used to artificially alter the “horizon length” of the task by setting the GAE credit assignment parameter λ to 1 and varying the GAE discount rate γ.
Proposition 1 states that the covariance matrix of the policy gradient is approximately for some symmetric positive semi-definite matrices Σ θ and Π θ that do not depend on h.
Gradient variance is an affine function of h (i.e., a linear function with an intercept).
The number of environment interactions required should be an affine function of h.
Results show that the number of interactions closely follows an affine function of the horizon length.
Increasing the horizon length shifts the optimal model size vs compute curve to the right.
Task horizon length is influenced by how long rewards are delayed for relative to the actions the agent is currently learning.

Variability of exponents over training

Power law for intrinsic performance holds across environments and model sizes
Scaling constants vary over the course of training
Fitted power law to three different periods of training
Early and middle periods of training have significantly lower scaling constants
Measurement of scaling constants is an extrapolation if learning curves don’t reach compute-efficient frontier
Exclude first 1/64 of training for other environments

Scaling depth

Experiments involved scaling width and depth of networks
Power law for intrinsic performance held, but with more noise
Fitted values of α N and α E for depth-scaling experiments similar to width-scaling experiments
Optimal model size for given compute budget smaller for depth-scaling experiments
Optimal model size vs compute scaling laws more similar if measured using FLOPs per forward pass

Natural performance metrics

Natural performance metrics can be found in some environments
CoinRun: fail-to-success ratio is a natural performance metric
Dota 2: TrueSkill rating system is a natural performance metric
Intrinsic performance and natural performance metric fit closely, except for Dota 2 at higher levels

Discussion

Extrapolating sample efficiency

We can use a power law to extrapolate sample efficiency to unseen model sizes and environment interactions.
We can compare the extrapolated infinite-width limit to human sample efficiency.
On StarPilot, humans are around 10,000 times more sample efficient than the extrapolated infinitely-wide model.

Cost-efficient reinforcement learning

Sample efficiency is the primary metric of algorithmic progress in reinforcement learning.
The cost of running the environment and the algorithm can both be taken into account.
It is usually inefficient to use a model that is much cheaper to run than the environment when training from scratch.

Limitations

Not separate training runs for each compute budget
Variability of exponents over training
Not carefully optimized aspect ratios
High-variance learning curves

Forecasting compute requirements

Scaling exponents for reinforcement learning lie in a similar range to generative modeling
Scaling coefficients for reinforcement learning vary by multiple orders of magnitude
Arithmetic intensity may confound scaling coefficients
Sample efficiency is an affine function of the task horizon length
Training data requirements are a sum of two components
Continuing to increase the horizon length will eventually lead to a proportional increase in the compute budget
Measuring scaling exponents precisely is challenging

Conclusion

Extending scaling laws to single-agent reinforcement learning
Notion of intrinsic performance
Intrinsic performance scales as a power law in model size and environment interactions
Optimal model size scales as a power law in the training compute budget
Relationship affected by various properties of the training setup
Implications for biological anchors framework for forecasting transformative artificial intelligence
Computing intrinsic performance and fitting the power law constants require some care
Intrinsic performance is the minimum compute required to train a model of any size
Jointly fit the power law constants and a monotonic function
Fit log (f ) rather than f
Weight the data in proportion to 1 E
Smooth learning curves
Exclude data from early in training
Greedy adaptive batch size algorithm
Batch size schedule can be approximated by a power law
Batch size schedule works well on different Procgen environments
Batch size schedule works well on MNIST environment
Tuning B min important for MNIST environment
No batch size schedule used for Dota 2 experiments

Link to paper#

Abstract#

Paper Content#

Introduction#

Intrinsic performance#

The power law for intrinsic performance#

Optimal model size vs compute#

Experimental setup#

Procgen benchmark#

Dota 2#

Mnist#

Learning rates#

Results#

Effect of task horizon length#

Variability of exponents over training#

Scaling depth#

Natural performance metrics#

Discussion#

Extrapolating sample efficiency#

Cost-efficient reinforcement learning#

Limitations#

Forecasting compute requirements#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Intrinsic performance

The power law for intrinsic performance

Optimal model size vs compute

Experimental setup

Procgen benchmark

Dota 2

Mnist

Learning rates

Results

Effect of task horizon length

Variability of exponents over training

Scaling depth

Natural performance metrics

Discussion

Extrapolating sample efficiency

Cost-efficient reinforcement learning

Limitations

Forecasting compute requirements

Conclusion