Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

This paper focuses on sample-efficient deep reinforcement learning with a simulator.
The proposed algorithmic framework, UFLP, takes advantage of the ability to reset the environment to a previously observed state.
UFLP can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks.
UFLP can achieve super-human performance on the Atari game, Montezuma’s Revenge.

Paper Content

Introduction

Simulators are used in modern reinforcement learning
Local access protocol allows agent to revisit previously observed states
Local access can improve exploration of state space
Recent works have proposed sample- and computationally-efficient algorithms under local access protocol
Uncertainty-based intrinsic rewards and bonuses are used to encourage exploration
Randomized value functions are another approach to exploration

Problem setting

MDPs are characterized by a tuple of elements
The elements include a state space, action space, reward function, probability transition kernel, initial state distribution, and discount factor
The action space is finite
At each state, the agent picks an action and the environment evolves to a random next state
A stationary policy is a mapping from a state to a distribution over actions
The value function is the expectation of cumulative rewards received under a policy
The action value function is the expectation of cumulative rewards received under a policy for a given state and action

Simulator interaction protocol

Three protocols for interacting with MDP simulator: online access, local access, random access
Online access: initial state sampled from initial state distribution, agent can reset environment to initial state or move to next state given action
Local access: agent can reset environment to random initial state or previously observed state
Random access: agent can query simulator with any state-action pair to obtain reward and sample of next state

Algorithm framework

Algorithm framework for policy optimization with local access to a simulator
Simulator (Env) can be reset to any state observed during learning
Base agent (Agent) takes actions given observation of state and updates itself
Function measures uncertainty of agent about value of state-action pairs
History buffer stores necessary information to reset environment to particular state
Data collection process starts with given state-action pair
With probability init, initial state is sampled from initial state distribution
Otherwise, highest-uncertainty state-action pair is chosen as starting point
FIFO queue used to store states visited during training
Uncertainty metric for states can be used to choose most uncertain state

Base agents and uncertainty metrics

Base agents

Double Deep Q Network (DDQN) is an improvement of the original DQN agent
DDQN is updated by minimizing a loss over transition tuples
To improve exploration, an additive bonus or intrinsic reward can be used
Bootstrapped DDQN mimics the behavior of Thompson sampling
Distributional DDQN predicts the expectation of the cumulative reward using the Q-network
Approximate Policy Iteration (PI) updates the Q-function using least-squares Monte Carlo
PI can act greedily or with an acting-time bonus

Uncertainty estimation

Standard deviation of ensemble predictions used to evaluate agent uncertainty
Covariance of random state-action features used to evaluate agent uncertainty
Approximate counts used to evaluate agent uncertainty
Random network distillation used to evaluate agent uncertainty
Error of neural network used as uncertainty metric for states

Experiments

Evaluated benefits of local vs. online access by training agents on difficult exploration tasks
Used two bsuite environments and four Atari games
Games correspond to difficult exploration problems
Provided details on how to checkpoint and restore environment state in Appendix B
Provided hyperparameter choices in Appendix C
95% confidence interval shown in all figures

Behavior suite experiments

Introduction of two bsuite environments
Illustrations of the two environments in Figure 1

Deep sea

Environment is an × grid with one-hot state encoding
Agent starts from top left corner and moves down-left or down-right depending on action
Moving right has a small cost of -0.01/
Reaching bottom-right corner gives reward of +1
Exploring uniformly at random has 2 − chance of finding high-reward state
Local access improves sample efficiency
Best performance achieved with small init
Sample efficiency of BootDDQN best with = |H |
Number of queries needed to achieve mean return 0.95 scales near-optimally with O ( 2.47 )
Local access leads to significant improvement over online access in hard version

Atari

Evaluated approach on four Atari games from the Arcade Learning Environment
Average human performance of 4753 points on one game
SOTA of 43K achieved by Go-Explore
Pitfall has sparse positive rewards and distractor rewards
PrivateEye has average human performance of 69K
Venture has denser rewards and average human score of 1187
Used DDQN and distributional DDQN implementations in Acme framework
Experimented with two different uncertainty metrics: approximate counts and RND
Local access significantly improves final return and sample efficiency
Number of cells found by online and local DDQN agents
Role of init and history buffer batch size in DDQN with approximate counts
DDQN-Intrinsic needs larger value of init to get good performance in local setting
Distributional DDQN with UFLP improves score of baseline algorithm to super-human level on Montezuma’s Revenge
Local access improves sample complexity and stability of baseline algorithm on PrivateEye
Neutral results on Venture
Both local and online access versions fail to obtain positive scores on Pitfall

Conclusions and future directions

Propose a new algorithmic framework for learning with a simulator under the local access protocol
Demonstrate that uncertainty-first approach to revisiting states in history can improve sample cost of baseline algorithms
Improve quality of uncertainty estimation in MDPs
Extend approach to partially observed environments
Cartpole Swingup environment: fix history buffer batch size = 5, study effect of init in default version
BootDDQN and DDQN-Bonus performance comparable for init values in [0.2, 1.0]
PI-Bonus local access leads to significant improvement over online access, best performance with small but non-zero init = 0.2
Fix init = 0.2, study effect of history buffer batch size for hard version
BootDDQN performance comparable for different values of
PI-Bonus choosing uncertain element, i.e., > 1 important
DDQN-Intrinsic local access finds more cells than online setting
DDQN with local access better mean return than DDQN-Intrinsic with online access
Evaluation results of DDQN-based agents in Table 1, distributional DDQN in Table 2
Max screen score seen during acting in Tables 3 and 4
DDQN-Intrinsic local agent with approximate count uncertainty reaches screen score 14300
For Atari games, use environment loader in Acme framework
For Q-networks, use MLP with two hidden layers, each with size 64
For BootDDQN, use ensemble of size 20 and prior scale of 40.0
For other agents, use covariance-based uncertainty metric cov with random Fourier features, feature dimension = 1500
For DDQN and DDQN-Bonus, regularization coefficient = 0.01, for PI-Bonus, = 0.1
For DDQN-Bonus and PI-Bonus, bonus scale = 1.0
Replay buffer of size 10 6
Horizontal position of cart denoted by , angle between pole and upright direction denoted by , angular velocity of pole denoted by
Default version: positive reward if and hard version: positive reward if
Model architecture: Acme AtariTorso network architecture followed by MLP with 512 hidden units
History buffer of size 10 6
Reset to highest-uncertainty state in sampled batch, then take random action
Sweep different combinations of init and
Use = 0.1 for Montezuma’s Revenge, = 0.01 for all other settings
For Montezuma’s Revenge, limit maximum length of all roll-outs to 2000
Use 64 CPU machines as actors and 1 TPU machine as learner, except RND use 128 actors

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem setting#

Simulator interaction protocol#

Algorithm framework#

Base agents and uncertainty metrics#

Base agents#

Uncertainty estimation#

Experiments#

Behavior suite experiments#

Deep sea#

Atari#

Conclusions and future directions#

Link to paper

Abstract

Paper Content

Introduction

Problem setting

Simulator interaction protocol

Algorithm framework

Base agents and uncertainty metrics

Base agents

Uncertainty estimation

Experiments

Behavior suite experiments

Deep sea

Atari

Conclusions and future directions