Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • This paper focuses on sample-efficient deep reinforcement learning with a simulator.
  • The proposed algorithmic framework, UFLP, takes advantage of the ability to reset the environment to a previously observed state.
  • UFLP can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks.
  • UFLP can achieve super-human performance on the Atari game, Montezuma’s Revenge.

Paper Content

Introduction

  • Simulators are used in modern reinforcement learning
  • Local access protocol allows agent to revisit previously observed states
  • Local access can improve exploration of state space
  • Recent works have proposed sample- and computationally-efficient algorithms under local access protocol
  • Uncertainty-based intrinsic rewards and bonuses are used to encourage exploration
  • Randomized value functions are another approach to exploration

Problem setting

  • MDPs are characterized by a tuple of elements
  • The elements include a state space, action space, reward function, probability transition kernel, initial state distribution, and discount factor
  • The action space is finite
  • At each state, the agent picks an action and the environment evolves to a random next state
  • A stationary policy is a mapping from a state to a distribution over actions
  • The value function is the expectation of cumulative rewards received under a policy
  • The action value function is the expectation of cumulative rewards received under a policy for a given state and action

Simulator interaction protocol

  • Three protocols for interacting with MDP simulator: online access, local access, random access
  • Online access: initial state sampled from initial state distribution, agent can reset environment to initial state or move to next state given action
  • Local access: agent can reset environment to random initial state or previously observed state
  • Random access: agent can query simulator with any state-action pair to obtain reward and sample of next state

Algorithm framework

  • Algorithm framework for policy optimization with local access to a simulator
  • Simulator (Env) can be reset to any state observed during learning
  • Base agent (Agent) takes actions given observation of state and updates itself
  • Function measures uncertainty of agent about value of state-action pairs
  • History buffer stores necessary information to reset environment to particular state
  • Data collection process starts with given state-action pair
  • With probability init, initial state is sampled from initial state distribution
  • Otherwise, highest-uncertainty state-action pair is chosen as starting point
  • FIFO queue used to store states visited during training
  • Uncertainty metric for states can be used to choose most uncertain state

Base agents and uncertainty metrics

Base agents

  • Double Deep Q Network (DDQN) is an improvement of the original DQN agent
  • DDQN is updated by minimizing a loss over transition tuples
  • To improve exploration, an additive bonus or intrinsic reward can be used
  • Bootstrapped DDQN mimics the behavior of Thompson sampling
  • Distributional DDQN predicts the expectation of the cumulative reward using the Q-network
  • Approximate Policy Iteration (PI) updates the Q-function using least-squares Monte Carlo
  • PI can act greedily or with an acting-time bonus

Uncertainty estimation

  • Standard deviation of ensemble predictions used to evaluate agent uncertainty
  • Covariance of random state-action features used to evaluate agent uncertainty
  • Approximate counts used to evaluate agent uncertainty
  • Random network distillation used to evaluate agent uncertainty
  • Error of neural network used as uncertainty metric for states

Experiments

  • Evaluated benefits of local vs. online access by training agents on difficult exploration tasks
  • Used two bsuite environments and four Atari games
  • Games correspond to difficult exploration problems
  • Provided details on how to checkpoint and restore environment state in Appendix B
  • Provided hyperparameter choices in Appendix C
  • 95% confidence interval shown in all figures

Behavior suite experiments

  • Introduction of two bsuite environments
  • Illustrations of the two environments in Figure 1

Deep sea

  • Environment is an × grid with one-hot state encoding
  • Agent starts from top left corner and moves down-left or down-right depending on action
  • Moving right has a small cost of -0.01/
  • Reaching bottom-right corner gives reward of +1
  • Exploring uniformly at random has 2 − chance of finding high-reward state
  • Local access improves sample efficiency
  • Best performance achieved with small init
  • Sample efficiency of BootDDQN best with = |H |
  • Number of queries needed to achieve mean return 0.95 scales near-optimally with O ( 2.47 )
  • Local access leads to significant improvement over online access in hard version

Atari

  • Evaluated approach on four Atari games from the Arcade Learning Environment
  • Average human performance of 4753 points on one game
  • SOTA of 43K achieved by Go-Explore
  • Pitfall has sparse positive rewards and distractor rewards
  • PrivateEye has average human performance of 69K
  • Venture has denser rewards and average human score of 1187
  • Used DDQN and distributional DDQN implementations in Acme framework
  • Experimented with two different uncertainty metrics: approximate counts and RND
  • Local access significantly improves final return and sample efficiency
  • Number of cells found by online and local DDQN agents
  • Role of init and history buffer batch size in DDQN with approximate counts
  • DDQN-Intrinsic needs larger value of init to get good performance in local setting
  • Distributional DDQN with UFLP improves score of baseline algorithm to super-human level on Montezuma’s Revenge
  • Local access improves sample complexity and stability of baseline algorithm on PrivateEye
  • Neutral results on Venture
  • Both local and online access versions fail to obtain positive scores on Pitfall

Conclusions and future directions

  • Propose a new algorithmic framework for learning with a simulator under the local access protocol
  • Demonstrate that uncertainty-first approach to revisiting states in history can improve sample cost of baseline algorithms
  • Improve quality of uncertainty estimation in MDPs
  • Extend approach to partially observed environments
  • Cartpole Swingup environment: fix history buffer batch size = 5, study effect of init in default version
  • BootDDQN and DDQN-Bonus performance comparable for init values in [0.2, 1.0]
  • PI-Bonus local access leads to significant improvement over online access, best performance with small but non-zero init = 0.2
  • Fix init = 0.2, study effect of history buffer batch size for hard version
  • BootDDQN performance comparable for different values of
  • PI-Bonus choosing uncertain element, i.e., > 1 important
  • DDQN-Intrinsic local access finds more cells than online setting
  • DDQN with local access better mean return than DDQN-Intrinsic with online access
  • Evaluation results of DDQN-based agents in Table 1, distributional DDQN in Table 2
  • Max screen score seen during acting in Tables 3 and 4
  • DDQN-Intrinsic local agent with approximate count uncertainty reaches screen score 14300
  • For Atari games, use environment loader in Acme framework
  • For Q-networks, use MLP with two hidden layers, each with size 64
  • For BootDDQN, use ensemble of size 20 and prior scale of 40.0
  • For other agents, use covariance-based uncertainty metric cov with random Fourier features, feature dimension = 1500
  • For DDQN and DDQN-Bonus, regularization coefficient = 0.01, for PI-Bonus, = 0.1
  • For DDQN-Bonus and PI-Bonus, bonus scale = 1.0
  • Replay buffer of size 10 6
  • Horizontal position of cart denoted by , angle between pole and upright direction denoted by , angular velocity of pole denoted by
  • Default version: positive reward if and hard version: positive reward if
  • Model architecture: Acme AtariTorso network architecture followed by MLP with 512 hidden units
  • History buffer of size 10 6
  • Reset to highest-uncertainty state in sampled batch, then take random action
  • Sweep different combinations of init and
  • Use = 0.1 for Montezuma’s Revenge, = 0.01 for all other settings
  • For Montezuma’s Revenge, limit maximum length of all roll-outs to 2000
  • Use 64 CPU machines as actors and 1 TPU machine as learner, except RND use 128 actors