Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- This paper focuses on sample-efficient deep reinforcement learning with a simulator.
- The proposed algorithmic framework, UFLP, takes advantage of the ability to reset the environment to a previously observed state.
- UFLP can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks.
- UFLP can achieve super-human performance on the Atari game, Montezuma’s Revenge.
Paper Content
Introduction
- Simulators are used in modern reinforcement learning
- Local access protocol allows agent to revisit previously observed states
- Local access can improve exploration of state space
- Recent works have proposed sample- and computationally-efficient algorithms under local access protocol
- Uncertainty-based intrinsic rewards and bonuses are used to encourage exploration
- Randomized value functions are another approach to exploration
Problem setting
- MDPs are characterized by a tuple of elements
- The elements include a state space, action space, reward function, probability transition kernel, initial state distribution, and discount factor
- The action space is finite
- At each state, the agent picks an action and the environment evolves to a random next state
- A stationary policy is a mapping from a state to a distribution over actions
- The value function is the expectation of cumulative rewards received under a policy
- The action value function is the expectation of cumulative rewards received under a policy for a given state and action
Simulator interaction protocol
- Three protocols for interacting with MDP simulator: online access, local access, random access
- Online access: initial state sampled from initial state distribution, agent can reset environment to initial state or move to next state given action
- Local access: agent can reset environment to random initial state or previously observed state
- Random access: agent can query simulator with any state-action pair to obtain reward and sample of next state
Algorithm framework
- Algorithm framework for policy optimization with local access to a simulator
- Simulator (Env) can be reset to any state observed during learning
- Base agent (Agent) takes actions given observation of state and updates itself
- Function measures uncertainty of agent about value of state-action pairs
- History buffer stores necessary information to reset environment to particular state
- Data collection process starts with given state-action pair
- With probability init, initial state is sampled from initial state distribution
- Otherwise, highest-uncertainty state-action pair is chosen as starting point
- FIFO queue used to store states visited during training
- Uncertainty metric for states can be used to choose most uncertain state
Base agents and uncertainty metrics
Base agents
- Double Deep Q Network (DDQN) is an improvement of the original DQN agent
- DDQN is updated by minimizing a loss over transition tuples
- To improve exploration, an additive bonus or intrinsic reward can be used
- Bootstrapped DDQN mimics the behavior of Thompson sampling
- Distributional DDQN predicts the expectation of the cumulative reward using the Q-network
- Approximate Policy Iteration (PI) updates the Q-function using least-squares Monte Carlo
- PI can act greedily or with an acting-time bonus
Uncertainty estimation
- Standard deviation of ensemble predictions used to evaluate agent uncertainty
- Covariance of random state-action features used to evaluate agent uncertainty
- Approximate counts used to evaluate agent uncertainty
- Random network distillation used to evaluate agent uncertainty
- Error of neural network used as uncertainty metric for states
Experiments
- Evaluated benefits of local vs. online access by training agents on difficult exploration tasks
- Used two bsuite environments and four Atari games
- Games correspond to difficult exploration problems
- Provided details on how to checkpoint and restore environment state in Appendix B
- Provided hyperparameter choices in Appendix C
- 95% confidence interval shown in all figures
Behavior suite experiments
- Introduction of two bsuite environments
- Illustrations of the two environments in Figure 1
Deep sea
- Environment is an × grid with one-hot state encoding
- Agent starts from top left corner and moves down-left or down-right depending on action
- Moving right has a small cost of -0.01/
- Reaching bottom-right corner gives reward of +1
- Exploring uniformly at random has 2 − chance of finding high-reward state
- Local access improves sample efficiency
- Best performance achieved with small init
- Sample efficiency of BootDDQN best with = |H |
- Number of queries needed to achieve mean return 0.95 scales near-optimally with O ( 2.47 )
- Local access leads to significant improvement over online access in hard version
Atari
- Evaluated approach on four Atari games from the Arcade Learning Environment
- Average human performance of 4753 points on one game
- SOTA of 43K achieved by Go-Explore
- Pitfall has sparse positive rewards and distractor rewards
- PrivateEye has average human performance of 69K
- Venture has denser rewards and average human score of 1187
- Used DDQN and distributional DDQN implementations in Acme framework
- Experimented with two different uncertainty metrics: approximate counts and RND
- Local access significantly improves final return and sample efficiency
- Number of cells found by online and local DDQN agents
- Role of init and history buffer batch size in DDQN with approximate counts
- DDQN-Intrinsic needs larger value of init to get good performance in local setting
- Distributional DDQN with UFLP improves score of baseline algorithm to super-human level on Montezuma’s Revenge
- Local access improves sample complexity and stability of baseline algorithm on PrivateEye
- Neutral results on Venture
- Both local and online access versions fail to obtain positive scores on Pitfall
Conclusions and future directions
- Propose a new algorithmic framework for learning with a simulator under the local access protocol
- Demonstrate that uncertainty-first approach to revisiting states in history can improve sample cost of baseline algorithms
- Improve quality of uncertainty estimation in MDPs
- Extend approach to partially observed environments
- Cartpole Swingup environment: fix history buffer batch size = 5, study effect of init in default version
- BootDDQN and DDQN-Bonus performance comparable for init values in [0.2, 1.0]
- PI-Bonus local access leads to significant improvement over online access, best performance with small but non-zero init = 0.2
- Fix init = 0.2, study effect of history buffer batch size for hard version
- BootDDQN performance comparable for different values of
- PI-Bonus choosing uncertain element, i.e., > 1 important
- DDQN-Intrinsic local access finds more cells than online setting
- DDQN with local access better mean return than DDQN-Intrinsic with online access
- Evaluation results of DDQN-based agents in Table 1, distributional DDQN in Table 2
- Max screen score seen during acting in Tables 3 and 4
- DDQN-Intrinsic local agent with approximate count uncertainty reaches screen score 14300
- For Atari games, use environment loader in Acme framework
- For Q-networks, use MLP with two hidden layers, each with size 64
- For BootDDQN, use ensemble of size 20 and prior scale of 40.0
- For other agents, use covariance-based uncertainty metric cov with random Fourier features, feature dimension = 1500
- For DDQN and DDQN-Bonus, regularization coefficient = 0.01, for PI-Bonus, = 0.1
- For DDQN-Bonus and PI-Bonus, bonus scale = 1.0
- Replay buffer of size 10 6
- Horizontal position of cart denoted by , angle between pole and upright direction denoted by , angular velocity of pole denoted by
- Default version: positive reward if and hard version: positive reward if
- Model architecture: Acme AtariTorso network architecture followed by MLP with 512 hidden units
- History buffer of size 10 6
- Reset to highest-uncertainty state in sampled batch, then take random action
- Sweep different combinations of init and
- Use = 0.1 for Montezuma’s Revenge, = 0.01 for all other settings
- For Montezuma’s Revenge, limit maximum length of all roll-outs to 2000
- Use 64 CPU machines as actors and 1 TPU machine as learner, except RND use 128 actors