Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Selecting exploratory actions to generate experience is a challenge in RL.
  • Options-based exploration builds on graph Laplacian eigenfunctions.
  • Previous methods limited to tabular domains, separate option discovery phase, and exact value function learning.
  • This paper introduces a deep RL algorithm to discover Laplacian-based options.
  • Evaluated on pixel-based tasks, compared to state-of-the-art exploration methods.

Paper Content

Introduction

  • Reinforcement learning (RL) is an agent-environment interaction to maximize rewards.
  • At each step, the agent receives an observation, reward, and takes an action.
  • Exploration is promoted by selecting actions according to a specific policy for an extended period of time.
  • Options are a formalism for representing behaviour at different timescales.
  • Methods based on these ideas have been successful in various domains.
  • A central question is how to define temporal abstractions for exploration.
  • Solutions often revolve around simple approaches or domain-specific information.
  • Graph Laplacian is used to capture the topology of the environment at various timescales.
  • This paper extends Laplacian-based options framework to deep function approximation.
  • It introduces an online algorithm for discovering Laplacian-based options.
  • It is benchmarked against state-of-the-art exploration methods.
  • It is effective in non-stationary environments.

Preliminaries

  • Agent interacts with environment at time step t
  • Agent selects action A t
  • Environment emits scalar reward R t+1
  • Agent’s goal is to maximize expected discounted sum of rewards
  • Value-based methods used to obtain policy
  • Tabular case assigns value to each state
  • Complexity of tasks increases, need to approximate optimal value function
  • Neural networks used to parameterize Q
  • Double DQN and n-step targets used to improve performance
  • Options framework provides formalism for abstractions over time

Covering eigenoptions

  • Options framework provides formalism for temporal abstraction
  • Option discovery is a challenge in HRL
  • Variety of strategies exist to discover options
  • Paper focuses on representation-driven option discovery (ROD) cycle
  • ROD cycle has 3 main steps
  • Representation-driven option discovery uses eigenfunctions of graph Laplacian
  • Eigenfunctions capture environment dynamics at different timescales
  • Empirical results show CEO is more efficient at covering Four Rooms domain
  • Paper evaluates CEO in different environments and compares to exploration algorithms
  • CEO is effective in covering state space and leads to faster reward maximization

Approximate laplacian-based options

  • CEO relies on costly eigendecomposition operation to define intrinsic rewards
  • Estimating Laplacian matrix in large state spaces is impractical
  • Recent methods use neural networks to approximate Laplacian eigenfunctions
  • Objective borrowed from graph theory to make it amenable to online RL
  • Objective incentivizes smoothness and orthogonality between eigenfunctions
  • Open question whether generalized Laplacian objective can be used to discover unknown parts of state space

A two-phased scalable method

  • Generalized Laplacian is conducive to be used by option discovery methods
  • Laplacian-based options are defined by intrinsic rewards generated by eigenfunctions of Laplacian
  • Approximation errors can lead to bad reward functions
  • Deep function approximation used for learning main policy and options’ policies
  • Option termination defined by uniform random probability
  • Deep Covering Eigenoptions (DCEO) algorithm introduced
  • Validated efficacy on pixel-based versions of environments
  • DCEO outperforms several state-of-the-art methods
  • DCEO agent learns options by maximizing intrinsic rewards based on approximated eigenfunctions of graph Laplacian
  • Option termination defined by uniform random probability
  • DCEO agent learns to maximize reward with DDQN and n-step targets
  • DCEO agent explores with sampled option’s policy until it terminates
  • Evaluated DCEO in terms of state coverage and reward maximization
  • DCEO performs at least as well as well-established baselines

A single continuous cycle

  • Laplacian-based options are effective for exploration under deep function approximation
  • Initial phase dedicated to option discovery is necessary, but problematic
  • Algorithm introduced to go beyond these limitations and be fully online and generally applicable

Online discovery of laplacian-based options

  • Randomly initializes Laplacian representation, options, and main DDQN learner
  • Adjusts options as Laplacian becomes more accurate
  • Addresses policy-dependent concern by being online
  • Evaluated in pixel-based environments
  • Competitive or outperforms other baselines
  • Similar performance to two-phased algorithm

Exploration in the face of non-stationarity

  • Options can encode temporally extended behaviours
  • Options can be leveraged for continual exploration
  • Count-based and error prediction-based methods are not as flexible in the face of non-stationarity
  • DCEO adapts quickly to changes in environment
  • DCEO’s performance remains robust to drastic changes in environment’s topology

Objects, obstacles and complex interactions

  • DCEO algorithm is effective for exploration
  • Environments have different topologies
  • Environments require different levels of abstraction
  • All transitions are stochastic
  • DCEO algorithm extends to challenging domains
  • DCEO outperforms other baselines in complex environment

Scaling up further

  • Investigated scalability of Laplacian representation to higher dimensional environments and partial observability
  • Performed experiments on Atari 2600 game Montezuma’s Revenge and 3D navigation task MiniWorld-FourRooms
  • Trained Laplacian with trajectories from random walks
  • Qualitative results show first eigenfunctions point to important stepping stones in Montezuma’s Revenge and traverse the observation space in MiniWorld-FourRooms
  • Suggests first options discovered in such environments without domain knowledge are meaningful

Conclusion

  • Introduced a scalable and generally applicable algorithm for Laplacian-based option discovery
  • Extended a tabular approach into an online method compatible with deep function approximation
  • Proposed a strategy for incorporating option discovery and reward maximization
  • Algorithm performs better than state-of-the-art baselines on a variety of environments and settings
  • First time a Laplacian-based method has done so
  • Results in non-stationarity environments are promising
  • Variety of research directions for future improvement
  • Graph Laplacian introduced in RL through Proto-Value Functions framework
  • Used to learn eigenoptions
  • Full eigenspectrum used to define options that leverage diffusion distance
  • Options learned through mutual information-based objectives
  • Laplacian representation introduced by Wu et al. (2019)
  • DCEO learns a diversity of options simultaneously
  • Graph Laplacian used for credit assignment, reward shaping, and representation learning
  • Nine rooms, Maze, and Rubik’s cube 2x2 used as environments
  • Agent starts from particular position in each environment
  • State coverage experiments use random policy or exploration approach
  • CEO algorithm introduces two hyperparameters
  • Reward maximization experiments use Q-learning algorithm
  • CEO performs significantly better or is on par with evaluated baselines
  • CEO covers environment in a fundamentally different way
  • CEO avoids issue of detachment in exploration
  • CEO uses two-phased algorithm for reward maximization