Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Selecting exploratory actions to generate experience is a challenge in RL.
- Options-based exploration builds on graph Laplacian eigenfunctions.
- Previous methods limited to tabular domains, separate option discovery phase, and exact value function learning.
- This paper introduces a deep RL algorithm to discover Laplacian-based options.
- Evaluated on pixel-based tasks, compared to state-of-the-art exploration methods.
Paper Content
Introduction
- Reinforcement learning (RL) is an agent-environment interaction to maximize rewards.
- At each step, the agent receives an observation, reward, and takes an action.
- Exploration is promoted by selecting actions according to a specific policy for an extended period of time.
- Options are a formalism for representing behaviour at different timescales.
- Methods based on these ideas have been successful in various domains.
- A central question is how to define temporal abstractions for exploration.
- Solutions often revolve around simple approaches or domain-specific information.
- Graph Laplacian is used to capture the topology of the environment at various timescales.
- This paper extends Laplacian-based options framework to deep function approximation.
- It introduces an online algorithm for discovering Laplacian-based options.
- It is benchmarked against state-of-the-art exploration methods.
- It is effective in non-stationary environments.
Preliminaries
- Agent interacts with environment at time step t
- Agent selects action A t
- Environment emits scalar reward R t+1
- Agent’s goal is to maximize expected discounted sum of rewards
- Value-based methods used to obtain policy
- Tabular case assigns value to each state
- Complexity of tasks increases, need to approximate optimal value function
- Neural networks used to parameterize Q
- Double DQN and n-step targets used to improve performance
- Options framework provides formalism for abstractions over time
Covering eigenoptions
- Options framework provides formalism for temporal abstraction
- Option discovery is a challenge in HRL
- Variety of strategies exist to discover options
- Paper focuses on representation-driven option discovery (ROD) cycle
- ROD cycle has 3 main steps
- Representation-driven option discovery uses eigenfunctions of graph Laplacian
- Eigenfunctions capture environment dynamics at different timescales
- Empirical results show CEO is more efficient at covering Four Rooms domain
- Paper evaluates CEO in different environments and compares to exploration algorithms
- CEO is effective in covering state space and leads to faster reward maximization
Approximate laplacian-based options
- CEO relies on costly eigendecomposition operation to define intrinsic rewards
- Estimating Laplacian matrix in large state spaces is impractical
- Recent methods use neural networks to approximate Laplacian eigenfunctions
- Objective borrowed from graph theory to make it amenable to online RL
- Objective incentivizes smoothness and orthogonality between eigenfunctions
- Open question whether generalized Laplacian objective can be used to discover unknown parts of state space
A two-phased scalable method
- Generalized Laplacian is conducive to be used by option discovery methods
- Laplacian-based options are defined by intrinsic rewards generated by eigenfunctions of Laplacian
- Approximation errors can lead to bad reward functions
- Deep function approximation used for learning main policy and options’ policies
- Option termination defined by uniform random probability
- Deep Covering Eigenoptions (DCEO) algorithm introduced
- Validated efficacy on pixel-based versions of environments
- DCEO outperforms several state-of-the-art methods
- DCEO agent learns options by maximizing intrinsic rewards based on approximated eigenfunctions of graph Laplacian
- Option termination defined by uniform random probability
- DCEO agent learns to maximize reward with DDQN and n-step targets
- DCEO agent explores with sampled option’s policy until it terminates
- Evaluated DCEO in terms of state coverage and reward maximization
- DCEO performs at least as well as well-established baselines
A single continuous cycle
- Laplacian-based options are effective for exploration under deep function approximation
- Initial phase dedicated to option discovery is necessary, but problematic
- Algorithm introduced to go beyond these limitations and be fully online and generally applicable
Online discovery of laplacian-based options
- Randomly initializes Laplacian representation, options, and main DDQN learner
- Adjusts options as Laplacian becomes more accurate
- Addresses policy-dependent concern by being online
- Evaluated in pixel-based environments
- Competitive or outperforms other baselines
- Similar performance to two-phased algorithm
Exploration in the face of non-stationarity
- Options can encode temporally extended behaviours
- Options can be leveraged for continual exploration
- Count-based and error prediction-based methods are not as flexible in the face of non-stationarity
- DCEO adapts quickly to changes in environment
- DCEO’s performance remains robust to drastic changes in environment’s topology
Objects, obstacles and complex interactions
- DCEO algorithm is effective for exploration
- Environments have different topologies
- Environments require different levels of abstraction
- All transitions are stochastic
- DCEO algorithm extends to challenging domains
- DCEO outperforms other baselines in complex environment
Scaling up further
- Investigated scalability of Laplacian representation to higher dimensional environments and partial observability
- Performed experiments on Atari 2600 game Montezuma’s Revenge and 3D navigation task MiniWorld-FourRooms
- Trained Laplacian with trajectories from random walks
- Qualitative results show first eigenfunctions point to important stepping stones in Montezuma’s Revenge and traverse the observation space in MiniWorld-FourRooms
- Suggests first options discovered in such environments without domain knowledge are meaningful
Conclusion
- Introduced a scalable and generally applicable algorithm for Laplacian-based option discovery
- Extended a tabular approach into an online method compatible with deep function approximation
- Proposed a strategy for incorporating option discovery and reward maximization
- Algorithm performs better than state-of-the-art baselines on a variety of environments and settings
- First time a Laplacian-based method has done so
- Results in non-stationarity environments are promising
- Variety of research directions for future improvement
- Graph Laplacian introduced in RL through Proto-Value Functions framework
- Used to learn eigenoptions
- Full eigenspectrum used to define options that leverage diffusion distance
- Options learned through mutual information-based objectives
- Laplacian representation introduced by Wu et al. (2019)
- DCEO learns a diversity of options simultaneously
- Graph Laplacian used for credit assignment, reward shaping, and representation learning
- Nine rooms, Maze, and Rubik’s cube 2x2 used as environments
- Agent starts from particular position in each environment
- State coverage experiments use random policy or exploration approach
- CEO algorithm introduces two hyperparameters
- Reward maximization experiments use Q-learning algorithm
- CEO performs significantly better or is on par with evaluated baselines
- CEO covers environment in a fundamentally different way
- CEO avoids issue of detachment in exploration
- CEO uses two-phased algorithm for reward maximization