Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a new method for optimistic planning in infinite-horizon discounted Markov decision processes
- Method adds regularization to updates of approximate value iteration procedure
- Allows use of approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation
- Provides computationally efficient algorithm for learning near-optimal policies in discounted linear kernel MDPs from single stream of experience
- Achieves near-optimal statistical guarantees
Paper Content
Introduction
- The idea of constructing a confidence set of statistically plausible models and picking a policy that maximizes the expected return can be traced back to Lai & Robbins (1985).
- This design principle is known as optimism in the face of uncertainty.
- Jaksch, Ortner, and Auer (2010) achieved a big breakthrough with their UCRL2 algorithm.
- Computational efficiency of optimistic methods relies on the implementation of the optimistic planning subroutine.
- Jaksch et al. (2010) used extended value iteration (EVI).
- Fruit et al. (2018) and Lattimore & Szepesvári (2020) made mild adjustments to EVI.
- Qian et al. (2018) proposed even more effective optimistic dynamic programming procedures.
- Jin, Yang, Wang, and Jordan (2020) extended the idea of optimistic exploration to a class of large-scale MDPs.
- Wei, Jahromi, Luo, and Jain (2021) proposed algorithms that are either statistically or computationally efficient.
- Vial, Parulekar, Shakkottai, and Srikant (2022) provided approximate DP methods for stochastic shortest path problems.
- Geist, Scherrer, and Pietquin (2019) proposed Mirror-Descent Modified Policy Iteration (MD-MPI).
- Zhou, He, and Gu (2021) used a version of EVI adapted to linear function approximation.
- We propose to solve the infinite-horizon optimistic planning problem using regularized dynamic programming.
Preliminaries
- MDP is a sequential interaction between an agent and its environment
- Agent’s goal is to pick sequence of actions to maximize total discounted return
- Policy is a mapping from a state to a probability measure over actions
- Value function and action-value function of a policy are defined
- Transition operator and adjoint act on distributions
- Bellman equations tie together value and action-value functions
- Discounted occupancy measure is induced by a policy
- Inverse of normalization constant is effective horizon
- State-action occupancy measure satisfies recurrence relation
- Discounted return of a policy can be written as R π γ
- Online learning in discounted MDPs aims to produce an εoptimal policy
- Performance is measured in terms of number of samples necessary to guarantee output policy is ε-optimal
- Regret of learner is also considered
- Linear kernel MDPs are distinct from linear MDPs
Algorithm and main result
- Implements optimism in discounted Markov decision processes
- Draws on techniques from convex optimization
- Algorithm is called RAVI-UCB
- Regularized approximate value iteration with upper confidence bounds
- Sequence of regularized Qfunctions
- Optimistic estimate of Bellman operator
- Truncated to range [0, H]
- Analysis shows (1 - γ) ν 0 , V k acts as optimistic estimate of optimal return
- Algorithm proceeds in sequence of epochs
- Updates model estimate via least-squares regression
- Applies approximate Bellman operator to produce state-action value estimate
- Regret bound and corollary guarantee quality of output policy
Analysis
- Theorem 3.1 is presented and a sequence of lemmas is stated.
- The analysis is split into two parts: general properties of the optimistic planning procedure and specifics of linear kernel MDPs.
- The proof is provided in the main text and more technical proofs are in Appendix A.
Optimistic planning
- RAVI-UCB produces policies with a bounded gap between the optimistic value and the return of the policy
- The gap is bounded in terms of the exploration bonus
- The exploration bonus must satisfy a certain condition for all x, a
- The gap is bounded by the sum of two terms
- The first term is bounded by the difference between the optimal return and the return of the policy
- The second term is bounded by the regularization and the boundedness of the Q-functions
- Pinsker’s inequality and the Fenchel-Young inequality are used to bound the second term
- The gap is bounded by the difference between the optimal return and the return of the policy plus the boundedness of the Q-functions
Valid exploration bonuses for linear kernel mdps
- Linear kernel MDPs satisfy the validity condition
- Proof of Lemma 4.2 is in Appendix A
- Proof relies on techniques from Zhou et al. (2021) and Cai et al. (2020)
- Truncation operation of RAVI-UCB guarantees boundedness of each Vk
The epoch schedule
- Introduce notation to account for effects of randomized epoch schedule
- Pad out trajectory of states and actions with artificial observations
- Lemma 4.3: Sequence of policies selected by RAVI-UCB satisfies bound
- Lemma 4.4: Sum of exploration bonuses generated by RAVI-UCB satisfy bound
- Proofs provided in Appendix A.3 and A.4
Putting everything together
- Put claims together to prove Theorem 3.1
- Lemma 4.2 guarantees validity of exploration bonuses with probability 1-δ
- Lemma 4.1 shows that RAVI-UCB guarantees expected regret
- Lemma 4.3 and 4.4 bound sum of exploration bonuses
Discussion
- Our approach relaxes the optimistic properties that previous methods strive for
- We guarantee that our value estimates are optimistic in an average sense
- Our algorithm may execute several policies that do not individually satisfy any optimistic properties
- Our planning procedure can be used to produce optimistic policies in a stricter sense
- Our contribution is different from Geist, Scherrer, and Pietquin (2019)
- Our result can be strengthened to hold with high probability
- Our technique is challenging to extend to the infinite-horizon average-reward setting
- We use a linear kernel MDP assumption
- We wish to extend our analysis to more tractable MDP models
- Our algorithm needs access to a reset action
- Our guarantees only hold on expectation
- We use classic tools from the analysis of mirror descent