Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a new method for optimistic planning in infinite-horizon discounted Markov decision processes
  • Method adds regularization to updates of approximate value iteration procedure
  • Allows use of approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation
  • Provides computationally efficient algorithm for learning near-optimal policies in discounted linear kernel MDPs from single stream of experience
  • Achieves near-optimal statistical guarantees

Paper Content

Introduction

  • The idea of constructing a confidence set of statistically plausible models and picking a policy that maximizes the expected return can be traced back to Lai & Robbins (1985).
  • This design principle is known as optimism in the face of uncertainty.
  • Jaksch, Ortner, and Auer (2010) achieved a big breakthrough with their UCRL2 algorithm.
  • Computational efficiency of optimistic methods relies on the implementation of the optimistic planning subroutine.
  • Jaksch et al. (2010) used extended value iteration (EVI).
  • Fruit et al. (2018) and Lattimore & Szepesvári (2020) made mild adjustments to EVI.
  • Qian et al. (2018) proposed even more effective optimistic dynamic programming procedures.
  • Jin, Yang, Wang, and Jordan (2020) extended the idea of optimistic exploration to a class of large-scale MDPs.
  • Wei, Jahromi, Luo, and Jain (2021) proposed algorithms that are either statistically or computationally efficient.
  • Vial, Parulekar, Shakkottai, and Srikant (2022) provided approximate DP methods for stochastic shortest path problems.
  • Geist, Scherrer, and Pietquin (2019) proposed Mirror-Descent Modified Policy Iteration (MD-MPI).
  • Zhou, He, and Gu (2021) used a version of EVI adapted to linear function approximation.
  • We propose to solve the infinite-horizon optimistic planning problem using regularized dynamic programming.

Preliminaries

  • MDP is a sequential interaction between an agent and its environment
  • Agent’s goal is to pick sequence of actions to maximize total discounted return
  • Policy is a mapping from a state to a probability measure over actions
  • Value function and action-value function of a policy are defined
  • Transition operator and adjoint act on distributions
  • Bellman equations tie together value and action-value functions
  • Discounted occupancy measure is induced by a policy
  • Inverse of normalization constant is effective horizon
  • State-action occupancy measure satisfies recurrence relation
  • Discounted return of a policy can be written as R π γ
  • Online learning in discounted MDPs aims to produce an εoptimal policy
  • Performance is measured in terms of number of samples necessary to guarantee output policy is ε-optimal
  • Regret of learner is also considered
  • Linear kernel MDPs are distinct from linear MDPs

Algorithm and main result

  • Implements optimism in discounted Markov decision processes
  • Draws on techniques from convex optimization
  • Algorithm is called RAVI-UCB
  • Regularized approximate value iteration with upper confidence bounds
  • Sequence of regularized Qfunctions
  • Optimistic estimate of Bellman operator
  • Truncated to range [0, H]
  • Analysis shows (1 - γ) ν 0 , V k acts as optimistic estimate of optimal return
  • Algorithm proceeds in sequence of epochs
  • Updates model estimate via least-squares regression
  • Applies approximate Bellman operator to produce state-action value estimate
  • Regret bound and corollary guarantee quality of output policy

Analysis

  • Theorem 3.1 is presented and a sequence of lemmas is stated.
  • The analysis is split into two parts: general properties of the optimistic planning procedure and specifics of linear kernel MDPs.
  • The proof is provided in the main text and more technical proofs are in Appendix A.

Optimistic planning

  • RAVI-UCB produces policies with a bounded gap between the optimistic value and the return of the policy
  • The gap is bounded in terms of the exploration bonus
  • The exploration bonus must satisfy a certain condition for all x, a
  • The gap is bounded by the sum of two terms
  • The first term is bounded by the difference between the optimal return and the return of the policy
  • The second term is bounded by the regularization and the boundedness of the Q-functions
  • Pinsker’s inequality and the Fenchel-Young inequality are used to bound the second term
  • The gap is bounded by the difference between the optimal return and the return of the policy plus the boundedness of the Q-functions

Valid exploration bonuses for linear kernel mdps

  • Linear kernel MDPs satisfy the validity condition
  • Proof of Lemma 4.2 is in Appendix A
  • Proof relies on techniques from Zhou et al. (2021) and Cai et al. (2020)
  • Truncation operation of RAVI-UCB guarantees boundedness of each Vk

The epoch schedule

  • Introduce notation to account for effects of randomized epoch schedule
  • Pad out trajectory of states and actions with artificial observations
  • Lemma 4.3: Sequence of policies selected by RAVI-UCB satisfies bound
  • Lemma 4.4: Sum of exploration bonuses generated by RAVI-UCB satisfy bound
  • Proofs provided in Appendix A.3 and A.4

Putting everything together

  • Put claims together to prove Theorem 3.1
  • Lemma 4.2 guarantees validity of exploration bonuses with probability 1-δ
  • Lemma 4.1 shows that RAVI-UCB guarantees expected regret
  • Lemma 4.3 and 4.4 bound sum of exploration bonuses

Discussion

  • Our approach relaxes the optimistic properties that previous methods strive for
  • We guarantee that our value estimates are optimistic in an average sense
  • Our algorithm may execute several policies that do not individually satisfy any optimistic properties
  • Our planning procedure can be used to produce optimistic policies in a stricter sense
  • Our contribution is different from Geist, Scherrer, and Pietquin (2019)
  • Our result can be strengthened to hold with high probability
  • Our technique is challenging to extend to the infinite-horizon average-reward setting
  • We use a linear kernel MDP assumption
  • We wish to extend our analysis to more tractable MDP models
  • Our algorithm needs access to a reset action
  • Our guarantees only hold on expectation
  • We use classic tools from the analysis of mirror descent