Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning
  • Characterizing the variance over values induced by a distribution over MDPs
  • Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation
  • New uncertainty Bellman equation converges to the true posterior variance over values and explicitly characterizes the gap in previous work
  • Easily integrated into common exploration strategies and scales naturally beyond the tabular setting
  • Experiments show that sharper uncertainty estimates improve sample-efficiency

Paper Content

Introduction

  • Goal of reinforcement learning (RL) agents is to maximize expected return
  • Model-based RL (MBRL) learns a statistical model of the environment
  • Recent improvements in deep MBRL algorithms due to models that quantify epistemic and aleatoric uncertainty
  • Paper proposes learning the solution to a Bellman recursion prescribed by theory
  • Experiments in tabular and continuous control problems demonstrate improved sample efficiency
  • Model-free approaches to Bayesian RL directly model the distribution over values
  • Model-based Bayesian RL maintains a posterior over plausible MDPs
  • Dyna-style actor-critic algorithms paired with model-based uncertainty estimates for improved performance
  • Optimism in the face of uncertainty (OFU) relies on building upper-confidence estimates of true values
  • Paper focuses on estimating and using the variance of the expected return for policy optimization
  • Aleatoric uncertainty about returns originates from aleatoric noise of MDP transitions and stochastic policy
  • Epistemic uncertainty about value function due to incomplete knowledge of MDP

Problem statement

  • Agent acts in an infinite-horizon MDP
  • Finite state and action spaces
  • Unknown transition function
  • Known and bounded reward function
  • Discount factor between 0 and 1
  • Reward function can be learned alongside transition function
  • One-step dynamics
  • Stochastic policy
  • At each time step, agent selects action, receives reward, transitions to next state
  • Bayesian setting with known prior distribution
  • Value function is expected sum of discounted rewards
  • Estimate variance of value function
  • Independent transitions and acyclic MDP assumed
  • Posterior mean transition and value functions of interest
  • Local uncertainty defined and UBE solved

Uncertainty bellman equation

  • Built a new UBE whose fixed-point solution is equal to the variance of the value function
  • Showed the gap between the UBE and the variance of the value function
  • The Bellman recursion propagates knowledge about the local rewards
  • The UBE propagates a notion of local uncertainty
  • The fixed-point solution to the UBE encodes the long-term epistemic uncertainty about the values of a given state
  • Theorem 1 states that the U-values converge exactly to the variance of values
  • Theorem 2 presents a clear relationship that connects Theorem 1 with the upper bound
  • The uncertainty reward has two components: total uncertainty about the mean values and average aleatoric uncertainty about the value of the next state
  • Corollary 1 states that the solution to the UBE results in an upper-bound of the variance
  • The gap between the exact reward function and the approximation is fully characterized by a gap term
  • The influence of the gap term depends on the stochasticity of the dynamics and the policy
  • The method returns the exact epistemic uncertainty about the values

Toy example

  • MRP has 4 possible combinations of δ and β
  • Assumptions 1 and 2 are satisfied
  • Table 1 includes results for uncertainty rewards and UBEs
  • W π (s 2 ) = U π (s 2 )
  • Gap terms for s 2 cancel out
  • W π overestimates variance of value by ∼ 36%

Variance-driven optimistic exploration

  • Propose a technique to solve the RL problem using uncertainty quantification of Q-values
  • Define Γ t as the posterior distribution over MDPs
  • Update policy by solving the upper-confidence bound optimization problem
  • Typical RL techniques violate theoretical assumptions
  • Propose practical upper-bound on the solution of the UBE
  • Tabular implementation uses Dirichlet prior on transition function and Normal prior for rewards
  • Deep RL implementation uses MBPO architecture and approximates sum of cumulative uncertainty rewards
  • Train an ensemble of N value functions and U-net
  • UBE-based methods have added complexity of training U-net
  • Small N reduces computational burden

Experiments

  • Evaluated performance of policy optimization scheme
  • Examined different variance estimates from Section 4

Tabular environments

  • Evaluated tabular implementation in grid-world environments
  • Used PSRL as baseline
  • Tested agent’s ability to explore over multiple time steps in presence of deterrent
  • Considered deterministic version of problem
  • Optimal policy is to always go right
  • Ran each method for 1000 episodes and five random seeds
  • Found using u min = -0.05 improves performance
  • Method achieved lowest learning time and best scaling with problem size
  • Method achieved lowest total regret across all values of L

Continuous control environments

  • Evaluated performance of deep RL implementation in continuous state-action spaces
  • Added action cost to complicate problem
  • SAC quickly converged to suboptimal solution
  • Exact-ube method had most robust performance across noise levels
  • Ensemble-mean was a strong baseline

Ensemble size ablation

  • Ensemble size is a critical factor for ensemble-based methods
  • An et al. suggest large ensemble size is needed for good performance, but is computationally expensive
  • Ablation study shows best or comparable performance across all environments and values of N
  • Performance increases for larger ensembles, matching observations from An et al.
  • Sample-based approximations of local uncertainty rewards are less sensitive to sample size
  • Larger ensembles may not always lead to better performance in presence of sparse rewards

Conclusions

  • Derived an uncertainty Bellman equation whose fixed-point solution converges to the variance of values given a posterior distribution over MDPs
  • Characterized the gap in previous UBE formulations that upper-bound the variance of values
  • Gap is the consequence of an over-approximation of the uncertainty rewards being propagated through the Bellman recursion
  • Ignores the inherent aleatoric uncertainty from acting in an MDP
  • Recovers exclusively the epistemic uncertainty due to limited environment data
  • Serves as an effective exploration signal
  • Proposed a practical method to estimate the solution of the UBE
  • Scalable beyond tabular problems with standard deep RL practices
  • Variance estimation integrated into a model-based approach
  • Uses the principle of optimism in the face of uncertainty to explore effectively
  • Improves sample efficiency in hard exploration problems
  • Does not require large ensembles
  • Identity for covariance of value function
  • Lemma 2: Under Assumptions 1 and 2, for any s ∈ S, any policy π, Cov[p(s | s, a), V π,p (s )] = 0
  • Lemma 3: Under Assumptions 1 and 2, it holds that a,s
  • Theorem 1: Under Assumptions 1 and 2, for any s ∈ S and policy π, the posterior variance of the value function, π,p ] obeys the uncertainty Bellman equation
  • Lemma 4: Under Assumptions 1 and 2, it holds that
  • Lemma 5: Under Assumptions 1 and 2, it holds that
  • Lemma 6: Under Assumptions 1 and 2, it holds that
  • Lemma 7: Under Assumptions 1 and 2, it holds that is non-negative
  • Theorem 2: Under Assumptions 1 and 2, for any s ∈ S and policy π, it holds that u t (s) = w t (s) − g t (s), where g t (s) = E p∼Φt V a,s ∼π,p V π,p (s ) − V a,s ∼π,p V π t (s )
  • Gap g t (s) is non-negative, thus u t (s) ≤ w t (s)
  • Evaluated performance in the DeepSea benchmark
  • Three uncertainty signals, since assumptions are violated in the practical setting
  • When integrated into Algorithm 1, performance in terms of learning time and total regret is quite similar
  • Selected exact-ube_3 as the default estimate for all other experiments
  • Ensemble size N is one important hyperparameter for all the OFU-based methods
  • Sample-based approximation of uncertainty rewards is not very sensitive to the number of samples
  • Exploration gain λ is an important hyperparameter for OFU-based methods
  • As λ increases, total regret of all the methods increases, but overall exact-ube achieves the best performance
  • Optimistic approach on top of MBPO presented in Algorithm 2
  • Main differences with the original implementation
  • Main hyperparameters for experiments included in Table 2