Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning
Characterizing the variance over values induced by a distribution over MDPs
Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation
New uncertainty Bellman equation converges to the true posterior variance over values and explicitly characterizes the gap in previous work
Easily integrated into common exploration strategies and scales naturally beyond the tabular setting
Experiments show that sharper uncertainty estimates improve sample-efficiency

Paper Content

Introduction

Goal of reinforcement learning (RL) agents is to maximize expected return
Model-based RL (MBRL) learns a statistical model of the environment
Recent improvements in deep MBRL algorithms due to models that quantify epistemic and aleatoric uncertainty
Paper proposes learning the solution to a Bellman recursion prescribed by theory
Experiments in tabular and continuous control problems demonstrate improved sample efficiency
Model-free approaches to Bayesian RL directly model the distribution over values
Model-based Bayesian RL maintains a posterior over plausible MDPs
Dyna-style actor-critic algorithms paired with model-based uncertainty estimates for improved performance
Optimism in the face of uncertainty (OFU) relies on building upper-confidence estimates of true values
Paper focuses on estimating and using the variance of the expected return for policy optimization
Aleatoric uncertainty about returns originates from aleatoric noise of MDP transitions and stochastic policy
Epistemic uncertainty about value function due to incomplete knowledge of MDP

Problem statement

Agent acts in an infinite-horizon MDP
Finite state and action spaces
Unknown transition function
Known and bounded reward function
Discount factor between 0 and 1
Reward function can be learned alongside transition function
One-step dynamics
Stochastic policy
At each time step, agent selects action, receives reward, transitions to next state
Bayesian setting with known prior distribution
Value function is expected sum of discounted rewards
Estimate variance of value function
Independent transitions and acyclic MDP assumed
Posterior mean transition and value functions of interest
Local uncertainty defined and UBE solved

Uncertainty bellman equation

Built a new UBE whose fixed-point solution is equal to the variance of the value function
Showed the gap between the UBE and the variance of the value function
The Bellman recursion propagates knowledge about the local rewards
The UBE propagates a notion of local uncertainty
The fixed-point solution to the UBE encodes the long-term epistemic uncertainty about the values of a given state
Theorem 1 states that the U-values converge exactly to the variance of values
Theorem 2 presents a clear relationship that connects Theorem 1 with the upper bound
The uncertainty reward has two components: total uncertainty about the mean values and average aleatoric uncertainty about the value of the next state
Corollary 1 states that the solution to the UBE results in an upper-bound of the variance
The gap between the exact reward function and the approximation is fully characterized by a gap term
The influence of the gap term depends on the stochasticity of the dynamics and the policy
The method returns the exact epistemic uncertainty about the values

Toy example

MRP has 4 possible combinations of δ and β
Assumptions 1 and 2 are satisfied
Table 1 includes results for uncertainty rewards and UBEs
W π (s 2 ) = U π (s 2 )
Gap terms for s 2 cancel out
W π overestimates variance of value by ∼ 36%

Variance-driven optimistic exploration

Propose a technique to solve the RL problem using uncertainty quantification of Q-values
Define Γ t as the posterior distribution over MDPs
Update policy by solving the upper-confidence bound optimization problem
Typical RL techniques violate theoretical assumptions
Propose practical upper-bound on the solution of the UBE
Tabular implementation uses Dirichlet prior on transition function and Normal prior for rewards
Deep RL implementation uses MBPO architecture and approximates sum of cumulative uncertainty rewards
Train an ensemble of N value functions and U-net
UBE-based methods have added complexity of training U-net
Small N reduces computational burden

Experiments

Evaluated performance of policy optimization scheme
Examined different variance estimates from Section 4

Tabular environments

Evaluated tabular implementation in grid-world environments
Used PSRL as baseline
Tested agent’s ability to explore over multiple time steps in presence of deterrent
Considered deterministic version of problem
Optimal policy is to always go right
Ran each method for 1000 episodes and five random seeds
Found using u min = -0.05 improves performance
Method achieved lowest learning time and best scaling with problem size
Method achieved lowest total regret across all values of L

Continuous control environments

Evaluated performance of deep RL implementation in continuous state-action spaces
Added action cost to complicate problem
SAC quickly converged to suboptimal solution
Exact-ube method had most robust performance across noise levels
Ensemble-mean was a strong baseline

Ensemble size ablation

Ensemble size is a critical factor for ensemble-based methods
An et al. suggest large ensemble size is needed for good performance, but is computationally expensive
Ablation study shows best or comparable performance across all environments and values of N
Performance increases for larger ensembles, matching observations from An et al.
Sample-based approximations of local uncertainty rewards are less sensitive to sample size
Larger ensembles may not always lead to better performance in presence of sparse rewards

Conclusions

Derived an uncertainty Bellman equation whose fixed-point solution converges to the variance of values given a posterior distribution over MDPs
Characterized the gap in previous UBE formulations that upper-bound the variance of values
Gap is the consequence of an over-approximation of the uncertainty rewards being propagated through the Bellman recursion
Ignores the inherent aleatoric uncertainty from acting in an MDP
Recovers exclusively the epistemic uncertainty due to limited environment data
Serves as an effective exploration signal
Proposed a practical method to estimate the solution of the UBE
Scalable beyond tabular problems with standard deep RL practices
Variance estimation integrated into a model-based approach
Uses the principle of optimism in the face of uncertainty to explore effectively
Improves sample efficiency in hard exploration problems
Does not require large ensembles
Identity for covariance of value function
Lemma 2: Under Assumptions 1 and 2, for any s ∈ S, any policy π, Cov[p(s | s, a), V π,p (s )] = 0
Lemma 3: Under Assumptions 1 and 2, it holds that a,s
Theorem 1: Under Assumptions 1 and 2, for any s ∈ S and policy π, the posterior variance of the value function, π,p ] obeys the uncertainty Bellman equation
Lemma 4: Under Assumptions 1 and 2, it holds that
Lemma 5: Under Assumptions 1 and 2, it holds that
Lemma 6: Under Assumptions 1 and 2, it holds that
Lemma 7: Under Assumptions 1 and 2, it holds that is non-negative
Theorem 2: Under Assumptions 1 and 2, for any s ∈ S and policy π, it holds that u t (s) = w t (s) − g t (s), where g t (s) = E p∼Φt V a,s ∼π,p V π,p (s ) − V a,s ∼π,p V π t (s )
Gap g t (s) is non-negative, thus u t (s) ≤ w t (s)
Evaluated performance in the DeepSea benchmark
Three uncertainty signals, since assumptions are violated in the practical setting
When integrated into Algorithm 1, performance in terms of learning time and total regret is quite similar
Selected exact-ube_3 as the default estimate for all other experiments
Ensemble size N is one important hyperparameter for all the OFU-based methods
Sample-based approximation of uncertainty rewards is not very sensitive to the number of samples
Exploration gain λ is an important hyperparameter for OFU-based methods
As λ increases, total regret of all the methods increases, but overall exact-ube achieves the best performance
Optimistic approach on top of MBPO presented in Algorithm 2
Main differences with the original implementation
Main hyperparameters for experiments included in Table 2

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem statement#

Uncertainty bellman equation#

Toy example#

Variance-driven optimistic exploration#

Experiments#

Tabular environments#

Continuous control environments#

Ensemble size ablation#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Problem statement

Uncertainty bellman equation

Toy example

Variance-driven optimistic exploration

Experiments

Tabular environments

Continuous control environments

Ensemble size ablation

Conclusions