Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning
- Characterizing the variance over values induced by a distribution over MDPs
- Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation
- New uncertainty Bellman equation converges to the true posterior variance over values and explicitly characterizes the gap in previous work
- Easily integrated into common exploration strategies and scales naturally beyond the tabular setting
- Experiments show that sharper uncertainty estimates improve sample-efficiency
Paper Content
Introduction
- Goal of reinforcement learning (RL) agents is to maximize expected return
- Model-based RL (MBRL) learns a statistical model of the environment
- Recent improvements in deep MBRL algorithms due to models that quantify epistemic and aleatoric uncertainty
- Paper proposes learning the solution to a Bellman recursion prescribed by theory
- Experiments in tabular and continuous control problems demonstrate improved sample efficiency
- Model-free approaches to Bayesian RL directly model the distribution over values
- Model-based Bayesian RL maintains a posterior over plausible MDPs
- Dyna-style actor-critic algorithms paired with model-based uncertainty estimates for improved performance
- Optimism in the face of uncertainty (OFU) relies on building upper-confidence estimates of true values
- Paper focuses on estimating and using the variance of the expected return for policy optimization
- Aleatoric uncertainty about returns originates from aleatoric noise of MDP transitions and stochastic policy
- Epistemic uncertainty about value function due to incomplete knowledge of MDP
Problem statement
- Agent acts in an infinite-horizon MDP
- Finite state and action spaces
- Unknown transition function
- Known and bounded reward function
- Discount factor between 0 and 1
- Reward function can be learned alongside transition function
- One-step dynamics
- Stochastic policy
- At each time step, agent selects action, receives reward, transitions to next state
- Bayesian setting with known prior distribution
- Value function is expected sum of discounted rewards
- Estimate variance of value function
- Independent transitions and acyclic MDP assumed
- Posterior mean transition and value functions of interest
- Local uncertainty defined and UBE solved
Uncertainty bellman equation
- Built a new UBE whose fixed-point solution is equal to the variance of the value function
- Showed the gap between the UBE and the variance of the value function
- The Bellman recursion propagates knowledge about the local rewards
- The UBE propagates a notion of local uncertainty
- The fixed-point solution to the UBE encodes the long-term epistemic uncertainty about the values of a given state
- Theorem 1 states that the U-values converge exactly to the variance of values
- Theorem 2 presents a clear relationship that connects Theorem 1 with the upper bound
- The uncertainty reward has two components: total uncertainty about the mean values and average aleatoric uncertainty about the value of the next state
- Corollary 1 states that the solution to the UBE results in an upper-bound of the variance
- The gap between the exact reward function and the approximation is fully characterized by a gap term
- The influence of the gap term depends on the stochasticity of the dynamics and the policy
- The method returns the exact epistemic uncertainty about the values
Toy example
- MRP has 4 possible combinations of δ and β
- Assumptions 1 and 2 are satisfied
- Table 1 includes results for uncertainty rewards and UBEs
- W π (s 2 ) = U π (s 2 )
- Gap terms for s 2 cancel out
- W π overestimates variance of value by ∼ 36%
Variance-driven optimistic exploration
- Propose a technique to solve the RL problem using uncertainty quantification of Q-values
- Define Γ t as the posterior distribution over MDPs
- Update policy by solving the upper-confidence bound optimization problem
- Typical RL techniques violate theoretical assumptions
- Propose practical upper-bound on the solution of the UBE
- Tabular implementation uses Dirichlet prior on transition function and Normal prior for rewards
- Deep RL implementation uses MBPO architecture and approximates sum of cumulative uncertainty rewards
- Train an ensemble of N value functions and U-net
- UBE-based methods have added complexity of training U-net
- Small N reduces computational burden
Experiments
- Evaluated performance of policy optimization scheme
- Examined different variance estimates from Section 4
Tabular environments
- Evaluated tabular implementation in grid-world environments
- Used PSRL as baseline
- Tested agent’s ability to explore over multiple time steps in presence of deterrent
- Considered deterministic version of problem
- Optimal policy is to always go right
- Ran each method for 1000 episodes and five random seeds
- Found using u min = -0.05 improves performance
- Method achieved lowest learning time and best scaling with problem size
- Method achieved lowest total regret across all values of L
Continuous control environments
- Evaluated performance of deep RL implementation in continuous state-action spaces
- Added action cost to complicate problem
- SAC quickly converged to suboptimal solution
- Exact-ube method had most robust performance across noise levels
- Ensemble-mean was a strong baseline
Ensemble size ablation
- Ensemble size is a critical factor for ensemble-based methods
- An et al. suggest large ensemble size is needed for good performance, but is computationally expensive
- Ablation study shows best or comparable performance across all environments and values of N
- Performance increases for larger ensembles, matching observations from An et al.
- Sample-based approximations of local uncertainty rewards are less sensitive to sample size
- Larger ensembles may not always lead to better performance in presence of sparse rewards
Conclusions
- Derived an uncertainty Bellman equation whose fixed-point solution converges to the variance of values given a posterior distribution over MDPs
- Characterized the gap in previous UBE formulations that upper-bound the variance of values
- Gap is the consequence of an over-approximation of the uncertainty rewards being propagated through the Bellman recursion
- Ignores the inherent aleatoric uncertainty from acting in an MDP
- Recovers exclusively the epistemic uncertainty due to limited environment data
- Serves as an effective exploration signal
- Proposed a practical method to estimate the solution of the UBE
- Scalable beyond tabular problems with standard deep RL practices
- Variance estimation integrated into a model-based approach
- Uses the principle of optimism in the face of uncertainty to explore effectively
- Improves sample efficiency in hard exploration problems
- Does not require large ensembles
- Identity for covariance of value function
- Lemma 2: Under Assumptions 1 and 2, for any s ∈ S, any policy π, Cov[p(s | s, a), V π,p (s )] = 0
- Lemma 3: Under Assumptions 1 and 2, it holds that a,s
- Theorem 1: Under Assumptions 1 and 2, for any s ∈ S and policy π, the posterior variance of the value function, π,p ] obeys the uncertainty Bellman equation
- Lemma 4: Under Assumptions 1 and 2, it holds that
- Lemma 5: Under Assumptions 1 and 2, it holds that
- Lemma 6: Under Assumptions 1 and 2, it holds that
- Lemma 7: Under Assumptions 1 and 2, it holds that is non-negative
- Theorem 2: Under Assumptions 1 and 2, for any s ∈ S and policy π, it holds that u t (s) = w t (s) − g t (s), where g t (s) = E p∼Φt V a,s ∼π,p V π,p (s ) − V a,s ∼π,p V π t (s )
- Gap g t (s) is non-negative, thus u t (s) ≤ w t (s)
- Evaluated performance in the DeepSea benchmark
- Three uncertainty signals, since assumptions are violated in the practical setting
- When integrated into Algorithm 1, performance in terms of learning time and total regret is quite similar
- Selected exact-ube_3 as the default estimate for all other experiments
- Ensemble size N is one important hyperparameter for all the OFU-based methods
- Sample-based approximation of uncertainty rewards is not very sensitive to the number of samples
- Exploration gain λ is an important hyperparameter for OFU-based methods
- As λ increases, total regret of all the methods increases, but overall exact-ube achieves the best performance
- Optimistic approach on top of MBPO presented in Algorithm 2
- Main differences with the original implementation
- Main hyperparameters for experiments included in Table 2