Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Exogenous state variables and rewards can slow reinforcement learning.
  • Reward function decomposes additively into endogenous and exogenous components.
  • Decomposition of state space into exogenous and endogenous state spaces must be discovered.
  • Algorithms introduced to discover exogenous and endogenous subspaces of state space.
  • Experiments show that these methods produce speedups in reinforcement learning.

Paper Content


  • Actions of an agent have limited effect on environment
  • Wireless cellular network has parameters that must be dynamically controlled
  • Formulated as a Markov Decision Process (MDP)
  • Reward function is negative of number of users with low bandwidth
  • Reward heavily influenced by exogenous factors
  • Stochasticity can confuse reinforcement learning algorithms
  • Need many trials to average away exogenous components
  • Learning rate needs to be small
  • Number of Monte Carlo trials required to estimate gradient grows large
  • Analyze setting and develop algorithms to detect and remove effects of exogenous state variables
  • Accelerates reinforcement learning (RL)

Mdps with exogenous states and rewards

  • Study of discrete time stationary MDPs with stochastic rewards and transitions
  • State and action spaces can be discrete or continuous
  • State space S, action space A, reward distribution R, transition function P, starting state distribution P0, discount factor γ
  • For all (s, a) in S x A, R(s, a) has expected value m(s, a) and finite variance σ2(s, a)
  • State space S takes form S = x d i=1 S i
  • State s can be written as d-tuple of values of state variables s = (s1, …, sd)
  • Random vector for state at time t is S t = x d i=i S t,i
  • Random variable for action at time t is A t
  • Random variable for reward at time t is R t
  • State variables can be decomposed into endogenous and exogenous sets E and X
  • Vector space of m x n matrices over R
  • Linear decomposition of S into endo/exo parts E and X via W
  • Direct Sum of vector spaces A and B
  • Vector subspace of vector space B
  • Orthogonal complement of subspace A of vector space S
  • Dimension of vector subspace A
  • Identity matrix of size n, Matrix of zeros of size m x n
  • Matrix that defines the linear exogenous subspace
  • Trace of matrix A, Transpose of A, Determinant of A
  • Euclidean norm of vector u
  • Covariance matrix of A, Cross-covariance matrix of A, B
  • Gaussian distribution with mean µ and variance σ2

Exogenous state variables

  • Exogeneity is a causal concept: a variable is exogenous if it is impossible for our actions to affect its value
  • Formalized in terms of Pearl’s do-calculus
  • A DBN with a certain structure is sufficient to ensure that the variables in X are exogenous
  • Multiple exo/endo decompositions are possible for an MDP with exogenous state variables
  • Not every subset of exogenous state variables can yield a valid exo/endo decomposition
  • Interested in the exo/endo decomposition where the exo set X is as large as possible
  • Maximal exo/endo decomposition is unique and contains all exogenous state variables
  • Statistically exogenous variables can be inferred from data, but not necessarily causally exogenous

Additive reward decomposition

  • Reinforcement learning can be accelerated when reward function can be decomposed into two functions
  • One function depends on exogenous variables, the other depends on both exogenous and endogenous variables
  • Mean and variance of exogenous and endogenous reward distributions can be calculated
  • Bellman optimality equation can be decomposed into two separate equations
  • Optimal policy for the endo-MDP is an optimal policy for the full exogenous state MDP
  • McGregor et al. (2017) show how to remove known exogenous state variables to accelerate Model Free Monte Carlo algorithm

Variance analysis of the exo/endo decomposition

  • Reinforcement learning on the endogenous MDP can be more efficient than on the original MDP
  • Estimating the value of a fixed policy in a given start state via Monte Carlo trials of length H
  • Sample complexity of estimating V π (s 0 ; H) on the full MDP compared to the sample complexity of estimating V π end (s 0 ; H) on the Endogenous MDP
  • Define B π (s 0 ; H) to be a random variable for the H-step cumulative discounted return
  • Theorem 5 states that the sample size bound using the endogenous MDP will be less than the required sample size using the full MDP
  • Variance and covariance of the H-step returns need to be computed
  • Algorithm 1 for discovering the exogenous variables of an MDP from data collected during exploration
  • Exogenous state discovery problem formulated as finding a set of variables that minimizes the residuals of the fitted exogenous reward function
  • Two variations of the optimization formulation
  • Most general formulation in which the exo-endo decomposition is defined by a diffeomorphism
  • Practical special case in which the exogenous and endogenous spaces are defined by a linear mapping

Variable selection formulation

  • Variable selection formulation aims to find two disjoint sets of state variables
  • Squared error of exogenous reward regression is minimized
  • Conditional mutual information constraints are used to enforce structure
  • Problem is broken into two subproblems
  • Hierarchical formulation is a simpler proxy for the coupled formulation

Continuous formulation

  • Continuous formulation assumes state space conditions
  • Mapping should preserve probability densities
  • Diffeomorphisms are a natural choice
  • Diffeomorphisms do not lose information
  • Probability density of transformation can be written analytically
  • Mutual information and conditional independence are preserved
  • Formulation jointly optimizes diffeomorphic state transformation and exogenous reward function
  • Relaxed definition of diffeomorphism (L-diffeomorphism) allows for mappings that are bijective and continuously differentiable almost everywhere

Linear formulation

  • Introduce a tractable formulation for continuous diffeomorphisms
  • Consider the general linear group GL(d, R) of invertible linear transformations in vector space R d
  • Define a linear mapping ξ exo from the full state space to the exogenous state space
  • Define the endogenous state space as the orthogonal complement of the exogenous state space
  • Represent the state-space decomposition defined by W = (W exo , W end )
  • Impose the requirement that the columns of W exo consist of orthonormal vectors
  • Estimate the expectations required in the optimization formulations from sample transitions
  • Adopt linear regression for the exogenous reward regression problem
  • Replace conditional mutual information with the conditional correlation coefficient (CCC)
  • Express the coupled formulation of the exogenous subspace discovery problem for the full setting
  • Express the linear hierarchical formulation for the full case

Conditions establishing sound inference

  • MDPs with finite, discrete state and action spaces can be used to find valid exo/endo decompositions
  • Exploration policy must be fully randomized to visit all states and execute all possible actions
  • Admissible MDPs must have all states reachable from the start state and all policies must reach a terminal state in a finite number of steps
  • Episodic MDPs must reset to the start state after reaching a terminal state
  • Ergodic MDPs must have all states reachable from each other in a finite number of steps
  • Conditional mutual information must be zero for a valid exo/endo decomposition
  • Faithfulness assumption must be made to infer the structure of a causal graph from observational data
  • PPO must be initialized to a fully-randomized policy to guarantee valid exogenous sets

Algorithms for decomposing an mdp into exogenous and endogenous components

  • GRDS algorithm is based on linear hierarchical formulation of Equation (22)
  • SRAS algorithm starts with d exo := 0 and adds one column at a time
  • GRDS algorithm solves inner objective by iterating from d exo := d down to zero
  • GRDS algorithm minimizes CCC and halts when it finds W exo with CCC <
  • GRDS algorithm solves optimization problem on Stiefel manifold

Analysis of the global rank-descending scheme

  • Assume linear regression is used to fit the exo reward function
  • Directly analyzing Algorithm 2 is difficult due to estimation errors, approximations, and using CCC as a proxy for conditional independence
  • Oracle-GRDS algorithm returns a matrix W exo that forms a valid exo/endo decomposition of the full form with maximal dimension
  • Unique maximal subspace X max is unique and all other exogenous subspaces X must be contained within X max
  • GRDS departs from the oracle version in three ways: CCC in place of CMI, parameter , and finite training sample
  • Stepwise Rank Ascending Scheme (SRAS) constructs W exo incrementally by solving a sequence of small manifold optimization problems
  • SRAS maintains current partial solution W exo, temporary matrix W temp, set of candidate column vectors C x, and orthonormal basis N for the null space of C x
  • SRAS computes new candidate vector ŵ by solving a simplified CCC minimization problem on the Stiefel manifold
  • SRAS checks if CCC is less than parameter , computes E matrix, and checks if CCC f ull is less than
  • Not all subsets of the maximal exogenous subspace are valid exogenous subspaces, so SRAS does not terminate when adding a candidate vector to W exo causes the full constraint to be violated