Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- The reward hypothesis suggests that goals and purposes can be thought of as maximizing the expected value of a reward.
- The paper aims to fully settle the hypothesis by specifying the requirements for it to hold.
Paper Content
Introduction
- Reward hypothesis posited by Sutton states that goals and purposes can be thought of as maximizing expected value of cumulative reward
- McCarthy’s claim that intelligence is the computational part of the ability to achieve goals
- Sutton’s hypothesis implies that to build AI, it is sufficient to solve RL
- Reward-is-enough hypothesis posits that intelligence can be understood as subserving the maximization of reward
- Abel et al. (2021) grounded the notion of goals and purposes as an ordering over policies
- Shakerinava and Ravanbakhsh (2022) grounded goals and purposes in preference relations over state-trajectories
- Pitis (2019) provides a normative account for why we should embrace a state-action dependent discount factor
- Dong et al. (2021), Lu et al. (2021), and early work on general RL explored general stochastic environments and policies
- New axiom introduced to accommodate discounted reward, average reward, and episodic settings
- Account does not give simple affirmation or refutation of reward hypothesis, but rather aims to specify implicit requirements on goals and purposes
The reward hypothesis
- Formalize reward hypothesis
- State assumptions for each phrase in claim
Goals as preferences
- Goals and purposes are expressed with a binary preference relation.
- Agent interaction is a cycle of observing the environment and taking action.
- Preferences are expressed over histories in deterministic settings.
- Distributions over histories are considered in stochastic settings.
- Rewards can be present in the agent’s observation or provided by an external observer.
- Preferences over policies are consistent with the distributions over histories they induce.
- Goals can be achieved in a defined time frame or of a continuing nature.
Maximizing cumulative sums
- Maximizing cumulative sum of scalar reward is domain of reinforcement learning
- Reward function and transition-dependent discount function used to compare policies
- Reward hypothesis means preference relation must be true for it to hold
Rationality axioms
- Completeness requires that the preference ordering make some judgment about any pair of distributions
- Transitivity requires that no coherent goal can involve cyclical preferences
- Independence requires that the preference between two distributions does not change when the other alternatives are the same
- Temporal -Indifference requires that the agent has no preference over which history is delayed, even if one history is highly preferred to the other
- There exists a utility function whose expectation for any distribution over histories is consistent with the preference relation
- There exists a Markov reward function such that the expected sum of rewards under a particular transition-dependent discount factor is consistent with the preference relation
- There exists an efficient algorithm that constructs the reward function and discount factor from a preference relation
- The form of the objective (e.g. discounted reward, episodic total reward, average reward) is determined by the preference relation and how it satisfies the Temporal -Indifference axiom
Objective goals
- Preferences and rewards originate from different agents.
- Designer provides learning signal reflecting preferences.
- Designer maintains preference relation over distributions.
- Agent receives rewards and discounts as separate inputs.
History and related work
- Discusses relevant literature from RL and economics
- Examines connections between the two fields
Economics
- Economics has studied rational behavior for centuries
- 1700s: Cramer and Bernoulli formulated the Expected Utility Hypothesis
- Expected Utility Hypothesis: individuals maximize expectation of utility
- Ramsey provided first formal axiomatic treatment of expected utility
- von Neumann and Morgenstern refined expected utility foundations of decision theory
- Research explored how to account for uncertainty, time, and computation
Utility theory for sequential decision making
- Pitis (2019) explored the relationship between the vNM axioms and the objectives of RL
- Shakerinava and Ravanbakhsh (2022) used utility theory to formalize “goals and purposes”
- Our work takes inspiration from Shakerinava and Ravanbakhsh and formalizes “goals and purposes” in a general setting
- Sunehag and Hutter (2011, 2015) studied what constitutes a rational RL agent
- Abel et al. (2021) studied the expressivity of reward in Markovian environments
- Abel et al. showed that there are restrictions on what kinds of preferences can be codified in terms of a reward function
- Abel et al. pointed out two styles of counterexample: steady state and entailment
The limited expressivity of markov reward
- Steady state counterexample violates Assumption 2
- Entailment counterexample violates Axiom 5
- Preference over policies requires 1 to be preferred over 2
- Preference over outcomes of Assumption 1
- Preference requires 1 to be preferred over 2 even though they induce identical distributions
Challenges to the reward hypothesis
- Common challenges to the reward hypothesis
- Formalization of the hypothesis can provide insight into arguments
Human irrationality
- Rational agents make decisions that are in their best interest
- Humans deviate from the model of rational agents
- Johnson-Laird showed people do not use mental logic when solving problems
- Kahneman and Tversky showed humans prefer outcomes that avoid risk
- Hayden and Niv argue against the presence of “economic values” in the brain
- Settling the reward hypothesis is about the expression of goals, not the behaviors that emerge in their pursuit
Multiple objectives
- Reward hypothesis suggests reducing purpose to a single scalar is difficult
- Multiobjective or multi-criteria decision making has been studied extensively
- Hausner (1953) drops continuity to generalize vNM results to multidimensional setting
- Gábor et al. (1998) propose multi-criteria RL in MDPs
- Miura (2022) shows multi-dimensional Markov reward functions are more expressive than scalar Markov reward functions
- Pitis et al. (2022) prove some multi-objective problems cannot be collapsed to a scalar objective
- Constrained MDPs are more expressive than single-reward MDPs
- Risk-sensitive objectives can cause optimal policy to be non-Markovian
- Axiom 5 can be violated, can be overcome by augmenting state using objective goals formulation
Conclusion
- Reward hypothesis requires assumptions and axioms
- Assumptions 1-4 and Axioms 1-5 must be satisfied for hypothesis to hold
- Algorithm 1 can construct reward function given preference relation
- Temporal indifference axiom is precondition for Memoryless axiom
- Additivity axiom states preferences remain unchanged after mixing
- VNM utility theorem guarantees preferences have utility representation
- Proposition 1 states if one policy has higher average reward, finite sum must eventually be larger
- Algorithm 2 sorts outcomes according to preference relation
- Extended axiom captures preferences expressed as reward bundles
- Additional axiom examines temporal nature of handling multiple objectives