Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

The reward hypothesis suggests that goals and purposes can be thought of as maximizing the expected value of a reward.
The paper aims to fully settle the hypothesis by specifying the requirements for it to hold.

Paper Content

Introduction

Reward hypothesis posited by Sutton states that goals and purposes can be thought of as maximizing expected value of cumulative reward
McCarthy’s claim that intelligence is the computational part of the ability to achieve goals
Sutton’s hypothesis implies that to build AI, it is sufficient to solve RL
Reward-is-enough hypothesis posits that intelligence can be understood as subserving the maximization of reward
Abel et al. (2021) grounded the notion of goals and purposes as an ordering over policies
Shakerinava and Ravanbakhsh (2022) grounded goals and purposes in preference relations over state-trajectories
Pitis (2019) provides a normative account for why we should embrace a state-action dependent discount factor
Dong et al. (2021), Lu et al. (2021), and early work on general RL explored general stochastic environments and policies
New axiom introduced to accommodate discounted reward, average reward, and episodic settings
Account does not give simple affirmation or refutation of reward hypothesis, but rather aims to specify implicit requirements on goals and purposes

The reward hypothesis

Formalize reward hypothesis
State assumptions for each phrase in claim

Goals as preferences

Goals and purposes are expressed with a binary preference relation.
Agent interaction is a cycle of observing the environment and taking action.
Preferences are expressed over histories in deterministic settings.
Distributions over histories are considered in stochastic settings.
Rewards can be present in the agent’s observation or provided by an external observer.
Preferences over policies are consistent with the distributions over histories they induce.
Goals can be achieved in a defined time frame or of a continuing nature.

Maximizing cumulative sums

Maximizing cumulative sum of scalar reward is domain of reinforcement learning
Reward function and transition-dependent discount function used to compare policies
Reward hypothesis means preference relation must be true for it to hold

Rationality axioms

Completeness requires that the preference ordering make some judgment about any pair of distributions
Transitivity requires that no coherent goal can involve cyclical preferences
Independence requires that the preference between two distributions does not change when the other alternatives are the same
Temporal -Indifference requires that the agent has no preference over which history is delayed, even if one history is highly preferred to the other
There exists a utility function whose expectation for any distribution over histories is consistent with the preference relation
There exists a Markov reward function such that the expected sum of rewards under a particular transition-dependent discount factor is consistent with the preference relation
There exists an efficient algorithm that constructs the reward function and discount factor from a preference relation
The form of the objective (e.g. discounted reward, episodic total reward, average reward) is determined by the preference relation and how it satisfies the Temporal -Indifference axiom

Objective goals

Preferences and rewards originate from different agents.
Designer provides learning signal reflecting preferences.
Designer maintains preference relation over distributions.
Agent receives rewards and discounts as separate inputs.

Discusses relevant literature from RL and economics
Examines connections between the two fields

Economics

Economics has studied rational behavior for centuries
1700s: Cramer and Bernoulli formulated the Expected Utility Hypothesis
Expected Utility Hypothesis: individuals maximize expectation of utility
Ramsey provided first formal axiomatic treatment of expected utility
von Neumann and Morgenstern refined expected utility foundations of decision theory
Research explored how to account for uncertainty, time, and computation

Utility theory for sequential decision making

Pitis (2019) explored the relationship between the vNM axioms and the objectives of RL
Shakerinava and Ravanbakhsh (2022) used utility theory to formalize “goals and purposes”
Our work takes inspiration from Shakerinava and Ravanbakhsh and formalizes “goals and purposes” in a general setting
Sunehag and Hutter (2011, 2015) studied what constitutes a rational RL agent
Abel et al. (2021) studied the expressivity of reward in Markovian environments
Abel et al. showed that there are restrictions on what kinds of preferences can be codified in terms of a reward function
Abel et al. pointed out two styles of counterexample: steady state and entailment

The limited expressivity of markov reward

Steady state counterexample violates Assumption 2
Entailment counterexample violates Axiom 5
Preference over policies requires 1 to be preferred over 2
Preference over outcomes of Assumption 1
Preference requires 1 to be preferred over 2 even though they induce identical distributions

Challenges to the reward hypothesis

Common challenges to the reward hypothesis
Formalization of the hypothesis can provide insight into arguments

Human irrationality

Rational agents make decisions that are in their best interest
Humans deviate from the model of rational agents
Johnson-Laird showed people do not use mental logic when solving problems
Kahneman and Tversky showed humans prefer outcomes that avoid risk
Hayden and Niv argue against the presence of “economic values” in the brain
Settling the reward hypothesis is about the expression of goals, not the behaviors that emerge in their pursuit

Multiple objectives

Reward hypothesis suggests reducing purpose to a single scalar is difficult
Multiobjective or multi-criteria decision making has been studied extensively
Hausner (1953) drops continuity to generalize vNM results to multidimensional setting
Gábor et al. (1998) propose multi-criteria RL in MDPs
Miura (2022) shows multi-dimensional Markov reward functions are more expressive than scalar Markov reward functions
Pitis et al. (2022) prove some multi-objective problems cannot be collapsed to a scalar objective
Constrained MDPs are more expressive than single-reward MDPs
Risk-sensitive objectives can cause optimal policy to be non-Markovian
Axiom 5 can be violated, can be overcome by augmenting state using objective goals formulation

Conclusion

Reward hypothesis requires assumptions and axioms
Assumptions 1-4 and Axioms 1-5 must be satisfied for hypothesis to hold
Algorithm 1 can construct reward function given preference relation
Temporal indifference axiom is precondition for Memoryless axiom
Additivity axiom states preferences remain unchanged after mixing
VNM utility theorem guarantees preferences have utility representation
Proposition 1 states if one policy has higher average reward, finite sum must eventually be larger
Algorithm 2 sorts outcomes according to preference relation
Extended axiom captures preferences expressed as reward bundles
Additional axiom examines temporal nature of handling multiple objectives

Link to paper#

Abstract#

Paper Content#

Introduction#

The reward hypothesis#

Goals as preferences#

Maximizing cumulative sums#

Rationality axioms#

Objective goals#

History and related work#

Economics#

Utility theory for sequential decision making#

The limited expressivity of markov reward#

Challenges to the reward hypothesis#

Human irrationality#

Multiple objectives#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

The reward hypothesis

Goals as preferences

Maximizing cumulative sums

Rationality axioms

Objective goals

History and related work

Economics

Utility theory for sequential decision making

The limited expressivity of markov reward

Challenges to the reward hypothesis

Human irrationality

Multiple objectives

Conclusion