Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • The reward hypothesis suggests that goals and purposes can be thought of as maximizing the expected value of a reward.
  • The paper aims to fully settle the hypothesis by specifying the requirements for it to hold.

Paper Content

Introduction

  • Reward hypothesis posited by Sutton states that goals and purposes can be thought of as maximizing expected value of cumulative reward
  • McCarthy’s claim that intelligence is the computational part of the ability to achieve goals
  • Sutton’s hypothesis implies that to build AI, it is sufficient to solve RL
  • Reward-is-enough hypothesis posits that intelligence can be understood as subserving the maximization of reward
  • Abel et al. (2021) grounded the notion of goals and purposes as an ordering over policies
  • Shakerinava and Ravanbakhsh (2022) grounded goals and purposes in preference relations over state-trajectories
  • Pitis (2019) provides a normative account for why we should embrace a state-action dependent discount factor
  • Dong et al. (2021), Lu et al. (2021), and early work on general RL explored general stochastic environments and policies
  • New axiom introduced to accommodate discounted reward, average reward, and episodic settings
  • Account does not give simple affirmation or refutation of reward hypothesis, but rather aims to specify implicit requirements on goals and purposes

The reward hypothesis

  • Formalize reward hypothesis
  • State assumptions for each phrase in claim

Goals as preferences

  • Goals and purposes are expressed with a binary preference relation.
  • Agent interaction is a cycle of observing the environment and taking action.
  • Preferences are expressed over histories in deterministic settings.
  • Distributions over histories are considered in stochastic settings.
  • Rewards can be present in the agent’s observation or provided by an external observer.
  • Preferences over policies are consistent with the distributions over histories they induce.
  • Goals can be achieved in a defined time frame or of a continuing nature.

Maximizing cumulative sums

  • Maximizing cumulative sum of scalar reward is domain of reinforcement learning
  • Reward function and transition-dependent discount function used to compare policies
  • Reward hypothesis means preference relation must be true for it to hold

Rationality axioms

  • Completeness requires that the preference ordering make some judgment about any pair of distributions
  • Transitivity requires that no coherent goal can involve cyclical preferences
  • Independence requires that the preference between two distributions does not change when the other alternatives are the same
  • Temporal -Indifference requires that the agent has no preference over which history is delayed, even if one history is highly preferred to the other
  • There exists a utility function whose expectation for any distribution over histories is consistent with the preference relation
  • There exists a Markov reward function such that the expected sum of rewards under a particular transition-dependent discount factor is consistent with the preference relation
  • There exists an efficient algorithm that constructs the reward function and discount factor from a preference relation
  • The form of the objective (e.g. discounted reward, episodic total reward, average reward) is determined by the preference relation and how it satisfies the Temporal -Indifference axiom

Objective goals

  • Preferences and rewards originate from different agents.
  • Designer provides learning signal reflecting preferences.
  • Designer maintains preference relation over distributions.
  • Agent receives rewards and discounts as separate inputs.
  • Discusses relevant literature from RL and economics
  • Examines connections between the two fields

Economics

  • Economics has studied rational behavior for centuries
  • 1700s: Cramer and Bernoulli formulated the Expected Utility Hypothesis
  • Expected Utility Hypothesis: individuals maximize expectation of utility
  • Ramsey provided first formal axiomatic treatment of expected utility
  • von Neumann and Morgenstern refined expected utility foundations of decision theory
  • Research explored how to account for uncertainty, time, and computation

Utility theory for sequential decision making

  • Pitis (2019) explored the relationship between the vNM axioms and the objectives of RL
  • Shakerinava and Ravanbakhsh (2022) used utility theory to formalize “goals and purposes”
  • Our work takes inspiration from Shakerinava and Ravanbakhsh and formalizes “goals and purposes” in a general setting
  • Sunehag and Hutter (2011, 2015) studied what constitutes a rational RL agent
  • Abel et al. (2021) studied the expressivity of reward in Markovian environments
  • Abel et al. showed that there are restrictions on what kinds of preferences can be codified in terms of a reward function
  • Abel et al. pointed out two styles of counterexample: steady state and entailment

The limited expressivity of markov reward

  • Steady state counterexample violates Assumption 2
  • Entailment counterexample violates Axiom 5
  • Preference over policies requires 1 to be preferred over 2
  • Preference over outcomes of Assumption 1
  • Preference requires 1 to be preferred over 2 even though they induce identical distributions

Challenges to the reward hypothesis

  • Common challenges to the reward hypothesis
  • Formalization of the hypothesis can provide insight into arguments

Human irrationality

  • Rational agents make decisions that are in their best interest
  • Humans deviate from the model of rational agents
  • Johnson-Laird showed people do not use mental logic when solving problems
  • Kahneman and Tversky showed humans prefer outcomes that avoid risk
  • Hayden and Niv argue against the presence of “economic values” in the brain
  • Settling the reward hypothesis is about the expression of goals, not the behaviors that emerge in their pursuit

Multiple objectives

  • Reward hypothesis suggests reducing purpose to a single scalar is difficult
  • Multiobjective or multi-criteria decision making has been studied extensively
  • Hausner (1953) drops continuity to generalize vNM results to multidimensional setting
  • Gábor et al. (1998) propose multi-criteria RL in MDPs
  • Miura (2022) shows multi-dimensional Markov reward functions are more expressive than scalar Markov reward functions
  • Pitis et al. (2022) prove some multi-objective problems cannot be collapsed to a scalar objective
  • Constrained MDPs are more expressive than single-reward MDPs
  • Risk-sensitive objectives can cause optimal policy to be non-Markovian
  • Axiom 5 can be violated, can be overcome by augmenting state using objective goals formulation

Conclusion

  • Reward hypothesis requires assumptions and axioms
  • Assumptions 1-4 and Axioms 1-5 must be satisfied for hypothesis to hold
  • Algorithm 1 can construct reward function given preference relation
  • Temporal indifference axiom is precondition for Memoryless axiom
  • Additivity axiom states preferences remain unchanged after mixing
  • VNM utility theorem guarantees preferences have utility representation
  • Proposition 1 states if one policy has higher average reward, finite sum must eventually be larger
  • Algorithm 2 sorts outcomes according to preference relation
  • Extended axiom captures preferences expressed as reward bundles
  • Additional axiom examines temporal nature of handling multiple objectives