Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Studies offline policy learning, which aims to learn an optimal decision rule for a given population
  • Existing methods rely on a uniform overlap assumption, which can be unrealistic in many situations
  • Proposed algorithm optimizes lower confidence bounds of policy values
  • Data-dependent upper bound for suboptimality of algorithm only depends on overlap for optimal policy and complexity of policy class
  • New self-normalized type concentration inequality developed for inverse-propensity-weighting estimators

Paper Content

Introduction

  • Policy learning aims to create an optimal decision rule for a given population
  • Policy learning is used in healthcare, advertising, and recommendation systems
  • Data is collected by executing a fixed and potentially suboptimal policy
  • Data is also collected by running an adaptive learning algorithm
  • Policy learning involves two components: policy evaluation and policy optimization
  • A good performance of policy learning relies on accurate evaluation for all policies
  • Uniform overlap assumption is often violated in practice
  • Fundamental challenge of policy learning is counterfactual reasoning
  • Estimation of policy value can have a huge variance if evaluated at out-of-distribution actions
  • Quality of assessment of policy value varies across policies
  • Need to quantify statistical confidence in policy value and incorporate into learning process

A pessimism-based framework

  • Proposed novel solution to offline policy learning with two ingredients: pessimism principle and generalized empirical Bernstein inequality
  • Algorithm optimizes confidence lower bounds (LCBs) instead of point estimates of policy values
  • Generic data-dependent upper bound for performance of algorithm
  • Complexity measure of policy and estimation error of optimal policy in upper bound
  • Oracle property of approach: adapts to overlap of optimal policy in data
  • Developed uniform concentration inequality for offline policy evaluation
  • Standard O(1/ √ T ) rate for learning performance when e(X, π * (X)) is lower bounded by a constant
  • Polynomial learning rate when propensity for optimal policy is decaying at a polynomial rate

Background

  • X and Y denote the space of contexts and rewards respectively
  • A contextual bandit model C is specified by an action set A, a context distribution PX and a set of reward distributions
  • Policy learning is done from an offline dataset collected from C

Data collecting process

  • Context Xt is drawn from PX at each time t
  • Action A t is taken according to behavior policy
  • Reward Yt is µ(Xt, At) + ǫt
  • Behavior policy is potentially adaptive
  • A t is treatment option taken for Xt
  • Yt(a) follows population joint distribution
  • Conditional independence condition: Yt(a) ⊥ ⊥ A t | X t , H t
  • Bounded reward: Yt ∈ [0, 1]
  • Known behavior policy: e t (x, a | H t ) for all (x, a) ∈ X ×A are known to learner

Policy learning and performance metric

  • Π is a collection of policies, each of which is a mapping from X to A
  • The average policy value of a policy π in Π is defined using the contextual bandit model C
  • The optimal policy π* maximizes the policy value
  • The goal is to learn a policy π and measure the learning performance by its suboptimality

Policy class complexity

  • Natarajan dimension is used to quantify the complexity of a policy class
  • Natarajan dimension is a generalization of the VC dimension
  • Finite upper bounds for the Natarajan dimension have been established for many policy classes
  • Policy learning from a fixed class is an area at the intersection of economics, causal inference and statistical machine learning.
  • Classical works focus on i.i.d. data collected by a fixed behavior policy that is known or unknown and estimable.
  • These works all assume a uniform overlap condition, i.e., the propensity for all actions in the behavior policy is uniformly lower bounded by a constant.
  • When the behavior policy is known, the uniform overlap assumption is not necessary for efficient policy learning.
  • There is a recent interest in policy learning from adaptively collected data.
  • This setting is more challenging because the observations are no longer i.i.d.
  • To tackle these challenges, existing works assume the propensities for all actions are deterministically lower bounded over time.
  • We introduce a novel algorithm to policy learning and show that the overlap of the optimal policy is sufficient for efficient policy learning.
  • Pessimism was initially proposed in offline reinforcement learning for learning an optimal policy in Markov decision processes.
  • Variance regularization has also appeared in statistical learning and empirical risk minimization.
  • Our method extends the setting in this line of work to unbounded random variables and adaptively collected hence mutually dependent data.
  • We develop a novel self-normalized-type concentration inequality for empirical risks.
  • Our work is generally related to reweighting-based approaches for off-policy evaluation in offline RL and contextual bandits.
  • We develop a novel empirical Bernstein’s inequality to provide uncertainty quantification for inverse-weighted estimators.

Pessimistic policy learning

  • Introducing the pessimism principle to eliminate reliance on uniform overlap assumption
  • Optimizing pessimistic policy value with policy-dependent regularization term
  • Proposition 3.1 states that suboptimality of policy only depends on regularization term of optimal policy
  • Learning performance guarantee and uniform concentration event differ from greedy algorithms in literature
  • Seek data-dependent bound for every policy in optimization problem to improve upper bound

Construction of the estimator and the regularizer

  • Instantiate generic idea from Section 3.2 into concrete algorithm
  • Estimator Q(π) and regularizer R(π) for all π ∈ Π
  • Estimator is AIPW-type estimator
  • Regularizer is β • V (π)
  • Scaling factor β depends on complexity of policy class Π
  • V (π) is proxy for standard deviation of Q(π)

Learning from a fixed behavior policy

  • Fixed behavior policy used to collect offline data
  • Upper and lower bounds on suboptimality of pessimistic learning algorithm

Suboptimality upper bound

  • Theorem 5.2 provides an upper bound for the suboptimality of pessimistic policy learning
  • Assumption 5.1 is required for the theorem to hold
  • The upper bound is data-dependent and small as long as the value of the optimal policy can be estimated
  • Assumption 5.1 is mild and does not require any assumptions on the adaptive data collecting process
  • Corollary 5.6 provides a polynomial upper bound on the suboptimality of pessimistic policy learning
  • The upper bound improves upon the results developed in other papers
  • The upper bound is efficient even when the assumptions in greedy learning do not hold

Pessimistic policy learning is minimax optimal

  • Pessimistic policy learning is the best effort given a fixed behavior policy
  • Theorem 4.4 establishes a minimax lower bound for policy learning in a policy class
  • Theorem 4.4 matches the upper bound in Corollary 4.3 up to logarithm factors of T and K
  • Pessimistic policy learning is the best effort given offline data from a fixed behavior policy

Learning from adaptive behavior policies

  • Pessimistic policy learning from adaptively collected data allows the behavior policy to change over time.
  • An additional mild condition is imposed that the sampling probability has a deterministic lower bound.

Minimax optimality

  • Investigated minimax optimality of pessimistic policy learning when data is adaptively collected
  • Worked under polynomial decay setting for clear interpretability
  • Fundamental difficulty of policy learning with adaptive behavior policy is captured by sampling propensities for optimal policy
  • Minimax lower bound matches upper bound, illustrating key role of overlap of optimal policy in fundamental limit of learning from offline dataset

Proof sketch of main results

  • Proof sketches are provided for the main results
  • Section 6.1 focuses on the upper bound in the fixed policy case
  • Section 6.2 focuses on the upper bound in the adaptive policy case
  • Notation is simplified to omit dependence on C when context is clear

Proof of upper bound for the fixed policy case

  • Estimation error for any (x, a) can be decomposed into two terms
  • Symmetrization via Rademacher process to bound term (i)
  • Lemma 6.1 bounds probability of term (i) exceeding a constant
  • Lemma 6.2 establishes uniform upper bound for tail probability of Rademacher process
  • Lemma 6.3 controls term (ii)
  • Combining results from Lemmas 6.1, 6.2 and 6.3, Theorem 4.1 is proven

Proof of upper bound for the adaptive policy case

  • Triangle inequality used to control terms (i) and (ii)
  • Decoupled tangent sequence used to control term (i)
  • Symmetrization used to move supremum over π out of probability
  • Tail bound of term (i) turned into tree Rademacher process
  • Self-normalized concentration inequality used to control tail bound
  • Lemma 6.8 used to control term (ii)
  • Union bound used to combine terms (i) and (ii)
  • Result used to prove Theorem 5.2

Discussion

  • Introduce new algorithm based on pessimism principle for policy learning from offline data
  • Algorithm promises efficient learning as long as optimal policy is sufficiently covered by data set
  • Developed generalized empirical Bernstein inequality for estimators with unbounded empirical losses
  • Extension to ERM, efficient and scalable algorithms, and sequential decision making

A cross-fitted pessimistic policy learning

  • Algorithm provides a cross-fitted version for fixed behavior policy case
  • Index set is randomly split into two disjoint folds
  • Estimator µ (k) is used to obtain an estimator for µ(x, a)
  • Estimation error of Q(π) can be bounded
  • Regularization term is constructed
  • Uniform concentration results can be obtained

B proof of fixed-policy results

  • Event E is defined
  • Symmetrization trick using Rademacher random variables is used to control probability of E
  • Lemma B.1 is used to control deviation of 1
  • Union bound is used to control probability of E
  • Tower property is used to bound probability of E
  • Hoeffding’s inequality is used to bound probability of E
  • Natarajan’s lemma is used to bound probability of E
  • Markov’s inequality is used to bound probability of E
  • Exchangeability between {A t , Y t } T t=1 and is used
  • Bernstein’s inequality is used to bound probability of E
  • Boundedness of Q(π) is used
  • Boundedness of e(X t , π * (X t )) is used
  • S is N-shattered by Π
  • For any v ∈ V, there exists some π * v
  • Fixed behavior policy is defined
  • Total-Variation distance is used
  • KL divergence is bounded
  • Union bound is used

C proof of adaptive results

  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lemma 6.8
  • Proof of Corollary 5.6
  • Proof of Theorem 5.7
  • Define event of interest
  • Define two additional events
  • Lemma C.1 controls conditional probability of E 1 and E 2
  • Bound left-handed side
  • Recursive symmetrization for t = T
  • Recursive symmetrization for t = T - 1
  • Continuing recursive symmetrization
  • Proof of Lemma 6.7
  • Proof of Lem…

E auxiliary lemmas

  • Natarajan’s lemma characterizes the number of distinct realizations of a policy
  • Lemma E.1 provides a uniform bound over a finite set with polynomially growing size
  • Lemma E.2 is a self-normalized concentration inequality
  • Lemma E.3 is Hoeffding’s inequality
  • Lemma E.4 is Freedman’s inequality, with Bernstein’s inequality as a special case
  • Lemma B.1 provides a bound on the expected regret