Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recommender systems aim to predict what items a user will interact with next.
  • Historically, this problem has been solved using supervised learning.
  • Recently, policy optimization has been used to maximize user engagement.
  • When training a new policy, data from a previously-deployed policy is used.
  • An alternative approach is local policy improvement without off-policy correction.
  • This approach does not involve density ratios and is well suited for recommender systems.

Paper Content

Introduction

  • Recommender systems are a common component of the web.
  • The goal of the recommendation problem is to present users with unseen items they will enjoy.
  • Historically, the problem has been framed as a prediction task.
  • More recently, research has focused on viewing recommendation as a form of intervention.
  • Offline learning for decision making is connected to recommender systems, specifically sequential recommendation.
  • Notation convention: upper-case letters represent random variables, lower-case letters represent actual values.
  • Random, data-driven estimates have the form Ĵ.

Offline learning for decision making

  • Offline learning is a process of coming up with a new policy that is “good”
  • The expected reward of a policy is defined as an expectation
  • The goal of offline learning is to maximize the expected reward
  • Inverse Propensity Scoring (IPS) is used to handle policy mismatch and estimate expected reward

Sequential recommendation as offline decision making

  • Sequential recommendation is a type of recommender system
  • Users interact with a catalog of items
  • Sequential models (e.g. neural networks) are used to predict the next item a user will interact with
  • Rewards (positive and negative) are associated with user-item interactions
  • Sequential recommendation can be interpreted as an offline decision making problem
  • The goal is to maximize expected reward
  • Recent works explore this from a RL and contextual bandit setting

Local policy improvement

  • Methods rely on optimizing a lower bound of expected reward function
  • Lower bound is easy to estimate from data and does not involve density ratios
  • Proposed methodology has not been used before in recommendation systems

The contextual bandit setting

  • Contextual bandit setting used
  • Netflix Prize data used
  • Ratings of 1-2 = 0, 3 = 0.5, 4-5 = 1
  • 20,000 user sequences used for validation and test sets
  • Matrix factorization model used to impute missing ratings
  • Local policy improvement approach used
  • Logging policy estimated with maximum likelihood estimation
  • Baselines compared to: Logging, L, ips
  • Results show similar AR@1 and iAR@1
  • Local policy improvement approach can control amount of policy increments

The rl setting

  • RL setting involves a Markov Decision Process (MDP) with context/state, action, reward, discount factor, starting distribution, and state transition model.
  • Trajectories are sampled from a policy.
  • Action-value, state-value, and advantage are defined.
  • Expected (discounted) cumulative reward of a policy can be written as an equation.
  • Local policy improvement objective is derived by introducing a penalty term.
  • Estimating advantage involves Monte-Carlo methods and temporal-difference (TD) methods.
  • TD approach is used in a typical recommender system setting.

Practical considerations and recipes

  • Local policy improvement approach has not been applied to recommender systems literature.
  • Typical recommender system interaction sequences can be formulated into policy optimization perspective.
  • Sequential model (e.g. convolutional/recurrent neural network or transformer) is the backbone of the architecture.
  • Loss function can be computed over the most recent actions.

And the

  • Two separate objective functions to optimize: Llpi and Ltd
  • Multi-head architecture to parametrize the policy
  • Auxiliary prediction head optimized by td loss
  • Data batching done by breaking down input sequence into sub-sequences or by doing one forward pass

Experiments

  • Evaluating recommender systems is difficult because only feedback on recommended items is observed
  • Offline evaluation is more challenging than off-policy learning
  • AB testing is considered the gold standard for evaluating a recommendation policy
  • Evaluation in this paper is done by looking at how heldout interactions are ranked by the new policy
  • Traditional recommendation ranking metrics provide a rough guidance
  • Looking at both traditional ranking metrics and divergence between new and logging policy gives a holistic picture

The mdp setting

  • Assume user context/state follows a transition model
  • Conduct experiments on two real-world e-commerce datasets
  • Use same train/validation/test data split as provided by Xin et al.
  • Use hit rate (HR) and normalized discounted cumulative gain (nDCG) as recommendation metrics
  • Reward actions leading to purchases higher than actions leading to clicks
  • Measure JS divergence between learned policy and estimated logging policy
  • Consider self-supervised Q-learning (sqn), self-supervised Actor-Critic (sac), Policy Gradient (pg) and Off-policy Policy Gradient (ips-pg) as baselines
  • Pick best hyperparameter/epoch by pnDCGp@20 + cnDCGc@20
  • Results show tradeoff between click and purchase metrics
  • JS divergence between learned policy and estimated logging policy goes down with larger Lagrangian multiplier
  • Regularization strength on td loss has a sweet spot on purchase metrics

Conclusion