Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recommender systems aim to predict what items a user will interact with next.
Historically, this problem has been solved using supervised learning.
Recently, policy optimization has been used to maximize user engagement.
When training a new policy, data from a previously-deployed policy is used.
An alternative approach is local policy improvement without off-policy correction.
This approach does not involve density ratios and is well suited for recommender systems.

Paper Content

Introduction

Recommender systems are a common component of the web.
The goal of the recommendation problem is to present users with unseen items they will enjoy.
Historically, the problem has been framed as a prediction task.
More recently, research has focused on viewing recommendation as a form of intervention.

Offline learning for decision making is connected to recommender systems, specifically sequential recommendation.
Notation convention: upper-case letters represent random variables, lower-case letters represent actual values.
Random, data-driven estimates have the form Ĵ.

Offline learning for decision making

Offline learning is a process of coming up with a new policy that is “good”
The expected reward of a policy is defined as an expectation
The goal of offline learning is to maximize the expected reward
Inverse Propensity Scoring (IPS) is used to handle policy mismatch and estimate expected reward

Sequential recommendation as offline decision making

Sequential recommendation is a type of recommender system
Users interact with a catalog of items
Sequential models (e.g. neural networks) are used to predict the next item a user will interact with
Rewards (positive and negative) are associated with user-item interactions
Sequential recommendation can be interpreted as an offline decision making problem
The goal is to maximize expected reward
Recent works explore this from a RL and contextual bandit setting

Local policy improvement

Methods rely on optimizing a lower bound of expected reward function
Lower bound is easy to estimate from data and does not involve density ratios
Proposed methodology has not been used before in recommendation systems

The contextual bandit setting

Contextual bandit setting used
Netflix Prize data used
Ratings of 1-2 = 0, 3 = 0.5, 4-5 = 1
20,000 user sequences used for validation and test sets
Matrix factorization model used to impute missing ratings
Local policy improvement approach used
Logging policy estimated with maximum likelihood estimation
Baselines compared to: Logging, L, ips
Results show similar AR@1 and iAR@1
Local policy improvement approach can control amount of policy increments

The rl setting

RL setting involves a Markov Decision Process (MDP) with context/state, action, reward, discount factor, starting distribution, and state transition model.
Trajectories are sampled from a policy.
Action-value, state-value, and advantage are defined.
Expected (discounted) cumulative reward of a policy can be written as an equation.
Local policy improvement objective is derived by introducing a penalty term.
Estimating advantage involves Monte-Carlo methods and temporal-difference (TD) methods.
TD approach is used in a typical recommender system setting.

Practical considerations and recipes

Local policy improvement approach has not been applied to recommender systems literature.
Typical recommender system interaction sequences can be formulated into policy optimization perspective.
Sequential model (e.g. convolutional/recurrent neural network or transformer) is the backbone of the architecture.
Loss function can be computed over the most recent actions.

And the

Two separate objective functions to optimize: Llpi and Ltd
Multi-head architecture to parametrize the policy
Auxiliary prediction head optimized by td loss
Data batching done by breaking down input sequence into sub-sequences or by doing one forward pass

Experiments

Evaluating recommender systems is difficult because only feedback on recommended items is observed
Offline evaluation is more challenging than off-policy learning
AB testing is considered the gold standard for evaluating a recommendation policy
Evaluation in this paper is done by looking at how heldout interactions are ranked by the new policy
Traditional recommendation ranking metrics provide a rough guidance
Looking at both traditional ranking metrics and divergence between new and logging policy gives a holistic picture

The mdp setting

Assume user context/state follows a transition model
Conduct experiments on two real-world e-commerce datasets
Use same train/validation/test data split as provided by Xin et al.
Use hit rate (HR) and normalized discounted cumulative gain (nDCG) as recommendation metrics
Reward actions leading to purchases higher than actions leading to clicks
Measure JS divergence between learned policy and estimated logging policy
Consider self-supervised Q-learning (sqn), self-supervised Actor-Critic (sac), Policy Gradient (pg) and Off-policy Policy Gradient (ips-pg) as baselines
Pick best hyperparameter/epoch by pnDCGp@20 + cnDCGc@20
Results show tradeoff between click and purchase metrics
JS divergence between learned policy and estimated logging policy goes down with larger Lagrangian multiplier
Regularization strength on td loss has a sweet spot on purchase metrics

Local Policy Improvement for Recommender Systems

Link to paper

Abstract

Paper Content

Introduction

Offline learning for decision making

Sequential recommendation as offline decision making

Local policy improvement

The contextual bandit setting

The rl setting

Practical considerations and recipes

And the

Experiments

The mdp setting

Conclusion

Link to paper#

Abstract#

Paper Content#

Introduction#

Background and related work#

Offline learning for decision making#

Sequential recommendation as offline decision making#

Local policy improvement#

The contextual bandit setting#

The rl setting#

Practical considerations and recipes#

And the#

Experiments#

The mdp setting#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background and related work

Offline learning for decision making

Sequential recommendation as offline decision making

Local policy improvement

The contextual bandit setting

The rl setting

Practical considerations and recipes

And the

Experiments

The mdp setting

Conclusion