Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recommender systems aim to predict what items a user will interact with next.
- Historically, this problem has been solved using supervised learning.
- Recently, policy optimization has been used to maximize user engagement.
- When training a new policy, data from a previously-deployed policy is used.
- An alternative approach is local policy improvement without off-policy correction.
- This approach does not involve density ratios and is well suited for recommender systems.
Paper Content
Introduction
- Recommender systems are a common component of the web.
- The goal of the recommendation problem is to present users with unseen items they will enjoy.
- Historically, the problem has been framed as a prediction task.
- More recently, research has focused on viewing recommendation as a form of intervention.
Background and related work
- Offline learning for decision making is connected to recommender systems, specifically sequential recommendation.
- Notation convention: upper-case letters represent random variables, lower-case letters represent actual values.
- Random, data-driven estimates have the form Ĵ.
Offline learning for decision making
- Offline learning is a process of coming up with a new policy that is “good”
- The expected reward of a policy is defined as an expectation
- The goal of offline learning is to maximize the expected reward
- Inverse Propensity Scoring (IPS) is used to handle policy mismatch and estimate expected reward
Sequential recommendation as offline decision making
- Sequential recommendation is a type of recommender system
- Users interact with a catalog of items
- Sequential models (e.g. neural networks) are used to predict the next item a user will interact with
- Rewards (positive and negative) are associated with user-item interactions
- Sequential recommendation can be interpreted as an offline decision making problem
- The goal is to maximize expected reward
- Recent works explore this from a RL and contextual bandit setting
Local policy improvement
- Methods rely on optimizing a lower bound of expected reward function
- Lower bound is easy to estimate from data and does not involve density ratios
- Proposed methodology has not been used before in recommendation systems
The contextual bandit setting
- Contextual bandit setting used
- Netflix Prize data used
- Ratings of 1-2 = 0, 3 = 0.5, 4-5 = 1
- 20,000 user sequences used for validation and test sets
- Matrix factorization model used to impute missing ratings
- Local policy improvement approach used
- Logging policy estimated with maximum likelihood estimation
- Baselines compared to: Logging, L, ips
- Results show similar AR@1 and iAR@1
- Local policy improvement approach can control amount of policy increments
The rl setting
- RL setting involves a Markov Decision Process (MDP) with context/state, action, reward, discount factor, starting distribution, and state transition model.
- Trajectories are sampled from a policy.
- Action-value, state-value, and advantage are defined.
- Expected (discounted) cumulative reward of a policy can be written as an equation.
- Local policy improvement objective is derived by introducing a penalty term.
- Estimating advantage involves Monte-Carlo methods and temporal-difference (TD) methods.
- TD approach is used in a typical recommender system setting.
Practical considerations and recipes
- Local policy improvement approach has not been applied to recommender systems literature.
- Typical recommender system interaction sequences can be formulated into policy optimization perspective.
- Sequential model (e.g. convolutional/recurrent neural network or transformer) is the backbone of the architecture.
- Loss function can be computed over the most recent actions.
And the
- Two separate objective functions to optimize: Llpi and Ltd
- Multi-head architecture to parametrize the policy
- Auxiliary prediction head optimized by td loss
- Data batching done by breaking down input sequence into sub-sequences or by doing one forward pass
Experiments
- Evaluating recommender systems is difficult because only feedback on recommended items is observed
- Offline evaluation is more challenging than off-policy learning
- AB testing is considered the gold standard for evaluating a recommendation policy
- Evaluation in this paper is done by looking at how heldout interactions are ranked by the new policy
- Traditional recommendation ranking metrics provide a rough guidance
- Looking at both traditional ranking metrics and divergence between new and logging policy gives a holistic picture
The mdp setting
- Assume user context/state follows a transition model
- Conduct experiments on two real-world e-commerce datasets
- Use same train/validation/test data split as provided by Xin et al.
- Use hit rate (HR) and normalized discounted cumulative gain (nDCG) as recommendation metrics
- Reward actions leading to purchases higher than actions leading to clicks
- Measure JS divergence between learned policy and estimated logging policy
- Consider self-supervised Q-learning (sqn), self-supervised Actor-Critic (sac), Policy Gradient (pg) and Off-policy Policy Gradient (ips-pg) as baselines
- Pick best hyperparameter/epoch by pnDCGp@20 + cnDCGc@20
- Results show tradeoff between click and purchase metrics
- JS divergence between learned policy and estimated logging policy goes down with larger Lagrangian multiplier
- Regularization strength on td loss has a sweet spot on purchase metrics