Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Pretrained large language models have been used to create social chatbots.
- These chatbots need to be engaging to retain users.
- This work proposes using human feedback to develop highly engaging chatbots.
- Evaluation metrics are used to measure the level of engagement.
- A/B testing shows an increase in user retention of up to 30%.
- Future work aims to use the reward model to fine-tune the language model.
Paper Content
Introduction
- PrLMs have enabled systems to perform language tasks with humanlike proficiency
- Dialogue generation is one such task
- PrLM-generated responses may not always be engaging
- This paper focuses on developing social chatbots that prioritize user engagement
- Human feedback is a promising and effective method to align systems with human intent
- This paper proposes an efficient approach for collecting feedback using automatic pseudo-labels
- Evaluation metrics measure how engaging deployed chatbots are
- Proposed method increases retention of GPT-J 6B model by more than 30%
- Anonymised user conversations and pseudo labels released publicly
Related work
- Chatbots and dialogue systems are designed for many applications
- This work focuses on chatbots for chitchat, with the objective of providing user entertainment and engagement
- Early social chatbots used rule-based methods, followed by retrieval-based models and generative models
- Recently, transformer-based designs have been used to develop more sophisticated social chatbots
- Incorporating human feedback in chatbot development has been shown to be an effective method
- Traditional method of assessing chatbots is through direct human assessment or text overlap based metrics
- Alternative evaluation metrics such as user retention and average conversational length are better suited for the properties of interest
Evaluating chit chat bots
- Chatbots used for task-oriented dialogue systems and virtual assistants
- This paper focuses on building entertaining social chatbots
- Optimizing for user engagement and entertainment
- Text overlap scores are not effective evaluation metrics
- Human evaluation is subjective, expensive and hard to scale
- Clear signals exist to demonstrate quality of chatbot
- Metrics based on assumption that captivating responses lead to longer conversations and higher user retention
Engagement evaluation metrics
- Mean Conversation Length measures average number of user queries per conversation session
- Retry Rate is fraction of system responses that user requested to regenerate
- User Star Rating is fraction of star ratings by users who responded to survey
- Retention Rate is fraction of users that engage with chatbot on Xth day after first conversation
Method
- Finetune PrLM on conversational and literary data
- Train reward model to determine if response is engaging and enjoyable
- Leverage reward model at inference time to produce captivating responses
Model fine-tuning
- PrLMs are trained on a variety of text sources.
- Finetuning is the standard to achieve better performance on specific tasks and text domains.
- This work finetunes decoder-based pretrained large language models on entertaining text domains, e.g. literature.
Reward modelling
- Aim is to incorporate human feedback in design of more engaging chatbots
- Proposed reward model learns how engaging a response is
- Retention experiment allows users to interact with different characters
- Manual annotations can be expensive and laborious to collect
- Three different pseudo-labelling strategies to measure engagement
Best-of-n sample rejection
- Trained reward model can be used to improve engagement of finetuned chatbot model
- Approach to use Proximal Policy Optimization (PPO) with reinforcement learning to update chatbot model weights
- Instead of updating chatbot model parameters, best of N sample rejection is applied
Experimental set-up
- Chatbot model used is GPT-J 6B
- Reward model based on GPT-2 small, medium, large and extra-large models
- Reward model fine-tuned to predict labels
- Chai user response dataset used to train reward model
- A/B experiments run to evaluate effectiveness of reward models
- Retention evaluation run for systems with better performance than previous best reward model
Results
Predicting conversation continuation
- Trained a reward model to predict if user sends at least one more message after chatbot response
- Binary classification problem, fed into dense layer with sigmoid activation
- Evaluated performance in production by sampling 4 responses from chatbot
- Metric was percentage improvement in MCL over serving chatbot without reward model
- GPT-2 reward model given last three user and chatbot turns as input
- Choice to be made in left- or right-padding input to reward model
- A/B experiment comparing chatbot with/without reward model
- GPT-2 small reward model trained on 90K rows
- Right-padded context gave significant improvement
- 3 A/B experiments studying scaling of MCL improvement with size of dataset
- All experiments used best-of-four rejection sampling
- MCL improvement grows linearly with log of number of rows
- Increasing dataset size by factor of 10 results in +11.4% increase in MCL
- Larger context window and predicting user responds at least 2 more times results in +14.0% increase in MCL
Adjusting hyper-parameters of gpt-2 reward models
- Three A/B experiments were conducted to explore the impact of hyper-parameters on the reward model
- Increasing the context given to the reward model by 256 tokens resulted in a 14% increase in MCL improvement
- Increasing the number of samples from 4 to 8 and 8 to 16 resulted in significant improvements of 6.54% and 6.90% respectively
- Increasing the number of further user messages from 1 to 2 was enough to prevent undesirable PPO behaviour
- Doubling the number of samples causes the MCL to improve by 6.72% but doubles the inference cost
- Adding one second of latency decreased the MCL by 3.01% and two seconds by 6.10%
- Each additional sample decreases the MCL improvement by 0.078%
Did the user retry the message?
- Chatbot interfaces often have the option to retry.
- Experimented with training reward model to predict user retry instead of conversation continuation.
- Trained GPT-2 large model on different dataset sizes.
- Ran A/B test comparing reward models to each other and to chatbot without reward model.
- Found that predicting retry results in worse MCL improvement than predicting conversation continuation.
Predicting whether the conversation continues without retrying the response
- Previous experiment found small improvement in MCL when predicting user retry
- Explored whether predicting user retry leads to improved retention
- Ran A/B test of 3 reward models plus baseline over 30 days
- Retention improved linearly with log of days since start of experiment
- Predicting conversation continuation increased retention by 12.1%, not retrying by 24.7%, both by 30.3%
Predicting the user rating
- Reward model can predict whether a chatbot response will receive a rating of two or more stars, three or more stars, or four stars.
- A/B experiment found that two stars or more reward model improved MCL by +8.70 ± 2.54%, three stars or more reward model improved MCL by +9.81 ± 2.57%, and four stars reward model improved MCL by +1.24 ± 2.47%.
- More important to avoid low-scoring one star or two star messages than to prioritize four star messages.
Reward model generalisation
- Explored using GPT-2 small reward model to rank responses of GPT-J chatbot
- Investigated larger GPT-2 reward models and differently fine-tuned GPT-J chatbot
- Found increasing number of parameters by factor of 10 gives +5.0% MCL improvement
- Increasing dataset size by factor of 10 gives +11.4% MCL improvement
- Pygmalion GPT-J chatbot gave +16.4% MCL improvement without reward model
- GPT-2 small reward model improved performance of fine-tuned GPT-J by +36.87%
- Pygmalion GPT-J improved to +54.33% with reward model
Conclusion
- Developing chatbots that are highly engaging and entertaining
- Training reward models with human feedback
- Leads to longer average user interactions and higher user retention
- Intuitive evaluation metrics: mean conversation length and user retention
- Pseudo labels to identify captivating responses
- GPT-J 6B language model increases user retention by over 30%
- Distribution of conversation lengths follows a decaying power-law
- Mean conversation length defined as mean of conversations up to 100 messages
- Percentage improvement of MCL relative to not having a reward model
- Distribution of user ratings of chatbot messages is top-heavy