Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Pretrained large language models have been used to create social chatbots.
  • These chatbots need to be engaging to retain users.
  • This work proposes using human feedback to develop highly engaging chatbots.
  • Evaluation metrics are used to measure the level of engagement.
  • A/B testing shows an increase in user retention of up to 30%.
  • Future work aims to use the reward model to fine-tune the language model.

Paper Content

Introduction

  • PrLMs have enabled systems to perform language tasks with humanlike proficiency
  • Dialogue generation is one such task
  • PrLM-generated responses may not always be engaging
  • This paper focuses on developing social chatbots that prioritize user engagement
  • Human feedback is a promising and effective method to align systems with human intent
  • This paper proposes an efficient approach for collecting feedback using automatic pseudo-labels
  • Evaluation metrics measure how engaging deployed chatbots are
  • Proposed method increases retention of GPT-J 6B model by more than 30%
  • Anonymised user conversations and pseudo labels released publicly
  • Chatbots and dialogue systems are designed for many applications
  • This work focuses on chatbots for chitchat, with the objective of providing user entertainment and engagement
  • Early social chatbots used rule-based methods, followed by retrieval-based models and generative models
  • Recently, transformer-based designs have been used to develop more sophisticated social chatbots
  • Incorporating human feedback in chatbot development has been shown to be an effective method
  • Traditional method of assessing chatbots is through direct human assessment or text overlap based metrics
  • Alternative evaluation metrics such as user retention and average conversational length are better suited for the properties of interest

Evaluating chit chat bots

  • Chatbots used for task-oriented dialogue systems and virtual assistants
  • This paper focuses on building entertaining social chatbots
  • Optimizing for user engagement and entertainment
  • Text overlap scores are not effective evaluation metrics
  • Human evaluation is subjective, expensive and hard to scale
  • Clear signals exist to demonstrate quality of chatbot
  • Metrics based on assumption that captivating responses lead to longer conversations and higher user retention

Engagement evaluation metrics

  • Mean Conversation Length measures average number of user queries per conversation session
  • Retry Rate is fraction of system responses that user requested to regenerate
  • User Star Rating is fraction of star ratings by users who responded to survey
  • Retention Rate is fraction of users that engage with chatbot on Xth day after first conversation

Method

  • Finetune PrLM on conversational and literary data
  • Train reward model to determine if response is engaging and enjoyable
  • Leverage reward model at inference time to produce captivating responses

Model fine-tuning

  • PrLMs are trained on a variety of text sources.
  • Finetuning is the standard to achieve better performance on specific tasks and text domains.
  • This work finetunes decoder-based pretrained large language models on entertaining text domains, e.g. literature.

Reward modelling

  • Aim is to incorporate human feedback in design of more engaging chatbots
  • Proposed reward model learns how engaging a response is
  • Retention experiment allows users to interact with different characters
  • Manual annotations can be expensive and laborious to collect
  • Three different pseudo-labelling strategies to measure engagement

Best-of-n sample rejection

  • Trained reward model can be used to improve engagement of finetuned chatbot model
  • Approach to use Proximal Policy Optimization (PPO) with reinforcement learning to update chatbot model weights
  • Instead of updating chatbot model parameters, best of N sample rejection is applied

Experimental set-up

  • Chatbot model used is GPT-J 6B
  • Reward model based on GPT-2 small, medium, large and extra-large models
  • Reward model fine-tuned to predict labels
  • Chai user response dataset used to train reward model
  • A/B experiments run to evaluate effectiveness of reward models
  • Retention evaluation run for systems with better performance than previous best reward model

Results

Predicting conversation continuation

  • Trained a reward model to predict if user sends at least one more message after chatbot response
  • Binary classification problem, fed into dense layer with sigmoid activation
  • Evaluated performance in production by sampling 4 responses from chatbot
  • Metric was percentage improvement in MCL over serving chatbot without reward model
  • GPT-2 reward model given last three user and chatbot turns as input
  • Choice to be made in left- or right-padding input to reward model
  • A/B experiment comparing chatbot with/without reward model
  • GPT-2 small reward model trained on 90K rows
  • Right-padded context gave significant improvement
  • 3 A/B experiments studying scaling of MCL improvement with size of dataset
  • All experiments used best-of-four rejection sampling
  • MCL improvement grows linearly with log of number of rows
  • Increasing dataset size by factor of 10 results in +11.4% increase in MCL
  • Larger context window and predicting user responds at least 2 more times results in +14.0% increase in MCL

Adjusting hyper-parameters of gpt-2 reward models

  • Three A/B experiments were conducted to explore the impact of hyper-parameters on the reward model
  • Increasing the context given to the reward model by 256 tokens resulted in a 14% increase in MCL improvement
  • Increasing the number of samples from 4 to 8 and 8 to 16 resulted in significant improvements of 6.54% and 6.90% respectively
  • Increasing the number of further user messages from 1 to 2 was enough to prevent undesirable PPO behaviour
  • Doubling the number of samples causes the MCL to improve by 6.72% but doubles the inference cost
  • Adding one second of latency decreased the MCL by 3.01% and two seconds by 6.10%
  • Each additional sample decreases the MCL improvement by 0.078%

Did the user retry the message?

  • Chatbot interfaces often have the option to retry.
  • Experimented with training reward model to predict user retry instead of conversation continuation.
  • Trained GPT-2 large model on different dataset sizes.
  • Ran A/B test comparing reward models to each other and to chatbot without reward model.
  • Found that predicting retry results in worse MCL improvement than predicting conversation continuation.

Predicting whether the conversation continues without retrying the response

  • Previous experiment found small improvement in MCL when predicting user retry
  • Explored whether predicting user retry leads to improved retention
  • Ran A/B test of 3 reward models plus baseline over 30 days
  • Retention improved linearly with log of days since start of experiment
  • Predicting conversation continuation increased retention by 12.1%, not retrying by 24.7%, both by 30.3%

Predicting the user rating

  • Reward model can predict whether a chatbot response will receive a rating of two or more stars, three or more stars, or four stars.
  • A/B experiment found that two stars or more reward model improved MCL by +8.70 ± 2.54%, three stars or more reward model improved MCL by +9.81 ± 2.57%, and four stars reward model improved MCL by +1.24 ± 2.47%.
  • More important to avoid low-scoring one star or two star messages than to prioritize four star messages.

Reward model generalisation

  • Explored using GPT-2 small reward model to rank responses of GPT-J chatbot
  • Investigated larger GPT-2 reward models and differently fine-tuned GPT-J chatbot
  • Found increasing number of parameters by factor of 10 gives +5.0% MCL improvement
  • Increasing dataset size by factor of 10 gives +11.4% MCL improvement
  • Pygmalion GPT-J chatbot gave +16.4% MCL improvement without reward model
  • GPT-2 small reward model improved performance of fine-tuned GPT-J by +36.87%
  • Pygmalion GPT-J improved to +54.33% with reward model

Conclusion

  • Developing chatbots that are highly engaging and entertaining
  • Training reward models with human feedback
  • Leads to longer average user interactions and higher user retention
  • Intuitive evaluation metrics: mean conversation length and user retention
  • Pseudo labels to identify captivating responses
  • GPT-J 6B language model increases user retention by over 30%
  • Distribution of conversation lengths follows a decaying power-law
  • Mean conversation length defined as mean of conversations up to 100 messages
  • Percentage improvement of MCL relative to not having a reward model
  • Distribution of user ratings of chatbot messages is top-heavy