Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Reward learning enables reinforcement learning to be applied to tasks where reward is defined by human judgment.
  • Most work on reward learning has used simulated environments.
  • Reward learning for language is a key to making RL practical and safe for real-world tasks.
  • This paper applies reward learning to four natural language tasks.
  • For stylistic continuation, good results are achieved with only 5,000 comparisons evaluated by humans.
  • For summarization, models trained with 60,000 comparisons copy whole sentences from the input.

Paper Content

Introduction

  • Reinforcement learning can be used to define complex tasks by human judgment
  • Human labels can be used to train a model of reward
  • Work learning models from humans has been applied to modern deep learning
  • Real world settings likely involve and require natural language
  • Natural language processing has seen advances with pretraining a large generative language model
  • Reinforcement learning has been applied to natural language tasks
  • This paper combines pretraining advances with human preference learning
  • Fine-tuning pretrained language models with reinforcement learning is applied to two types of tasks
  • Human comparisons result in the fine-tuned model being preferred by humans
  • Models are trained with human samples to copy whole sentences from the input
  • Human labelers prefer models to supervised fine-tuning and human-written reference summaries
  • Online and offline data collection is tested
  • Offline data collection works similarly well for style tasks

Methods

  • Vocabulary Σ and language model ρ define probability distribution over sequences of tokens
  • Input space X = Σ ≤m and output space Y = Σ n
  • x ∈ X is an article of up to 1000 words and y ∈ Y is a 100-word summary
  • ρ defines a probabilistic policy for this task
  • Initialize policy π = ρ and fine-tune π to perform task well using RL
  • Task defined by human judgments, so use human labels to train reward model
  • Ask humans to choose between four options (y 0 , y 1 , y 2 , y 3 )
  • Fit reward model r : X × Y → R using loss (1)
  • Initialize reward model from language model policy ρ
  • Fine-tune π to optimize reward model r
  • Add penalty with expectation β KL(π, ρ)
  • Use 774M parameter version of GPT-2 language model
  • Use temperature of T < 1 for all experiments
  • Train reward model using Adam optimizer
  • Train policy π using Proximal Policy Optimization
  • Use supervised finetuning of language model to BookCorpus dataset
  • Gather validation samples to estimate progress and inter-labeler agreement

Human labeling

  • Scale AI is used to collect labels
  • Scale API accepts requests of the form (x, y 0 , y 1 , y 2 , y 3 )
  • Instructions and dataset of 100 example comparisons are used to describe the task to Scale
  • Ground truth is not always clear
  • Agreement between labelers is low
  • Quality control process for Scale is complicated
  • Final evaluation of two models A and B is done with 2-way or 4-way comparisons
  • Quality of model trained by Scale is perilous

Experiments

  • Tested approach to RL fine-tuning of language models using a mock labeler as a stand-in for human labels
  • Showed that RL fine-tuning is effective at optimizing complex but artificial reward
  • Optimized language models from human preferences on stylistic continuation tasks with little data
  • Applied RL fine-tuning to summarization on CNN/Daily Mail and TL;DR datasets
  • Results are “smart copiers”
  • Released code for reward modeling and fine-tuning in the offline data case
  • Applied method to optimize known reward function by training a classifier on Amazon review dataset
  • Optimized reward function using limited number of queries to a human
  • Analytically computed optimal policy and compared to learned policies
  • Applied method to two continuation tasks defined by human judgments: sentiment and descriptiveness
  • Dynamically adjusted β to obtain KL divergence of 6 nats for descriptiveness and 10 nats for sentiment
  • Trained range of models using different amounts of feedback

Summarization

  • Applied method to two summarization tasks
  • Sampled articles/posts, truncated to 500 tokens, added suffix
  • Set temperature of pretrained model to 0.5 for CNN/Daily Mail and 0.7 for TL;DR
  • Truncated to last newline character to ensure articles consist of whole sentences
  • Penalized summaries without newline by giving them a fixed score of -1
  • Used fixed KL coefficient β = 0.1 for CNN/Daily Mail and β = 0.03 for TL;DR
  • Trained online data collection models with 15k, 30k, and 60k human labels
  • Also showed zero-shot performance of pretrained model, supervised fine-tuned baseline, and lead-3 baseline
  • Combined supervised and RL fine-tuning
  • Human evaluations and ROUGE results suggest online data collection is important
  • RL fine-tuning causes models to copy more
  • Supervised + RL fine-tuning is best
  • Copying is the easiest way to be accurate
  • Labelers check primarily for copying

Online data collection is hard

  • Online data collection had disadvantages in terms of software complexity, machine learning complexity, and quality control issues.
  • The right middle ground between offline and online data collection is batched data collection.
  • Batching data collection simplifies software architecture and diagnosis of ML issues.
  • Batch mode active learning techniques can be used with the policy π as the unlabeled data distribution.

Sharing parameters between reward model and policy causes overfitting

  • Reward model and policy are initialized to same value
  • RL used as an auxiliary task to improve reward model
  • Joint training could help reward model stay strong
  • Sharing could improve computational efficiency
  • Imbalance of data makes joint training challenging

Ambiguous tasks make labeling hard

  • Evaluation of a summary is subjective and multidimensional.
  • Choosing between samples with deficiencies is difficult for honest labelers.
  • Research is difficult to present and interpret.
  • Better to design less ambiguous labeling tasks.

Bugs can optimize for bad behavior

  • A bug was introduced which flipped the sign of the reward and KL penalty
  • This caused the model to optimize for negative sentiment
  • The model quickly learned to output only sexually explicit text
  • The problem was noticed only once training had finished
  • A mechanism such as Toyota’s Andon cord could have prevented this

Conclusion

  • Demonstrated RL fine-tuning of language models to four NLP tasks
  • Tasks include stylistic continuation with high sentiment or physically descriptive language, and summarization on the CNN/Daily Mail and TL;DR datasets
  • Achieved results by applying reward learning to language generation
  • Used KL regularization to prevent policy from diverging too far from natural language
  • Results are mixed, with good results on continuation tasks but summarization tasks only copying from input text
  • Data quality is a limiting factor
  • Application of human reward learning to natural language tasks is important for capability and safety
  • Quality assurance process handled by Scale AI
  • Samples from models shown in tables
  • Agreement between authors and Scale labelers estimated