Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Reward learning enables reinforcement learning to be applied to tasks where reward is defined by human judgment.
Most work on reward learning has used simulated environments.
Reward learning for language is a key to making RL practical and safe for real-world tasks.
This paper applies reward learning to four natural language tasks.
For stylistic continuation, good results are achieved with only 5,000 comparisons evaluated by humans.
For summarization, models trained with 60,000 comparisons copy whole sentences from the input.

Paper Content

Introduction

Reinforcement learning can be used to define complex tasks by human judgment
Human labels can be used to train a model of reward
Work learning models from humans has been applied to modern deep learning
Real world settings likely involve and require natural language
Natural language processing has seen advances with pretraining a large generative language model
Reinforcement learning has been applied to natural language tasks
This paper combines pretraining advances with human preference learning
Fine-tuning pretrained language models with reinforcement learning is applied to two types of tasks
Human comparisons result in the fine-tuned model being preferred by humans
Models are trained with human samples to copy whole sentences from the input
Human labelers prefer models to supervised fine-tuning and human-written reference summaries
Online and offline data collection is tested
Offline data collection works similarly well for style tasks

Methods

Vocabulary Σ and language model ρ define probability distribution over sequences of tokens
Input space X = Σ ≤m and output space Y = Σ n
x ∈ X is an article of up to 1000 words and y ∈ Y is a 100-word summary
ρ defines a probabilistic policy for this task
Initialize policy π = ρ and fine-tune π to perform task well using RL
Task defined by human judgments, so use human labels to train reward model
Ask humans to choose between four options (y 0 , y 1 , y 2 , y 3 )
Fit reward model r : X × Y → R using loss (1)
Initialize reward model from language model policy ρ
Fine-tune π to optimize reward model r
Add penalty with expectation β KL(π, ρ)
Use 774M parameter version of GPT-2 language model
Use temperature of T < 1 for all experiments
Train reward model using Adam optimizer
Train policy π using Proximal Policy Optimization
Use supervised finetuning of language model to BookCorpus dataset
Gather validation samples to estimate progress and inter-labeler agreement

Human labeling

Scale AI is used to collect labels
Scale API accepts requests of the form (x, y 0 , y 1 , y 2 , y 3 )
Instructions and dataset of 100 example comparisons are used to describe the task to Scale
Ground truth is not always clear
Agreement between labelers is low
Quality control process for Scale is complicated
Final evaluation of two models A and B is done with 2-way or 4-way comparisons
Quality of model trained by Scale is perilous

Experiments

Tested approach to RL fine-tuning of language models using a mock labeler as a stand-in for human labels
Showed that RL fine-tuning is effective at optimizing complex but artificial reward
Optimized language models from human preferences on stylistic continuation tasks with little data
Applied RL fine-tuning to summarization on CNN/Daily Mail and TL;DR datasets
Results are “smart copiers”
Released code for reward modeling and fine-tuning in the offline data case
Applied method to optimize known reward function by training a classifier on Amazon review dataset
Optimized reward function using limited number of queries to a human
Analytically computed optimal policy and compared to learned policies
Applied method to two continuation tasks defined by human judgments: sentiment and descriptiveness
Dynamically adjusted β to obtain KL divergence of 6 nats for descriptiveness and 10 nats for sentiment
Trained range of models using different amounts of feedback

Summarization

Applied method to two summarization tasks
Sampled articles/posts, truncated to 500 tokens, added suffix
Set temperature of pretrained model to 0.5 for CNN/Daily Mail and 0.7 for TL;DR
Truncated to last newline character to ensure articles consist of whole sentences
Penalized summaries without newline by giving them a fixed score of -1
Used fixed KL coefficient β = 0.1 for CNN/Daily Mail and β = 0.03 for TL;DR
Trained online data collection models with 15k, 30k, and 60k human labels
Also showed zero-shot performance of pretrained model, supervised fine-tuned baseline, and lead-3 baseline
Combined supervised and RL fine-tuning
Human evaluations and ROUGE results suggest online data collection is important
RL fine-tuning causes models to copy more
Supervised + RL fine-tuning is best
Copying is the easiest way to be accurate
Labelers check primarily for copying

Online data collection is hard

Online data collection had disadvantages in terms of software complexity, machine learning complexity, and quality control issues.
The right middle ground between offline and online data collection is batched data collection.
Batching data collection simplifies software architecture and diagnosis of ML issues.
Batch mode active learning techniques can be used with the policy π as the unlabeled data distribution.

Reward model and policy are initialized to same value
RL used as an auxiliary task to improve reward model
Joint training could help reward model stay strong
Sharing could improve computational efficiency
Imbalance of data makes joint training challenging

Ambiguous tasks make labeling hard

Evaluation of a summary is subjective and multidimensional.
Choosing between samples with deficiencies is difficult for honest labelers.
Research is difficult to present and interpret.
Better to design less ambiguous labeling tasks.

Bugs can optimize for bad behavior

A bug was introduced which flipped the sign of the reward and KL penalty
This caused the model to optimize for negative sentiment
The model quickly learned to output only sexually explicit text
The problem was noticed only once training had finished
A mechanism such as Toyota’s Andon cord could have prevented this

Conclusion

Demonstrated RL fine-tuning of language models to four NLP tasks
Tasks include stylistic continuation with high sentiment or physically descriptive language, and summarization on the CNN/Daily Mail and TL;DR datasets
Achieved results by applying reward learning to language generation
Used KL regularization to prevent policy from diverging too far from natural language
Results are mixed, with good results on continuation tasks but summarization tasks only copying from input text
Data quality is a limiting factor
Application of human reward learning to natural language tasks is important for capability and safety
Quality assurance process handled by Scale AI
Samples from models shown in tables
Agreement between authors and Scale labelers estimated

Link to paper#

Abstract#

Paper Content#

Introduction#

Methods#

Human labeling#

Experiments#

Summarization#

Online data collection is hard#

Sharing parameters between reward model and policy causes overfitting#

Ambiguous tasks make labeling hard#

Bugs can optimize for bad behavior#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Methods

Human labeling

Experiments

Summarization

Online data collection is hard

Sharing parameters between reward model and policy causes overfitting

Ambiguous tasks make labeling hard

Bugs can optimize for bad behavior

Conclusion