Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Reward learning enables reinforcement learning to be applied to tasks where reward is defined by human judgment.
- Most work on reward learning has used simulated environments.
- Reward learning for language is a key to making RL practical and safe for real-world tasks.
- This paper applies reward learning to four natural language tasks.
- For stylistic continuation, good results are achieved with only 5,000 comparisons evaluated by humans.
- For summarization, models trained with 60,000 comparisons copy whole sentences from the input.
Paper Content
Introduction
- Reinforcement learning can be used to define complex tasks by human judgment
- Human labels can be used to train a model of reward
- Work learning models from humans has been applied to modern deep learning
- Real world settings likely involve and require natural language
- Natural language processing has seen advances with pretraining a large generative language model
- Reinforcement learning has been applied to natural language tasks
- This paper combines pretraining advances with human preference learning
- Fine-tuning pretrained language models with reinforcement learning is applied to two types of tasks
- Human comparisons result in the fine-tuned model being preferred by humans
- Models are trained with human samples to copy whole sentences from the input
- Human labelers prefer models to supervised fine-tuning and human-written reference summaries
- Online and offline data collection is tested
- Offline data collection works similarly well for style tasks
Methods
- Vocabulary Σ and language model ρ define probability distribution over sequences of tokens
- Input space X = Σ ≤m and output space Y = Σ n
- x ∈ X is an article of up to 1000 words and y ∈ Y is a 100-word summary
- ρ defines a probabilistic policy for this task
- Initialize policy π = ρ and fine-tune π to perform task well using RL
- Task defined by human judgments, so use human labels to train reward model
- Ask humans to choose between four options (y 0 , y 1 , y 2 , y 3 )
- Fit reward model r : X × Y → R using loss (1)
- Initialize reward model from language model policy ρ
- Fine-tune π to optimize reward model r
- Add penalty with expectation β KL(π, ρ)
- Use 774M parameter version of GPT-2 language model
- Use temperature of T < 1 for all experiments
- Train reward model using Adam optimizer
- Train policy π using Proximal Policy Optimization
- Use supervised finetuning of language model to BookCorpus dataset
- Gather validation samples to estimate progress and inter-labeler agreement
Human labeling
- Scale AI is used to collect labels
- Scale API accepts requests of the form (x, y 0 , y 1 , y 2 , y 3 )
- Instructions and dataset of 100 example comparisons are used to describe the task to Scale
- Ground truth is not always clear
- Agreement between labelers is low
- Quality control process for Scale is complicated
- Final evaluation of two models A and B is done with 2-way or 4-way comparisons
- Quality of model trained by Scale is perilous
Experiments
- Tested approach to RL fine-tuning of language models using a mock labeler as a stand-in for human labels
- Showed that RL fine-tuning is effective at optimizing complex but artificial reward
- Optimized language models from human preferences on stylistic continuation tasks with little data
- Applied RL fine-tuning to summarization on CNN/Daily Mail and TL;DR datasets
- Results are “smart copiers”
- Released code for reward modeling and fine-tuning in the offline data case
- Applied method to optimize known reward function by training a classifier on Amazon review dataset
- Optimized reward function using limited number of queries to a human
- Analytically computed optimal policy and compared to learned policies
- Applied method to two continuation tasks defined by human judgments: sentiment and descriptiveness
- Dynamically adjusted β to obtain KL divergence of 6 nats for descriptiveness and 10 nats for sentiment
- Trained range of models using different amounts of feedback
Summarization
- Applied method to two summarization tasks
- Sampled articles/posts, truncated to 500 tokens, added suffix
- Set temperature of pretrained model to 0.5 for CNN/Daily Mail and 0.7 for TL;DR
- Truncated to last newline character to ensure articles consist of whole sentences
- Penalized summaries without newline by giving them a fixed score of -1
- Used fixed KL coefficient β = 0.1 for CNN/Daily Mail and β = 0.03 for TL;DR
- Trained online data collection models with 15k, 30k, and 60k human labels
- Also showed zero-shot performance of pretrained model, supervised fine-tuned baseline, and lead-3 baseline
- Combined supervised and RL fine-tuning
- Human evaluations and ROUGE results suggest online data collection is important
- RL fine-tuning causes models to copy more
- Supervised + RL fine-tuning is best
- Copying is the easiest way to be accurate
- Labelers check primarily for copying
Online data collection is hard
- Online data collection had disadvantages in terms of software complexity, machine learning complexity, and quality control issues.
- The right middle ground between offline and online data collection is batched data collection.
- Batching data collection simplifies software architecture and diagnosis of ML issues.
- Batch mode active learning techniques can be used with the policy π as the unlabeled data distribution.
Sharing parameters between reward model and policy causes overfitting
- Reward model and policy are initialized to same value
- RL used as an auxiliary task to improve reward model
- Joint training could help reward model stay strong
- Sharing could improve computational efficiency
- Imbalance of data makes joint training challenging
Ambiguous tasks make labeling hard
- Evaluation of a summary is subjective and multidimensional.
- Choosing between samples with deficiencies is difficult for honest labelers.
- Research is difficult to present and interpret.
- Better to design less ambiguous labeling tasks.
Bugs can optimize for bad behavior
- A bug was introduced which flipped the sign of the reward and KL penalty
- This caused the model to optimize for negative sentiment
- The model quickly learned to output only sexually explicit text
- The problem was noticed only once training had finished
- A mechanism such as Toyota’s Andon cord could have prevented this
Conclusion
- Demonstrated RL fine-tuning of language models to four NLP tasks
- Tasks include stylistic continuation with high sentiment or physically descriptive language, and summarization on the CNN/Daily Mail and TL;DR datasets
- Achieved results by applying reward learning to language generation
- Used KL regularization to prevent policy from diverging too far from natural language
- Results are mixed, with good results on continuation tasks but summarization tasks only copying from input text
- Data quality is a limiting factor
- Application of human reward learning to natural language tasks is important for capability and safety
- Quality assurance process handled by Scale AI
- Samples from models shown in tables
- Agreement between authors and Scale labelers estimated