Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introducing an open-source modular library, RL4LMs, for optimizing language generators with RL.
- Presenting the GRUE benchmark, a set of 6 language generation tasks supervised by reward functions.
- Introducing NLPO, an easy-to-use, performant RL algorithm.
- RL techniques are better than supervised methods at aligning LMs to human preferences.
- NLPO exhibits greater stability and performance than previous policy gradient methods.
Paper Content
Introduction
- Language technology aims to interact with humans
- Most language models are trained without direct signals of human preference
- Human-in-the-loop is an option to incorporate user feedback, but is inefficient
- Automated metrics offer a compromise, but are not per-token differentiable
- Reinforcement Learning (RL) offers a natural path forward for optimizing non-differentiable, scalar objectives
- Goodhart’s Law warns of nonsense samples that achieve high-quality estimates
- Recent works have shown promising results in aligning LMs to human preferences via RL
- RL4LMs library enables generative HuggingFace models to be trained using RL methods
- GRUE benchmark challenges models to optimize reward functions while remaining fluent
- NLPO algorithm dynamically learns task-specific constraints over the distribution of language
- RL can be more data and parameter efficient than supervised learning
Related work
- Imitation learning algorithms have been used for NLP
- RL has been used to address the cliff MDP problem
- MIXER combined ideas from schedule sampling and REINFORCE
- RL has been used to align LMs with human preferences
- RL has been critiqued for being less stable than supervised LM training
Rl4lms: a library for training lms with rl
- RL4LMs is an open-source library for fine-tuning and evaluating RL algorithms on LM-based generation.
- RL4LMs is built on HuggingFace and stable-baselines-3.
- It can be used to train any decoder only or encoder-decoder transformer models.
- It provides reliable implementations of popular on-policy RL algorithms.
- It is modular and supports 6 different NLP tasks, 16 evaluation metrics and rewards, and 4 RL algorithms.
Environments: generation as a token-level mdp
- NLP task given a supervised dataset
- Generation viewed as a Markov Decision Process
- Episode begins with sampling a datapoint and ends when time step exceeds horizon or EOS token is generated
- Input is task-specific prompt used as initial state
- Action consists of token from vocabulary
- Reward depends on state and target string
Reward functions and evaluation metrics
- RL4LMs provides a generic interface for rewards
- RL algorithms can be applied to a range of textual metrics
- Examples of metrics: ROUGE, BLEU, SacreBLEU, METEOR, BertScore, BLEURT, CIDER, SPICE, PARENT, SummaCZS, perplexity, MSSTR, Shannon entropy, Distinct-1, Distinct-2, Li et al., classifiers trained on human preference data
On-policy actor-critic algorithms
- RL4LMs supports fine-tuning and training LMs from scratch
- RL4LMs uses on-policy actor-critic algorithms
- RL4LMs uses a parameterized control policy to maximize long term rewards
- RL4LMs initializes the value network from a pre-trained LM
- RL4LMs uses Generalized Advantage Estimation to increase training stability
- RL4LMs uses a token-level KL penalty to prevent the model from deviating from the initialized LM
Nlpo: natural language policy optimization
- Language generation action spaces are larger than what most RL algorithms are designed for
- Size of action space is a core cause of instability when training LMs
- NLPO is a parameterized-masked extension of PPO to address this issue
Grue (general reinforced-language understanding eval)
- GRUE is a collection of 7 generative NLP tasks
- Each task is evaluated at test time according to a task-specific mix of metrics
- Metrics span two categories: task preference and naturalness
- Models are free to use supervised data and compute metrics on intermediate generations
- 3 algorithms for direct fine-tuning are compared: Supervised, PPO, and NLPO
- Hybrid approach of supervised learning and RL methods is tested
- Zero-shot evaluations are run with no training data or parameter updates
- RL algorithms are tested on GRUE benchmark
- Human participant study is conducted to validate automated metrics
- Human judgments generally match those seen in automated metrics
- Automated metrics usually correlate with human judgments if text is above a certain threshold of naturalness
- Reward hacking behaviors may be undetected by automated metrics but caught by human preference feedback
Preference reward learning, selection, and hacking
- GRUE benchmark uses average of several measures to evaluate task performance
- RL models optimize single metric independently
- Many single metric rewards provide task performance gains over supervised methods
- KL constraint balances task-specific reward with base LM
- If KL constraint is removed, models reward hack
- NLPO outperforms PPO and supervised
- Human feedback model improves alignment to human preferences
Data budget: improve your reward or gather more demonstration?
- Fixed data collection budget
- IMDB text continuation task
- Model given partial movie review and asked to continue positively
- DistilBERT classifier trained on sentiment labels
- Trade-off between gathering sentiment labels or positive sentiment reviews
- More training data improves test accuracy and reward quality
- Learned reward function more data efficient than expert demonstrations
Practical considerations: which implementation details matter most?
- Generation is modeled as a token-level MDP, not a bandit environment
- Recent works tune LMs using RL by calculating a reward for all tokens in a sentence
- Setting the discount factor γ = 0.95 reduces the magnitude of the reward applied to tokens selected at the beginning
- Dropout and sampling methods are critical for stability of RL training
Conclusions
- GRUE benchmark and RL4LMs library can help align language models to human preferences via RL methods
- Training stability and consistency can lead to better user experiences when interacting with generative models
- PPO and NLPO are used to train policies
- Qualification round used to select workers
- 5 training algorithms benchmarked on 6 tasks
- Discount factor ablation to understand effect of discounted vs undiscounted environments
- Evaluation of GPT2 with different algorithms on IMDB sentiment text continuation task
- Data budget ablations to measure performance differences
- IMDB instructions, example, and interface used for qualification round and human evaluation experiments
- Averaged results, annotator agreement, and statistical significance tests to determine which models output better generations
- Sample generations from each of the algorithms for three randomly picked prompts