Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introducing an open-source modular library, RL4LMs, for optimizing language generators with RL.
Presenting the GRUE benchmark, a set of 6 language generation tasks supervised by reward functions.
Introducing NLPO, an easy-to-use, performant RL algorithm.
RL techniques are better than supervised methods at aligning LMs to human preferences.
NLPO exhibits greater stability and performance than previous policy gradient methods.

Paper Content

Introduction

Language technology aims to interact with humans
Most language models are trained without direct signals of human preference
Human-in-the-loop is an option to incorporate user feedback, but is inefficient
Automated metrics offer a compromise, but are not per-token differentiable
Reinforcement Learning (RL) offers a natural path forward for optimizing non-differentiable, scalar objectives
Goodhart’s Law warns of nonsense samples that achieve high-quality estimates
Recent works have shown promising results in aligning LMs to human preferences via RL
RL4LMs library enables generative HuggingFace models to be trained using RL methods
GRUE benchmark challenges models to optimize reward functions while remaining fluent
NLPO algorithm dynamically learns task-specific constraints over the distribution of language
RL can be more data and parameter efficient than supervised learning

Imitation learning algorithms have been used for NLP
RL has been used to address the cliff MDP problem
MIXER combined ideas from schedule sampling and REINFORCE
RL has been used to align LMs with human preferences
RL has been critiqued for being less stable than supervised LM training

Rl4lms: a library for training lms with rl

RL4LMs is an open-source library for fine-tuning and evaluating RL algorithms on LM-based generation.
RL4LMs is built on HuggingFace and stable-baselines-3.
It can be used to train any decoder only or encoder-decoder transformer models.
It provides reliable implementations of popular on-policy RL algorithms.
It is modular and supports 6 different NLP tasks, 16 evaluation metrics and rewards, and 4 RL algorithms.

Environments: generation as a token-level mdp

NLP task given a supervised dataset
Generation viewed as a Markov Decision Process
Episode begins with sampling a datapoint and ends when time step exceeds horizon or EOS token is generated
Input is task-specific prompt used as initial state
Action consists of token from vocabulary
Reward depends on state and target string

Reward functions and evaluation metrics

RL4LMs provides a generic interface for rewards
RL algorithms can be applied to a range of textual metrics
Examples of metrics: ROUGE, BLEU, SacreBLEU, METEOR, BertScore, BLEURT, CIDER, SPICE, PARENT, SummaCZS, perplexity, MSSTR, Shannon entropy, Distinct-1, Distinct-2, Li et al., classifiers trained on human preference data

On-policy actor-critic algorithms

RL4LMs supports fine-tuning and training LMs from scratch
RL4LMs uses on-policy actor-critic algorithms
RL4LMs uses a parameterized control policy to maximize long term rewards
RL4LMs initializes the value network from a pre-trained LM
RL4LMs uses Generalized Advantage Estimation to increase training stability
RL4LMs uses a token-level KL penalty to prevent the model from deviating from the initialized LM

Nlpo: natural language policy optimization

Language generation action spaces are larger than what most RL algorithms are designed for
Size of action space is a core cause of instability when training LMs
NLPO is a parameterized-masked extension of PPO to address this issue

Grue (general reinforced-language understanding eval)

GRUE is a collection of 7 generative NLP tasks
Each task is evaluated at test time according to a task-specific mix of metrics
Metrics span two categories: task preference and naturalness
Models are free to use supervised data and compute metrics on intermediate generations
3 algorithms for direct fine-tuning are compared: Supervised, PPO, and NLPO
Hybrid approach of supervised learning and RL methods is tested
Zero-shot evaluations are run with no training data or parameter updates
RL algorithms are tested on GRUE benchmark
Human participant study is conducted to validate automated metrics
Human judgments generally match those seen in automated metrics
Automated metrics usually correlate with human judgments if text is above a certain threshold of naturalness
Reward hacking behaviors may be undetected by automated metrics but caught by human preference feedback

Preference reward learning, selection, and hacking

GRUE benchmark uses average of several measures to evaluate task performance
RL models optimize single metric independently
Many single metric rewards provide task performance gains over supervised methods
KL constraint balances task-specific reward with base LM
If KL constraint is removed, models reward hack
NLPO outperforms PPO and supervised
Human feedback model improves alignment to human preferences

Data budget: improve your reward or gather more demonstration?

Fixed data collection budget
IMDB text continuation task
Model given partial movie review and asked to continue positively
DistilBERT classifier trained on sentiment labels
Trade-off between gathering sentiment labels or positive sentiment reviews
More training data improves test accuracy and reward quality
Learned reward function more data efficient than expert demonstrations

Practical considerations: which implementation details matter most?

Generation is modeled as a token-level MDP, not a bandit environment
Recent works tune LMs using RL by calculating a reward for all tokens in a sentence
Setting the discount factor γ = 0.95 reduces the magnitude of the reward applied to tokens selected at the beginning
Dropout and sampling methods are critical for stability of RL training

Conclusions

GRUE benchmark and RL4LMs library can help align language models to human preferences via RL methods
Training stability and consistency can lead to better user experiences when interacting with generative models
PPO and NLPO are used to train policies
Qualification round used to select workers
5 training algorithms benchmarked on 6 tasks
Discount factor ablation to understand effect of discounted vs undiscounted environments
Evaluation of GPT2 with different algorithms on IMDB sentiment text continuation task
Data budget ablations to measure performance differences
IMDB instructions, example, and interface used for qualification round and human evaluation experiments
Averaged results, annotator agreement, and statistical significance tests to determine which models output better generations
Sample generations from each of the algorithms for three randomly picked prompts

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Rl4lms: a library for training lms with rl#

Environments: generation as a token-level mdp#

Reward functions and evaluation metrics#

On-policy actor-critic algorithms#

Nlpo: natural language policy optimization#

Grue (general reinforced-language understanding eval)#

Preference reward learning, selection, and hacking#

Data budget: improve your reward or gather more demonstration?#

Practical considerations: which implementation details matter most?#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Related work

Rl4lms: a library for training lms with rl

Environments: generation as a token-level mdp

Reward functions and evaluation metrics

On-policy actor-critic algorithms

Nlpo: natural language policy optimization

Grue (general reinforced-language understanding eval)

Preference reward learning, selection, and hacking

Data budget: improve your reward or gather more demonstration?

Practical considerations: which implementation details matter most?

Conclusions