Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introducing an open-source modular library, RL4LMs, for optimizing language generators with RL.
  • Presenting the GRUE benchmark, a set of 6 language generation tasks supervised by reward functions.
  • Introducing NLPO, an easy-to-use, performant RL algorithm.
  • RL techniques are better than supervised methods at aligning LMs to human preferences.
  • NLPO exhibits greater stability and performance than previous policy gradient methods.

Paper Content

Introduction

  • Language technology aims to interact with humans
  • Most language models are trained without direct signals of human preference
  • Human-in-the-loop is an option to incorporate user feedback, but is inefficient
  • Automated metrics offer a compromise, but are not per-token differentiable
  • Reinforcement Learning (RL) offers a natural path forward for optimizing non-differentiable, scalar objectives
  • Goodhart’s Law warns of nonsense samples that achieve high-quality estimates
  • Recent works have shown promising results in aligning LMs to human preferences via RL
  • RL4LMs library enables generative HuggingFace models to be trained using RL methods
  • GRUE benchmark challenges models to optimize reward functions while remaining fluent
  • NLPO algorithm dynamically learns task-specific constraints over the distribution of language
  • RL can be more data and parameter efficient than supervised learning
  • Imitation learning algorithms have been used for NLP
  • RL has been used to address the cliff MDP problem
  • MIXER combined ideas from schedule sampling and REINFORCE
  • RL has been used to align LMs with human preferences
  • RL has been critiqued for being less stable than supervised LM training

Rl4lms: a library for training lms with rl

  • RL4LMs is an open-source library for fine-tuning and evaluating RL algorithms on LM-based generation.
  • RL4LMs is built on HuggingFace and stable-baselines-3.
  • It can be used to train any decoder only or encoder-decoder transformer models.
  • It provides reliable implementations of popular on-policy RL algorithms.
  • It is modular and supports 6 different NLP tasks, 16 evaluation metrics and rewards, and 4 RL algorithms.

Environments: generation as a token-level mdp

  • NLP task given a supervised dataset
  • Generation viewed as a Markov Decision Process
  • Episode begins with sampling a datapoint and ends when time step exceeds horizon or EOS token is generated
  • Input is task-specific prompt used as initial state
  • Action consists of token from vocabulary
  • Reward depends on state and target string

Reward functions and evaluation metrics

  • RL4LMs provides a generic interface for rewards
  • RL algorithms can be applied to a range of textual metrics
  • Examples of metrics: ROUGE, BLEU, SacreBLEU, METEOR, BertScore, BLEURT, CIDER, SPICE, PARENT, SummaCZS, perplexity, MSSTR, Shannon entropy, Distinct-1, Distinct-2, Li et al., classifiers trained on human preference data

On-policy actor-critic algorithms

  • RL4LMs supports fine-tuning and training LMs from scratch
  • RL4LMs uses on-policy actor-critic algorithms
  • RL4LMs uses a parameterized control policy to maximize long term rewards
  • RL4LMs initializes the value network from a pre-trained LM
  • RL4LMs uses Generalized Advantage Estimation to increase training stability
  • RL4LMs uses a token-level KL penalty to prevent the model from deviating from the initialized LM

Nlpo: natural language policy optimization

  • Language generation action spaces are larger than what most RL algorithms are designed for
  • Size of action space is a core cause of instability when training LMs
  • NLPO is a parameterized-masked extension of PPO to address this issue

Grue (general reinforced-language understanding eval)

  • GRUE is a collection of 7 generative NLP tasks
  • Each task is evaluated at test time according to a task-specific mix of metrics
  • Metrics span two categories: task preference and naturalness
  • Models are free to use supervised data and compute metrics on intermediate generations
  • 3 algorithms for direct fine-tuning are compared: Supervised, PPO, and NLPO
  • Hybrid approach of supervised learning and RL methods is tested
  • Zero-shot evaluations are run with no training data or parameter updates
  • RL algorithms are tested on GRUE benchmark
  • Human participant study is conducted to validate automated metrics
  • Human judgments generally match those seen in automated metrics
  • Automated metrics usually correlate with human judgments if text is above a certain threshold of naturalness
  • Reward hacking behaviors may be undetected by automated metrics but caught by human preference feedback

Preference reward learning, selection, and hacking

  • GRUE benchmark uses average of several measures to evaluate task performance
  • RL models optimize single metric independently
  • Many single metric rewards provide task performance gains over supervised methods
  • KL constraint balances task-specific reward with base LM
  • If KL constraint is removed, models reward hack
  • NLPO outperforms PPO and supervised
  • Human feedback model improves alignment to human preferences

Data budget: improve your reward or gather more demonstration?

  • Fixed data collection budget
  • IMDB text continuation task
  • Model given partial movie review and asked to continue positively
  • DistilBERT classifier trained on sentiment labels
  • Trade-off between gathering sentiment labels or positive sentiment reviews
  • More training data improves test accuracy and reward quality
  • Learned reward function more data efficient than expert demonstrations

Practical considerations: which implementation details matter most?

  • Generation is modeled as a token-level MDP, not a bandit environment
  • Recent works tune LMs using RL by calculating a reward for all tokens in a sentence
  • Setting the discount factor γ = 0.95 reduces the magnitude of the reward applied to tokens selected at the beginning
  • Dropout and sampling methods are critical for stability of RL training

Conclusions

  • GRUE benchmark and RL4LMs library can help align language models to human preferences via RL methods
  • Training stability and consistency can lead to better user experiences when interacting with generative models
  • PPO and NLPO are used to train policies
  • Qualification round used to select workers
  • 5 training algorithms benchmarked on 6 tasks
  • Discount factor ablation to understand effect of discounted vs undiscounted environments
  • Evaluation of GPT2 with different algorithms on IMDB sentiment text continuation task
  • Data budget ablations to measure performance differences
  • IMDB instructions, example, and interface used for qualification round and human evaluation experiments
  • Averaged results, annotator agreement, and statistical significance tests to determine which models output better generations
  • Sample generations from each of the algorithms for three randomly picked prompts