Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Learning from human preferences is important for language models to be helpful and useful.
  • Existing works focus on supervised finetuning of pretrained models based on preferred data.
  • Supervised finetuning cannot learn from negative ratings, making it data inefficient.
  • Hindsight Finetuning proposed to make language models learn from diverse human feedback.
  • Hindsight Finetuning motivated by how humans learn from hindsight experience.
  • Applying Hindsight Finetuning to GPT-J improves results on summarization and dialogue tasks.

Paper Content

Introduction

  • Large neural network models are used in many applications
  • Human feedback is important to ensure models align with human values
  • Various methods have been developed to incorporate human feedback
  • Supervised finetuning is used to improve model performance
  • Limitation of supervised finetuning is that it cannot use negative-rated data
  • Chain of Hindsight Finetuning proposed to use both positive-rated and negative-rated data
  • Experiments show CoHF outperforms supervised finetuning
  • Prior work has explored using human feedback to improve various NLP tasks
  • Main techniques behind them can be categorized as supervised finetuning or training and learning a reward function from human feedback for reinforcement learning
  • Our work explores learning from chain of hindsight with human feedback
  • Key idea of learning from hindsight experience was explored in goal conditioned RL
  • Our work proposes algorithm improvements to construct hindsight experience directly from human rated model generations
  • Finetuning on chain of hindsight using human feedback is akin to instruction finetuning

Chain of hindsight finetuning

  • CoHF uses a standard causal, decoder-only Transformer model architecture
  • Goal is to train the Transformer on human rated data to learn to achieve higher human preference scores
  • Human feedback data is in the form of (x, {y i , r i , z i } n i=1 )
  • Rather than conventional SFT methods, CoHF leverages both positive-rated data and negative-rated data
  • Model is trained to predict most preferred data conditioning on less preferred data as well as human feedback and explanations
  • Sequence representation is given by (x, y n )
  • Prevents shortcut by randomly masking 15% of past tokens
  • Prevent overfitting by minimizing the negative log likelihood of the pretraining dataset
  • Training involves sampling minibatches of model outputs and generating hindsight feedback in natural language
  • Prediction is autoregressively predicting the most preferred model output sequence
  • Crossentropy loss is averaged for each timestep in the last model output sequence

Evaluation setup

  • Evaluate performance on standard NLP tasks
  • Use Language Model Evaluation Harness for evaluation
  • Consider two tasks best evaluated with human preference: summarization and dialogue
  • Summarization evaluated on TL;DRs dataset
  • Dialogue evaluated on dataset from Bai et al. (2022a)
  • Metrics for summarization: coverage, accuracy, coherence, overall quality
  • Metrics for dialogue: helpfulness, harmlessness
  • Finetune model on three datasets: WebGPT comparisons, Human Preference, Summarize from feedback
  • Model architecture same as GPT-J (Wang & Komatsuzaki, 2021)
  • Baselines: pretrained model and SFT

Main results

Automatic evaluation

  • Average performance of supervised finetuning decreased after finetuning
  • CoHF improves over pretrained model and supervised finetuned model
  • CoHF significantly outperforms both pretrained model and supervised finetuning
  • CoHF is more effective at learning from human feedback
  • CoHF performs slightly worse than SFT at smaller model sizes, but better at larger model sizes

Human evaluation

  • Human labelers hired to provide ratings for summarization and dialogue tasks
  • Human labelers presented with two summaries, one generated by SFT and one by CoHF
  • Results show CoHF is significantly more preferred by human labelers than SFT
  • Instead of having humans directly chat with finetuned model, data is reused to save costs and improve data quality
  • Results show CoHF is more favorable to human labelers compared to SFT

Model variations

  • Performance decreases when using a large mask ratio
  • Using a more diverse set of hindsight feedback is helpful
  • Model can learn from reversed chain of hindsight
  • Variable length chain of hindsight reduces the gap between training/finetuning and inference
  • Overfitting occurs when pretraining dataset regularization is disabled
  • Model can follow adversarial instructions encoded in chain of hindsight
  • Unlikelihood training on human feedback data may hurt the pretrained model

Conclusion

  • We propose Chain of Hindsight Finetuning (CoHF) to finetune language models with human feedback
  • CoHF can use both negative and positive examples
  • CoHF outperforms supervised finetuning for summarization and dialogue tasks
  • We use Adam optimizer, batch size 512, residual dropout of 0.1, and ฮป equals 1.5
  • We show screenshots of our labeling interface
  • We construct a chain of hindsight sequence from human ranked model generations
  • Model takes task question prompt and chain of hindsight as input and predicts model output
  • CoHF outperforms supervised finetuning on automatic evaluation
  • CoHF scales better than SFT
  • We experiment with two variants of SFT