Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Learning from human preferences is important for language models to be helpful and useful.
- Existing works focus on supervised finetuning of pretrained models based on preferred data.
- Supervised finetuning cannot learn from negative ratings, making it data inefficient.
- Hindsight Finetuning proposed to make language models learn from diverse human feedback.
- Hindsight Finetuning motivated by how humans learn from hindsight experience.
- Applying Hindsight Finetuning to GPT-J improves results on summarization and dialogue tasks.
Paper Content
Introduction
- Large neural network models are used in many applications
- Human feedback is important to ensure models align with human values
- Various methods have been developed to incorporate human feedback
- Supervised finetuning is used to improve model performance
- Limitation of supervised finetuning is that it cannot use negative-rated data
- Chain of Hindsight Finetuning proposed to use both positive-rated and negative-rated data
- Experiments show CoHF outperforms supervised finetuning
Related work
- Prior work has explored using human feedback to improve various NLP tasks
- Main techniques behind them can be categorized as supervised finetuning or training and learning a reward function from human feedback for reinforcement learning
- Our work explores learning from chain of hindsight with human feedback
- Key idea of learning from hindsight experience was explored in goal conditioned RL
- Our work proposes algorithm improvements to construct hindsight experience directly from human rated model generations
- Finetuning on chain of hindsight using human feedback is akin to instruction finetuning
Chain of hindsight finetuning
- CoHF uses a standard causal, decoder-only Transformer model architecture
- Goal is to train the Transformer on human rated data to learn to achieve higher human preference scores
- Human feedback data is in the form of (x, {y i , r i , z i } n i=1 )
- Rather than conventional SFT methods, CoHF leverages both positive-rated data and negative-rated data
- Model is trained to predict most preferred data conditioning on less preferred data as well as human feedback and explanations
- Sequence representation is given by (x, y n )
- Prevents shortcut by randomly masking 15% of past tokens
- Prevent overfitting by minimizing the negative log likelihood of the pretraining dataset
- Training involves sampling minibatches of model outputs and generating hindsight feedback in natural language
- Prediction is autoregressively predicting the most preferred model output sequence
- Crossentropy loss is averaged for each timestep in the last model output sequence
Evaluation setup
- Evaluate performance on standard NLP tasks
- Use Language Model Evaluation Harness for evaluation
- Consider two tasks best evaluated with human preference: summarization and dialogue
- Summarization evaluated on TL;DRs dataset
- Dialogue evaluated on dataset from Bai et al. (2022a)
- Metrics for summarization: coverage, accuracy, coherence, overall quality
- Metrics for dialogue: helpfulness, harmlessness
- Finetune model on three datasets: WebGPT comparisons, Human Preference, Summarize from feedback
- Model architecture same as GPT-J (Wang & Komatsuzaki, 2021)
- Baselines: pretrained model and SFT
Main results
Automatic evaluation
- Average performance of supervised finetuning decreased after finetuning
- CoHF improves over pretrained model and supervised finetuned model
- CoHF significantly outperforms both pretrained model and supervised finetuning
- CoHF is more effective at learning from human feedback
- CoHF performs slightly worse than SFT at smaller model sizes, but better at larger model sizes
Human evaluation
- Human labelers hired to provide ratings for summarization and dialogue tasks
- Human labelers presented with two summaries, one generated by SFT and one by CoHF
- Results show CoHF is significantly more preferred by human labelers than SFT
- Instead of having humans directly chat with finetuned model, data is reused to save costs and improve data quality
- Results show CoHF is more favorable to human labelers compared to SFT
Model variations
- Performance decreases when using a large mask ratio
- Using a more diverse set of hindsight feedback is helpful
- Model can learn from reversed chain of hindsight
- Variable length chain of hindsight reduces the gap between training/finetuning and inference
- Overfitting occurs when pretraining dataset regularization is disabled
- Model can follow adversarial instructions encoded in chain of hindsight
- Unlikelihood training on human feedback data may hurt the pretrained model
Conclusion
- We propose Chain of Hindsight Finetuning (CoHF) to finetune language models with human feedback
- CoHF can use both negative and positive examples
- CoHF outperforms supervised finetuning for summarization and dialogue tasks
- We use Adam optimizer, batch size 512, residual dropout of 0.1, and ฮป equals 1.5
- We show screenshots of our labeling interface
- We construct a chain of hindsight sequence from human ranked model generations
- Model takes task question prompt and chain of hindsight as input and predicts model output
- CoHF outperforms supervised finetuning on automatic evaluation
- CoHF scales better than SFT
- We experiment with two variants of SFT