Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduction of Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions
- 76 distinct task types, including classification, extraction, infilling, sequence tagging, text rewriting, and text composition
- Tk-Instruct, a transformer model trained to follow a variety of in-context instructions
- Tk-Instruct outperforms existing instruction-following models
- Analysis of generalization as a function of various scaling parameters
Paper Content
Introduction
- NLP community has made progress in building models for generalization to unseen tasks
- Models like InstructGPT are successful, but the contribution of design choices is opaque
- Need for large-scale public benchmarks of NLP tasks and instructions to facilitate research
- Constructed meta-dataset of 1,616 NLP tasks and instructions
- Model Tk-INSTRUCT outperforms InstructGPT by 9.9 ROUGE-L points on 119 unseen English tasks
- Tk-INSTRUCT generates responses at least as well as the ground truth for 77% of testing instances
- Scaling up diversity of training tasks and model size important for strong generalization to unseen tasks
Related work
- Language instructions are versatile for defining goals
- They have been studied in a variety of applications
- This paper focuses on applications for general NLP tasks
- Recent literature has been motivated by building models that are generalizable across NLP tasks
- SUPER-NATURALINSTRUCTIONS is a metadataset consisting of NLP tasks and instructions
- Instructions follow a uniform schema
- Each instance consists of a textual input and a list of acceptable outputs
- Tasks were collected through a community effort on GitHub
- Quality control was done in several phases
- Tasks are categorized by task type, language, and domain
- Experiments and analysis are based on the T5 model
- Instructions are encoded and appended before the input instance
- Recommended recipe for benchmarking generalization is provided
Evaluation setup
- Split large collection of tasks into two subsets: one for evaluation and one for supervision
- Manually selected 12 categories that represent 154 tasks for evaluation
- Variety of tasks at word, sentence, and document levels, covering both classification and generation formats
- Maximum of 100 instances for each task, resulting in 15,310 testing instances in total
- Divided tasks into two tracks: one for English cross-task generalization and one for cross-lingual cross-task generalization
- Adopted ROUGE-L for reporting aggregated performance results
Baselines and existing models
- Evaluated heuristic baselines
- Evaluated existing language models
- Evaluated existing models fine-tuned to follow language instructions
- Estimated upper bound on models’ generalization to unseen tasks
Overall results
- Table 3 summarizes benchmarking results
- Input encoding contains most effective instructional elements
- Performance broken down according to task categories
- Instruction-tuning enables stronger generalization to unseen tasks
- Models learn to follow instructions by finetuning on instruction data
- Tk-INSTRUCT outperforms InstructGPT
- Models generalize best to unseen tasks for both English and non-English tasks
- Gap for improvement between generalization of instruction-based models and supervised training approach
Human evaluation
- Automatic metrics are an approximation of human judgments
- Human evaluation was conducted to confirm findings
- Crowdworkers were asked to indicate if they prefer model predictions or ground truth
- Human evaluation metric indicates how often model predictions were rated as good as ground truth
- Theoretical upper bound of metric is 100% when model is rated as good as ground truth for all instances
- Results of human evaluation align with automatic metrics
Further analysis
- Conducted further analysis to understand important factors for models to generalize across tasks
- Analysis done on English track and using T5-3B checkpoint
- Exceptions for experiments on model sizes
Scaling trends of generalization
- More observed tasks improve generalization
- Generalization performance grows log-linearly with increasing number of tasks
- More training instances do not help generalization
- Tuning larger models with instructions consistently lead to gains
Instructing with different elements
- Evaluated performance of Tk-INSTRUCT under different instructional elements
- SUP-NATINST provides multiple elements for instructing a task
- Trained multiple models with different combinations of these elements
- Performance of models when trained and evaluated on particular instruction encoding
Conclusion
- Constructed a large-scale benchmark of NLP tasks and instructions
- Trained Tk-INSTRUCT using the data
- Demonstrated capability to perform unseen tasks
- Provided analysis to understand important factors for generalization
- Data has notable variety but underlying distributions suffer from skews
- Biased toward English language
- Short responses are over-represented
- Extension of instruction-following setup to other modalities
- Used ROUGE-L as an aggregated metric
- Used Amazon Mechanical Turk for crowdsource feedback
- Simplified instruction schema from NATINST
- Used T5, GPT-3 and InstructGPT for experiments
- Encoded instruction with input
- Used ROUGE-L as automatic evaluation metric
- Studied model’s generalization ability in different senses
- 12 task categories included in the dataset