Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduction of Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions
  • 76 distinct task types, including classification, extraction, infilling, sequence tagging, text rewriting, and text composition
  • Tk-Instruct, a transformer model trained to follow a variety of in-context instructions
  • Tk-Instruct outperforms existing instruction-following models
  • Analysis of generalization as a function of various scaling parameters

Paper Content

Introduction

  • NLP community has made progress in building models for generalization to unseen tasks
  • Models like InstructGPT are successful, but the contribution of design choices is opaque
  • Need for large-scale public benchmarks of NLP tasks and instructions to facilitate research
  • Constructed meta-dataset of 1,616 NLP tasks and instructions
  • Model Tk-INSTRUCT outperforms InstructGPT by 9.9 ROUGE-L points on 119 unseen English tasks
  • Tk-INSTRUCT generates responses at least as well as the ground truth for 77% of testing instances
  • Scaling up diversity of training tasks and model size important for strong generalization to unseen tasks
  • Language instructions are versatile for defining goals
  • They have been studied in a variety of applications
  • This paper focuses on applications for general NLP tasks
  • Recent literature has been motivated by building models that are generalizable across NLP tasks
  • SUPER-NATURALINSTRUCTIONS is a metadataset consisting of NLP tasks and instructions
  • Instructions follow a uniform schema
  • Each instance consists of a textual input and a list of acceptable outputs
  • Tasks were collected through a community effort on GitHub
  • Quality control was done in several phases
  • Tasks are categorized by task type, language, and domain
  • Experiments and analysis are based on the T5 model
  • Instructions are encoded and appended before the input instance
  • Recommended recipe for benchmarking generalization is provided

Evaluation setup

  • Split large collection of tasks into two subsets: one for evaluation and one for supervision
  • Manually selected 12 categories that represent 154 tasks for evaluation
  • Variety of tasks at word, sentence, and document levels, covering both classification and generation formats
  • Maximum of 100 instances for each task, resulting in 15,310 testing instances in total
  • Divided tasks into two tracks: one for English cross-task generalization and one for cross-lingual cross-task generalization
  • Adopted ROUGE-L for reporting aggregated performance results

Baselines and existing models

  • Evaluated heuristic baselines
  • Evaluated existing language models
  • Evaluated existing models fine-tuned to follow language instructions
  • Estimated upper bound on models’ generalization to unseen tasks

Overall results

  • Table 3 summarizes benchmarking results
  • Input encoding contains most effective instructional elements
  • Performance broken down according to task categories
  • Instruction-tuning enables stronger generalization to unseen tasks
  • Models learn to follow instructions by finetuning on instruction data
  • Tk-INSTRUCT outperforms InstructGPT
  • Models generalize best to unseen tasks for both English and non-English tasks
  • Gap for improvement between generalization of instruction-based models and supervised training approach

Human evaluation

  • Automatic metrics are an approximation of human judgments
  • Human evaluation was conducted to confirm findings
  • Crowdworkers were asked to indicate if they prefer model predictions or ground truth
  • Human evaluation metric indicates how often model predictions were rated as good as ground truth
  • Theoretical upper bound of metric is 100% when model is rated as good as ground truth for all instances
  • Results of human evaluation align with automatic metrics

Further analysis

  • Conducted further analysis to understand important factors for models to generalize across tasks
  • Analysis done on English track and using T5-3B checkpoint
  • Exceptions for experiments on model sizes
  • More observed tasks improve generalization
  • Generalization performance grows log-linearly with increasing number of tasks
  • More training instances do not help generalization
  • Tuning larger models with instructions consistently lead to gains

Instructing with different elements

  • Evaluated performance of Tk-INSTRUCT under different instructional elements
  • SUP-NATINST provides multiple elements for instructing a task
  • Trained multiple models with different combinations of these elements
  • Performance of models when trained and evaluated on particular instruction encoding

Conclusion

  • Constructed a large-scale benchmark of NLP tasks and instructions
  • Trained Tk-INSTRUCT using the data
  • Demonstrated capability to perform unseen tasks
  • Provided analysis to understand important factors for generalization
  • Data has notable variety but underlying distributions suffer from skews
  • Biased toward English language
  • Short responses are over-represented
  • Extension of instruction-following setup to other modalities
  • Used ROUGE-L as an aggregated metric
  • Used Amazon Mechanical Turk for crowdsource feedback
  • Simplified instruction schema from NATINST
  • Used T5, GPT-3 and InstructGPT for experiments
  • Encoded instruction with input
  • Used ROUGE-L as automatic evaluation metric
  • Studied model’s generalization ability in different senses
  • 12 task categories included in the dataset