Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduction of Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions
76 distinct task types, including classification, extraction, infilling, sequence tagging, text rewriting, and text composition
Tk-Instruct, a transformer model trained to follow a variety of in-context instructions
Tk-Instruct outperforms existing instruction-following models
Analysis of generalization as a function of various scaling parameters

Paper Content

Introduction

NLP community has made progress in building models for generalization to unseen tasks
Models like InstructGPT are successful, but the contribution of design choices is opaque
Need for large-scale public benchmarks of NLP tasks and instructions to facilitate research
Constructed meta-dataset of 1,616 NLP tasks and instructions
Model Tk-INSTRUCT outperforms InstructGPT by 9.9 ROUGE-L points on 119 unseen English tasks
Tk-INSTRUCT generates responses at least as well as the ground truth for 77% of testing instances
Scaling up diversity of training tasks and model size important for strong generalization to unseen tasks

Language instructions are versatile for defining goals
They have been studied in a variety of applications
This paper focuses on applications for general NLP tasks
Recent literature has been motivated by building models that are generalizable across NLP tasks
SUPER-NATURALINSTRUCTIONS is a metadataset consisting of NLP tasks and instructions
Instructions follow a uniform schema
Each instance consists of a textual input and a list of acceptable outputs
Tasks were collected through a community effort on GitHub
Quality control was done in several phases
Tasks are categorized by task type, language, and domain
Experiments and analysis are based on the T5 model
Instructions are encoded and appended before the input instance
Recommended recipe for benchmarking generalization is provided

Evaluation setup

Split large collection of tasks into two subsets: one for evaluation and one for supervision
Manually selected 12 categories that represent 154 tasks for evaluation
Variety of tasks at word, sentence, and document levels, covering both classification and generation formats
Maximum of 100 instances for each task, resulting in 15,310 testing instances in total
Divided tasks into two tracks: one for English cross-task generalization and one for cross-lingual cross-task generalization
Adopted ROUGE-L for reporting aggregated performance results

Baselines and existing models

Evaluated heuristic baselines
Evaluated existing language models
Evaluated existing models fine-tuned to follow language instructions
Estimated upper bound on models’ generalization to unseen tasks

Overall results

Table 3 summarizes benchmarking results
Input encoding contains most effective instructional elements
Performance broken down according to task categories
Instruction-tuning enables stronger generalization to unseen tasks
Models learn to follow instructions by finetuning on instruction data
Tk-INSTRUCT outperforms InstructGPT
Models generalize best to unseen tasks for both English and non-English tasks
Gap for improvement between generalization of instruction-based models and supervised training approach

Human evaluation

Automatic metrics are an approximation of human judgments
Human evaluation was conducted to confirm findings
Crowdworkers were asked to indicate if they prefer model predictions or ground truth
Human evaluation metric indicates how often model predictions were rated as good as ground truth
Theoretical upper bound of metric is 100% when model is rated as good as ground truth for all instances
Results of human evaluation align with automatic metrics

Further analysis

Conducted further analysis to understand important factors for models to generalize across tasks
Analysis done on English track and using T5-3B checkpoint
Exceptions for experiments on model sizes

Scaling trends of generalization

More observed tasks improve generalization
Generalization performance grows log-linearly with increasing number of tasks
More training instances do not help generalization
Tuning larger models with instructions consistently lead to gains

Instructing with different elements

Evaluated performance of Tk-INSTRUCT under different instructional elements
SUP-NATINST provides multiple elements for instructing a task
Trained multiple models with different combinations of these elements
Performance of models when trained and evaluated on particular instruction encoding

Conclusion

Constructed a large-scale benchmark of NLP tasks and instructions
Trained Tk-INSTRUCT using the data
Demonstrated capability to perform unseen tasks
Provided analysis to understand important factors for generalization
Data has notable variety but underlying distributions suffer from skews
Biased toward English language
Short responses are over-represented
Extension of instruction-following setup to other modalities
Used ROUGE-L as an aggregated metric
Used Amazon Mechanical Turk for crowdsource feedback
Simplified instruction schema from NATINST
Used T5, GPT-3 and InstructGPT for experiments
Encoded instruction with input
Used ROUGE-L as automatic evaluation metric
Studied model’s generalization ability in different senses
12 task categories included in the dataset

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Evaluation setup#

Baselines and existing models#

Overall results#

Human evaluation#

Further analysis#

Scaling trends of generalization#

Instructing with different elements#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Evaluation setup

Baselines and existing models

Overall results

Human evaluation

Further analysis

Scaling trends of generalization

Instructing with different elements

Conclusion