Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Instruction tuning enables language models to perform new tasks from natural language descriptions
Unnatural Instructions is a large dataset of creative and diverse instructions, collected with virtually no human labor
Unnatural Instructions contains approximately 240,000 examples of instructions, inputs, and outputs
Training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets

Paper Content

Introduction

Instruction tuning enables pretrained language models to generalize to unseen tasks in a zero-shot setting
Data can be collected by reformulating existing NLP datasets in an explicit instruction-input-output format
Alternatively, user-generated prompts can be collected
Process requires only 15 instruction examples to produce 64,000 diverse triplets of instructions, inputs, and outputs
More than 50% of generated examples are correct, and incorrect examples contain valuable information
Unnatural Instructions contains highly creative tasks and has a more diverse set of instructions than Super-Natural Instructions
Experiments show that fine-tuning on Unnatural Instructions outperforms other models on several benchmarks
Log-linear relationship between number of generated examples and downstream task performance
Language models can produce creative and diverse data faster and cheaper than human labor

Data collection

Unnatural Instructions is a dataset of 240,670 natural language instructions.
It was collected in an automatic process with only 15 manually-constructed examples.
The dataset was expanded by rephrasing structured instructions in free-form natural language.

Core dataset generation

Core dataset consists of examples in a structured format
Use stochastic decoding to generate example inputs
Use deterministic decoding to generate outputs
Each example contains four fields: instruction, input argument, constraints, and output
Generate examples by prompting a model with three task demonstrations
Apply three automatic filters to remove incorrect examples
Generate outputs by conditioning a pretrained language model with instruction, input argument, and constraints

Template expansion

Examples in the Unnatural Instructions core dataset have a strict instruction-input-output format
To increase format diversity, language model is used to reformulate tasks
Two alternative formulations are collected for each generated task
Input is embedded into the task description using an “{INPUT}” placeholder
More than 97.5% of instructions have two valid and distinct alternative formulations

Data analysis

Creativity is a major challenge when creating a general-purpose instructions dataset
200 examples were manually analyzed
A good experiment should have a single independent variable and multiple dependent variables
Task is to correct errors in a recipe for baking muffins
Task is to determine if characters in a text are static or dynamic
Task is to generate a limerick given two rhyming words
Unnatural Instructions tend to be more diverse than Super-Natural Instructions
Unnatural Instructions inputs tend to be less similar to each other than Super-Natural Instructions

Experimental setup

Used Unnatural Instructions to fine-tune models
Followed standard practice for fine-tuning (batch size of 16, 3 epochs)

Baselines

Compared Unnatural Instructions to a variety of models based on T5-11B
T0++ is an instruction-tuned variant of T5-LM
Compared utility of original manually-curated instructions in Super-Natural Instructions
Trained a model identical to ours but with different data, base model, and training hyperparameters

Evaluation

Evaluated models on four different benchmarks
Measured range of capabilities in a zero-shot setting
Evaluated on Super-Natural Instructions using Rouge-L
Evaluated on T0: Zero-Shot using rank classification and accuracy
Evaluated on “hard” subset of BIGbench in two formats
Evaluated on LMentry using greedy decoding and LMentry score

Results

Table 4 shows that T5-LM finetuned on Unnatural Instructions outperforms several strong instruction-tuned baselines.
Retraining a model on Super-Natural Instructions reveals a much stronger performance than Tk-Instruct.
Unnatural Instructions leads to stronger or equal performance for every dataset except Super-Natural Instructions itself.
T5-LM finetuned on Unnatural Instructions is outperformed by FLAN-T5, but the amount of training data for this model is larger.
Automated data generation with pretrained LMs is a viable and cost-effective alternative to human-curated data.

Performance with template expansion

Adding instruction paraphrases boosts performance on 3 benchmarks.
Models trained on the core dataset rely too much on the specific format and style of Super-Natural Instructions.
Template expansion successfully addresses this issue.

Performance scaling by dataset size

Generating more data adds a valuable signal to training data
Template expansion process is beneficial when controlling for dataset size
Added value of paraphrases is likely to be in terms of format diversity

Performance scaling by cost

Performance is measured as a function of the cost for obtaining training data
OpenAI’s pricing as of December 2022 estimates cost of generating an example at $0.02 for core dataset and $0.01 for expanded dataset
Human annotation cost estimated at $0.50-$1.00 per example, excluding indirect costs
Unnatural Instructions is more cost-efficient than manually curated data

Data collection ablations

Conducted structural prompt ablations to explore effect of different components of data collection pipeline
Trained models for 1,500 steps using 2,000 examples
Evaluated Super-Natural Instructions validation set performance, averaged across three different random seeds

Generative model

Used text-davinci-002, an instruction-tuned variant of GPT-3
Experimented with generating examples using the original (untuned) GPT-3 model
Replacing an instruction-tuned model with a vanilla model affects the quality of the data
Generating outputs requires some level of instruction tuning
Outputs generated by GPT-3 suffer from the model’s inability to stop

Meta-prompts

Language models are sensitive to the meta-prompt.
Three different metaprompt styles were experimented with.
Results show that the enumeration approach elicits more informative examples than the other two.
Surprisingly, the verbose meta-prompt performs worse than the minimalistic one.

In-context examples

Models are sensitive to slight variations in prompt content
Performance differences occur when provided with different demonstrations from the same dataset
Performance differences occur when permuting the in-context demonstrations
Experiments conducted with each of five demonstration sets separately
Data generation pipeline is largely robust to variations in the in-context demonstrations

Constraints

The constraints field details the task’s output space restrictions.
Adding the constraints field may emphasize these restrictions and steer the output generation model.
Two ablation experiments were conducted to verify the hypothesis.
Removing the constraints field from the data generation pipeline reduces performance.

Two-step process

An alternative to the two-step pipeline is to generate instruction-input-output triplets in one pass.
One-step generation obtains a score that is lower by 1.7 than the default two-step process.
The gap is likely due to using stochastic decoding in the unified input-output generation phase.

Instruction Tuning proposed by Efrat and Levy (2020) uses natural language instructions to learn new tasks.
State-of-the-art performance on NLI benchmarks achieved by Dagan et al. (2006) and Bowman et al. (2015).
Liu et al. (2022a) use human annotators and GPT-3 to create NLI examples.
Schick and Schütze (2021b) use pretrained language models to generate labeled text pairs.
Agrawal et al. (2022) use pretrained language models to create multilingual QA data.
Unnatural Instructions is the first work to generate a large-scale general-purpose dataset.

Conclusion

Introduce Unnatural Instructions, an automatically generated dataset of natural language instructions and their corresponding inputs and outputs
Experiments show models trained on Unnatural Instructions can outperform models trained on manually annotated datasets
Unnatural Instructions is cost-effective and provides enhanced diversity in instructions and a high level of creativity in tasks
Models without instruction tuning can generate useful instructions, but struggle with producing outputs
Utilizing models for general-purpose data generation is an intriguing direction for future research
Hyperparameters used for finetuning experiments with T5-LM
Evaluate model performance on Super-Natural Instructions, T0: Zero-Shot and LMEntry
Cluster instructions into task types and measure the number of unique task types
Compare Unnatural Instructions with Super-Natural Instructions
Performance of several models on four benchmarks
Performance of 11B T5-LM models trained on 2,000 examples on Super-Natural Instructions validation set
Human-and-model-in-the-loop dataset creation suggested by Nie et al. (2020)

Link to paper#

Abstract#

Paper Content#

Introduction#

Data collection#

Core dataset generation#

Template expansion#

Data analysis#

Experimental setup#

Baselines#

Evaluation#

Results#

Performance with template expansion#

Performance scaling by dataset size#

Performance scaling by cost#

Data collection ablations#

Generative model#

Meta-prompts#

In-context examples#

Constraints#

Two-step process#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Data collection

Core dataset generation

Template expansion

Data analysis

Experimental setup

Baselines

Evaluation

Results

Performance with template expansion

Performance scaling by dataset size

Performance scaling by cost

Data collection ablations

Generative model

Meta-prompts

In-context examples

Constraints

Two-step process

Related work

Conclusion