Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Instruction tuning enables language models to perform new tasks from natural language descriptions
- Unnatural Instructions is a large dataset of creative and diverse instructions, collected with virtually no human labor
- Unnatural Instructions contains approximately 240,000 examples of instructions, inputs, and outputs
- Training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets
Paper Content
Introduction
- Instruction tuning enables pretrained language models to generalize to unseen tasks in a zero-shot setting
- Data can be collected by reformulating existing NLP datasets in an explicit instruction-input-output format
- Alternatively, user-generated prompts can be collected
- Process requires only 15 instruction examples to produce 64,000 diverse triplets of instructions, inputs, and outputs
- More than 50% of generated examples are correct, and incorrect examples contain valuable information
- Unnatural Instructions contains highly creative tasks and has a more diverse set of instructions than Super-Natural Instructions
- Experiments show that fine-tuning on Unnatural Instructions outperforms other models on several benchmarks
- Log-linear relationship between number of generated examples and downstream task performance
- Language models can produce creative and diverse data faster and cheaper than human labor
Data collection
- Unnatural Instructions is a dataset of 240,670 natural language instructions.
- It was collected in an automatic process with only 15 manually-constructed examples.
- The dataset was expanded by rephrasing structured instructions in free-form natural language.
Core dataset generation
- Core dataset consists of examples in a structured format
- Use stochastic decoding to generate example inputs
- Use deterministic decoding to generate outputs
- Each example contains four fields: instruction, input argument, constraints, and output
- Generate examples by prompting a model with three task demonstrations
- Apply three automatic filters to remove incorrect examples
- Generate outputs by conditioning a pretrained language model with instruction, input argument, and constraints
Template expansion
- Examples in the Unnatural Instructions core dataset have a strict instruction-input-output format
- To increase format diversity, language model is used to reformulate tasks
- Two alternative formulations are collected for each generated task
- Input is embedded into the task description using an “{INPUT}” placeholder
- More than 97.5% of instructions have two valid and distinct alternative formulations
Data analysis
- Creativity is a major challenge when creating a general-purpose instructions dataset
- 200 examples were manually analyzed
- A good experiment should have a single independent variable and multiple dependent variables
- Task is to correct errors in a recipe for baking muffins
- Task is to determine if characters in a text are static or dynamic
- Task is to generate a limerick given two rhyming words
- Unnatural Instructions tend to be more diverse than Super-Natural Instructions
- Unnatural Instructions inputs tend to be less similar to each other than Super-Natural Instructions
Experimental setup
- Used Unnatural Instructions to fine-tune models
- Followed standard practice for fine-tuning (batch size of 16, 3 epochs)
Baselines
- Compared Unnatural Instructions to a variety of models based on T5-11B
- T0++ is an instruction-tuned variant of T5-LM
- Compared utility of original manually-curated instructions in Super-Natural Instructions
- Trained a model identical to ours but with different data, base model, and training hyperparameters
Evaluation
- Evaluated models on four different benchmarks
- Measured range of capabilities in a zero-shot setting
- Evaluated on Super-Natural Instructions using Rouge-L
- Evaluated on T0: Zero-Shot using rank classification and accuracy
- Evaluated on “hard” subset of BIGbench in two formats
- Evaluated on LMentry using greedy decoding and LMentry score
Results
- Table 4 shows that T5-LM finetuned on Unnatural Instructions outperforms several strong instruction-tuned baselines.
- Retraining a model on Super-Natural Instructions reveals a much stronger performance than Tk-Instruct.
- Unnatural Instructions leads to stronger or equal performance for every dataset except Super-Natural Instructions itself.
- T5-LM finetuned on Unnatural Instructions is outperformed by FLAN-T5, but the amount of training data for this model is larger.
- Automated data generation with pretrained LMs is a viable and cost-effective alternative to human-curated data.
Performance with template expansion
- Adding instruction paraphrases boosts performance on 3 benchmarks.
- Models trained on the core dataset rely too much on the specific format and style of Super-Natural Instructions.
- Template expansion successfully addresses this issue.
Performance scaling by dataset size
- Generating more data adds a valuable signal to training data
- Template expansion process is beneficial when controlling for dataset size
- Added value of paraphrases is likely to be in terms of format diversity
Performance scaling by cost
- Performance is measured as a function of the cost for obtaining training data
- OpenAI’s pricing as of December 2022 estimates cost of generating an example at $0.02 for core dataset and $0.01 for expanded dataset
- Human annotation cost estimated at $0.50-$1.00 per example, excluding indirect costs
- Unnatural Instructions is more cost-efficient than manually curated data
Data collection ablations
- Conducted structural prompt ablations to explore effect of different components of data collection pipeline
- Trained models for 1,500 steps using 2,000 examples
- Evaluated Super-Natural Instructions validation set performance, averaged across three different random seeds
Generative model
- Used text-davinci-002, an instruction-tuned variant of GPT-3
- Experimented with generating examples using the original (untuned) GPT-3 model
- Replacing an instruction-tuned model with a vanilla model affects the quality of the data
- Generating outputs requires some level of instruction tuning
- Outputs generated by GPT-3 suffer from the model’s inability to stop
Meta-prompts
- Language models are sensitive to the meta-prompt.
- Three different metaprompt styles were experimented with.
- Results show that the enumeration approach elicits more informative examples than the other two.
- Surprisingly, the verbose meta-prompt performs worse than the minimalistic one.
In-context examples
- Models are sensitive to slight variations in prompt content
- Performance differences occur when provided with different demonstrations from the same dataset
- Performance differences occur when permuting the in-context demonstrations
- Experiments conducted with each of five demonstration sets separately
- Data generation pipeline is largely robust to variations in the in-context demonstrations
Constraints
- The constraints field details the task’s output space restrictions.
- Adding the constraints field may emphasize these restrictions and steer the output generation model.
- Two ablation experiments were conducted to verify the hypothesis.
- Removing the constraints field from the data generation pipeline reduces performance.
Two-step process
- An alternative to the two-step pipeline is to generate instruction-input-output triplets in one pass.
- One-step generation obtains a score that is lower by 1.7 than the default two-step process.
- The gap is likely due to using stochastic decoding in the unified input-output generation phase.
Related work
- Instruction Tuning proposed by Efrat and Levy (2020) uses natural language instructions to learn new tasks.
- State-of-the-art performance on NLI benchmarks achieved by Dagan et al. (2006) and Bowman et al. (2015).
- Liu et al. (2022a) use human annotators and GPT-3 to create NLI examples.
- Schick and Schütze (2021b) use pretrained language models to generate labeled text pairs.
- Agrawal et al. (2022) use pretrained language models to create multilingual QA data.
- Unnatural Instructions is the first work to generate a large-scale general-purpose dataset.
Conclusion
- Introduce Unnatural Instructions, an automatically generated dataset of natural language instructions and their corresponding inputs and outputs
- Experiments show models trained on Unnatural Instructions can outperform models trained on manually annotated datasets
- Unnatural Instructions is cost-effective and provides enhanced diversity in instructions and a high level of creativity in tasks
- Models without instruction tuning can generate useful instructions, but struggle with producing outputs
- Utilizing models for general-purpose data generation is an intriguing direction for future research
- Hyperparameters used for finetuning experiments with T5-LM
- Evaluate model performance on Super-Natural Instructions, T0: Zero-Shot and LMEntry
- Cluster instructions into task types and measure the number of unique task types
- Compare Unnatural Instructions with Super-Natural Instructions
- Performance of several models on four benchmarks
- Performance of 11B T5-LM models trained on 2,000 examples on Super-Natural Instructions validation set
- Human-and-model-in-the-loop dataset creation suggested by Nie et al. (2020)