Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

We studied the design decisions of publicly available instruction tuning methods.
We found that task balancing and enrichment techniques are important for effective instruction tuning.
We showed that Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks.
We made the Flan 2022 collection of datasets, templates, and methods publicly available.

LLMs have unlocked new capabilities in NLP tasks
Instruction tuning further enhances the ability of LLMs
Evaluate open sourced instruction generalization efforts
Flan 2022 Collection offers most extensive publicly available set of tasks and methods
Model trained on this collection outperforms other public collections
Adding few-shot prompts improves zero-shot prompting results
Enriching task diversity and balancing task sources are critical to performance
Flan-T5 model converges faster and at a higher performance than T5 models
Open source the new Flan 2022 task collection, templates, and methods

Several instruction tuning task collections have been released since 2020
Natural Instructions, Flan 2021, PromptSource, and MetaICL consolidated task collections
Second wave of instruction tuning collections expanded prior resources
New directions include synthetic data generation and offering human feedback signals
Instruction tuning on human feedback has strong results but is expensive
This work focuses on instruction generalization without human feedback
Open source data collections are used to democratize accessibility to research

Recent research has yet to coalesce around a unified set of techniques for computer science tasks.
Flan 2022 is a new collection of tasks, combining Flan 2021, P3++3, Super-Natural Instructions, and some additional reasoning, dialog, and program synthesis datasets.
Flan 2022 includes four design components: mixed zero-shot, few-shot, and Chain-of-Thought templates at training, scaling T5-sized models to 1800+ tasks, enriching tasks with input inversion, and balancing task mixtures.
Evaluations are done on 8 “Held-In” tasks, 5 Chain-of-Thought tasks, and 2 “Held-Out” tasks (MMLU and BBH).
Flan-T5 XL outperforms other instruction tuning collections in almost every setting.
Training with mixed zero-and few-shot prompts significantly improves performance in both settings.
Scaling model sizes and tasks for the Flan 2022 collection improves performance on both Held-In and Held-Out tasks.
Input inversion is not beneficial for Held-In performance, but strongly beneficial for Held-Out performance.
Mixture weight balancing is important to optimize results.
Flan 2022 outperforms OPT-IML-Max’s much larger (10x) 30B and (58x) 175B models.