Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- We studied the design decisions of publicly available instruction tuning methods.
- We found that task balancing and enrichment techniques are important for effective instruction tuning.
- We showed that Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks.
- We made the Flan 2022 collection of datasets, templates, and methods publicly available.
Paper Content
Introduction
- LLMs have unlocked new capabilities in NLP tasks
- Instruction tuning further enhances the ability of LLMs
- Evaluate open sourced instruction generalization efforts
- Flan 2022 Collection offers most extensive publicly available set of tasks and methods
- Model trained on this collection outperforms other public collections
- Adding few-shot prompts improves zero-shot prompting results
- Enriching task diversity and balancing task sources are critical to performance
- Flan-T5 model converges faster and at a higher performance than T5 models
- Open source the new Flan 2022 task collection, templates, and methods
Public instruction tuning collections
- Several instruction tuning task collections have been released since 2020
- Natural Instructions, Flan 2021, PromptSource, and MetaICL consolidated task collections
- Second wave of instruction tuning collections expanded prior resources
- New directions include synthetic data generation and offering human feedback signals
- Instruction tuning on human feedback has strong results but is expensive
- This work focuses on instruction generalization without human feedback
- Open source data collections are used to democratize accessibility to research
Flan 2022 instruction tuning experiments
- Recent research has yet to coalesce around a unified set of techniques for computer science tasks.
- Flan 2022 is a new collection of tasks, combining Flan 2021, P3++3, Super-Natural Instructions, and some additional reasoning, dialog, and program synthesis datasets.
- Flan 2022 includes four design components: mixed zero-shot, few-shot, and Chain-of-Thought templates at training, scaling T5-sized models to 1800+ tasks, enriching tasks with input inversion, and balancing task mixtures.
- Evaluations are done on 8 “Held-In” tasks, 5 Chain-of-Thought tasks, and 2 “Held-Out” tasks (MMLU and BBH).
- Flan-T5 XL outperforms other instruction tuning collections in almost every setting.
- Training with mixed zero-and few-shot prompts significantly improves performance in both settings.
- Scaling model sizes and tasks for the Flan 2022 collection improves performance on both Held-In and Held-Out tasks.
- Input inversion is not beneficial for Held-In performance, but strongly beneficial for Held-Out performance.
- Mixture weight balancing is important to optimize results.
- Flan 2022 outperforms OPT-IML-Max’s much larger (10x) 30B and (58x) 175B models.