Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • We studied the design decisions of publicly available instruction tuning methods.
  • We found that task balancing and enrichment techniques are important for effective instruction tuning.
  • We showed that Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks.
  • We made the Flan 2022 collection of datasets, templates, and methods publicly available.

Paper Content

Introduction

  • LLMs have unlocked new capabilities in NLP tasks
  • Instruction tuning further enhances the ability of LLMs
  • Evaluate open sourced instruction generalization efforts
  • Flan 2022 Collection offers most extensive publicly available set of tasks and methods
  • Model trained on this collection outperforms other public collections
  • Adding few-shot prompts improves zero-shot prompting results
  • Enriching task diversity and balancing task sources are critical to performance
  • Flan-T5 model converges faster and at a higher performance than T5 models
  • Open source the new Flan 2022 task collection, templates, and methods

Public instruction tuning collections

  • Several instruction tuning task collections have been released since 2020
  • Natural Instructions, Flan 2021, PromptSource, and MetaICL consolidated task collections
  • Second wave of instruction tuning collections expanded prior resources
  • New directions include synthetic data generation and offering human feedback signals
  • Instruction tuning on human feedback has strong results but is expensive
  • This work focuses on instruction generalization without human feedback
  • Open source data collections are used to democratize accessibility to research

Flan 2022 instruction tuning experiments

  • Recent research has yet to coalesce around a unified set of techniques for computer science tasks.
  • Flan 2022 is a new collection of tasks, combining Flan 2021, P3++3, Super-Natural Instructions, and some additional reasoning, dialog, and program synthesis datasets.
  • Flan 2022 includes four design components: mixed zero-shot, few-shot, and Chain-of-Thought templates at training, scaling T5-sized models to 1800+ tasks, enriching tasks with input inversion, and balancing task mixtures.
  • Evaluations are done on 8 “Held-In” tasks, 5 Chain-of-Thought tasks, and 2 “Held-Out” tasks (MMLU and BBH).
  • Flan-T5 XL outperforms other instruction tuning collections in almost every setting.
  • Training with mixed zero-and few-shot prompts significantly improves performance in both settings.
  • Scaling model sizes and tasks for the Flan 2022 collection improves performance on both Held-In and Held-Out tasks.
  • Input inversion is not beneficial for Held-In performance, but strongly beneficial for Held-Out performance.
  • Mixture weight balancing is important to optimize results.
  • Flan 2022 outperforms OPT-IML-Max’s much larger (10x) 30B and (58x) 175B models.