Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Instruction tuning is a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions.
MultiInstruct is the first multimodal instruction tuning benchmark dataset that consists of 47 diverse multimodal tasks.
OFA is the base pre-trained model for multimodal instruction tuning and multiple transfer learning strategies are explored to leverage the large-scale Natural Instructions dataset.
A new evaluation metric, Sensitivity, is designed to evaluate how sensitive the model is to the variety of instructions.

Paper Content

Introduction

Advances in large-scale pre-trained language models (PLMs) have enabled efficient learning paradigms to generalize PLMs to new tasks without task-specific tuning.
Instruction tuning has achieved success in zero-shot learning on natural language processing tasks.
Multimodal pretraining has shown potential for jointly interpreting text and images in a shared semantic space.
MULTIINSTRUCT is a benchmark dataset for multimodal instruction tuning with 47 diverse tasks from 11 broad categories.
OFA is a pre-trained multimodal language model used to fine-tune MULTIINSTRUCT.
Transfer learning strategies, such as Mixed Instruction Tuning, Sequential Instruction Tuning, and Adapter-based Sequential Instruction Tuning, are explored.
A new metric -Sensitivity- is developed to measure how sensitive the model is toward the variety of instructions for the same task.

Multimodal pretraining has advanced downstream vision-language tasks
Several studies have explored building a unified pre-training framework to handle a diverse set of cross-modal and unimodal tasks
Efficient language model tuning strategies have been proposed to improve generalizability and adaptivity of large-scale pre-trained language models
Prompt tuning and in-context learning are two strategies used to improve generalizability
Instruction tuning is another strategy used to improve generalizability
MULTIINSTRUCT is a new multi-modal instruction tuning dataset with 47 tasks derived from 54 datasets

Multimodal task and data collection

MULTIINSTRUCT dataset covers a range of multimodal tasks that require reasoning among regions, image, and text
26 tasks collected from existing studies in visual and multimodal learning
Tasks include Visual Question Answering, Image Captioning, Grounded Generation, Image-Text Matching, Grounded Matching, Visual Relationship, Image Attributes, Image Generation, Commonsense Reasoning, Temporal Ordering, and Miscellaneous
21 additional tasks derived from the 26 existing tasks
47 tasks divided into 11 broad categories
Training and evaluation tasks split based on criteria of similarity to pre-training tasks and complexity

Task instruction creation

Definition of “instruction” provided
Placeholders , and mentioned
Iterative annotation process involving two expert annotators
Annotators provided with clear and detailed information about the task and dataset
Annotators review instructions and identify discrepancies or inconsistencies
Final set of instructions created for each task

Multimodal instruction formatting

Represent images, text, and bounding box coordinates as tokens in a unified vocabulary
Apply bytepair encoding (BPE) to encode text input
Apply VQ-GAN to generate discrete image tokens
Represent regions or bounding boxes of an image by discretizing the four corner coordinates
Formulate tasks as natural language sequence-to-sequence generation problems
Input includes an image and an instruction with placeholders such as , or

Problem setup

Followed same instruction tuning setting as previous studies
Evaluated zero-shot learning capabilities of finetuned large language models
Pre-trained multimodal language model M finetuned on collection of instruction tasks T
Each task t associated with number of training instances
Input information from x t j used to fill in placeholders in instruction
OFA used as pretrained multimodal model
Finetuned on MULTIINSTRUCT dataset
Transformer encoder used to encode instruction with necessary filled information and optional image
Training dataset contains many tasks, all instances mixed and randomly shuffled
Parameter-efficient instruction tuning strategy introduced with adapters

Evaluation metrics

Evaluated model using 5 instructions
Reported mean and maximum performance
Computed aggregated performance for each model
Proposed new metric to evaluate model sensitivity to instruction variations

Approaches for comparison

OFA is a pre-trained model with 472M parameters
OFA has demonstrated zero-shot capability on unseen multimodal tasks
OFA was fine-tuned on two datasets with instruction tuning
Evaluation was done on instruction templates with removed specific tokens to ensure fairness

Training details

Maximum input token length is 1024, maximum target length is 512
Image preprocessing follows OFA
Trained on 8 Nvidia A100 GPUs with batch size 8, learning rate 1e-05, float16 enabled for 3 epochs
OFA MultiInstruct improves zero-shot performance over original pre-trained OFA model
OFA without instruction tuning failed to generate region-specific tokens
Fine-tuning OFA on NATURAL INSTRUCTIONS degrades zero-shot performance
OFA SeqInstruct achieves similar performance to OFA MultiInstruct
OFA AdapterInstruct tunes 12.6M parameters and achieves comparable or slightly worse performance

Number of multimodal instruction tasks

Performance improves and sensitivity decreases as number of multimodal instruction tasks increases
Low intra-task sensitivity indicates consistent results despite variations in instructions

Effect of diverse instructions on instruction tuning

Hypothesis: Using diverse instructions for multimodal instruction tuning can improve zero-shot performance and reduce intra-task sensitivity.
Results: Finetuning on 5 instructions significantly improves overall zeroshot performance and shows lower sensitivity.
Conclusion: Increasing diversity of instructions is effective and future work could explore crowd-sourcing or automatic generation strategies.
Results: Finetuning and transfer learning strategies reduce model sensitivity.

Conclusion and future work

Presented a new large-scale multimodal instruction tuning benchmark dataset -MULTIINSTRUCT
Covers a wide variety of vision and multimodal tasks
Each task is associated with multiple expert-written instructions
Finetuning OFA (Wang et al., 2022a) on MULTIINSTRUCT with instruction tuning improves zero-shot performance on various unseen multimodal tasks
Explored several transferring learning techniques to leverage the much larger text-only NAT-URAL INSTRUCTIONS dataset
40 multimodal tasks included in MULTIINSTRUCT
Zero-shot performance on Question Answering, Multimodal Commonsense Reasoning and Miscellaneous datasets
OFA MultiInstruct finetuned on different numbers of instructions

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Multimodal task and data collection#

Task instruction creation#

Multimodal instruction formatting#

Problem setup#

Evaluation metrics#

Approaches for comparison#

Training details#

Number of multimodal instruction tasks#

Effect of diverse instructions on instruction tuning#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Related work

Multimodal task and data collection

Task instruction creation

Multimodal instruction formatting

Problem setup

Evaluation metrics

Approaches for comparison

Training details

Number of multimodal instruction tasks

Effect of diverse instructions on instruction tuning

Conclusion and future work