Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Instruction tuning is a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions.
- MultiInstruct is the first multimodal instruction tuning benchmark dataset that consists of 47 diverse multimodal tasks.
- OFA is the base pre-trained model for multimodal instruction tuning and multiple transfer learning strategies are explored to leverage the large-scale Natural Instructions dataset.
- A new evaluation metric, Sensitivity, is designed to evaluate how sensitive the model is to the variety of instructions.
Paper Content
Introduction
- Advances in large-scale pre-trained language models (PLMs) have enabled efficient learning paradigms to generalize PLMs to new tasks without task-specific tuning.
- Instruction tuning has achieved success in zero-shot learning on natural language processing tasks.
- Multimodal pretraining has shown potential for jointly interpreting text and images in a shared semantic space.
- MULTIINSTRUCT is a benchmark dataset for multimodal instruction tuning with 47 diverse tasks from 11 broad categories.
- OFA is a pre-trained multimodal language model used to fine-tune MULTIINSTRUCT.
- Transfer learning strategies, such as Mixed Instruction Tuning, Sequential Instruction Tuning, and Adapter-based Sequential Instruction Tuning, are explored.
- A new metric -Sensitivity- is developed to measure how sensitive the model is toward the variety of instructions for the same task.
Related work
- Multimodal pretraining has advanced downstream vision-language tasks
- Several studies have explored building a unified pre-training framework to handle a diverse set of cross-modal and unimodal tasks
- Efficient language model tuning strategies have been proposed to improve generalizability and adaptivity of large-scale pre-trained language models
- Prompt tuning and in-context learning are two strategies used to improve generalizability
- Instruction tuning is another strategy used to improve generalizability
- MULTIINSTRUCT is a new multi-modal instruction tuning dataset with 47 tasks derived from 54 datasets
Multimodal task and data collection
- MULTIINSTRUCT dataset covers a range of multimodal tasks that require reasoning among regions, image, and text
- 26 tasks collected from existing studies in visual and multimodal learning
- Tasks include Visual Question Answering, Image Captioning, Grounded Generation, Image-Text Matching, Grounded Matching, Visual Relationship, Image Attributes, Image Generation, Commonsense Reasoning, Temporal Ordering, and Miscellaneous
- 21 additional tasks derived from the 26 existing tasks
- 47 tasks divided into 11 broad categories
- Training and evaluation tasks split based on criteria of similarity to pre-training tasks and complexity
Task instruction creation
- Definition of “instruction” provided
- Placeholders , and mentioned
- Iterative annotation process involving two expert annotators
- Annotators provided with clear and detailed information about the task and dataset
- Annotators review instructions and identify discrepancies or inconsistencies
- Final set of instructions created for each task
Multimodal instruction formatting
- Represent images, text, and bounding box coordinates as tokens in a unified vocabulary
- Apply bytepair encoding (BPE) to encode text input
- Apply VQ-GAN to generate discrete image tokens
- Represent regions or bounding boxes of an image by discretizing the four corner coordinates
- Formulate tasks as natural language sequence-to-sequence generation problems
- Input includes an image and an instruction with placeholders such as , or
Problem setup
- Followed same instruction tuning setting as previous studies
- Evaluated zero-shot learning capabilities of finetuned large language models
- Pre-trained multimodal language model M finetuned on collection of instruction tasks T
- Each task t associated with number of training instances
- Input information from x t j used to fill in placeholders in instruction
- OFA used as pretrained multimodal model
- Finetuned on MULTIINSTRUCT dataset
- Transformer encoder used to encode instruction with necessary filled information and optional image
- Training dataset contains many tasks, all instances mixed and randomly shuffled
- Parameter-efficient instruction tuning strategy introduced with adapters
Evaluation metrics
- Evaluated model using 5 instructions
- Reported mean and maximum performance
- Computed aggregated performance for each model
- Proposed new metric to evaluate model sensitivity to instruction variations
Approaches for comparison
- OFA is a pre-trained model with 472M parameters
- OFA has demonstrated zero-shot capability on unseen multimodal tasks
- OFA was fine-tuned on two datasets with instruction tuning
- Evaluation was done on instruction templates with removed specific tokens to ensure fairness
Training details
- Maximum input token length is 1024, maximum target length is 512
- Image preprocessing follows OFA
- Trained on 8 Nvidia A100 GPUs with batch size 8, learning rate 1e-05, float16 enabled for 3 epochs
- OFA MultiInstruct improves zero-shot performance over original pre-trained OFA model
- OFA without instruction tuning failed to generate region-specific tokens
- Fine-tuning OFA on NATURAL INSTRUCTIONS degrades zero-shot performance
- OFA SeqInstruct achieves similar performance to OFA MultiInstruct
- OFA AdapterInstruct tunes 12.6M parameters and achieves comparable or slightly worse performance
Number of multimodal instruction tasks
- Performance improves and sensitivity decreases as number of multimodal instruction tasks increases
- Low intra-task sensitivity indicates consistent results despite variations in instructions
Effect of diverse instructions on instruction tuning
- Hypothesis: Using diverse instructions for multimodal instruction tuning can improve zero-shot performance and reduce intra-task sensitivity.
- Results: Finetuning on 5 instructions significantly improves overall zeroshot performance and shows lower sensitivity.
- Conclusion: Increasing diversity of instructions is effective and future work could explore crowd-sourcing or automatic generation strategies.
- Results: Finetuning and transfer learning strategies reduce model sensitivity.
Conclusion and future work
- Presented a new large-scale multimodal instruction tuning benchmark dataset -MULTIINSTRUCT
- Covers a wide variety of vision and multimodal tasks
- Each task is associated with multiple expert-written instructions
- Finetuning OFA (Wang et al., 2022a) on MULTIINSTRUCT with instruction tuning improves zero-shot performance on various unseen multimodal tasks
- Explored several transferring learning techniques to leverage the much larger text-only NAT-URAL INSTRUCTIONS dataset
- 40 multimodal tasks included in MULTIINSTRUCT
- Zero-shot performance on Question Answering, Multimodal Commonsense Reasoning and Miscellaneous datasets
- OFA MultiInstruct finetuned on different numbers of instructions