Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce INSTRUCTOR, a new method for computing text embeddings given task instructions
  • INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains
  • Annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss
  • Evaluate INSTRUCTOR on 70 embedding evaluation tasks, ranging from classification to text generation
  • INSTRUCTOR achieves state-of-the-art performance, with an average improvement of 3.4% compared to previous best results
  • INSTRUCTOR is robust to changes in instructions, and instruction finetuning mitigates challenge of training single model on diverse datasets

Paper Content

Introduction

  • Text embeddings represent discrete text inputs as fixed-sized vectors
  • Recently, there have been advances in learning text embeddings
  • Existing embeddings can have degraded performance when applied to new tasks or domains
  • Common method to address this issue is to finetune the embeddings on datasets in downstream tasks and domains
  • Hypothesis: text embeddings can be adjusted to different downstream applications using task and domain descriptions, without further finetuning
  • INSTRUCTOR is a single multitask model that generates task-and domain-aware embeddings
  • INSTRUCTOR is trained on MEDI, a collection of 330 text embedding datasets
  • INSTRUCTOR outperforms prior state-of-the-art embedding models
  • INSTRUCTOR is robust to paraphrases in instructions
  • INSTRUCTOR encodes inputs together with task instructions, providing task-specific representations

Embedding architecture

  • INSTRUCTOR is a single encoder architecture based on prior work
  • GTR models are used as the backbone encoder
  • GTR models are initialized from T5 models and finetuned on information search datasets
  • INSTRUCTOR encodes the concatenation of an input text and a task instruction
  • Mean pooling is applied to the last hidden representations to generate a fixed-sized, task-specific embedding

Training objective

  • INSTRUCTOR is trained by formulating a variety of tasks as a text-to-text problem
  • Training sample consists of (x, Ix, y, Iy)
  • Goodness of candidate y for input x is given by similarity s(x, y)
  • Maximize similarity between positive pairs and minimize negative pairs
  • 330 datasets with instructions across diverse task categories and domains created

Data construction

  • 300 datasets from Super-NaturalInstructions combined with 30 datasets from existing collections
  • Positive and negative pairs constructed using Sentence-T5 embeddings
  • Cosine similarity used to create positive and negative pairs for classification datasets
  • Scores computed for tasks with output labels as text sequences
  • Highest scores used to select positive and hard negative pairs
  • Training data from Super-NaturalInstructions improve instruction robustness
  • 30 datasets from Sentence Transformers, KILT and MedMCQA
  • Unified instruction template developed for datasets without instructions

Experiments

  • Trained INSTRUCTOR on MEDI data and evaluated on 70 downstream tasks
  • Used MTEB benchmark from Muennighoff et al. (2022) which consists of 56 datasets over 7 task categories
  • Applied INSTRUCTOR to prompt retrieval, in-context learning and text generation evaluation
  • Achieved state-of-the-art performance

Embedding evaluations

  • 70 evaluation datasets split into 9 categories by task objectives
  • 66 datasets unseen during training
  • MTEB (Muennighoff et al., 2022) provides a holistic view of current embedding models’ performance
  • Retrieval: query q and corpus D, embedding model used to embed q and p 1 …p n , similarity measured by cosine similarity, NDCG@10 used to measure performance
  • Reranking: query q and list of documents D, embedding model computes embeddings, cosine similarities used to rank documents, MAP used to measure performance
  • Clustering: set of documents, encoder maps each document into an embedding, k-means clustering algorithm used to partition embedded documents into clusters, v-measure used to measure performance
  • Pair Classification: predict binary label for a pair of texts, example is paraphrase identification, cosine similarity used to predict label, average precision score used to measure performance
  • Classification: embedding of input text used as features to a classifier, classifier trained on training data, sentence embeddings kept frozen, classification accuracy on test set used as evaluation metric
  • STS: sentence pair (t 1 , t 2 ), embedding model maps t 1 and t 2 separately, similarity between t 1 and t 2 measured by cosine similarity, Spearman’s rank correlation used to measure performance
  • Summarization: reference summary r and machine-generated summary t, embedding model maps them into embeddings separately, cosine similarity between r and t, Spearman’s rank correlation between human judgements and automatic scores
  • INSTRUCTOR used for automatic evaluations for 3 additional text generation tasks, cosine similarity between generated text and each reference text, Pearson correlation with human judgments reported

Baselines

  • Official MTEB benchmark used for comparisons
  • Two types of baselines: embedding models and semantic textual similarity
  • Embedding models trained on open-domain QA datasets
  • Semantic textual similarity trained on symmetric paraphrase datasets
  • All baselines based on pretrained language models
  • Sent-T5-XXL and GTR-XXL achieve best average performances

Main results

  • INSTRUCTOR achieves the best performance on all three benchmarks on average
  • INSTRUCTOR (335M) demonstrates large improvements over GTR-Large on text evaluation, prompt retrieval, and classification tasks
  • INSTRUCTOR performs better than the previous state-of-the-art model despite having one order of magnitude fewer parameters
  • Retrieval-based models perform well on retrieval and reranking, but lag behind on STS and classification
  • Similarity-based models perform well on STS, classification, and text evaluation, but not on retrieval

Analysis and ablations

  • INSTRUCTOR enables universal text embeddings for many tasks
  • Analyzed results from various perspectives
  • Reported average performance across all categories

Instructions enable diverse training

  • Split MEDI into symmetric and asymmetric groups
  • INSTRUCTOR finetuned without instructions yields similar or better performance than original GTR model
  • INSTRUCTOR suffers if finetuned without task instructions on combination of both types of data
  • Finetuning with instructions enables model to benefit from combination of symmetric and asymmetric data
  • Instruction finetuning important when diverse data used for embedding training

Instruction robustness

  • Previous work shows that instruction-finetuned language models are not robust to paraphrased instructions.
  • Measured INSTRUCTOR’s robustness to variation in human-written instructions.
  • Wrote five paraphrased instructions for all evaluation datasets.
  • Measured INSTRUCTOR’s performance gap between best- and worst-performing instructions.
  • Inclusion of 300 super-NI datasets is critical to the robustness of INSTRUCTOR.
  • Removing these datasets from training increases performance gap between best- and worst-performing instructions.

Complexity of instructions

  • Instructions can improve performance across different task categories.
  • As more information is provided in the instruction, performance improves.
  • As the model size increases, performance increases for both GTR and INSTRUCTOR, with INSTRUCTOR showing more pronounced improvement.

Instructions mitigate domain shifts

  • Instruction-based finetuning improves models’ ability to generalize to unseen domains and tasks.
  • Results in Table 3 show INSTRUCTOR largely improves GTR-Large’s performance on three unseen domains.
  • Domain-specific datasets benefit particularly from instruction finetuning.

Qualititive analysis

  • Used T-SNE to visualize two examples of classification with and without instructions
  • Without instructions, pairs with different sentiment were closer together, and pairs with same sentiment were farther apart
  • With instructions, pairs with same sentiment were close together and correctly classified, and pairs with different sentiment were farther apart
  • Text embeddings are used in many applications
  • Different embedding models are used for different applications
  • MEDI combines symmetric and asymmetric data to train INSTRUCTOR
  • Muennighoff et al. (2022) introduced a benchmark to evaluate embedding models
  • Poor zero-shot transfer abilities of existing embedding models
  • Instruction finetuning has yet to be studied in the context of broadly-applicable embeddings
  • Instructions can facilitate information retrieval
  • INSTRUCTOR can be applied to 8 tasks categories

Conclusion

  • Introduced INSTRUCTOR, a single model that creates broadly-applicable text embeddings using natural language instructions
  • Constructed MEDI, a collection of diverse datasets, to finetune INSTRUCTOR with instructions
  • Experiments showed INSTRUCTOR achieves state-of-the-art performance on text embedding benchmarks
  • Experiments showed INSTRUCTOR achieves prompt retrieval for fewshot in-context learning
  • 330 diverse datasets with human-written task instructions used for training
  • Evaluated on 70 embedding datasets, spanning various downstream applications
  • Outperforms prior best model by an average of 3.4%
  • Robustness to instruction paraphrases improved
  • Performance improves as instructions become more detailed
  • Benefits more from scaling up
  • T-SNE visualization of pair classification examples with and without instruction
  • Performance of model tuned with instructions of various complexity degrees in Billboard and Prompt Retrieval
  • Evaluation includes 70 diverse datasets in 9 different downstream applications