Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce INSTRUCTOR, a new method for computing text embeddings given task instructions
INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains
Annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss
Evaluate INSTRUCTOR on 70 embedding evaluation tasks, ranging from classification to text generation
INSTRUCTOR achieves state-of-the-art performance, with an average improvement of 3.4% compared to previous best results
INSTRUCTOR is robust to changes in instructions, and instruction finetuning mitigates challenge of training single model on diverse datasets

Paper Content

Introduction

Text embeddings represent discrete text inputs as fixed-sized vectors
Recently, there have been advances in learning text embeddings
Existing embeddings can have degraded performance when applied to new tasks or domains
Common method to address this issue is to finetune the embeddings on datasets in downstream tasks and domains
Hypothesis: text embeddings can be adjusted to different downstream applications using task and domain descriptions, without further finetuning
INSTRUCTOR is a single multitask model that generates task-and domain-aware embeddings
INSTRUCTOR is trained on MEDI, a collection of 330 text embedding datasets
INSTRUCTOR outperforms prior state-of-the-art embedding models
INSTRUCTOR is robust to paraphrases in instructions
INSTRUCTOR encodes inputs together with task instructions, providing task-specific representations

Embedding architecture

INSTRUCTOR is a single encoder architecture based on prior work
GTR models are used as the backbone encoder
GTR models are initialized from T5 models and finetuned on information search datasets
INSTRUCTOR encodes the concatenation of an input text and a task instruction
Mean pooling is applied to the last hidden representations to generate a fixed-sized, task-specific embedding

Training objective

INSTRUCTOR is trained by formulating a variety of tasks as a text-to-text problem
Training sample consists of (x, Ix, y, Iy)
Goodness of candidate y for input x is given by similarity s(x, y)
Maximize similarity between positive pairs and minimize negative pairs
330 datasets with instructions across diverse task categories and domains created

Data construction

300 datasets from Super-NaturalInstructions combined with 30 datasets from existing collections
Positive and negative pairs constructed using Sentence-T5 embeddings
Cosine similarity used to create positive and negative pairs for classification datasets
Scores computed for tasks with output labels as text sequences
Highest scores used to select positive and hard negative pairs
Training data from Super-NaturalInstructions improve instruction robustness
30 datasets from Sentence Transformers, KILT and MedMCQA
Unified instruction template developed for datasets without instructions

Experiments

Trained INSTRUCTOR on MEDI data and evaluated on 70 downstream tasks
Used MTEB benchmark from Muennighoff et al. (2022) which consists of 56 datasets over 7 task categories
Applied INSTRUCTOR to prompt retrieval, in-context learning and text generation evaluation
Achieved state-of-the-art performance

Embedding evaluations

70 evaluation datasets split into 9 categories by task objectives
66 datasets unseen during training
MTEB (Muennighoff et al., 2022) provides a holistic view of current embedding models’ performance
Retrieval: query q and corpus D, embedding model used to embed q and p 1 …p n , similarity measured by cosine similarity, NDCG@10 used to measure performance
Reranking: query q and list of documents D, embedding model computes embeddings, cosine similarities used to rank documents, MAP used to measure performance
Clustering: set of documents, encoder maps each document into an embedding, k-means clustering algorithm used to partition embedded documents into clusters, v-measure used to measure performance
Pair Classification: predict binary label for a pair of texts, example is paraphrase identification, cosine similarity used to predict label, average precision score used to measure performance
Classification: embedding of input text used as features to a classifier, classifier trained on training data, sentence embeddings kept frozen, classification accuracy on test set used as evaluation metric
STS: sentence pair (t 1 , t 2 ), embedding model maps t 1 and t 2 separately, similarity between t 1 and t 2 measured by cosine similarity, Spearman’s rank correlation used to measure performance
Summarization: reference summary r and machine-generated summary t, embedding model maps them into embeddings separately, cosine similarity between r and t, Spearman’s rank correlation between human judgements and automatic scores
INSTRUCTOR used for automatic evaluations for 3 additional text generation tasks, cosine similarity between generated text and each reference text, Pearson correlation with human judgments reported

Baselines

Official MTEB benchmark used for comparisons
Two types of baselines: embedding models and semantic textual similarity
Embedding models trained on open-domain QA datasets
Semantic textual similarity trained on symmetric paraphrase datasets
All baselines based on pretrained language models
Sent-T5-XXL and GTR-XXL achieve best average performances

Main results

INSTRUCTOR achieves the best performance on all three benchmarks on average
INSTRUCTOR (335M) demonstrates large improvements over GTR-Large on text evaluation, prompt retrieval, and classification tasks
INSTRUCTOR performs better than the previous state-of-the-art model despite having one order of magnitude fewer parameters
Retrieval-based models perform well on retrieval and reranking, but lag behind on STS and classification
Similarity-based models perform well on STS, classification, and text evaluation, but not on retrieval

Analysis and ablations

INSTRUCTOR enables universal text embeddings for many tasks
Analyzed results from various perspectives
Reported average performance across all categories

Instructions enable diverse training

Split MEDI into symmetric and asymmetric groups
INSTRUCTOR finetuned without instructions yields similar or better performance than original GTR model
INSTRUCTOR suffers if finetuned without task instructions on combination of both types of data
Finetuning with instructions enables model to benefit from combination of symmetric and asymmetric data
Instruction finetuning important when diverse data used for embedding training

Instruction robustness

Previous work shows that instruction-finetuned language models are not robust to paraphrased instructions.
Measured INSTRUCTOR’s robustness to variation in human-written instructions.
Wrote five paraphrased instructions for all evaluation datasets.
Measured INSTRUCTOR’s performance gap between best- and worst-performing instructions.
Inclusion of 300 super-NI datasets is critical to the robustness of INSTRUCTOR.
Removing these datasets from training increases performance gap between best- and worst-performing instructions.

Complexity of instructions

Instructions can improve performance across different task categories.
As more information is provided in the instruction, performance improves.
As the model size increases, performance increases for both GTR and INSTRUCTOR, with INSTRUCTOR showing more pronounced improvement.

Instructions mitigate domain shifts

Instruction-based finetuning improves models’ ability to generalize to unseen domains and tasks.
Results in Table 3 show INSTRUCTOR largely improves GTR-Large’s performance on three unseen domains.
Domain-specific datasets benefit particularly from instruction finetuning.

Qualititive analysis

Used T-SNE to visualize two examples of classification with and without instructions
Without instructions, pairs with different sentiment were closer together, and pairs with same sentiment were farther apart
With instructions, pairs with same sentiment were close together and correctly classified, and pairs with different sentiment were farther apart

Text embeddings are used in many applications
Different embedding models are used for different applications
MEDI combines symmetric and asymmetric data to train INSTRUCTOR
Muennighoff et al. (2022) introduced a benchmark to evaluate embedding models
Poor zero-shot transfer abilities of existing embedding models
Instruction finetuning has yet to be studied in the context of broadly-applicable embeddings
Instructions can facilitate information retrieval
INSTRUCTOR can be applied to 8 tasks categories

Conclusion

Introduced INSTRUCTOR, a single model that creates broadly-applicable text embeddings using natural language instructions
Constructed MEDI, a collection of diverse datasets, to finetune INSTRUCTOR with instructions
Experiments showed INSTRUCTOR achieves state-of-the-art performance on text embedding benchmarks
Experiments showed INSTRUCTOR achieves prompt retrieval for fewshot in-context learning
330 diverse datasets with human-written task instructions used for training
Evaluated on 70 embedding datasets, spanning various downstream applications
Outperforms prior best model by an average of 3.4%
Robustness to instruction paraphrases improved
Performance improves as instructions become more detailed
Benefits more from scaling up
T-SNE visualization of pair classification examples with and without instruction
Performance of model tuned with instructions of various complexity degrees in Billboard and Prompt Retrieval
Evaluation includes 70 diverse datasets in 9 different downstream applications

Link to paper#

Abstract#

Paper Content#

Introduction#

Embedding architecture#

Training objective#

Data construction#

Experiments#

Embedding evaluations#

Baselines#

Main results#

Analysis and ablations#

Instructions enable diverse training#

Instruction robustness#

Complexity of instructions#

Instructions mitigate domain shifts#

Qualititive analysis#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Embedding architecture

Training objective

Data construction

Experiments

Embedding evaluations

Baselines

Main results

Analysis and ablations

Instructions enable diverse training

Instruction robustness

Complexity of instructions

Instructions mitigate domain shifts

Qualititive analysis

Related work

Conclusion