Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce INSTRUCTOR, a new method for computing text embeddings given task instructions
- INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains
- Annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss
- Evaluate INSTRUCTOR on 70 embedding evaluation tasks, ranging from classification to text generation
- INSTRUCTOR achieves state-of-the-art performance, with an average improvement of 3.4% compared to previous best results
- INSTRUCTOR is robust to changes in instructions, and instruction finetuning mitigates challenge of training single model on diverse datasets
Paper Content
Introduction
- Text embeddings represent discrete text inputs as fixed-sized vectors
- Recently, there have been advances in learning text embeddings
- Existing embeddings can have degraded performance when applied to new tasks or domains
- Common method to address this issue is to finetune the embeddings on datasets in downstream tasks and domains
- Hypothesis: text embeddings can be adjusted to different downstream applications using task and domain descriptions, without further finetuning
- INSTRUCTOR is a single multitask model that generates task-and domain-aware embeddings
- INSTRUCTOR is trained on MEDI, a collection of 330 text embedding datasets
- INSTRUCTOR outperforms prior state-of-the-art embedding models
- INSTRUCTOR is robust to paraphrases in instructions
- INSTRUCTOR encodes inputs together with task instructions, providing task-specific representations
Embedding architecture
- INSTRUCTOR is a single encoder architecture based on prior work
- GTR models are used as the backbone encoder
- GTR models are initialized from T5 models and finetuned on information search datasets
- INSTRUCTOR encodes the concatenation of an input text and a task instruction
- Mean pooling is applied to the last hidden representations to generate a fixed-sized, task-specific embedding
Training objective
- INSTRUCTOR is trained by formulating a variety of tasks as a text-to-text problem
- Training sample consists of (x, Ix, y, Iy)
- Goodness of candidate y for input x is given by similarity s(x, y)
- Maximize similarity between positive pairs and minimize negative pairs
- 330 datasets with instructions across diverse task categories and domains created
Data construction
- 300 datasets from Super-NaturalInstructions combined with 30 datasets from existing collections
- Positive and negative pairs constructed using Sentence-T5 embeddings
- Cosine similarity used to create positive and negative pairs for classification datasets
- Scores computed for tasks with output labels as text sequences
- Highest scores used to select positive and hard negative pairs
- Training data from Super-NaturalInstructions improve instruction robustness
- 30 datasets from Sentence Transformers, KILT and MedMCQA
- Unified instruction template developed for datasets without instructions
Experiments
- Trained INSTRUCTOR on MEDI data and evaluated on 70 downstream tasks
- Used MTEB benchmark from Muennighoff et al. (2022) which consists of 56 datasets over 7 task categories
- Applied INSTRUCTOR to prompt retrieval, in-context learning and text generation evaluation
- Achieved state-of-the-art performance
Embedding evaluations
- 70 evaluation datasets split into 9 categories by task objectives
- 66 datasets unseen during training
- MTEB (Muennighoff et al., 2022) provides a holistic view of current embedding models’ performance
- Retrieval: query q and corpus D, embedding model used to embed q and p 1 …p n , similarity measured by cosine similarity, NDCG@10 used to measure performance
- Reranking: query q and list of documents D, embedding model computes embeddings, cosine similarities used to rank documents, MAP used to measure performance
- Clustering: set of documents, encoder maps each document into an embedding, k-means clustering algorithm used to partition embedded documents into clusters, v-measure used to measure performance
- Pair Classification: predict binary label for a pair of texts, example is paraphrase identification, cosine similarity used to predict label, average precision score used to measure performance
- Classification: embedding of input text used as features to a classifier, classifier trained on training data, sentence embeddings kept frozen, classification accuracy on test set used as evaluation metric
- STS: sentence pair (t 1 , t 2 ), embedding model maps t 1 and t 2 separately, similarity between t 1 and t 2 measured by cosine similarity, Spearman’s rank correlation used to measure performance
- Summarization: reference summary r and machine-generated summary t, embedding model maps them into embeddings separately, cosine similarity between r and t, Spearman’s rank correlation between human judgements and automatic scores
- INSTRUCTOR used for automatic evaluations for 3 additional text generation tasks, cosine similarity between generated text and each reference text, Pearson correlation with human judgments reported
Baselines
- Official MTEB benchmark used for comparisons
- Two types of baselines: embedding models and semantic textual similarity
- Embedding models trained on open-domain QA datasets
- Semantic textual similarity trained on symmetric paraphrase datasets
- All baselines based on pretrained language models
- Sent-T5-XXL and GTR-XXL achieve best average performances
Main results
- INSTRUCTOR achieves the best performance on all three benchmarks on average
- INSTRUCTOR (335M) demonstrates large improvements over GTR-Large on text evaluation, prompt retrieval, and classification tasks
- INSTRUCTOR performs better than the previous state-of-the-art model despite having one order of magnitude fewer parameters
- Retrieval-based models perform well on retrieval and reranking, but lag behind on STS and classification
- Similarity-based models perform well on STS, classification, and text evaluation, but not on retrieval
Analysis and ablations
- INSTRUCTOR enables universal text embeddings for many tasks
- Analyzed results from various perspectives
- Reported average performance across all categories
Instructions enable diverse training
- Split MEDI into symmetric and asymmetric groups
- INSTRUCTOR finetuned without instructions yields similar or better performance than original GTR model
- INSTRUCTOR suffers if finetuned without task instructions on combination of both types of data
- Finetuning with instructions enables model to benefit from combination of symmetric and asymmetric data
- Instruction finetuning important when diverse data used for embedding training
Instruction robustness
- Previous work shows that instruction-finetuned language models are not robust to paraphrased instructions.
- Measured INSTRUCTOR’s robustness to variation in human-written instructions.
- Wrote five paraphrased instructions for all evaluation datasets.
- Measured INSTRUCTOR’s performance gap between best- and worst-performing instructions.
- Inclusion of 300 super-NI datasets is critical to the robustness of INSTRUCTOR.
- Removing these datasets from training increases performance gap between best- and worst-performing instructions.
Complexity of instructions
- Instructions can improve performance across different task categories.
- As more information is provided in the instruction, performance improves.
- As the model size increases, performance increases for both GTR and INSTRUCTOR, with INSTRUCTOR showing more pronounced improvement.
Instructions mitigate domain shifts
- Instruction-based finetuning improves models’ ability to generalize to unseen domains and tasks.
- Results in Table 3 show INSTRUCTOR largely improves GTR-Large’s performance on three unseen domains.
- Domain-specific datasets benefit particularly from instruction finetuning.
Qualititive analysis
- Used T-SNE to visualize two examples of classification with and without instructions
- Without instructions, pairs with different sentiment were closer together, and pairs with same sentiment were farther apart
- With instructions, pairs with same sentiment were close together and correctly classified, and pairs with different sentiment were farther apart
Related work
- Text embeddings are used in many applications
- Different embedding models are used for different applications
- MEDI combines symmetric and asymmetric data to train INSTRUCTOR
- Muennighoff et al. (2022) introduced a benchmark to evaluate embedding models
- Poor zero-shot transfer abilities of existing embedding models
- Instruction finetuning has yet to be studied in the context of broadly-applicable embeddings
- Instructions can facilitate information retrieval
- INSTRUCTOR can be applied to 8 tasks categories
Conclusion
- Introduced INSTRUCTOR, a single model that creates broadly-applicable text embeddings using natural language instructions
- Constructed MEDI, a collection of diverse datasets, to finetune INSTRUCTOR with instructions
- Experiments showed INSTRUCTOR achieves state-of-the-art performance on text embedding benchmarks
- Experiments showed INSTRUCTOR achieves prompt retrieval for fewshot in-context learning
- 330 diverse datasets with human-written task instructions used for training
- Evaluated on 70 embedding datasets, spanning various downstream applications
- Outperforms prior best model by an average of 3.4%
- Robustness to instruction paraphrases improved
- Performance improves as instructions become more detailed
- Benefits more from scaling up
- T-SNE visualization of pair classification examples with and without instruction
- Performance of model tuned with instructions of various complexity degrees in Billboard and Prompt Retrieval
- Evaluation includes 70 diverse datasets in 9 different downstream applications