Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed PromptBERT, a novel contrastive learning method for learning better sentence representation
- Analyzed drawback of current sentence embedding from original BERT
- Proposed first prompt-based sentence embeddings method
- Discussed two prompt representing methods and three prompt searching methods
- Proposed novel unsupervised training objective by technology of template denoising
- Experiments show effectiveness of method
- Compared to SimCSE, PromptBert achieved 2.29 and 2.58 points of improvement based on BERT and RoBERTa in unsupervised setting
Paper Content
Introduction
- Pre-trained language models like BERT and RoBERTa have been successful in sentence embeddings
- Original BERT has poor performance in sentence embeddings compared to traditional word embedding methods like GloVe
- Anisotropy has been linked to explain the poor performance of original BERT
- Methods have been proposed to eliminate anisotropy in sentence embeddings
- Anisotropy may not be the primary cause of poor semantic similarity
- Token embeddings are biased by frequency, case sensitivity and subwords
- Removing these biased tokens can improve the performance of sentence representations
- Prompt-based method can avoid embedding bias and utilize the original BERT layers
- Prompts can provide a better way to generate positive pairs by different viewpoints from different templates
Related work
- Learning sentence embeddings is a popular NLP problem
- Leveraging BERT for sentence embeddings is a new trend
- Contrastive learning based methods achieve best results
- Original BERT has unsatisfactory performance
- Anisotropy in original BERT causes high similarity between sentence pairs
- Works focus on reducing anisotropy by post-processing sentence embeddings
Rethinking the sentence embeddings of the original bert
- Previous works explain poor performance of original BERT due to anisotropic token embeddings
- Relationship between anisotropy and performance examined
- Main reasons for poor performance are ineffective BERT layers and static token embedding biases
- BERT layers significantly harm sentence embedding performance
- Anisotropy not related to performance degradation
- Token embeddings biased by token frequency, subwords, and case
- Anisotropy not related to bias
- Removing embedding biases improves performance significantly
Prompt based sentence embeddings
- Proposed a prompt based sentence method to obtain sentence embeddings
- Reformulated sentence embedding task as mask language task
- Avoided embedding biases by representing sentences from [MASK] tokens
- Discussed implementation of prompt based sentence embeddings
- Proposed a prompt based contrastive learning method to fine-tune BERT on sentence embeddings
Represent sentence with the prompt
- Two methods to represent one sentence with a prompt are discussed
- First method uses the hidden vector of [MASK] token as sentence representation
- Second method maps the sentence to the tokens and calculates the weighted average of these tokens
- Disadvantages of the second method are noted
- First method is preferred
Prompt search
- Manual search requires hand-crafting templates
- Template generation based on T5 can outperform manual search
- OptiPrompt uses continuous template and unsupervised contrastive learning to increase spearman correlation
- OptiPrompt increases spearman correlation from 73.44 to 80.90
Prompt based contrastive learning with template denoising
- Contrastive learning uses BERT for sentence embeddings
- Challenge is how to construct proper positive instances
- Gao et al. (2021b) and Yan et al. (2021) discussed strategies to construct positive instances
- Proposed method to generate positive instances based on prompt
- Template denoising proposed to reduce influence of template on sentence representation
- Training objective proposed to use denoised sentence representation
Experiments
- Conducted experiments on STS tasks with non fine-tuned and fine-tuned BERT settings
- Exploited performance of original BERT in sentence embeddings
- Reported unsupervised and supervised results by fine-tuning BERT with downstream tasks
- Results of transfer tasks in Appendix C
Dataset
- Past works have conducted experiments on 7 common STS datasets
- SentEval toolkit is used to download the datasets
- Sentence pairs in each dataset are scored from 0 to 5 to indicate semantic similarity
Baselines
- Compared method with enlightening and state-of-the-art methods
- Used GLoVe, BERT-flow, and BERT-whitening as baselines
- Compared method with IS-BERT, InferSent, Universal Sentence Encoder, SBERT, SimCSE, and ConSERT in fine-tuned setting
Implementation details
- BERT was used to validate the effectiveness of the representation method in the non-fine-tuned setting
- BERT and RoBERTa were used in the fine-tuned setting with unsupervised and supervised training data
Non fine-tuned bert results
- Using templates can improve the results of original BERT on all datasets.
- Compared to pooling methods, our methods can improve spearman correlation by more than 10%.
- Manual template surpasses postprocess methods like BERT-flow and BERT-whitening.
- Continuous template by OptiPrompt can help original BERT achieve better results than unsupervised ConSERT.
Fine-tuned bert results
- Results of fine-tuned BERT are shown in Table 6
- Unsupervised and supervised methods are run
- BERT-flow and BERT-whitening are mentioned
- Unsupervised constrastive learning is unstable
- Model is trained with 10 random seeds
- Outperforms previous methods
- Leverages knowledge of unlabeled data
- Results of SimCSE with 10 random seeds are reported
Effectiveness of prompt based contrastive learning with template denoising
- Results of unsupervised training objectives in prompt based BERT reported
- Used same template with inner dropout noise as data augmentation
- Used different templates as positive pairs
- Used different templates with template denoising as default method
- Predicted same template and setting, changed way to generate positive pairs
Discussion
Template denoising
- Template denoising removes bias from templates and improves quality of top-k tokens predicted by MLM head.
- Template denoising removes unrelated tokens and helps model predict more related tokens.
- Template denoising significantly improves quality of tokens predicted by MLM head.
Stability in unsupervised contrastive learning
- Unsupervised contrastive learning in sentence embeddings produces unstable results.
- Results of unsupervised SimCSE-BERT base with 10 random seeds were reproduced.
- Results of the method used are more stable than SimCSE.
- Difference between best and worst results in SimCSE is up to 3.14%.
- Difference between best and worst results in the method used is only 0.53%.
Conclusion
- Poor performance of original BERT for sentence embeddings
- Inappropriate sentence representation methods cause underestimation of original BERT
- Proposed prompt-based sentence embedding method to better leverage BERT
- Contrastive learning method based on template denoising to further improve method
- Extensive experiments demonstrate efficiency of method on STS tasks and transfer tasks
- Manual templates used, automatic templates generated by T5 underperform
- Performance with continuous templates verify efficiency of prompts in sentence embeddings
- Eliminating biases by removing tokens
- Most of biases in static token embeddings are gradient from MLM classification head weight
- Pre-trained two BERT-like models with MLM pre-training objective
- Distribution of untying model less influenced by biases
- Greedy searching templates on bert-base-uncased
- Complex templates improve spearman correlation
- All BERT based methods use bert-base-uncased
- Template denoising helps eliminate different template biases
- Removed top 36 frequent tokens in bert-base-cased, bert-base-uncased and roberta-base