Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed PromptBERT, a novel contrastive learning method for learning better sentence representation
  • Analyzed drawback of current sentence embedding from original BERT
  • Proposed first prompt-based sentence embeddings method
  • Discussed two prompt representing methods and three prompt searching methods
  • Proposed novel unsupervised training objective by technology of template denoising
  • Experiments show effectiveness of method
  • Compared to SimCSE, PromptBert achieved 2.29 and 2.58 points of improvement based on BERT and RoBERTa in unsupervised setting

Paper Content

Introduction

  • Pre-trained language models like BERT and RoBERTa have been successful in sentence embeddings
  • Original BERT has poor performance in sentence embeddings compared to traditional word embedding methods like GloVe
  • Anisotropy has been linked to explain the poor performance of original BERT
  • Methods have been proposed to eliminate anisotropy in sentence embeddings
  • Anisotropy may not be the primary cause of poor semantic similarity
  • Token embeddings are biased by frequency, case sensitivity and subwords
  • Removing these biased tokens can improve the performance of sentence representations
  • Prompt-based method can avoid embedding bias and utilize the original BERT layers
  • Prompts can provide a better way to generate positive pairs by different viewpoints from different templates
  • Learning sentence embeddings is a popular NLP problem
  • Leveraging BERT for sentence embeddings is a new trend
  • Contrastive learning based methods achieve best results
  • Original BERT has unsatisfactory performance
  • Anisotropy in original BERT causes high similarity between sentence pairs
  • Works focus on reducing anisotropy by post-processing sentence embeddings

Rethinking the sentence embeddings of the original bert

  • Previous works explain poor performance of original BERT due to anisotropic token embeddings
  • Relationship between anisotropy and performance examined
  • Main reasons for poor performance are ineffective BERT layers and static token embedding biases
  • BERT layers significantly harm sentence embedding performance
  • Anisotropy not related to performance degradation
  • Token embeddings biased by token frequency, subwords, and case
  • Anisotropy not related to bias
  • Removing embedding biases improves performance significantly

Prompt based sentence embeddings

  • Proposed a prompt based sentence method to obtain sentence embeddings
  • Reformulated sentence embedding task as mask language task
  • Avoided embedding biases by representing sentences from [MASK] tokens
  • Discussed implementation of prompt based sentence embeddings
  • Proposed a prompt based contrastive learning method to fine-tune BERT on sentence embeddings

Represent sentence with the prompt

  • Two methods to represent one sentence with a prompt are discussed
  • First method uses the hidden vector of [MASK] token as sentence representation
  • Second method maps the sentence to the tokens and calculates the weighted average of these tokens
  • Disadvantages of the second method are noted
  • First method is preferred
  • Manual search requires hand-crafting templates
  • Template generation based on T5 can outperform manual search
  • OptiPrompt uses continuous template and unsupervised contrastive learning to increase spearman correlation
  • OptiPrompt increases spearman correlation from 73.44 to 80.90

Prompt based contrastive learning with template denoising

  • Contrastive learning uses BERT for sentence embeddings
  • Challenge is how to construct proper positive instances
  • Gao et al. (2021b) and Yan et al. (2021) discussed strategies to construct positive instances
  • Proposed method to generate positive instances based on prompt
  • Template denoising proposed to reduce influence of template on sentence representation
  • Training objective proposed to use denoised sentence representation

Experiments

  • Conducted experiments on STS tasks with non fine-tuned and fine-tuned BERT settings
  • Exploited performance of original BERT in sentence embeddings
  • Reported unsupervised and supervised results by fine-tuning BERT with downstream tasks
  • Results of transfer tasks in Appendix C

Dataset

  • Past works have conducted experiments on 7 common STS datasets
  • SentEval toolkit is used to download the datasets
  • Sentence pairs in each dataset are scored from 0 to 5 to indicate semantic similarity

Baselines

  • Compared method with enlightening and state-of-the-art methods
  • Used GLoVe, BERT-flow, and BERT-whitening as baselines
  • Compared method with IS-BERT, InferSent, Universal Sentence Encoder, SBERT, SimCSE, and ConSERT in fine-tuned setting

Implementation details

  • BERT was used to validate the effectiveness of the representation method in the non-fine-tuned setting
  • BERT and RoBERTa were used in the fine-tuned setting with unsupervised and supervised training data

Non fine-tuned bert results

  • Using templates can improve the results of original BERT on all datasets.
  • Compared to pooling methods, our methods can improve spearman correlation by more than 10%.
  • Manual template surpasses postprocess methods like BERT-flow and BERT-whitening.
  • Continuous template by OptiPrompt can help original BERT achieve better results than unsupervised ConSERT.

Fine-tuned bert results

  • Results of fine-tuned BERT are shown in Table 6
  • Unsupervised and supervised methods are run
  • BERT-flow and BERT-whitening are mentioned
  • Unsupervised constrastive learning is unstable
  • Model is trained with 10 random seeds
  • Outperforms previous methods
  • Leverages knowledge of unlabeled data
  • Results of SimCSE with 10 random seeds are reported

Effectiveness of prompt based contrastive learning with template denoising

  • Results of unsupervised training objectives in prompt based BERT reported
  • Used same template with inner dropout noise as data augmentation
  • Used different templates as positive pairs
  • Used different templates with template denoising as default method
  • Predicted same template and setting, changed way to generate positive pairs

Discussion

Template denoising

  • Template denoising removes bias from templates and improves quality of top-k tokens predicted by MLM head.
  • Template denoising removes unrelated tokens and helps model predict more related tokens.
  • Template denoising significantly improves quality of tokens predicted by MLM head.

Stability in unsupervised contrastive learning

  • Unsupervised contrastive learning in sentence embeddings produces unstable results.
  • Results of unsupervised SimCSE-BERT base with 10 random seeds were reproduced.
  • Results of the method used are more stable than SimCSE.
  • Difference between best and worst results in SimCSE is up to 3.14%.
  • Difference between best and worst results in the method used is only 0.53%.

Conclusion

  • Poor performance of original BERT for sentence embeddings
  • Inappropriate sentence representation methods cause underestimation of original BERT
  • Proposed prompt-based sentence embedding method to better leverage BERT
  • Contrastive learning method based on template denoising to further improve method
  • Extensive experiments demonstrate efficiency of method on STS tasks and transfer tasks
  • Manual templates used, automatic templates generated by T5 underperform
  • Performance with continuous templates verify efficiency of prompts in sentence embeddings
  • Eliminating biases by removing tokens
  • Most of biases in static token embeddings are gradient from MLM classification head weight
  • Pre-trained two BERT-like models with MLM pre-training objective
  • Distribution of untying model less influenced by biases
  • Greedy searching templates on bert-base-uncased
  • Complex templates improve spearman correlation
  • All BERT based methods use bert-base-uncased
  • Template denoising helps eliminate different template biases
  • Removed top 36 frequent tokens in bert-base-cased, bert-base-uncased and roberta-base