Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed PromptBERT, a novel contrastive learning method for learning better sentence representation
Analyzed drawback of current sentence embedding from original BERT
Proposed first prompt-based sentence embeddings method
Discussed two prompt representing methods and three prompt searching methods
Proposed novel unsupervised training objective by technology of template denoising
Experiments show effectiveness of method
Compared to SimCSE, PromptBert achieved 2.29 and 2.58 points of improvement based on BERT and RoBERTa in unsupervised setting

Paper Content

Introduction

Pre-trained language models like BERT and RoBERTa have been successful in sentence embeddings
Original BERT has poor performance in sentence embeddings compared to traditional word embedding methods like GloVe
Anisotropy has been linked to explain the poor performance of original BERT
Methods have been proposed to eliminate anisotropy in sentence embeddings
Anisotropy may not be the primary cause of poor semantic similarity
Token embeddings are biased by frequency, case sensitivity and subwords
Removing these biased tokens can improve the performance of sentence representations
Prompt-based method can avoid embedding bias and utilize the original BERT layers
Prompts can provide a better way to generate positive pairs by different viewpoints from different templates

Learning sentence embeddings is a popular NLP problem
Leveraging BERT for sentence embeddings is a new trend
Contrastive learning based methods achieve best results
Original BERT has unsatisfactory performance
Anisotropy in original BERT causes high similarity between sentence pairs
Works focus on reducing anisotropy by post-processing sentence embeddings

Rethinking the sentence embeddings of the original bert

Previous works explain poor performance of original BERT due to anisotropic token embeddings
Relationship between anisotropy and performance examined
Main reasons for poor performance are ineffective BERT layers and static token embedding biases
BERT layers significantly harm sentence embedding performance
Anisotropy not related to performance degradation
Token embeddings biased by token frequency, subwords, and case
Anisotropy not related to bias
Removing embedding biases improves performance significantly

Prompt based sentence embeddings

Proposed a prompt based sentence method to obtain sentence embeddings
Reformulated sentence embedding task as mask language task
Avoided embedding biases by representing sentences from [MASK] tokens
Discussed implementation of prompt based sentence embeddings
Proposed a prompt based contrastive learning method to fine-tune BERT on sentence embeddings

Represent sentence with the prompt

Two methods to represent one sentence with a prompt are discussed
First method uses the hidden vector of [MASK] token as sentence representation
Second method maps the sentence to the tokens and calculates the weighted average of these tokens
Disadvantages of the second method are noted
First method is preferred

Prompt search

Manual search requires hand-crafting templates
Template generation based on T5 can outperform manual search
OptiPrompt uses continuous template and unsupervised contrastive learning to increase spearman correlation
OptiPrompt increases spearman correlation from 73.44 to 80.90

Prompt based contrastive learning with template denoising

Contrastive learning uses BERT for sentence embeddings
Challenge is how to construct proper positive instances
Gao et al. (2021b) and Yan et al. (2021) discussed strategies to construct positive instances
Proposed method to generate positive instances based on prompt
Template denoising proposed to reduce influence of template on sentence representation
Training objective proposed to use denoised sentence representation

Experiments

Conducted experiments on STS tasks with non fine-tuned and fine-tuned BERT settings
Exploited performance of original BERT in sentence embeddings
Reported unsupervised and supervised results by fine-tuning BERT with downstream tasks
Results of transfer tasks in Appendix C

Dataset

Past works have conducted experiments on 7 common STS datasets
SentEval toolkit is used to download the datasets
Sentence pairs in each dataset are scored from 0 to 5 to indicate semantic similarity

Baselines

Compared method with enlightening and state-of-the-art methods
Used GLoVe, BERT-flow, and BERT-whitening as baselines
Compared method with IS-BERT, InferSent, Universal Sentence Encoder, SBERT, SimCSE, and ConSERT in fine-tuned setting

Implementation details

BERT was used to validate the effectiveness of the representation method in the non-fine-tuned setting
BERT and RoBERTa were used in the fine-tuned setting with unsupervised and supervised training data

Non fine-tuned bert results

Using templates can improve the results of original BERT on all datasets.
Compared to pooling methods, our methods can improve spearman correlation by more than 10%.
Manual template surpasses postprocess methods like BERT-flow and BERT-whitening.
Continuous template by OptiPrompt can help original BERT achieve better results than unsupervised ConSERT.

Fine-tuned bert results

Results of fine-tuned BERT are shown in Table 6
Unsupervised and supervised methods are run
BERT-flow and BERT-whitening are mentioned
Unsupervised constrastive learning is unstable
Model is trained with 10 random seeds
Outperforms previous methods
Leverages knowledge of unlabeled data
Results of SimCSE with 10 random seeds are reported

Effectiveness of prompt based contrastive learning with template denoising

Results of unsupervised training objectives in prompt based BERT reported
Used same template with inner dropout noise as data augmentation
Used different templates as positive pairs
Used different templates with template denoising as default method
Predicted same template and setting, changed way to generate positive pairs

Discussion

Template denoising

Template denoising removes bias from templates and improves quality of top-k tokens predicted by MLM head.
Template denoising removes unrelated tokens and helps model predict more related tokens.
Template denoising significantly improves quality of tokens predicted by MLM head.

Stability in unsupervised contrastive learning

Unsupervised contrastive learning in sentence embeddings produces unstable results.
Results of unsupervised SimCSE-BERT base with 10 random seeds were reproduced.
Results of the method used are more stable than SimCSE.
Difference between best and worst results in SimCSE is up to 3.14%.
Difference between best and worst results in the method used is only 0.53%.

Conclusion

Poor performance of original BERT for sentence embeddings
Inappropriate sentence representation methods cause underestimation of original BERT
Proposed prompt-based sentence embedding method to better leverage BERT
Contrastive learning method based on template denoising to further improve method
Extensive experiments demonstrate efficiency of method on STS tasks and transfer tasks
Manual templates used, automatic templates generated by T5 underperform
Performance with continuous templates verify efficiency of prompts in sentence embeddings
Eliminating biases by removing tokens
Most of biases in static token embeddings are gradient from MLM classification head weight
Pre-trained two BERT-like models with MLM pre-training objective
Distribution of untying model less influenced by biases
Greedy searching templates on bert-base-uncased
Complex templates improve spearman correlation
All BERT based methods use bert-base-uncased
Template denoising helps eliminate different template biases
Removed top 36 frequent tokens in bert-base-cased, bert-base-uncased and roberta-base

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Rethinking the sentence embeddings of the original bert#

Prompt based sentence embeddings#

Represent sentence with the prompt#

Prompt search#

Prompt based contrastive learning with template denoising#

Experiments#

Dataset#

Baselines#

Implementation details#

Non fine-tuned bert results#

Fine-tuned bert results#

Effectiveness of prompt based contrastive learning with template denoising#

Discussion#

Template denoising#

Stability in unsupervised contrastive learning#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Rethinking the sentence embeddings of the original bert

Prompt based sentence embeddings

Represent sentence with the prompt

Prompt search

Prompt based contrastive learning with template denoising

Experiments

Dataset

Baselines

Implementation details

Non fine-tuned bert results

Fine-tuned bert results

Effectiveness of prompt based contrastive learning with template denoising

Discussion

Template denoising

Stability in unsupervised contrastive learning

Conclusion