Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Discrete prompts have been used to fine-tune Pre-trained Language Models.
Automatic methods can generate discrete prompts from a small set of training instances.
Automatically learnt discrete prompts contain noisy and counter-intuitive lexical constructs.
Study of robustness of discrete prompts by applying perturbations to an application using AutoPrompt.
Discrete prompt-based method remains relatively robust against NLI input perturbations.
Highly sensitive to other types of perturbations such as shuffling and deletion of prompt tokens.
Generalize poorly across different NLI datasets.

Pre-trained Language Models (PLMs) have been used for a range of Natural Language Processing (NLP) tasks
Manual writing of prompts is challenging
Automatic learning of discrete prompts has been proposed
Table 1 shows manually-written and AP-learnt prompts for fact retrieval
AP-learnt prompts outperform manual prompts in precision scores
AP-learnt prompts contain counter-intuitive language constructs
Automatic methods use a small number of training instances
Evaluation of robustness of discrete prompts is important
Discrete prompts are represented in natural language and supposed to be interpretable
Evaluation of robustness of discrete prompts to random or adversarial perturbations has not been studied
AP is used as an example of a widely-used method
MP outperforms both AP and HFT on CB and MNLI
Performance of AP is highly dataset-dependent
AP and MP are sensitive to the ordering of prompt tokens
Random deletion of prompt tokens decreases performance in both AP and MP
AP is relatively more robust against adversarial perturbations than MP

Prompting or in-context learning is an efficient method to extract knowledge from PLMs
Two types of prompts can be learnt: discrete and continuous
Continuous prompts are parameter efficient but cannot be learnt when a PLM is not publicly available
Discrete prompts can be used with diverse NLP tasks and are an attractive alternative to finetuning massive PLMs
Prior work has analyzed prompts from various viewpoints

AutoPrompt (AP) is a method of discrete prompt learning based on fill-in-the-blank tasks
Manually-written Prompts (MP) is a method for fine-tuning the entire masked language model with training data using manually-written prompts
Head-based Fine-Tuning (HFT) fine-tunes the PLM with a classifier head
RoBERTa-large (355M parameters) was used as the pretrained language model
Rate of degradation (RoD) was used to evaluate robustness
AP, MP, and HFT were evaluated on NLI
MP was always superior to AP
Performance of AP is highly dataset dependent
HFT is superior to AP in some datasets
Accuracy of MNLI was low on AP
Table 2 shows the average accuracy of models trained on 200 instances that performed well in both CB and MNLI

AP prompts have no obvious ordering among their tokens
Experiment conducted to measure effect of token order in discrete prompt
Accuracy of AP drops significantly when prompt tokens are randomly reordered
MP accuracy drops only slightly when prompt tokens are randomly reordered
AP prompts rely heavily on token order
Accuracy of AP drops as perturbation noise (measured by edit distance) increases

AP performs better than MP
It is difficult to determine the importance of prompt tokens
An experiment was conducted to delete one or more prompt tokens from a given discrete prompt
Results show that accuracy of both AP and MP drops when a single token is deleted
Performance of both AP and MP degrades when more tokens are deleted
Accuracy drop in CB is very small for MP even when all prompt tokens are deleted

Discrete prompt learning methods learn prompts from a small set of training instances.
Cross-dataset evaluation is used to study the transferability of the learnt discrete prompts.
Results show that AP-based prompts do not generalize well across datasets.

Adversarial perturbations can be used to probe the robustness of models
Previous studies have shown models can be fooled by perturbated test instances
Two annotators manually edited hypothesis sentences with two types of perturbations
Perturbations without label changes resulted in small RoD values
Perturbations with label changes resulted in significant drops in accuracy for both AP and MP

Discrete prompts remain relatively robust against token deletion
Highly sensitive to token shuffling
AP more robust than MP for perturbations with label changes
Generalize poorly across different datasets annotated for NLI
Possible limitations include not investigating other methods, using RoBERTa-large, focusing on NLI, and English datasets
Performance gap between MP/HFT and AP could affect accuracies
Aim is to analyze robustness of discrete prompts
Adversarial dataset came from existing datasets of CB and MNLI with no ethical concerns
RoBERTa known to have gender biases