Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Language models lack the ability to interpret and generate expressions of uncertainty.
  • Expressions of uncertainty are important for human decision-making.
  • GPT3’s accuracy varies depending on the expression of uncertainty used.
  • Models struggle to emit trustworthy expressions of uncertainty.

Paper Content

Introduction

  • Natural language systems need to communicate uncertainties
  • Expressions of uncertainty inform decision-making
  • Prior work has focused on mapping internal probabilities to verbal/numerical output
  • Our work focuses on non-unidimensional linguistic features (hedges, epistemic markers, etc.)
  • Experiments conducted in zero-shot and in-context learning settings
  • Systematic losses in accuracy when expressions of certainty are used
  • Teaching model to emit weakeners leads to better calibration
  • Need for model analysis through lens of uncertainty

Expressions of certainty and uncertainty: linguistic background

  • Certainty and uncertainty can be expressed through linguistic markers
  • Hedges weaken category membership and speaker commitment to truth value
  • Certainty markers are strengtheners that contain factive verbs
  • Evidential markers can be both strengtheners and weakeners
  • Personal pronouns and references to sources are additional expressions of (un)certainty

Expressions of uncertainty typology

  • Compiled a list of 50 expressions of verbal uncertainty
  • Coded each expression with linguistic features related to certainty and uncertainty

Methods

  • Studies LM expressions of uncertainty in the context of open-ended question answering
  • Uses handcrafted templates to insert linguistic expressions of uncertainty into QA questions
  • Uses zero-shot promoting with expressions of uncertainty to elicit answers
  • Creates in-context learning prompt with nearly fifty samples containing question-answer pairs and expressions of (un)certainty

Datasets

  • Analysis performed across four QA datasets
  • Subsampled datasets to single-token answers based on GPT3’s vocabulary
  • Calculated 95% confidence intervals using bootstrap resampling

Lm, prompting, and evaluation details

  • OpenAI’s GPT3 model (davinci 175B) is a commonly used large language model.
  • To measure accuracy, 10 tokens are generated and if any of them match the answer, it is counted as correct.

The impact of uncertainty on language generation

  • Expressions of uncertainty are necessary for decision-making processes.
  • Investigating how GPT3’s generation changes based on expressions of uncertainty.
  • Hypothesis 1: Model is robust to adding expressions of uncertainty.
  • Hypothesis 2: Model will respond differently based on uncertainty cues.

Variation in gpt3 responses across verbal uncertainties

  • GPT3 is sensitive to uncertainty cues
  • Accuracy changes significantly depending on the uncertainty expression used
  • Accuracy can change by up to 80% on the same set of questions
  • Weakeners perform significantly better than strengtheners
  • Factive verbs consistently result in significant losses in accuracy
  • Evidential markers significantly improve performance

A redistribution of probability mass when prompted with weakeners

  • Templates with weakeners outperform templates with strengtheners
  • Weakeners change the underlying probability distribution of the potential answers
  • Weakeners do not increase the probability-on-gold
  • Weakeners lead to a flattening of the probability mass across answers
  • Weakeners increase the entropy of the probability distribution of top tokens

Expressions of uncertainty compared to the standard prompting method

  • Expressions of uncertainty can lead to better performance than standard prompting method
  • Entropy is higher among weakeners, meaning model places probability more evenly across alternative answers
  • In TriviaQA, template “Online says it’s…” achieves 66% accuracy compared to 63% with standard method
  • In Natural Questions, 7 templates outperform standard method, 6 of which are expressions of uncertainty

The impact of the degree of uncertainty on performance

  • GPT3’s generation is sensitive to uncertainty in prompts
  • Introducing numerical values to study uncertainty at a more fine-grained level
  • Evaluating expected calibration error (ECE) and finding poor values
  • Certainty hurts model performance, especially at the extremes (0% and 100%)
  • Hyperbolic language and imbalances in numerical frequencies in training data could be causing results
  • Visualization of frequency of percentages in training data shows peaks at extremes and intervals

When lms emit their own uncertainty

  • Researchers have studied training models to generate expressions of uncertainty.
  • Model performance is studied based on in-context learning examples.
  • Few-shot learning with 50 samples is nearly as effective as fine-tuning on larger datasets.
  • Dataset contains 48 samples with probability-on-gold uniformly distributed between 0 and 100.

Experiment details

  • Study how language models respond when emitting their own uncertainty
  • Follow Lin et al. (2022) setup but modify it for strengtheners and weakeners
  • Teach model to output strengthener when confidence is above threshold and nothing otherwise
  • Teach model to output weakener when probability is below threshold and nothing otherwise

Prompting perturbations

  • Recent work has shown differences in prompting setup
  • Experiment with appending expressions of uncertainty
  • Experiment with different sample orderings and thresholds
  • Threshold does not drastically change accuracy, but does impact balance of training datasets
  • Limited differences in ordering of samples
  • Choose threshold of 0.5 and random ordering as hyper-parameters

Results

  • GPT3 has limited ability to learn naturalistic expressions of uncertainty in a calibrated manner
  • Measured calibration based on whether models successfully emit expressions of uncertainty when probability of top token is above/below training threshold
  • F1 score for uncertainty templates is 0.56, compared to 0.53 for certainty templates
  • Accuracy higher when uncertainty not expressed, but not significantly higher when certainty expressed
  • Entropy higher when uncertainty expressed and lower when certainty expressed
  • Placement of template affects accuracy and probability-on-gold, with better results when prompting model to respond as soon as possible
  • Prior work has focused on accurately extracting model confidence, measuring, and improving model calibration
  • Mixed results on the calibration of neural models
  • Tradeoff between model performance and calibration
  • Solutions to reducing model overconfidence through linguistic calibration
  • Semanticists and computational linguists have studied speaker commitment factors
  • Computational issues in factuality, veridicality, and commitment, bias, and hedges

Discussion and conclusion

  • Naturalistic expressions of uncertainty can impact model behavior.
  • Accuracy drops when naturalistic expressions of certainty are used.
  • Calibration gains when teaching models to express weakeners.
  • Shift to uncertainty to be safer for human-computer interactions.
  • Focus on training models to emit expressions of uncertainty.
  • Spoken language expressions of uncertainty may vary from written language.
  • Potential to integrate attributions of uncertainty in a verified manner.

B amazon mechanical turk results

  • Used Amazon Mechanical Turk to crowd-source expressions of uncertainty
  • Filtered workers to have HITs greater than 99 and at least 500 approved HITs
  • Collected 9 samples of 5 examples each
  • Analyzed expressions of certainty or uncertainty in two settings
  • Accuracy losses for templates with factive verbs
  • Use of evidential markers improved accuracy in three out of four datasets
  • Consistent drop in accuracy between 90% and 100% uncertainty
  • Increase in accuracy between 0% and 10% uncertainty
  • Use of plausibility shields, sources, and personal pronouns had mixed results