Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Questions with questionable assumptions are difficult to answer.
  • (QA)$^2$ is an open-domain evaluation dataset of naturally-occurring search engine queries that may or may not contain questionable assumptions.
  • Current models struggle with handling questionable assumptions.

Paper Content

Introduction

  • Naturally occurring information-seeking questions often contain false or unverifiable assumptions.
  • Examples of such questions are given.
  • Questionable assumptions remain challenging for general-purpose models.
  • (QA)2 is a dataset of naturally-occurring information-seeking questions with questionable assumptions.
  • Models tested on (QA)2 show modest performance.

Questions with questionable assumptions

  • Presuppositions are backgrounded meanings associated with a linguistic utterance
  • Information that is presupposed must be true for the utterance to be appropriate
  • Questionable assumptions are false or unverifiable assumptions likely to be believed by the speaker
  • QA2 dataset consists of 602 naturally-occurring, expert-annotated questions
  • Half of the questions contain questionable assumptions, half do not
  • 32 examples set aside for few-shot tuning or in-context demonstration

Data collection and annotation

  • Constructed QA2 in 3 steps: question collection, identification of questionable assumptions via crowdsourcing, and expert annotation
  • Used Google’s autocompletion API as source of questions
  • Scraped English wh-questions by calling API on prefix strings
  • Automatic filtering algorithm discarded duplicates and bad/non-questions
  • Questions lack case or punctuation information
  • Questions may contain grammatical errors, fragmented expressions, or be temporally misaligned
  • Crowdsourcing used for binary classification task to identify questions with questionable assumptions
  • Recruited workers on Amazon Mechanical Turk
  • Annotated 12,000 questions with 5 annotators each
  • Expert annotators manually verified and provided additional detailed annotations
  • Expert annotators identified questionable assumptions and transformed them into Yes/No verification questions
  • Verified by different third annotator and any disagreements resolved through adjudication discussion

Dataset statistics

  • Dataset contains 602 annotated questions
  • Half of questions contain questionable assumptions, half do not
  • Out of 301 questionable assumptions, 251 are false and 50 are unverifiable
  • 32 questions set aside as an adaptation set
  • Questions annotated with fields containing abstractive answer, extractive evidence, URL, questionable assumptions, and temporal status

Discussion

  • Questions with questionable assumptions often go unnoticed
  • Detection of questionable assumptions is dependent on narrower domains of factual knowledge
  • Most questionable assumptions are associated with either the wh-word or a definite description in the question
  • Existence and contextual uniqueness assumptions are two types of questionable assumptions

Evaluation

  • Propose two simpler subtasks in binary classification format: questionable assumption detection and verification
  • Tasks can be posed in question-answering format and evaluated using zero-shot, few-shot, or in-context learning setup
  • Tasks can also serve as automated metrics for model development

End-to-end abstractive qa

  • Evaluated end-to-end abstractive QA
  • Used human raters to judge acceptability of model-generated answers
  • 100 questions randomly sampled from evaluation set
  • 5 raters recruited on Prolific
  • Took 15 minutes to complete task, compensated $5 per task

Questionable assumption detection

  • Detecting questionable assumptions in questions
  • Binary classification task (contains/does not contain a questionable assumption)

Questionable assumption verification

  • Task aims to break down problem further by assuming there is an oracle that correctly identifies questionable assumption and converts it to Yes/No verification question.
  • Example given: from question “where are the winter olympics held 2021”, oracle provides false assumption “the winter olympics were held in 2021” and derives verification question “were the winter olympics held in 2021”.

Experiments

Models

  • Applied evaluations to range of models, including QA and general-purpose language models
  • Experimented with zero-shot, in-context learning and few-shot tuning
  • Dataset contains up-to-date information as of December 2022
  • Evaluated two opendomain QA models: Macaw and REALM
  • REALM has retrieval component which is expected to have advantage in assumption verification task
  • Evaluated general-purpose language models in zero-shot setting
  • Developed prompting strategy combining Let’s think step by step and task decomposition

Results and discussion

  • End-to-end abstractive QA is challenging, with the best performing model at 59% human-judged acceptability.
  • Text-davinci-003 was the best performing model in the zero-shot setting at 53%.
  • Models generally underperformed when answering questions with questionable assumptions compared to valid questions.
  • Text-davinci-003 was an exception with equivalent performance in both scenarios.
  • Adding in-context demonstrations to the prompt of text-davinci-003 improved its answers on questions containing questionable assumptions.

Subtasks

  • Most models performed around chance on questionable assumption detection.
  • GPT-3 text-davinci-003 was the best performing model with 57% accuracy.
  • Verification performance was generally higher than detection performance.

Error analysis

  • 59% of errors were producing a factually wrong answer
  • 27% of errors were not producing an answer
  • 10% of errors were producing an uninformative or underinformative answer
  • Questions with questionable assumptions have been part of benchmark QA datasets for a while.
  • Kim et al. (2021) argued for an approach that uses presuppositions to explain why questions are unanswerable.
  • Even when the difficulty of identifying assumptions is removed, factual verification is still challenging.
  • (QA) 2 is a robustness check for QA models and is similar to the challenge set of Tafjord and Clark (2021).

Conclusion

  • Proposed (QA)2, an evaluation set for question-answering systems
  • Consists of adaptation set and main evaluation set
  • Existing QA models have difficulty with questionable assumptions
  • (QA)2 serves as checkpointing tool for building better QA systems
  • Table 5 lists fields in (QA)2 and example annotation
  • Tables 6-9 list full prompt templates used for evaluation
  • End-to-end QA, questionable assumption detection, and assumption verification evaluated
  • Figure 1 and 2 show failure and success cases
  • 0-shot and 4-shot prompts for end-to-end QA, questionable assumption detection, and verification