Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Models for textual entailment have been applied to settings like fact-checking and question answering.
  • We propose WiCE, a new dataset for verifying claims in text.
  • WiCE is built on real-world claims and evidence from Wikipedia.
  • Annotations are over sub-sentence units of the hypothesis, decomposed automatically by GPT-3.
  • Real claims in WiCE involve challenging verification problems.

Paper Content

Introduction

  • Textual entailment and natural language inference are longstanding problems in NLP
  • SNLI dataset uses NLI to evaluate domain-general approaches to semantic representation
  • NLI is used for validating answers from QA systems, evaluating generated summaries, and understanding knowledge-grounded dialog
  • NLI is used for attribution and factual consistency
  • NLI datasets target short premises
  • WICE dataset is for verification of real-world claims in Wikipedia with fine-grained annotations
  • CLAIM-SPLIT decomposes complex claims into simpler independent sub-claims
  • WICE is a challenging problem
  • Off-the-shelf models perform poorly at the claim level
  • Chunk-level processing is a strong starting point for future systems

Background

Past nli datasets

  • NLI datasets involve single-sentence premises and hypotheses
  • Some datasets involve multi-sentence premises, but they are often short paragraphs
  • NLI datasets favor inference types like hypernymy that are not necessarily matched to the attribution problem
  • Fact checking datasets focus on the multihop verification problem, but only a small number of claims require multiple sentences as evidence

Hypothesis decomposition

  • Breaking down complex hypotheses into smaller units has been studied
  • Pyramid method proposed one way to decompose a summary into semantic content units
  • Frameworks have looked at breaking statements down into propositions
  • Factual consistency in summarization has looked at entailment of sub-sentence units
  • Question-answer pairs used to isolate specific pieces of information
  • SuperPAL and PropSegmEnt propose methods of retrieving sub-sentences for factuality evaluation
  • WICE dataset collects real world claims from Wikipedia grounded in cited documents
  • WICE provides annotation of entailment labels, supporting sentences, and non-supported tokens

Dataset preprocessing

  • Uses same base set of Wikipedia claims as SIDE dataset
  • Automatically parses cited articles’ HTML to extract article text
  • Uses GPT-3 to decompose claims into sub-claims
  • Filters out trivially entailed claims using RoBERTa-large
  • Retains 16.3% of claims after filtering

Dataset collection

  • Task interface presents evidence sentences and corresponding claim to crowd workers
  • Annotation done at sub-claim level
  • Entailment classification: annotator provides entailment label (SUPPORTED, PARTIALLY-SUPPORTED, NOT-SUPPORTED)
  • Supporting sentences: if entailment label is SUPPORTED or PARTIALLY-SUPPORTED, annotator selects subset of evidence sentences
  • Unsupported tokens in sub-claim: if entailment label is PARTIALLY-SUPPORTED, annotator highlights tokens not supported by evidence sentences
  • Workers recruited using Amazon Mechanical Turk
  • Inter-annotator agreement: Krippendorff’s α = 0.62
  • Aggregating worker annotations: majority vote for entailment label, union of supporting sentences
  • Claim-level labels from sub-claim annotations: union of sub-claim level supporting sentences

Tasks

  • Intended to support three tasks: entailment, sentence retrieval, and non-supported tokens
  • Given a claim and evidence articles, models must return a label, select a group of supporting sentences, and return a group of tokens not supported by the evidence

Dataset statistics

  • Average of 3.0 sub-claims per claim
  • Approximately 5.9K sub-claim level examples
  • 56% of sub-claims supported by evidence sentences
  • 33% of claims supported by evidence sentences
  • 1.9 evidence sentences per sub-claim, 3.1 per claim
  • 90% of claims and 64% of sub-claims require multiple supporting sentences
  • 374 partially supported sub-claims in training data
  • 12.8 tokens per sub-claim in training data
  • 3.3 non-supported tokens per sub-claim in dev set

Analysis of phenomena

  • WICE dataset contains contextualized claims that require multiple supporting sentences for verification
  • WICE verification problems are categorized into compression, compression w/ decontextualization, paraphrasing, require calculation, require inference, and require background knowledge
  • 50 randomly sampled claims in the development set of three datasets were manually analyzed
  • Table 4 shows the estimated distribution of verification types in WICE, FEVER, and VitaminC
  • CLAIM-SPLIT decomposes claims into sub-claims to provide a fine-grained view and improve annotation agreement
  • Human annotation study was conducted to evaluate the performance of CLAIM-SPLIT
  • 7.7% of the claims fail the completeness criterion and 2.3% fail the correctness criterion
  • CLAIM-SPLIT is tested on VitaminC and PAWS datasets
  • T5-large and T5-3B models are fine-tuned on ANLI dataset
  • Table 5 shows that using CLAIM-SPLIT and aggregating sub-claim scores improves performance on the entailment classification task
  • CLAIM-SPLIT is effective at simplifying the entailment classification problem and improving performance
  • Better prompts and aggregation methods could lead to further improvement

Experiments

  • How well do existing entailment models do in off-the-shelf settings when using the “stretching” paradigm?
  • How much can fine-tuning on the dataset improve accuracy?
  • Would retrieving relevant context sentences improve accuracy further?

Experimental setup

  • Benchmarked performance of baseline entailment models on WICE dataset
  • Considered multiple combinations of transformer encoders and entailment datasets
  • Defaulted to T5-3B as it performed best
  • Evaluated models using F1 score and accuracy
  • Used “stretching” technique to make document-level judgments
  • Fine-tuned T5 models on WICE dataset
  • Off-the-shelf results showed sentence-level models performed less well than chunk-level models
  • Fine-tuned results improved from off-the-shelf performance but still lower than human performance
  • Used weighted sum of prediction probability for SUPPORTED and PARTIALLY-SUPPORTED classes as retrieval score
  • Retrieval strategy with automatic retrieval not better than MAX strategy
  • Oracle retrieved sentences substantially improved performance

Task: supporting sentence retrieval

  • Performed intrinsic evaluation of supporting sentence retrieval task
  • Retrieve sentences with retrieval scores larger than a threshold τ
  • Take maximum F1 score over reference supporting sentences as score for each data
  • Threshold used for test set is τ that returns maximum F1 score on dev set
  • Investigated performance of retrieving correct sentences
  • Model with context improves retrieval performance
  • Performance lower than human performance

Conclusion

  • WICE is a new dataset constructed from claims on Wikipedia
  • WICE contains a variety of real-world entailment phenomena distinct from prior annotated datasets
  • Decomposing sentences into sub-claims can be a valuable preprocessing step
  • WICE consists of sentences with either 1 or 2 cited articles
  • Performance of finetuning on WICE is saturating with current training procedure and models
  • Filtering by using the baseline entailment classification model removes 45% of the data
  • Median work time for each HIT was about 5 minutes
  • Annotation interface includes cited websites and a paragraph and sentence in a Wikipedia article
  • Supported and partially supported sub-claims have almost the same number of supporting sentences
  • Word overlap between claims and evidence in WICE is competitively low
  • Claims in different entailment labels have similar word overlap distributions
  • GPT-3 used to split sentences into sub-claims
  • Human annotators used to detect mistakes in dev and test sets
  • Mistakes caused by removing first or intermediate clauses, parentheses, and “and”
  • Mistakes caused by over-splitting and multiple decomposed sentences
  • PyTorch and Hugging Face Transformers libraries used in implementation
  • 4 NVIDIA Quadro RTX 8000 used in experiments
  • Publicly available pretrained models in Hugging Face Hub used for evaluation
  • Sentence- and chunk-level data used for training
  • Chunks split into overlapping tokens shorter than 256
  • Oracle train data includes chunks of all supporting sentences