Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Pattern-based models are good at precision, while learning based models are better at recall.
  • There are two kinds of recall: d-recall (diversity) and e-recall (exhaustiveness).
  • Neural methods are better at d-recall, but pattern-based methods can be better at e-recall.
  • Evaluations should aim for both kinds of recall.

Paper Content

Introduction

  • Pattern-based methods are more precise, while learning-based methods have better recall
  • Recent advances in neural-network based models have made learning-based methods more precise
  • There are two kinds of recall: diversity and exhaustiveness
  • Pattern-based methods are better at exhaustiveness, while learning-based methods are better at diversity
  • Current datasets and evaluation methods focus primarily on diversity recall

Background

Dependency trees and syntactic patterns

Extractive question answering and the squad dataset

  • SQuAD is a collection of over 150,000 <question, passage> pairs
  • The dataset is used to train machine learning models to perform extractive QA
  • SQuAD v 2.0 also includes <question, passage> pairs where the question is not answerable by the passage
  • Human annotators were asked to ask a question and mark its answer in the text
  • SQuAD dataset and the extractive-QA task became very popular
  • Models “solve” the SQuAD benchmark with very high accuracies
  • SQuaD-based models are often used as of-the-shelf NLP components

Experiments

  • Experiments are performed using the SPIKE system
  • SPIKE system allows queries over corpora with linguistic structures
  • Syntactic annotation in SPIKE-indexed corpora based on universal dependencies v1 scheme
  • Additional non-tree arcs added by pyBART system
  • SQuAD experiments based on Huggingface model hub
  • Model is RoBERTA large pre-trained transformer

Demonstrating d-recall

  • Retrieve PubMed sentences containing “pain” and “molecule”
  • 98 sentences identified an answer span, most of which were correct
  • Answers were diverse in terms of structure and words
  • Retrieve Wikipedia sentences including “police”, “arrest”, and a named person
  • Answers demonstrate the d-recall abilities of neural QA models

Demonstrating (lack of) e-recall

  • Syntactic patterns were used to query a Wikipedia corpus and a PubMed corpus
  • The SQuAD model was asked questions based on the syntactic patterns
  • The SQuAD model identified 611 sentences as No-Answer cases
  • The No-Answer cases contained explicit and clear answers to the questions
  • The SQuAD model failed to identify many answers recovered by the syntactic pattern query

Discussion

  • Neural QA models have blind-spots that can lead to low e-recall.
  • Adding pattern-based queries to the training data may hide the underlying problem, but not solve it.

Going forward

  • There are two kinds of recall: diversity-based and exhaustiveness-based
  • Pattern-based methods are better at exhaustiveness, while neural methods struggle
  • It is hard to measure exhaustiveness
  • Establishing vocabulary is a good start, but not enough
  • Need to develop robust methods of assessing e-recall

Limitations

  • Opinion should be assessed, not taken for granted
  • Evidence is reported anecdotally, not quantitatively
  • Limitation or stylistic choice depending on the aim of the paper

A reviews and (some) responses

  • Paper argues for two kinds of recall: d-recall (favoring diversity) and e-recall (favoring exhaustiveness)
  • Paper presents anecdotal evidence that neural systems are good at d-recall, but less so at e-recall
  • Paper does not put in effort to warrant publication
  • Two kinds of recall are not clearly defined evaluation metrics
  • Results presented are qualitative
  • Anecdotal evidence and cherry-picked examples are not conclusive and not rigorous enough for scientific publication
  • Quantitative results presented are not conclusive
  • Paper only considers one task, one type of neural model, and one type of pattern-based model
  • Paper does not put in effort to support claims with relevant references
  • Paper does not provide figure visualizing the difference between d-recall and e-recall
  • Section 2 could use more attention to organization
  • Section 3 results need to be presented differently