Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Pattern-based models are good at precision, while learning based models are better at recall.
- There are two kinds of recall: d-recall (diversity) and e-recall (exhaustiveness).
- Neural methods are better at d-recall, but pattern-based methods can be better at e-recall.
- Evaluations should aim for both kinds of recall.
Paper Content
Introduction
- Pattern-based methods are more precise, while learning-based methods have better recall
- Recent advances in neural-network based models have made learning-based methods more precise
- There are two kinds of recall: diversity and exhaustiveness
- Pattern-based methods are better at exhaustiveness, while learning-based methods are better at diversity
- Current datasets and evaluation methods focus primarily on diversity recall
Background
Dependency trees and syntactic patterns
Extractive question answering and the squad dataset
- SQuAD is a collection of over 150,000 <question, passage> pairs
- The dataset is used to train machine learning models to perform extractive QA
- SQuAD v 2.0 also includes <question, passage> pairs where the question is not answerable by the passage
- Human annotators were asked to ask a question and mark its answer in the text
- SQuAD dataset and the extractive-QA task became very popular
- Models “solve” the SQuAD benchmark with very high accuracies
- SQuaD-based models are often used as of-the-shelf NLP components
Experiments
- Experiments are performed using the SPIKE system
- SPIKE system allows queries over corpora with linguistic structures
- Syntactic annotation in SPIKE-indexed corpora based on universal dependencies v1 scheme
- Additional non-tree arcs added by pyBART system
- SQuAD experiments based on Huggingface model hub
- Model is RoBERTA large pre-trained transformer
Demonstrating d-recall
- Retrieve PubMed sentences containing “pain” and “molecule”
- 98 sentences identified an answer span, most of which were correct
- Answers were diverse in terms of structure and words
- Retrieve Wikipedia sentences including “police”, “arrest”, and a named person
- Answers demonstrate the d-recall abilities of neural QA models
Demonstrating (lack of) e-recall
- Syntactic patterns were used to query a Wikipedia corpus and a PubMed corpus
- The SQuAD model was asked questions based on the syntactic patterns
- The SQuAD model identified 611 sentences as No-Answer cases
- The No-Answer cases contained explicit and clear answers to the questions
- The SQuAD model failed to identify many answers recovered by the syntactic pattern query
Discussion
- Neural QA models have blind-spots that can lead to low e-recall.
- Adding pattern-based queries to the training data may hide the underlying problem, but not solve it.
Going forward
- There are two kinds of recall: diversity-based and exhaustiveness-based
- Pattern-based methods are better at exhaustiveness, while neural methods struggle
- It is hard to measure exhaustiveness
- Establishing vocabulary is a good start, but not enough
- Need to develop robust methods of assessing e-recall
Limitations
- Opinion should be assessed, not taken for granted
- Evidence is reported anecdotally, not quantitatively
- Limitation or stylistic choice depending on the aim of the paper
A reviews and (some) responses
- Paper argues for two kinds of recall: d-recall (favoring diversity) and e-recall (favoring exhaustiveness)
- Paper presents anecdotal evidence that neural systems are good at d-recall, but less so at e-recall
- Paper does not put in effort to warrant publication
- Two kinds of recall are not clearly defined evaluation metrics
- Results presented are qualitative
- Anecdotal evidence and cherry-picked examples are not conclusive and not rigorous enough for scientific publication
- Quantitative results presented are not conclusive
- Paper only considers one task, one type of neural model, and one type of pattern-based model
- Paper does not put in effort to support claims with relevant references
- Paper does not provide figure visualizing the difference between d-recall and e-recall
- Section 2 could use more attention to organization
- Section 3 results need to be presented differently