Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Pattern-based models are good at precision, while learning based models are better at recall.
There are two kinds of recall: d-recall (diversity) and e-recall (exhaustiveness).
Neural methods are better at d-recall, but pattern-based methods can be better at e-recall.
Evaluations should aim for both kinds of recall.

Paper Content

Introduction

Pattern-based methods are more precise, while learning-based methods have better recall
Recent advances in neural-network based models have made learning-based methods more precise
There are two kinds of recall: diversity and exhaustiveness
Pattern-based methods are better at exhaustiveness, while learning-based methods are better at diversity
Current datasets and evaluation methods focus primarily on diversity recall

Background

Dependency trees and syntactic patterns

Extractive question answering and the squad dataset

SQuAD is a collection of over 150,000 <question, passage> pairs
The dataset is used to train machine learning models to perform extractive QA
SQuAD v 2.0 also includes <question, passage> pairs where the question is not answerable by the passage
Human annotators were asked to ask a question and mark its answer in the text
SQuAD dataset and the extractive-QA task became very popular
Models “solve” the SQuAD benchmark with very high accuracies
SQuaD-based models are often used as of-the-shelf NLP components

Experiments

Experiments are performed using the SPIKE system
SPIKE system allows queries over corpora with linguistic structures
Syntactic annotation in SPIKE-indexed corpora based on universal dependencies v1 scheme
Additional non-tree arcs added by pyBART system
SQuAD experiments based on Huggingface model hub
Model is RoBERTA large pre-trained transformer

Demonstrating d-recall

Retrieve PubMed sentences containing “pain” and “molecule”
98 sentences identified an answer span, most of which were correct
Answers were diverse in terms of structure and words
Retrieve Wikipedia sentences including “police”, “arrest”, and a named person
Answers demonstrate the d-recall abilities of neural QA models

Demonstrating (lack of) e-recall

Syntactic patterns were used to query a Wikipedia corpus and a PubMed corpus
The SQuAD model was asked questions based on the syntactic patterns
The SQuAD model identified 611 sentences as No-Answer cases
The No-Answer cases contained explicit and clear answers to the questions
The SQuAD model failed to identify many answers recovered by the syntactic pattern query

Discussion

Neural QA models have blind-spots that can lead to low e-recall.
Adding pattern-based queries to the training data may hide the underlying problem, but not solve it.

Going forward

There are two kinds of recall: diversity-based and exhaustiveness-based
Pattern-based methods are better at exhaustiveness, while neural methods struggle
It is hard to measure exhaustiveness
Establishing vocabulary is a good start, but not enough
Need to develop robust methods of assessing e-recall

Limitations

Opinion should be assessed, not taken for granted
Evidence is reported anecdotally, not quantitatively
Limitation or stylistic choice depending on the aim of the paper

A reviews and (some) responses

Paper argues for two kinds of recall: d-recall (favoring diversity) and e-recall (favoring exhaustiveness)
Paper presents anecdotal evidence that neural systems are good at d-recall, but less so at e-recall
Paper does not put in effort to warrant publication
Two kinds of recall are not clearly defined evaluation metrics
Results presented are qualitative
Anecdotal evidence and cherry-picked examples are not conclusive and not rigorous enough for scientific publication
Quantitative results presented are not conclusive
Paper only considers one task, one type of neural model, and one type of pattern-based model
Paper does not put in effort to support claims with relevant references
Paper does not provide figure visualizing the difference between d-recall and e-recall
Section 2 could use more attention to organization
Section 3 results need to be presented differently

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Dependency trees and syntactic patterns#

Extractive question answering and the squad dataset#

Experiments#

Demonstrating d-recall#

Demonstrating (lack of) e-recall#

Discussion#

Going forward#

Limitations#

A reviews and (some) responses#

Link to paper

Abstract

Paper Content

Introduction

Background

Dependency trees and syntactic patterns

Extractive question answering and the squad dataset

Experiments

Demonstrating d-recall

Demonstrating (lack of) e-recall

Discussion

Going forward

Limitations

A reviews and (some) responses