Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have shown impressive results with little or no direct supervision
LLMs may have potential in information-seeking scenarios
Attributed QA is a key first step in developing attributed LLMs
Evaluation framework uses human annotations and automatic metric
Benchmark a broad set of architectures for the task

Paper Content

Introduction

Large language models (LLMs) have shown impressive results across a variety of natural language tasks
LLMs require little or no direct supervision
LLMs have potential in information-seeking scenarios
LLMs can produce compelling output in scenarios such as question answering and dialog
Difficult to construct labeled datasets for complex tasks
LLMs need to attribute text they generate
Attributed question answering task proposed
Evaluation framework proposed using human annotations and automatic metric
Analysis of systems based on state-of-the-art components
Possibility of post-hoc attribution of LLM-generated answers

Related work in computer science
Key areas of related work

Question answering tasks

Question answering is a key way to discover and demonstrate advances in large language models.
Reading comprehension is a type of QA task.
SQuAD was the first large-scale, human-created reading comprehension dataset.
Natural Questions is a large reading comprehension dataset based on real information-seeking queries.
Open-domain QA is a task in which a system receives an input query and must return its answer based on a provided corpus of documents.
T5 can produce answers to questions without access to any corpus at inference time.
Natural Questions dataset has a subset of examples implicitly gold-labeled for attribution.

Llms with attribution

Evaluated single system: hybrid Google search/LLM system
Evaluation limited: 307 questions presented to raters, only 115 questions retained for evaluation
Recent demos use commercial search engine and LLM to generate factual responses
Systematic study of pros and cons of different architecture decisions for attributed LLM vision

Attributed question answering

Defines attributed question-answering task
Gives discussion

Task definition

Attributed QA task is defined with a set C of units to which answers can be attributed
Input is a question x, output is a pair (a, c)
Two evaluation metrics: human ratings and automatic evaluation
Human ratings use AIS evaluation definitions and guidelines
Automatic evaluation uses NLI classifier
Test accuracy is proportion of test examples where majority of raters judge system output to be attributable

Discussion

Task complexity is likely to be high due to multiple paragraphs that support an answer and multiple answers that have some supporting paragraph
Attribution allows users to assess trustworthiness and other aspects of the underlying source
Attribution allows users to better appreciate nuances of an answer
System developers can evaluate answer quality through measures such as AIS
Closed-book QA evaluations do not require attribution and depend on gold-curated labels
Human ratings are used to measure system performance
Attributed QA is closely related to the problem of attributing statements made by an LLM
Progress on Attributed QA will extend naturally into more complex tasks

Approaches to attributed qa

Three architecture classes of systems investigated in the paper
Systems differentiated in terms of type and quantity of supervision used

Architectures

Retrieve-then-read (RTR) models first retrieve k relevant passages based on the input question
A second-stage model takes P ⊂ k retrieved passages to generate a short answer
Post-hoc retrieval uses an LLM to generate an answer to the input question, then concatenates the question and answer to form a query to sparse or dense retrieval
LLM-as-retriever models use an LLM to generate both an answer and a pointer into the attribution corpus
Three system architectures are plausible for Attributed QA

Supervision

Systems can be differentiated based on the type and quantity of supervision used
NQ-64 systems use very limited supervision in the form of 64 randomly chosen training examples
NQ-full systems assume access to the full NQ training set
NQ-64 or NQ-full data can be used for finetuning or prompting

Best systems

Four systems achieve highest AIS score for their architecture class
Best RTR system uses GTR for retrieval and NQ-reranker for reranking
Best post-hoc retrieval system uses prompted PaLM and GTR for retrieval
Best low-resource system uses prompted PaLM and BM25 for retrieval
Best LLM-as-retriever system uses fine-tuned PaLM and BM25 for retrieval
AutoAIS reranked variants use AutoAIS to select attribution passages

Experiments

Technical details of datasets and evaluation metrics used for Attributed QA task
System results for Attributed QA task
Analysis of evaluation metrics for Attributed QA task

Datasets

Questions from the validation set of Natural Questions are used for evaluation.
Evaluation is restricted to the short-answer portion of the dataset.
The answer to the example question is ‘Antarctica’.
Attribution Corpus is derived from a snapshot of Wikipedia.

Evaluation metrics

Three metrics are reported for all experiments: AIS, AutoAIS, and Exact Match
AIS is assessed by human raters using the guidelines from Rashkin et al. (2021)
AutoAIS is a Natural Language Inference task that uses a T5 model
Exact Match is used to compare to prior work

System results

EM score does not necessarily correlate with AIS score
EM correlates moderately with human judgment of AIS
EM has limitations for Attributed QA evaluation
Best RTR achieves highest performance
RTR approaches require large amounts of explicit supervision
Best Post-hoc achieves relatively high EM with minimal supervision
Best Low Resource performs competitively with Best Post-hoc on AIS
End-to-end models have potential benefit of not requiring retrieval
Best LLM-as-retriever is competitive with low-resource post-hoc attribution
RTR-10 outperforms RTR-4 by more than 17 points AIS
Post-8 vs Post-4 AIS difference is not significant but AutoAIS difference is
NQ-full gives a 10 point boost to Exact Match
AutoAIS correlates well with human judgments of AIS
AutoAIS is fit-for-purpose as a development metric
AutoAIS is noisier than system-level AutoAIS
Correlation between system AIS and AutoAIS scores is strong
Correlation between system AIS and EM scores is moderate

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Question answering tasks#

Llms with attribution#

Attributed question answering#

Task definition#

Discussion#

Approaches to attributed qa#

Architectures#

Supervision#

Best systems#

Experiments#

Datasets#

Evaluation metrics#

System results#

Link to paper

Abstract

Paper Content

Introduction

Related work

Question answering tasks

Llms with attribution

Attributed question answering

Task definition

Discussion

Approaches to attributed qa

Architectures

Supervision

Best systems

Experiments

Datasets

Evaluation metrics

System results