Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs have shown impressive results with little or no direct supervision
- LLMs may have potential in information-seeking scenarios
- Attributed QA is a key first step in developing attributed LLMs
- Evaluation framework uses human annotations and automatic metric
- Benchmark a broad set of architectures for the task
Paper Content
Introduction
- Large language models (LLMs) have shown impressive results across a variety of natural language tasks
- LLMs require little or no direct supervision
- LLMs have potential in information-seeking scenarios
- LLMs can produce compelling output in scenarios such as question answering and dialog
- Difficult to construct labeled datasets for complex tasks
- LLMs need to attribute text they generate
- Attributed question answering task proposed
- Evaluation framework proposed using human annotations and automatic metric
- Analysis of systems based on state-of-the-art components
- Possibility of post-hoc attribution of LLM-generated answers
Related work
- Related work in computer science
- Key areas of related work
Question answering tasks
- Question answering is a key way to discover and demonstrate advances in large language models.
- Reading comprehension is a type of QA task.
- SQuAD was the first large-scale, human-created reading comprehension dataset.
- Natural Questions is a large reading comprehension dataset based on real information-seeking queries.
- Open-domain QA is a task in which a system receives an input query and must return its answer based on a provided corpus of documents.
- T5 can produce answers to questions without access to any corpus at inference time.
- Natural Questions dataset has a subset of examples implicitly gold-labeled for attribution.
Llms with attribution
- Evaluated single system: hybrid Google search/LLM system
- Evaluation limited: 307 questions presented to raters, only 115 questions retained for evaluation
- Recent demos use commercial search engine and LLM to generate factual responses
- Systematic study of pros and cons of different architecture decisions for attributed LLM vision
Attributed question answering
- Defines attributed question-answering task
- Gives discussion
Task definition
- Attributed QA task is defined with a set C of units to which answers can be attributed
- Input is a question x, output is a pair (a, c)
- Two evaluation metrics: human ratings and automatic evaluation
- Human ratings use AIS evaluation definitions and guidelines
- Automatic evaluation uses NLI classifier
- Test accuracy is proportion of test examples where majority of raters judge system output to be attributable
Discussion
- Task complexity is likely to be high due to multiple paragraphs that support an answer and multiple answers that have some supporting paragraph
- Attribution allows users to assess trustworthiness and other aspects of the underlying source
- Attribution allows users to better appreciate nuances of an answer
- System developers can evaluate answer quality through measures such as AIS
- Closed-book QA evaluations do not require attribution and depend on gold-curated labels
- Human ratings are used to measure system performance
- Attributed QA is closely related to the problem of attributing statements made by an LLM
- Progress on Attributed QA will extend naturally into more complex tasks
Approaches to attributed qa
- Three architecture classes of systems investigated in the paper
- Systems differentiated in terms of type and quantity of supervision used
Architectures
- Retrieve-then-read (RTR) models first retrieve k relevant passages based on the input question
- A second-stage model takes P ⊂ k retrieved passages to generate a short answer
- Post-hoc retrieval uses an LLM to generate an answer to the input question, then concatenates the question and answer to form a query to sparse or dense retrieval
- LLM-as-retriever models use an LLM to generate both an answer and a pointer into the attribution corpus
- Three system architectures are plausible for Attributed QA
Supervision
- Systems can be differentiated based on the type and quantity of supervision used
- NQ-64 systems use very limited supervision in the form of 64 randomly chosen training examples
- NQ-full systems assume access to the full NQ training set
- NQ-64 or NQ-full data can be used for finetuning or prompting
Best systems
- Four systems achieve highest AIS score for their architecture class
- Best RTR system uses GTR for retrieval and NQ-reranker for reranking
- Best post-hoc retrieval system uses prompted PaLM and GTR for retrieval
- Best low-resource system uses prompted PaLM and BM25 for retrieval
- Best LLM-as-retriever system uses fine-tuned PaLM and BM25 for retrieval
- AutoAIS reranked variants use AutoAIS to select attribution passages
Experiments
- Technical details of datasets and evaluation metrics used for Attributed QA task
- System results for Attributed QA task
- Analysis of evaluation metrics for Attributed QA task
Datasets
- Questions from the validation set of Natural Questions are used for evaluation.
- Evaluation is restricted to the short-answer portion of the dataset.
- The answer to the example question is ‘Antarctica’.
- Attribution Corpus is derived from a snapshot of Wikipedia.
Evaluation metrics
- Three metrics are reported for all experiments: AIS, AutoAIS, and Exact Match
- AIS is assessed by human raters using the guidelines from Rashkin et al. (2021)
- AutoAIS is a Natural Language Inference task that uses a T5 model
- Exact Match is used to compare to prior work
System results
- EM score does not necessarily correlate with AIS score
- EM correlates moderately with human judgment of AIS
- EM has limitations for Attributed QA evaluation
- Best RTR achieves highest performance
- RTR approaches require large amounts of explicit supervision
- Best Post-hoc achieves relatively high EM with minimal supervision
- Best Low Resource performs competitively with Best Post-hoc on AIS
- End-to-end models have potential benefit of not requiring retrieval
- Best LLM-as-retriever is competitive with low-resource post-hoc attribution
- RTR-10 outperforms RTR-4 by more than 17 points AIS
- Post-8 vs Post-4 AIS difference is not significant but AutoAIS difference is
- NQ-full gives a 10 point boost to Exact Match
- AutoAIS correlates well with human judgments of AIS
- AutoAIS is fit-for-purpose as a development metric
- AutoAIS is noisier than system-level AutoAIS
- Correlation between system AIS and AutoAIS scores is strong
- Correlation between system AIS and EM scores is moderate