Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • LLMs have shown impressive results with little or no direct supervision
  • LLMs may have potential in information-seeking scenarios
  • Attributed QA is a key first step in developing attributed LLMs
  • Evaluation framework uses human annotations and automatic metric
  • Benchmark a broad set of architectures for the task

Paper Content


  • Large language models (LLMs) have shown impressive results across a variety of natural language tasks
  • LLMs require little or no direct supervision
  • LLMs have potential in information-seeking scenarios
  • LLMs can produce compelling output in scenarios such as question answering and dialog
  • Difficult to construct labeled datasets for complex tasks
  • LLMs need to attribute text they generate
  • Attributed question answering task proposed
  • Evaluation framework proposed using human annotations and automatic metric
  • Analysis of systems based on state-of-the-art components
  • Possibility of post-hoc attribution of LLM-generated answers
  • Related work in computer science
  • Key areas of related work

Question answering tasks

  • Question answering is a key way to discover and demonstrate advances in large language models.
  • Reading comprehension is a type of QA task.
  • SQuAD was the first large-scale, human-created reading comprehension dataset.
  • Natural Questions is a large reading comprehension dataset based on real information-seeking queries.
  • Open-domain QA is a task in which a system receives an input query and must return its answer based on a provided corpus of documents.
  • T5 can produce answers to questions without access to any corpus at inference time.
  • Natural Questions dataset has a subset of examples implicitly gold-labeled for attribution.

Llms with attribution

  • Evaluated single system: hybrid Google search/LLM system
  • Evaluation limited: 307 questions presented to raters, only 115 questions retained for evaluation
  • Recent demos use commercial search engine and LLM to generate factual responses
  • Systematic study of pros and cons of different architecture decisions for attributed LLM vision

Attributed question answering

  • Defines attributed question-answering task
  • Gives discussion

Task definition

  • Attributed QA task is defined with a set C of units to which answers can be attributed
  • Input is a question x, output is a pair (a, c)
  • Two evaluation metrics: human ratings and automatic evaluation
  • Human ratings use AIS evaluation definitions and guidelines
  • Automatic evaluation uses NLI classifier
  • Test accuracy is proportion of test examples where majority of raters judge system output to be attributable


  • Task complexity is likely to be high due to multiple paragraphs that support an answer and multiple answers that have some supporting paragraph
  • Attribution allows users to assess trustworthiness and other aspects of the underlying source
  • Attribution allows users to better appreciate nuances of an answer
  • System developers can evaluate answer quality through measures such as AIS
  • Closed-book QA evaluations do not require attribution and depend on gold-curated labels
  • Human ratings are used to measure system performance
  • Attributed QA is closely related to the problem of attributing statements made by an LLM
  • Progress on Attributed QA will extend naturally into more complex tasks

Approaches to attributed qa

  • Three architecture classes of systems investigated in the paper
  • Systems differentiated in terms of type and quantity of supervision used


  • Retrieve-then-read (RTR) models first retrieve k relevant passages based on the input question
  • A second-stage model takes P ⊂ k retrieved passages to generate a short answer
  • Post-hoc retrieval uses an LLM to generate an answer to the input question, then concatenates the question and answer to form a query to sparse or dense retrieval
  • LLM-as-retriever models use an LLM to generate both an answer and a pointer into the attribution corpus
  • Three system architectures are plausible for Attributed QA


  • Systems can be differentiated based on the type and quantity of supervision used
  • NQ-64 systems use very limited supervision in the form of 64 randomly chosen training examples
  • NQ-full systems assume access to the full NQ training set
  • NQ-64 or NQ-full data can be used for finetuning or prompting

Best systems

  • Four systems achieve highest AIS score for their architecture class
  • Best RTR system uses GTR for retrieval and NQ-reranker for reranking
  • Best post-hoc retrieval system uses prompted PaLM and GTR for retrieval
  • Best low-resource system uses prompted PaLM and BM25 for retrieval
  • Best LLM-as-retriever system uses fine-tuned PaLM and BM25 for retrieval
  • AutoAIS reranked variants use AutoAIS to select attribution passages


  • Technical details of datasets and evaluation metrics used for Attributed QA task
  • System results for Attributed QA task
  • Analysis of evaluation metrics for Attributed QA task


  • Questions from the validation set of Natural Questions are used for evaluation.
  • Evaluation is restricted to the short-answer portion of the dataset.
  • The answer to the example question is ‘Antarctica’.
  • Attribution Corpus is derived from a snapshot of Wikipedia.

Evaluation metrics

  • Three metrics are reported for all experiments: AIS, AutoAIS, and Exact Match
  • AIS is assessed by human raters using the guidelines from Rashkin et al. (2021)
  • AutoAIS is a Natural Language Inference task that uses a T5 model
  • Exact Match is used to compare to prior work

System results

  • EM score does not necessarily correlate with AIS score
  • EM correlates moderately with human judgment of AIS
  • EM has limitations for Attributed QA evaluation
  • Best RTR achieves highest performance
  • RTR approaches require large amounts of explicit supervision
  • Best Post-hoc achieves relatively high EM with minimal supervision
  • Best Low Resource performs competitively with Best Post-hoc on AIS
  • End-to-end models have potential benefit of not requiring retrieval
  • Best LLM-as-retriever is competitive with low-resource post-hoc attribution
  • RTR-10 outperforms RTR-4 by more than 17 points AIS
  • Post-8 vs Post-4 AIS difference is not significant but AutoAIS difference is
  • NQ-full gives a 10 point boost to Exact Match
  • AutoAIS correlates well with human judgments of AIS
  • AutoAIS is fit-for-purpose as a development metric
  • AutoAIS is noisier than system-level AutoAIS
  • Correlation between system AIS and AutoAIS scores is strong
  • Correlation between system AIS and EM scores is moderate