Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a question-answering system that can answer questions with evidence spread over multiple documents
System uses a three-step pipeline: decompose, retrieve, and aggregate
Evaluated on three datasets: IIRC, Qasper, and StrategyQA
Results suggest current retrievers are the main bottleneck
Model is more effective when it gives explanations before answering a question

QA tasks that use short contexts have seen progress in multiple domains
Necessary information to answer a question is often spread over multiple documents or long ones
QA models are based on a pipeline of a retriever and a reader component
LLMs can reduce costs for solving QA tasks by allowing implementation of QA systems for different domains without needing a specific annotated dataset
Adding a chain-of-thought reasoning step before answering significantly improves LLMs’ zero or few-shot effectiveness
Visconde is a QA system that combines a retriever and a few-shot approach to induce an LLM to generate the answer
Visconde rivals state-of-the-art supervised models in three datasets
Future work should focus on improving retrievers

Most approaches for multi-document QA use a retriever and reader component
Retriever selects relevant documents for a given question, reader infers final answer
Retrievers use dense retrievers, commercial search engines, etc.
Readers use sequence-to-sequence models, numerical reasoning models, LLMs, etc.
Recent work adds components to perform query decomposition or evidence retrieved by web search engine
Our work focuses on evaluating limitations of this method and found retrieval component needs more work

Visconde is a multi-document QA system with three steps: Question decomposition, Document Retrieval, and Aggregation
Question decomposition breaks the user question into subquestions
Document Retrieval uses an inverted index, BM25 algorithm, and a sequence-to-sequence model to retrieve and rank documents
Aggregation uses GPT-3 to generate reasoning steps and an answer
Two approaches for prompt construction: static and dynamic
Dynamic prompts use KNN algorithm to find similar questions

Qasper is a dataset for information-seeking QA.
Questions in the dataset are closed-ended and grounded in a single paper.
Document retrieval step consists of reranking paper’s paragraphs and choosing top five as context documents.
No advantage in using dynamic prompt in this dataset.

StrategyQA is a dataset focused on open-domain questions that require reasoning steps.
It has three tasks: question decomposition, evidence paragraph retrieval, and question answering.
Pre-processing involved splitting the context articles into windows of three sentences each.

Visconde outperforms baselines in IIRC dataset
Visconde approaches human performance when using gold contexts
Visconde performs better with CoT
Visconde outperforms LED-base model in StrategyQA
Visconde outperforms baselines in answer accuracy and evidence recall@10 in StrategyQA

System for multi-document question answering
Uses passage reranker to retrieve documents and large language models to reason over them and compose an answer
Rival state-of-the-art supervised models in three datasets: IIRC, Qasper, and StrategyQA
GPT-3 close to human-level performance as long as relevant passages are provided
Current retrievers are the main bottleneck
Inducing the model to give explanations before answering a question improves effectiveness