Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Proposed a question-answering system that can answer questions with evidence spread over multiple documents
  • System uses a three-step pipeline: decompose, retrieve, and aggregate
  • Evaluated on three datasets: IIRC, Qasper, and StrategyQA
  • Results suggest current retrievers are the main bottleneck
  • Model is more effective when it gives explanations before answering a question

Paper Content


  • QA tasks that use short contexts have seen progress in multiple domains
  • Necessary information to answer a question is often spread over multiple documents or long ones
  • QA models are based on a pipeline of a retriever and a reader component
  • LLMs can reduce costs for solving QA tasks by allowing implementation of QA systems for different domains without needing a specific annotated dataset
  • Adding a chain-of-thought reasoning step before answering significantly improves LLMs’ zero or few-shot effectiveness
  • Visconde is a QA system that combines a retriever and a few-shot approach to induce an LLM to generate the answer
  • Visconde rivals state-of-the-art supervised models in three datasets
  • Future work should focus on improving retrievers
  • Most approaches for multi-document QA use a retriever and reader component
  • Retriever selects relevant documents for a given question, reader infers final answer
  • Retrievers use dense retrievers, commercial search engines, etc.
  • Readers use sequence-to-sequence models, numerical reasoning models, LLMs, etc.
  • Recent work adds components to perform query decomposition or evidence retrieved by web search engine
  • Our work focuses on evaluating limitations of this method and found retrieval component needs more work

Our method: visconde

  • Visconde is a multi-document QA system with three steps: Question decomposition, Document Retrieval, and Aggregation
  • Question decomposition breaks the user question into subquestions
  • Document Retrieval uses an inverted index, BM25 algorithm, and a sequence-to-sequence model to retrieve and rank documents
  • Aggregation uses GPT-3 to generate reasoning steps and an answer
  • Two approaches for prompt construction: static and dynamic
  • Dynamic prompts use KNN algorithm to find similar questions


  • IIRC dataset consists of information-seeking questions
  • 10% of IIRC training set was automatically generated using GPT-3
  • Context articles were processed to create a searchable index
  • Framework depicted in Figure 1 was used to decompose questions
  • Document retrieval was performed on a database of Wikipedia documents
  • Four methods were applied in the aggregation step


  • Qasper is a dataset for information-seeking QA.
  • Questions in the dataset are closed-ended and grounded in a single paper.
  • Document retrieval step consists of reranking paper’s paragraphs and choosing top five as context documents.
  • No advantage in using dynamic prompt in this dataset.


  • StrategyQA is a dataset focused on open-domain questions that require reasoning steps.
  • It has three tasks: question decomposition, evidence paragraph retrieval, and question answering.
  • Pre-processing involved splitting the context articles into windows of three sentences each.


  • Visconde outperforms baselines in IIRC dataset
  • Visconde approaches human performance when using gold contexts
  • Visconde performs better with CoT
  • Visconde outperforms LED-base model in StrategyQA
  • Visconde outperforms baselines in answer accuracy and evidence recall@10 in StrategyQA


  • System for multi-document question answering
  • Uses passage reranker to retrieve documents and large language models to reason over them and compose an answer
  • Rival state-of-the-art supervised models in three datasets: IIRC, Qasper, and StrategyQA
  • GPT-3 close to human-level performance as long as relevant passages are provided
  • Current retrievers are the main bottleneck
  • Inducing the model to give explanations before answering a question improves effectiveness