Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a question-answering system that can answer questions with evidence spread over multiple documents
- System uses a three-step pipeline: decompose, retrieve, and aggregate
- Evaluated on three datasets: IIRC, Qasper, and StrategyQA
- Results suggest current retrievers are the main bottleneck
- Model is more effective when it gives explanations before answering a question
Paper Content
Introduction
- QA tasks that use short contexts have seen progress in multiple domains
- Necessary information to answer a question is often spread over multiple documents or long ones
- QA models are based on a pipeline of a retriever and a reader component
- LLMs can reduce costs for solving QA tasks by allowing implementation of QA systems for different domains without needing a specific annotated dataset
- Adding a chain-of-thought reasoning step before answering significantly improves LLMs’ zero or few-shot effectiveness
- Visconde is a QA system that combines a retriever and a few-shot approach to induce an LLM to generate the answer
- Visconde rivals state-of-the-art supervised models in three datasets
- Future work should focus on improving retrievers
Related work
- Most approaches for multi-document QA use a retriever and reader component
- Retriever selects relevant documents for a given question, reader infers final answer
- Retrievers use dense retrievers, commercial search engines, etc.
- Readers use sequence-to-sequence models, numerical reasoning models, LLMs, etc.
- Recent work adds components to perform query decomposition or evidence retrieved by web search engine
- Our work focuses on evaluating limitations of this method and found retrieval component needs more work
Our method: visconde
- Visconde is a multi-document QA system with three steps: Question decomposition, Document Retrieval, and Aggregation
- Question decomposition breaks the user question into subquestions
- Document Retrieval uses an inverted index, BM25 algorithm, and a sequence-to-sequence model to retrieve and rank documents
- Aggregation uses GPT-3 to generate reasoning steps and an answer
- Two approaches for prompt construction: static and dynamic
- Dynamic prompts use KNN algorithm to find similar questions
Iirc
- IIRC dataset consists of information-seeking questions
- 10% of IIRC training set was automatically generated using GPT-3
- Context articles were processed to create a searchable index
- Framework depicted in Figure 1 was used to decompose questions
- Document retrieval was performed on a database of Wikipedia documents
- Four methods were applied in the aggregation step
Qasper
- Qasper is a dataset for information-seeking QA.
- Questions in the dataset are closed-ended and grounded in a single paper.
- Document retrieval step consists of reranking paper’s paragraphs and choosing top five as context documents.
- No advantage in using dynamic prompt in this dataset.
Strategyqa
- StrategyQA is a dataset focused on open-domain questions that require reasoning steps.
- It has three tasks: question decomposition, evidence paragraph retrieval, and question answering.
- Pre-processing involved splitting the context articles into windows of three sentences each.
Results
- Visconde outperforms baselines in IIRC dataset
- Visconde approaches human performance when using gold contexts
- Visconde performs better with CoT
- Visconde outperforms LED-base model in StrategyQA
- Visconde outperforms baselines in answer accuracy and evidence recall@10 in StrategyQA
Conclusion
- System for multi-document question answering
- Uses passage reranker to retrieve documents and large language models to reason over them and compose an answer
- Rival state-of-the-art supervised models in three datasets: IIRC, Qasper, and StrategyQA
- GPT-3 close to human-level performance as long as relevant passages are provided
- Current retrievers are the main bottleneck
- Inducing the model to give explanations before answering a question improves effectiveness