Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent work has shown that large language models can generate natural language reasoning steps to answer multi-step questions.
- When the necessary knowledge is not available or up-to-date, an external knowledge source can be used to retrieve text and prepend it as context to the model’s input.
- A new approach, IRCoT, interleaves retrieval with CoT for multi-step QA, guiding the retrieval with CoT and using retrieved results to improve CoT.
- Experiments with GPT3 show substantial improvements in retrieval and downstream QA on four datasets.
- The method also works well for much smaller models without any additional training.
Paper Content
Introduction
- Large language models can answer complex questions by generating step-by-step natural language reasoning steps
- This approach has been used when all the information needed to answer the question is provided or assumed to be present
- For open-domain questions, knowledge is not always available or up-to-date in models’ parameters
- It is beneficial to retrieve knowledge from external sources
- How can we augment chain-of-thought prompting for open-domain, knowledge-intensive tasks?
- Retrieval from a knowledge source based on the question can successfully augment LMs
- This strategy has limitations for more complex multi-step reasoning questions
- Retrieval and reasoning steps must inform each other
- Proposed an interleaving approach to this problem
- Use retrieval to guide the chain-of-thought reasoning steps and use CoT reasoning to guide the retrieval
- Evaluated efficacy of system on 4 multi-step reasoning datasets
- Results show improved retrieval and few-shot QA performance
Related work
- LLMs can learn tasks with few examples as prompts
- LLMs can answer complex questions with step-by-step reasoning
- Smaller models can be used with additional fine-tuning
- Multi-step open-domain QA has been relatively underexplored
- GPT-3 can answer long-form questions by interacting with a browser
- Augmenting few-shot QA prompting with context from Google Search results
- SelfAsk prompts LLMs to decompose a question into subquestions
- DecomP decomposes a task and delegates sub-tasks to sub-models
- ReAct frames the problem as generating a sequence of reasoning and action steps
- Supervised multi-step open-domain QA has been explored
- Retrieve-and-read paradigm used to answer knowledge-intensive questions
Interleaving retrieval with chain-of-thought reasoning
- IRCoT is a proposed retriever method
- It is instantiated from three ingredients: a base retriever, a language model, and annotated questions
- The method involves iteratively interleaving two steps (reason and retrieve) until a termination criterion is met
Question answering reader
- QA reader answers questions using retrieved paragraphs
- Two prompting strategies used: CoT Prompting and Direct Prompting
- CoT Prompting requires model to generate full CoT from scratch
- Direct Prompting requires answer field to contain only final answer
Experimental setup
Datasets
- Evaluated method on 4 multi-step QA datasets in open-domain setting
- HotpotQA already comes with associated Wikipedia corpus
- 2WikiMulti-hopQA and MuSiQue are originally reading comprehension datasets, turned into open-domain setting by combining all supporting and non-supporting paragraphs
- IIRC is mix between reading comprehension and open-domain setting
- Sampled 100 questions from original development set for tuning hyperparameters, 500 questions from remaining development set for test set
Models
- Retriever systems compared: One-step Retriever (OneR) and IRCoT Retriever
- Language models used: OpenAI GPT3 and T5-Flan
- Retrieval metric: Recall of gold paragraphs among retrieved set
- QA reader: Direct Prompting strategy for T5-Flan and CoT Prompting strategy for GPT3
- ODQA models: OneR-[LM-name] and IRCoT-[LM-name]
- IIRC: Main passage always kept as part of input, generated titles mapped to nearest Wikipedia page titles
Results
- IRCoT outperforms one-step retrieval
- IRCoT QA outperforms NoR and OneR QA
- IRCoT is effective for smaller models too
- IRCoT improves recall metric on all datasets
- IRCoT QA outperforms OneR QA on all datasets
Ablations and other findings
- IRCoT produces a CoT as part of its retrieval process
- Ablating the reader hurts performance
- Separate reader has chance to consider all evidence together
- Compared to 4 recent approaches to using large language models for open domain QA
- IRCoT QA outperforms all other systems by a large margin
Conclusions
- Chain-of-thought prompting has improved prompting-based large language models’ ability to perform multi-step reasoning
- This work leveraged this ability to improve retrieval and QA performance for complex knowledge-intensive open-domain tasks in a fewshot setting
- One-step question based retrieval is insufficient for such tasks
- IRCoT interleaves chain-of-thought generation and retrieval steps to guide each other step-by-step
- IRCoT significantly improves retrieval performance and few-shot open-domain QA performance compared to one-step retrieval
- Direct prompting is a better choice for the reader for Flan-T5-XXL, and CoT prompting is a better choice for the reader for GPT3
- IRCoT QA > OneR QA > ZeroR QA holds up regardless of reader choice
- Manually written chain-of-thought annotations are given in Listings 1-4
- Retrieval recall is higher for IRCoT than OneR for both models and all datasets
- IRCoT with 3B model outperforms OneR with 58X larger GPT3 model