Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent work has shown that large language models can generate natural language reasoning steps to answer multi-step questions.
  • When the necessary knowledge is not available or up-to-date, an external knowledge source can be used to retrieve text and prepend it as context to the model’s input.
  • A new approach, IRCoT, interleaves retrieval with CoT for multi-step QA, guiding the retrieval with CoT and using retrieved results to improve CoT.
  • Experiments with GPT3 show substantial improvements in retrieval and downstream QA on four datasets.
  • The method also works well for much smaller models without any additional training.

Paper Content

Introduction

  • Large language models can answer complex questions by generating step-by-step natural language reasoning steps
  • This approach has been used when all the information needed to answer the question is provided or assumed to be present
  • For open-domain questions, knowledge is not always available or up-to-date in models’ parameters
  • It is beneficial to retrieve knowledge from external sources
  • How can we augment chain-of-thought prompting for open-domain, knowledge-intensive tasks?
  • Retrieval from a knowledge source based on the question can successfully augment LMs
  • This strategy has limitations for more complex multi-step reasoning questions
  • Retrieval and reasoning steps must inform each other
  • Proposed an interleaving approach to this problem
  • Use retrieval to guide the chain-of-thought reasoning steps and use CoT reasoning to guide the retrieval
  • Evaluated efficacy of system on 4 multi-step reasoning datasets
  • Results show improved retrieval and few-shot QA performance
  • LLMs can learn tasks with few examples as prompts
  • LLMs can answer complex questions with step-by-step reasoning
  • Smaller models can be used with additional fine-tuning
  • Multi-step open-domain QA has been relatively underexplored
  • GPT-3 can answer long-form questions by interacting with a browser
  • Augmenting few-shot QA prompting with context from Google Search results
  • SelfAsk prompts LLMs to decompose a question into subquestions
  • DecomP decomposes a task and delegates sub-tasks to sub-models
  • ReAct frames the problem as generating a sequence of reasoning and action steps
  • Supervised multi-step open-domain QA has been explored
  • Retrieve-and-read paradigm used to answer knowledge-intensive questions

Interleaving retrieval with chain-of-thought reasoning

  • IRCoT is a proposed retriever method
  • It is instantiated from three ingredients: a base retriever, a language model, and annotated questions
  • The method involves iteratively interleaving two steps (reason and retrieve) until a termination criterion is met

Question answering reader

  • QA reader answers questions using retrieved paragraphs
  • Two prompting strategies used: CoT Prompting and Direct Prompting
  • CoT Prompting requires model to generate full CoT from scratch
  • Direct Prompting requires answer field to contain only final answer

Experimental setup

Datasets

  • Evaluated method on 4 multi-step QA datasets in open-domain setting
  • HotpotQA already comes with associated Wikipedia corpus
  • 2WikiMulti-hopQA and MuSiQue are originally reading comprehension datasets, turned into open-domain setting by combining all supporting and non-supporting paragraphs
  • IIRC is mix between reading comprehension and open-domain setting
  • Sampled 100 questions from original development set for tuning hyperparameters, 500 questions from remaining development set for test set

Models

  • Retriever systems compared: One-step Retriever (OneR) and IRCoT Retriever
  • Language models used: OpenAI GPT3 and T5-Flan
  • Retrieval metric: Recall of gold paragraphs among retrieved set
  • QA reader: Direct Prompting strategy for T5-Flan and CoT Prompting strategy for GPT3
  • ODQA models: OneR-[LM-name] and IRCoT-[LM-name]
  • IIRC: Main passage always kept as part of input, generated titles mapped to nearest Wikipedia page titles

Results

  • IRCoT outperforms one-step retrieval
  • IRCoT QA outperforms NoR and OneR QA
  • IRCoT is effective for smaller models too
  • IRCoT improves recall metric on all datasets
  • IRCoT QA outperforms OneR QA on all datasets

Ablations and other findings

  • IRCoT produces a CoT as part of its retrieval process
  • Ablating the reader hurts performance
  • Separate reader has chance to consider all evidence together
  • Compared to 4 recent approaches to using large language models for open domain QA
  • IRCoT QA outperforms all other systems by a large margin

Conclusions

  • Chain-of-thought prompting has improved prompting-based large language models’ ability to perform multi-step reasoning
  • This work leveraged this ability to improve retrieval and QA performance for complex knowledge-intensive open-domain tasks in a fewshot setting
  • One-step question based retrieval is insufficient for such tasks
  • IRCoT interleaves chain-of-thought generation and retrieval steps to guide each other step-by-step
  • IRCoT significantly improves retrieval performance and few-shot open-domain QA performance compared to one-step retrieval
  • Direct prompting is a better choice for the reader for Flan-T5-XXL, and CoT prompting is a better choice for the reader for GPT3
  • IRCoT QA > OneR QA > ZeroR QA holds up regardless of reader choice
  • Manually written chain-of-thought annotations are given in Listings 1-4
  • Retrieval recall is higher for IRCoT than OneR for both models and all datasets
  • IRCoT with 3B model outperforms OneR with 58X larger GPT3 model