Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs can have incomplete, out-of-date, or incorrect knowledge
External knowledge can be used to assist LLMs
Current methods for incorporating external knowledge require additional training or fine-tuning
Rethinking with Retrieval (RR) is a novel post-processing approach that retrieves relevant external knowledge
RR does not require additional training or fine-tuning and is not limited by the input length of LLMs
RR is evaluated on three complex reasoning tasks and improves the performance of LLMs

Paper Content

Introduction

LLMs have shown good performance without task-specific training or fine-tuning
Chain-of-thought prompting can generate explanations and predictions
Rethinking with retrieval uses decomposed reasoning steps to retrieve external knowledge
RR provides more faithful explanations and more accurate predictions
RR is evaluated on 3 complex reasoning tasks using GPT-3 175B and external knowledge sources

Retrieval-enhanced LMs have been used to improve performance
Search query generators have been used to retrieve relevant documents
Documents have been used as additional context in generation tasks
Human feedback has been used in a text-based web-browsing environment
External knowledge sources have been incorporated to enhance LMs for tabular reasoning tasks
Explicit rules have been added to inputs to improve reasoning ability
Symbolic reasoning module has been introduced to improve coherence and consistency
Careful prompting has been used to encourage LLMs to generate explanations
Sampling diverse reasoning paths has been proposed to improve LLMs
LLMs have been shown to generate incorrect supporting facts
External knowledge has been proposed to enhance LLMs without additional training or fine-tuning
Knowledge retrieval has been used to estimate the faithfulness of each reasoning path
Inference procedure has been proposed to identify the most faithful prediction

Baselines

Zero-shot/few-shot prompting: providing few in-context exemplars of input-output pairs in the prompt
Chain-of-thought prompting: feeding LLMs step-by-step reasoning examples instead of standard input-output examples
Self-consistency: sampling a diverse set of reasoning paths and selecting the most consistent answer by marginalizing the sampled paths

Commonsense reasoning

Utilize Wikipedia as external knowledge base
Use BM25 to retrieve top 10 most relevant paragraphs
Use MPNet model to select most similar paragraph
Use NLI model to obtain entailment and contradiction scores
Calculate faithfulness of each reasoning path using f KB (•)

Temporal reasoning

Experiment uses TempQuestions dataset to investigate temporal reasoning
Dataset includes 1,271 questions divided into four classes
Focus on implicit temporal questions which contain free-text temporal expressions
Utilize Wikidata as external knowledge base
Extract temporal relations from Wikidata pages and convert to sentences

Tabular reasoning

IN-FOTABS dataset consists of 23,738 hypotheses based on 2,540 tables
Development set includes 1,800 hypotheses based on 200 tables
Entailed and contradictory hypotheses are considered
External knowledge bases WordNet and ConceptNet are used
Tables are converted into textual premises
Word relation triples are retrieved and converted into sentences
Final prediction is obtained using procedure described in Section 4.2

Evaluation

Utilize GPT-3 text-davinci-002
Maximum number of tokens for generation set to 256
Temperature fixed at 0 for zero-shot, few-shot, and chain-of-thought prompting
Temperature set to 0.7 for selfconsistency and rethinking with retrieval
Evaluate performance using accuracy and exact match metric
Rethinking with retrieval outperforms all baselines on all three reasoning tasks

Analysis

Analyze RR in depth
Gain a better understanding of RR

Limitations of llms in reasoning

GPT-3 can provide reasonable explanations and correct predictions for some questions
GPT-3’s output consists of 3 components: supporting facts, chaining arguments, and a prediction
GPT-3 may occasionally produce incorrect supporting facts or make incorrect inferences

Ablation study

Decomposition-based retrieval is important for the proposed method.
Different types of knowledge have a limited impact on the performance of the proposed method.

Variations of the proposed approach

Basic approach involves weighting outputs and using a voting mechanism to select the best-supported prediction
Variant I involves selecting facts from the outputs of LLMs based on external knowledge
Variant II involves generating facts based on both the outputs of LLMs and external knowledge
Inference with supporting facts can be done with LLMs or an off-the-shelf model
Experimental settings involve using evidence paragraphs from StrategyQA and a pre-trained NLI model
Results show that the fact selection and fact generation variants improve the faithfulness of the supporting facts in explanations, leading to increased prediction accuracy
Incorporation of a voting mechanism leads to an increased prediction accuracy of 79.91%
UnifiedQA performs poorly on StrategyQA but demonstrates excellent performance with an accuracy of 90.83% when provided with gold supporting facts

Impact of the size of lms

Performance of proposed method (Variant II) was compared using various sizes of OPT models and GPT-3 (175B).
Proposed method consistently outperformed CoT prompting in terms of prediction accuracy and faithfulness of explanations, even with smaller LMs.

Conclusion

Proposed approach is a promising solution for utilizing external knowledge to assist LLMs
Does not require additional training or fine-tuning
Experiments on three reasoning tasks using GPT-3 show RR produces more faithful explanations and improves performance of LLMs
Future plans to investigate variations of RR to enhance effectiveness and efficiency

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Baselines#

Commonsense reasoning#

Temporal reasoning#

Tabular reasoning#

Evaluation#

Analysis#

Limitations of llms in reasoning#

Ablation study#

Variations of the proposed approach#

Impact of the size of lms#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Baselines

Commonsense reasoning

Temporal reasoning

Tabular reasoning

Evaluation

Analysis

Limitations of llms in reasoning

Ablation study

Variations of the proposed approach

Impact of the size of lms

Conclusion