Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs can have incomplete, out-of-date, or incorrect knowledge
- External knowledge can be used to assist LLMs
- Current methods for incorporating external knowledge require additional training or fine-tuning
- Rethinking with Retrieval (RR) is a novel post-processing approach that retrieves relevant external knowledge
- RR does not require additional training or fine-tuning and is not limited by the input length of LLMs
- RR is evaluated on three complex reasoning tasks and improves the performance of LLMs
Paper Content
Introduction
- LLMs have shown good performance without task-specific training or fine-tuning
- Chain-of-thought prompting can generate explanations and predictions
- Rethinking with retrieval uses decomposed reasoning steps to retrieve external knowledge
- RR provides more faithful explanations and more accurate predictions
- RR is evaluated on 3 complex reasoning tasks using GPT-3 175B and external knowledge sources
Related work
- Retrieval-enhanced LMs have been used to improve performance
- Search query generators have been used to retrieve relevant documents
- Documents have been used as additional context in generation tasks
- Human feedback has been used in a text-based web-browsing environment
- External knowledge sources have been incorporated to enhance LMs for tabular reasoning tasks
- Explicit rules have been added to inputs to improve reasoning ability
- Symbolic reasoning module has been introduced to improve coherence and consistency
- Careful prompting has been used to encourage LLMs to generate explanations
- Sampling diverse reasoning paths has been proposed to improve LLMs
- LLMs have been shown to generate incorrect supporting facts
- External knowledge has been proposed to enhance LLMs without additional training or fine-tuning
- Knowledge retrieval has been used to estimate the faithfulness of each reasoning path
- Inference procedure has been proposed to identify the most faithful prediction
Baselines
- Zero-shot/few-shot prompting: providing few in-context exemplars of input-output pairs in the prompt
- Chain-of-thought prompting: feeding LLMs step-by-step reasoning examples instead of standard input-output examples
- Self-consistency: sampling a diverse set of reasoning paths and selecting the most consistent answer by marginalizing the sampled paths
Commonsense reasoning
- Utilize Wikipedia as external knowledge base
- Use BM25 to retrieve top 10 most relevant paragraphs
- Use MPNet model to select most similar paragraph
- Use NLI model to obtain entailment and contradiction scores
- Calculate faithfulness of each reasoning path using f KB (โข)
Temporal reasoning
- Experiment uses TempQuestions dataset to investigate temporal reasoning
- Dataset includes 1,271 questions divided into four classes
- Focus on implicit temporal questions which contain free-text temporal expressions
- Utilize Wikidata as external knowledge base
- Extract temporal relations from Wikidata pages and convert to sentences
Tabular reasoning
- IN-FOTABS dataset consists of 23,738 hypotheses based on 2,540 tables
- Development set includes 1,800 hypotheses based on 200 tables
- Entailed and contradictory hypotheses are considered
- External knowledge bases WordNet and ConceptNet are used
- Tables are converted into textual premises
- Word relation triples are retrieved and converted into sentences
- Final prediction is obtained using procedure described in Section 4.2
Evaluation
- Utilize GPT-3 text-davinci-002
- Maximum number of tokens for generation set to 256
- Temperature fixed at 0 for zero-shot, few-shot, and chain-of-thought prompting
- Temperature set to 0.7 for selfconsistency and rethinking with retrieval
- Evaluate performance using accuracy and exact match metric
- Rethinking with retrieval outperforms all baselines on all three reasoning tasks
Analysis
- Analyze RR in depth
- Gain a better understanding of RR
Limitations of llms in reasoning
- GPT-3 can provide reasonable explanations and correct predictions for some questions
- GPT-3’s output consists of 3 components: supporting facts, chaining arguments, and a prediction
- GPT-3 may occasionally produce incorrect supporting facts or make incorrect inferences
Ablation study
- Decomposition-based retrieval is important for the proposed method.
- Different types of knowledge have a limited impact on the performance of the proposed method.
Variations of the proposed approach
- Basic approach involves weighting outputs and using a voting mechanism to select the best-supported prediction
- Variant I involves selecting facts from the outputs of LLMs based on external knowledge
- Variant II involves generating facts based on both the outputs of LLMs and external knowledge
- Inference with supporting facts can be done with LLMs or an off-the-shelf model
- Experimental settings involve using evidence paragraphs from StrategyQA and a pre-trained NLI model
- Results show that the fact selection and fact generation variants improve the faithfulness of the supporting facts in explanations, leading to increased prediction accuracy
- Incorporation of a voting mechanism leads to an increased prediction accuracy of 79.91%
- UnifiedQA performs poorly on StrategyQA but demonstrates excellent performance with an accuracy of 90.83% when provided with gold supporting facts
Impact of the size of lms
- Performance of proposed method (Variant II) was compared using various sizes of OPT models and GPT-3 (175B).
- Proposed method consistently outperformed CoT prompting in terms of prediction accuracy and faithfulness of explanations, even with smaller LMs.
Conclusion
- Proposed approach is a promising solution for utilizing external knowledge to assist LLMs
- Does not require additional training or fine-tuning
- Experiments on three reasoning tasks using GPT-3 show RR produces more faithful explanations and improves performance of LLMs
- Future plans to investigate variations of RR to enhance effectiveness and efficiency