Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Automated reasoning with unstructured natural text has made progress with the help of large language models.
- Forward direction reasoning suffers from a combinatorial explosion of the search space.
- Backward direction reasoning is more efficient for proof-finding problems.
- LAMBADA is a backward chaining algorithm that decomposes reasoning into four sub-modules.
- LAMBADA achieves accuracy boosts over forward reasoning methods on two challenging logical reasoning datasets.
Paper Content
Introduction
- Automated reasoning has been a goal for AI since its early days
- Recent advancements in natural language understanding have been made with pre-trained language models
- Performance of these models for logical reasoning still lags behind
- Logical reasoning with unstructured, natural text is a fundamental building block for automated knowledge discovery
- Scale has been observed to provide limited benefit for solving complex reasoning problems
- Finetuning initially seemed to enable logical reasoning in LMs, but mostly exploits spurious correlations
- Prompting strategies have found some success, but struggle with proof planning for more complex logical reasoning problems
- Integrating the strength and reliability of classical AI models in logical reasoning with LMs is one solution
- Two major approaches to logical reasoning are Forward Chaining and Backward Chaining
- Previous approaches to reasoning with LMs mostly incorporate elements of Forward Chaining
- Backward Chaining is better suited for text-based deductive logical reasoning
Related work
- Deep learning models can be used to solve text-based reasoning tasks
- Pretraining an LM on relevant data can lead to improvements
- Implicit reasoning approaches finetune LMs to produce desired output
- Explicit reasoning approaches generate intermediate reasoning steps
- Verifiers judge the quality of reasoning chains
- Length generalization is possible with certain models
- Modular reasoning breaks problem into smaller modules
- Natural Language Inference (NLI) is related to logical reasoning
- Example of theory with facts and rules given
Backward chaining
- Backward chaining (BC) is a strategy for reasoning that starts from the goal and recursively breaks it down into sub-goals.
- BC uses an operation called unification in logic to determine if a rule applies to a goal.
- BC has four LM-based modules: Fact Check, Rule Selection, Goal Decomposition, and Sign Agreement.
- Fact Check verifies if a goal can be proved or disproved from any of the facts.
- Rule Selection identifies rules whose consequent unifies with the goal.
- Goal Decomposition identifies sub-goals that need to be proved in order for the goal to be proved or disproved.
- Sign Agreement verifies if the sign of the rule consequent agrees or disagrees with the sign of the goal.
- Algorithm 1 integrates the four LM modules with BC to enable text-based logical reasoning.
- LAMBADA is a depth-first search algorithm over the facts and the rules.
- Rules are ranked based on their lengths with shorter rules being ranked higher.
Experimental setup
- Compared two main baselines for logical reasoning: Chain of Thought (CoT) and Selection-Inference (SI)
- Experimented with two challenging logical reasoning datasets: ProofWriter and PrOntoQA
- ProofWriter has two subsets: open-world assumption (OWA) and closed-world assumption (CWA)
- ProofWriter has three labels: PROVED, DISPROVED, UNKNOWN
- PrOntoQA has multiple variants; used fictional characters version
Experimental results
- LAMBADA outperforms two baselines on two ProofWriter datasets
- Backward chaining (LAMBADA) better than forward chaining (SI)
- CoT has short-coming in dealing with UNKNOWN labels
- SI over-predicts DISPROVED and UNKNOWN for higher depths
Proof accuracy
- CoT predicted the result correctly for 50 examples from depth-5 of the dataset
- Manually verified the proofs generated by LAMBADA
- CoT only generated the correct chain for 28% of the examples
- Main source of error was hallucinating rules or facts
- Spurious correlations in ProofWriter-PD can be exploited by CoT
- Rule Selection and Goal Decomposition modules are responsible for majority of failure cases
The role of scale
- Experiment from Section 5.2 repeated with PaLM 62B and 8B to see role of LM scale on LAMBADA
- Performance of Goal Decomposition and Sign Agreement modules remain comparable with PaLM 62B
- Performance of Fact Check and Rule Selection modules drop significantly with PaLM 62B
- Smaller LMs require further decomposition of one-to-many comparisons in selection module
Number of inference calls
- LAMBADA and SI require multiple LM inference calls per example.
- LAMBADA requires significantly fewer inference calls than SI, especially at higher depths.
Qualitative analysis
- LAMBADA calls Fact Check module which cannot prove or disprove goal
Conclusion
- Developed LAMBADA, an algorithm for text-based deductive logical reasoning
- Combines capacity of LMs to handle naturalistic text input with backward chaining algorithm for high-level reasoning
- Significantly improved prediction and proof accuracy compared to existing approaches
- Efficiently searches entire proof space to accurately conclude statement can neither be proved nor disproved
- Insight on efficacy of goal-directed reasoning with LMs widely applicable and can be adapted to other NLP tasks
- Algorithm may be called with same goal multiple times
- Lexical sensitivity of LAMBADA tested with modified test set
- Performance stays in same ballpark, showing robustness of LAMBADA
- Overall model accuracies reported in main text
- Confusion matrices reported to better understand biases of model
- CoT and SI show similar behaviour as LAMBADA on ProofWriter-PUD
- SI shows large tendency toward predicting DISPROVED for PrOntoQA
- For CoT on PrOntoQA, wording slightly changed to make chains have better flow
- Few-shot examples used/written for each dataset depth
- For selecting a fact in Fact Check, prompt looks like “is equivalent to/negation of/neither equivalent nor negation of”
- Prompts used for each of four components of model for ProofWriter dataset