Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Automated reasoning with unstructured natural text has made progress with the help of large language models.
  • Forward direction reasoning suffers from a combinatorial explosion of the search space.
  • Backward direction reasoning is more efficient for proof-finding problems.
  • LAMBADA is a backward chaining algorithm that decomposes reasoning into four sub-modules.
  • LAMBADA achieves accuracy boosts over forward reasoning methods on two challenging logical reasoning datasets.

Paper Content

Introduction

  • Automated reasoning has been a goal for AI since its early days
  • Recent advancements in natural language understanding have been made with pre-trained language models
  • Performance of these models for logical reasoning still lags behind
  • Logical reasoning with unstructured, natural text is a fundamental building block for automated knowledge discovery
  • Scale has been observed to provide limited benefit for solving complex reasoning problems
  • Finetuning initially seemed to enable logical reasoning in LMs, but mostly exploits spurious correlations
  • Prompting strategies have found some success, but struggle with proof planning for more complex logical reasoning problems
  • Integrating the strength and reliability of classical AI models in logical reasoning with LMs is one solution
  • Two major approaches to logical reasoning are Forward Chaining and Backward Chaining
  • Previous approaches to reasoning with LMs mostly incorporate elements of Forward Chaining
  • Backward Chaining is better suited for text-based deductive logical reasoning
  • Deep learning models can be used to solve text-based reasoning tasks
  • Pretraining an LM on relevant data can lead to improvements
  • Implicit reasoning approaches finetune LMs to produce desired output
  • Explicit reasoning approaches generate intermediate reasoning steps
  • Verifiers judge the quality of reasoning chains
  • Length generalization is possible with certain models
  • Modular reasoning breaks problem into smaller modules
  • Natural Language Inference (NLI) is related to logical reasoning
  • Example of theory with facts and rules given

Backward chaining

  • Backward chaining (BC) is a strategy for reasoning that starts from the goal and recursively breaks it down into sub-goals.
  • BC uses an operation called unification in logic to determine if a rule applies to a goal.
  • BC has four LM-based modules: Fact Check, Rule Selection, Goal Decomposition, and Sign Agreement.
  • Fact Check verifies if a goal can be proved or disproved from any of the facts.
  • Rule Selection identifies rules whose consequent unifies with the goal.
  • Goal Decomposition identifies sub-goals that need to be proved in order for the goal to be proved or disproved.
  • Sign Agreement verifies if the sign of the rule consequent agrees or disagrees with the sign of the goal.
  • Algorithm 1 integrates the four LM modules with BC to enable text-based logical reasoning.
  • LAMBADA is a depth-first search algorithm over the facts and the rules.
  • Rules are ranked based on their lengths with shorter rules being ranked higher.

Experimental setup

  • Compared two main baselines for logical reasoning: Chain of Thought (CoT) and Selection-Inference (SI)
  • Experimented with two challenging logical reasoning datasets: ProofWriter and PrOntoQA
  • ProofWriter has two subsets: open-world assumption (OWA) and closed-world assumption (CWA)
  • ProofWriter has three labels: PROVED, DISPROVED, UNKNOWN
  • PrOntoQA has multiple variants; used fictional characters version

Experimental results

  • LAMBADA outperforms two baselines on two ProofWriter datasets
  • Backward chaining (LAMBADA) better than forward chaining (SI)
  • CoT has short-coming in dealing with UNKNOWN labels
  • SI over-predicts DISPROVED and UNKNOWN for higher depths

Proof accuracy

  • CoT predicted the result correctly for 50 examples from depth-5 of the dataset
  • Manually verified the proofs generated by LAMBADA
  • CoT only generated the correct chain for 28% of the examples
  • Main source of error was hallucinating rules or facts
  • Spurious correlations in ProofWriter-PD can be exploited by CoT
  • Rule Selection and Goal Decomposition modules are responsible for majority of failure cases

The role of scale

  • Experiment from Section 5.2 repeated with PaLM 62B and 8B to see role of LM scale on LAMBADA
  • Performance of Goal Decomposition and Sign Agreement modules remain comparable with PaLM 62B
  • Performance of Fact Check and Rule Selection modules drop significantly with PaLM 62B
  • Smaller LMs require further decomposition of one-to-many comparisons in selection module

Number of inference calls

  • LAMBADA and SI require multiple LM inference calls per example.
  • LAMBADA requires significantly fewer inference calls than SI, especially at higher depths.

Qualitative analysis

  • LAMBADA calls Fact Check module which cannot prove or disprove goal

Conclusion

  • Developed LAMBADA, an algorithm for text-based deductive logical reasoning
  • Combines capacity of LMs to handle naturalistic text input with backward chaining algorithm for high-level reasoning
  • Significantly improved prediction and proof accuracy compared to existing approaches
  • Efficiently searches entire proof space to accurately conclude statement can neither be proved nor disproved
  • Insight on efficacy of goal-directed reasoning with LMs widely applicable and can be adapted to other NLP tasks
  • Algorithm may be called with same goal multiple times
  • Lexical sensitivity of LAMBADA tested with modified test set
  • Performance stays in same ballpark, showing robustness of LAMBADA
  • Overall model accuracies reported in main text
  • Confusion matrices reported to better understand biases of model
  • CoT and SI show similar behaviour as LAMBADA on ProofWriter-PUD
  • SI shows large tendency toward predicting DISPROVED for PrOntoQA
  • For CoT on PrOntoQA, wording slightly changed to make chains have better flow
  • Few-shot examples used/written for each dataset depth
  • For selecting a fact in Fact Check, prompt looks like “is equivalent to/negation of/neither equivalent nor negation of”
  • Prompts used for each of four components of model for ProofWriter dataset