Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Neural reasoning accuracy improves when generating intermediate steps
  • Source of improvement is unclear
  • Investigated benefit of generating intermediate steps for symbolic reasoning
  • Decomposed reasoning strategy in terms of step granularity and chaining strategy
  • Found that choice of reasoning strategies affects performance
  • Certain configurations lead to nearly perfect performance
  • Results indicate importance of exploring effective strategies for neural reasoning models

Paper Content

Introduction

  • Artificial intelligence researchers have been attempting neural-symbolic integration for a long time.
  • Neural models perform better when generating intermediate reasoning steps in addition to the answer.
  • This phenomenon was seen across various reasoning tasks.
  • Researchers broke down the neural reasoning process into two strategies: output strategy and chaining strategy.
  • Iterative generation outperformed all-at-once outputting, and roughly granular reasoning steps lagged behind finely granular steps.

Experimental settings

  • Evaluated models’ ability to perform arithmetic operations over given symbols
  • Task is to answer value of target variable
  • Reasoning depth is number of equations needed to reach answer
  • Equations define assignments and modular additions
  • Contexts contain distractors not necessary to calculate answer
  • Artificial data allows easier control of reasoning depth for generalization tests

Output strategies

  • Generating intermediate reasoning steps improves performance
  • Step-by-step works best, all-at-once works worst
  • Neural models have low symbolic reasoning ability
  • All-at-once strategy overfits to output similar length of reasoning steps as those in the training data
  • Step-by-step has advantage over token-by-token

Chaining strategies

  • Results of fixed step-by-step output strategy shown in Figure 4b and Table 1
  • Accuracy measured based on mathematical correctness, not exact match
  • Performance dropped in shortest-path setting as depth increased
  • Models successfully solved task when extrapolating to depths 6-12
  • Models correctly generated intermediate steps and final answer
  • Chaining strategies with no reasoning steps had better generalization performance
  • Appropriate output strategy improves reasoning ability of model
  • Accuracy higher when granularity of intermediate steps is finer

Results

  • Used pre-trained T5-base, T5-large2, and BARTbase3
  • Results of BART-base in Appendix C
  • Numbers in dataset divided into digits
  • Pre-trained using 10K simple dataset for 30 epochs
  • Trained with 5K training set for 2000 epochs
  • Experiment setting details in Appendix A
  • 0.2K test instances for each reasoning depth
  • Pretraining dataset contains two types of single-depth instances
  • Results in paper are averages of results on three different seeds

Error analysis

  • Copying errors were the most frequent (53%)
  • Hasty assignment was the second type of error

Models’ scalability

  • T5-large and T5-base were compared to investigate scalability.
  • T5-large had lower accuracy than T5-base on all-at-once and step-by-step.
  • T5-large had higher accuracy than T5-base on token-by-token.

Conclusions

  • Investigated and factorized reasoning strategy in symbolic numerical reasoning with neural seq2seq models
  • Combination of step-by-step output and finely granular reasoning leads to successful symbolic reasoning
  • Simple symbolic reasoning requires appropriate selection of reasoning strategy
  • Unclear if findings generalize to more complex symbolic reasoning and/or problems written in natural language
  • Iterative strategies limited to input length of model
  • Examined learning rate from 10-3, 10-4, and 10-5
  • Used four NVIDIA V100 GPUs
  • Performance drops in shortest path setting as reasoning depth increases
  • Exhaustive or backward successfully solves task even when extrapolating to depths 6-12
  • T5 outperforms BART