Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Mathematical reasoning is important in many fields
  • AI systems can solve math problems and prove theorems
  • Mathematics is a testbed for challenging aspects of reasoning
  • Advances in neural language models have opened up new opportunities for deep learning
  • This paper reviews tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning

Paper Content

Tasks and datasets

  • Examining tasks and datasets for mathematical reasoning using deep learning methods
  • Summary of commonly used datasets in this field found in Table 2

Math word problem solving

  • Math word problems (MWPs) have been studied by NLP researchers for decades.
  • MWPs involve characters, entities, and quantities, and can be modeled with equations.
  • Existing MWP datasets are from online learning websites, textbooks, or manually annotated.
  • Some datasets involve modalities beyond text, such as annotated equations, operation programs, and Python programs.

Theorem proving

  • Automating theorem proving is a challenge in AI
  • Theorem proving tests various skills such as choosing strategies, using background knowledge, and performing symbolic manipulations
  • Recently, there has been increased interest in using language models for theorem proving in formal interactive theorem provers
  • Data sources for neural theorem proving in ITPs include interactive learning environments and datasets derived from proofs in ITP libraries
  • Early applications of deep learning for formal theorem proving focus on selecting relevant premises
  • Informal theorem proving presents an alternative medium for theorem proving, written in natural language and symbols

Geometry problem solving

  • GPS is a long-standing AI task in mathematical reasoning research
  • GPS involves the ability to parse multimodal information, perform symbolic abstraction, utilize theorem knowledge, and conduct quantitative reasoning
  • Early datasets are relatively small or not publicly available
  • Geometry3K dataset consists of 3,002 multi-choice geometry problems
  • GeoQA, GeoQA+, and UniGeo are larger-scale datasets annotated with programs that can be learned by neural solvers

Math question answering

  • Numerical reasoning is a core ability in human intelligence
  • Many NLP tasks involve mathematical reasoning
  • Recent datasets have been presented for math QA
  • State-of-the-art models may suffer from brittleness in reasoning
  • New benchmarks have been proposed from various aspects
  • Some work incorporates tabular contexts in the question inputs
  • Others present large-scale unified benchmarks for mathematical reasoning

Other quantitative problems

  • Numbers are used in everyday tasks
  • AI systems can be evaluated for quantitative reasoning
  • Diagrams are used to convey large amounts of information
  • Various benchmarks have been developed to evaluate AI systems
  • Specific domains like finance, science, and programming involve quantitative reasoning

Graph-based networks for math

  • Seq2Seq approaches generate mathematical expressions without relying on hand-crafted features
  • Mathematical expressions can be transformed into tree-based and graph-based structures
  • Seq2Seq methods do not explicitly model this structure
  • Graph-based neural networks are developed to explicitly model the structure in expressions
  • Examples of this include Seq2Tree and Seq2DAG models
  • Graph-based information can also be embedded when encoding the input mathematical sequences

Attention-based networks for math

  • Attention mechanism has been used in natural language processing and computer vision
  • Attention mechanism has been used in mathematical reasoning tasks
  • Group-ATT and graph attention have been studied to extract better representations

Other neural networks for math

  • Deep learning approaches to mathematical reasoning tasks can use convolutional neural networks (CNN) and multimodal networks
  • Encoding input text with CNN can capture long-term relationships between symbols
  • Multimodal mathematical reasoning tasks are formalized as visual question answer (VQA) problems
  • Visual inputs are encoded using ResNet or Faster-RCNN, textual representations are obtained via GRU or LTSM
  • Graph Neural Network (GNN) used for geometry problem parsing
  • WaveNet used for theorem proving
  • Transformers outperform GRU in generating mathematical equations
  • MathDQN uses reinforcement learning for math word problem solving
  • Pre-trained language models used for math-related problems

Self-supervised learning for math

  • Self-supervised learning is a machine learning approach that does not require labeled training data
  • An example of self-supervised learning is next-token prediction
  • Table 4 provides a list of language models pre-trained with self-supervised tasks for mathematical reasoning
  • Models have become increasingly larger in the past few years
  • A study found that models that win head-to-head model comparisons for accuracy are at least 50B parameters
  • There are two types of pre-training corpus for mathematical language models: curated datasets and synthetic datasets
  • Pre-training language models have two typical self-supervised learning tasks: Masked Language Modeling and Causal Language Modeling
  • Researchers have designed customized tasks to inject mathematical reasoning capabilities into language models
  • There are studies about probing whether pre-trained language models have captured numerical commonsense knowledge

Task-specific fine-tuning for math

  • Task-specific fine-tuning is a technique to improve the performance of a pre-trained language model.
  • It is used when there is not enough data for training large models from scratch.
  • It is used for a variety of tasks, such as Math Word Problems, MathQA over financial tabular data, Geometry, Linear Algebra, and informal theorem proving.
  • It is also used to combine pre-trained language models with other modules for downstream tasks.

In-context learning for mathematical reasoning

  • GPT-3 has revolutionized NLP
  • In-context Learning (ICL) allows users to quickly build models for new use cases
  • Standard few-shot promptings have not yet proved sufficient to achieve high performance on challenging tasks
  • Chain-of-thought prompting (CoT) leverages intermediate natural language rationales as prompts
  • Recent work has focused on how to improve chain-of-thought reasoning under the few-shot setting
  • This work is divided into two parts: selecting better in-context examples and creating better reasoning chains

In-context example selection

  • Early chain-of-thought work randomly or heuristically selects in-context examples
  • Recent studies have shown this type of few-shot learning can be unstable
  • Unknown problem in literature of which in-context examples make most effective prompts
  • Various methods to optimize in-context examples selection process
  • Rubin et al. (2022) attempt to address issue by retrieving semantically similar examples
  • Zhang et al. (2022b) propose diversifying demonstration questions to improve model performance

High-quality reasoning chains

  • Early chain of thought work relies on a single human-annotated reasoning chain.
  • Manually creating reasoning chains has two disadvantages.
  • Recent studies focus on two aspects: process-based and outcome-based approaches.
  • Process-based approaches aim to improve chain-of-thought reasoning quality.

Discussion

Analysis of benchmarks

  • Existing benchmarks for mathematical reasoning have targeted the textual-only modality
  • Visual elements can provide a rich source of quantitative information
  • Mathematical reasoning in low-resource settings is under-explored
  • Datasets have been annotated with intermediate rationales such as logic forms, programs, and reasoning graphs

Analysis of deep learning methods

  • Current representation of numeracy is not sufficient
  • Deep learning techniques treat numbers the same as words
  • Numbers are often collapsed into an “UNK” token
  • Tokenization approaches are suboptimal
  • Lack of consistent representation makes it difficult for deep learning models to process numbers
  • Insufficient representations of numbers can lead to out-of-distribution problems
  • Deep learning methods for mathematical reasoning are not robust and susceptible to adversarial attacks

Generalization and robustness

  • Neural models have difficulty generalizing to larger numbers and remaining robust to nearby problems.
  • Strategies such as inference-time and fine-tuning can be explored to improve generalization.
  • Memorization may play a role in complex solutions, but more analysis is needed.

Trustworthy reasoning

  • Language models can generate ungrounded answers that users must verify.
  • Recent prompting strategies provide rationales before making predictions, but language models still often hallucinate statements and produce wrong answers.
  • Methods that enable more trustworthy reasoning are needed, such as using language models to provide evidence, incorporating a mechanism to make a judgment when the model is unsure, and using a model to detect and locate mistakes.

Learning from feedback

  • Language models can be improved by learning from feedback
  • Reinforcement learning from human feedback (RLHF) can be used to align language models with instructions
  • Online learning and incorporating humans in the loop are related to this research direction
  • Feedback can come from humans, theorem-proof engines, or execution results of model-generated scripts

Multi-modal mathematical reasoning

  • Growing interest in multi-modal mathematical reasoning
  • Challenges and opportunities for further research
  • Currently available datasets are small, generated from templates, or focus on specific topics
  • VQA-based frameworks used to analyze figures and plots, but can result in semantic gaps
  • Converting tables and natural images into text descriptions can lead to important information being lost
  • Future work involves creating unified models and better evaluation benchmarks

Conclusion

  • Comprehensive survey of deep learning for mathematical reasoning
  • Review of tasks, datasets, and approaches
  • Gaps in existing datasets and methods
  • Outline of directions for future research
  • Reading list and GitHub repository created