Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Mathematical reasoning is important in many fields
- AI systems can solve math problems and prove theorems
- Mathematics is a testbed for challenging aspects of reasoning
- Advances in neural language models have opened up new opportunities for deep learning
- This paper reviews tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning
Paper Content
Tasks and datasets
- Examining tasks and datasets for mathematical reasoning using deep learning methods
- Summary of commonly used datasets in this field found in Table 2
Math word problem solving
- Math word problems (MWPs) have been studied by NLP researchers for decades.
- MWPs involve characters, entities, and quantities, and can be modeled with equations.
- Existing MWP datasets are from online learning websites, textbooks, or manually annotated.
- Some datasets involve modalities beyond text, such as annotated equations, operation programs, and Python programs.
Theorem proving
- Automating theorem proving is a challenge in AI
- Theorem proving tests various skills such as choosing strategies, using background knowledge, and performing symbolic manipulations
- Recently, there has been increased interest in using language models for theorem proving in formal interactive theorem provers
- Data sources for neural theorem proving in ITPs include interactive learning environments and datasets derived from proofs in ITP libraries
- Early applications of deep learning for formal theorem proving focus on selecting relevant premises
- Informal theorem proving presents an alternative medium for theorem proving, written in natural language and symbols
Geometry problem solving
- GPS is a long-standing AI task in mathematical reasoning research
- GPS involves the ability to parse multimodal information, perform symbolic abstraction, utilize theorem knowledge, and conduct quantitative reasoning
- Early datasets are relatively small or not publicly available
- Geometry3K dataset consists of 3,002 multi-choice geometry problems
- GeoQA, GeoQA+, and UniGeo are larger-scale datasets annotated with programs that can be learned by neural solvers
Math question answering
- Numerical reasoning is a core ability in human intelligence
- Many NLP tasks involve mathematical reasoning
- Recent datasets have been presented for math QA
- State-of-the-art models may suffer from brittleness in reasoning
- New benchmarks have been proposed from various aspects
- Some work incorporates tabular contexts in the question inputs
- Others present large-scale unified benchmarks for mathematical reasoning
Other quantitative problems
- Numbers are used in everyday tasks
- AI systems can be evaluated for quantitative reasoning
- Diagrams are used to convey large amounts of information
- Various benchmarks have been developed to evaluate AI systems
- Specific domains like finance, science, and programming involve quantitative reasoning
Graph-based networks for math
- Seq2Seq approaches generate mathematical expressions without relying on hand-crafted features
- Mathematical expressions can be transformed into tree-based and graph-based structures
- Seq2Seq methods do not explicitly model this structure
- Graph-based neural networks are developed to explicitly model the structure in expressions
- Examples of this include Seq2Tree and Seq2DAG models
- Graph-based information can also be embedded when encoding the input mathematical sequences
Attention-based networks for math
- Attention mechanism has been used in natural language processing and computer vision
- Attention mechanism has been used in mathematical reasoning tasks
- Group-ATT and graph attention have been studied to extract better representations
Other neural networks for math
- Deep learning approaches to mathematical reasoning tasks can use convolutional neural networks (CNN) and multimodal networks
- Encoding input text with CNN can capture long-term relationships between symbols
- Multimodal mathematical reasoning tasks are formalized as visual question answer (VQA) problems
- Visual inputs are encoded using ResNet or Faster-RCNN, textual representations are obtained via GRU or LTSM
- Graph Neural Network (GNN) used for geometry problem parsing
- WaveNet used for theorem proving
- Transformers outperform GRU in generating mathematical equations
- MathDQN uses reinforcement learning for math word problem solving
- Pre-trained language models used for math-related problems
Self-supervised learning for math
- Self-supervised learning is a machine learning approach that does not require labeled training data
- An example of self-supervised learning is next-token prediction
- Table 4 provides a list of language models pre-trained with self-supervised tasks for mathematical reasoning
- Models have become increasingly larger in the past few years
- A study found that models that win head-to-head model comparisons for accuracy are at least 50B parameters
- There are two types of pre-training corpus for mathematical language models: curated datasets and synthetic datasets
- Pre-training language models have two typical self-supervised learning tasks: Masked Language Modeling and Causal Language Modeling
- Researchers have designed customized tasks to inject mathematical reasoning capabilities into language models
- There are studies about probing whether pre-trained language models have captured numerical commonsense knowledge
Task-specific fine-tuning for math
- Task-specific fine-tuning is a technique to improve the performance of a pre-trained language model.
- It is used when there is not enough data for training large models from scratch.
- It is used for a variety of tasks, such as Math Word Problems, MathQA over financial tabular data, Geometry, Linear Algebra, and informal theorem proving.
- It is also used to combine pre-trained language models with other modules for downstream tasks.
In-context learning for mathematical reasoning
- GPT-3 has revolutionized NLP
- In-context Learning (ICL) allows users to quickly build models for new use cases
- Standard few-shot promptings have not yet proved sufficient to achieve high performance on challenging tasks
- Chain-of-thought prompting (CoT) leverages intermediate natural language rationales as prompts
- Recent work has focused on how to improve chain-of-thought reasoning under the few-shot setting
- This work is divided into two parts: selecting better in-context examples and creating better reasoning chains
In-context example selection
- Early chain-of-thought work randomly or heuristically selects in-context examples
- Recent studies have shown this type of few-shot learning can be unstable
- Unknown problem in literature of which in-context examples make most effective prompts
- Various methods to optimize in-context examples selection process
- Rubin et al. (2022) attempt to address issue by retrieving semantically similar examples
- Zhang et al. (2022b) propose diversifying demonstration questions to improve model performance
High-quality reasoning chains
- Early chain of thought work relies on a single human-annotated reasoning chain.
- Manually creating reasoning chains has two disadvantages.
- Recent studies focus on two aspects: process-based and outcome-based approaches.
- Process-based approaches aim to improve chain-of-thought reasoning quality.
Discussion
Analysis of benchmarks
- Existing benchmarks for mathematical reasoning have targeted the textual-only modality
- Visual elements can provide a rich source of quantitative information
- Mathematical reasoning in low-resource settings is under-explored
- Datasets have been annotated with intermediate rationales such as logic forms, programs, and reasoning graphs
Analysis of deep learning methods
- Current representation of numeracy is not sufficient
- Deep learning techniques treat numbers the same as words
- Numbers are often collapsed into an “UNK” token
- Tokenization approaches are suboptimal
- Lack of consistent representation makes it difficult for deep learning models to process numbers
- Insufficient representations of numbers can lead to out-of-distribution problems
- Deep learning methods for mathematical reasoning are not robust and susceptible to adversarial attacks
Generalization and robustness
- Neural models have difficulty generalizing to larger numbers and remaining robust to nearby problems.
- Strategies such as inference-time and fine-tuning can be explored to improve generalization.
- Memorization may play a role in complex solutions, but more analysis is needed.
Trustworthy reasoning
- Language models can generate ungrounded answers that users must verify.
- Recent prompting strategies provide rationales before making predictions, but language models still often hallucinate statements and produce wrong answers.
- Methods that enable more trustworthy reasoning are needed, such as using language models to provide evidence, incorporating a mechanism to make a judgment when the model is unsure, and using a model to detect and locate mistakes.
Learning from feedback
- Language models can be improved by learning from feedback
- Reinforcement learning from human feedback (RLHF) can be used to align language models with instructions
- Online learning and incorporating humans in the loop are related to this research direction
- Feedback can come from humans, theorem-proof engines, or execution results of model-generated scripts
Multi-modal mathematical reasoning
- Growing interest in multi-modal mathematical reasoning
- Challenges and opportunities for further research
- Currently available datasets are small, generated from templates, or focus on specific topics
- VQA-based frameworks used to analyze figures and plots, but can result in semantic gaps
- Converting tables and natural images into text descriptions can lead to important information being lost
- Future work involves creating unified models and better evaluation benchmarks
Conclusion
- Comprehensive survey of deep learning for mathematical reasoning
- Review of tasks, datasets, and approaches
- Gaps in existing datasets and methods
- Outline of directions for future research
- Reading list and GitHub repository created