Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Investigated mathematical capabilities of ChatGPT
- Tested on publicly available and hand-crafted datasets
- Measured performance against other models
- Tested usefulness to professional mathematicians
- Current datasets only cover elementary mathematics
- Introduced new dataset GHOSTS to cover graduate-level mathematics
- Benchmarked ChatGPT on GHOSTS
- ChatGPT’s mathematical abilities are below average mathematics graduate student
Paper Content
Introduction
- ChatGPT is a widely known question-and-answer dialogue system
- It is the most talked about language model on Twitter
- It has been tested in a number of exam-related use cases
- It is believed to be used as an assistant by many professionals
- This paper focuses on analyzing the mathematical capabilities of ChatGPT
- It introduces new natural-language math datasets to benchmark ChatGPT’s performance
Related work
- ChatGPT is a large language model that can be used to perform mathematical reasoning
- Mathematical reasoning has been studied since 1959
- Classical approaches using symbolic encoding have reached a plateau
- There is a growing body of literature on learning mathematical relationships directly
- Most recently published large language models are tested on elementary-level mathematical reasoning datasets
- Variations of BERT have been shown to solve between 28-37% of problems on AQuA-RAT dataset
- Minerva, based on PaLM, achieved a score of 50% on MATH dataset
- Supervised approaches have outperformed classical solvers
- An up-to-date survey on mathematical datasets and performance of LLMs can be found in [23]
- Investigations related to ChatGPT’s performance consist of anecdotal evidence
- Ideas used in this article are echoed in [31] for formal mathematics
Datasets
Dataset creation
- Created collection of 728 prompts
- Manually rated by experts
• symbolic-integration
- MATH dataset and Symbolic-Integration subdataset taken from existing datasets
- Minvera and supervised-learning approach used for comparison
- Hand-crafted datasets by authors
- Creation of datasets requires advanced mathematical insight
Format
- GHOSTS dataset consists of multiple JSON-formatted files
- Each datapoint in a JSON file has a prompt, reference, MSC code, confidence, and timestamp
- Rating is a number from 1 to 5
- Errorcodes and warningcodes highlight failure modes of ChatGPT
- Comment field can provide context
- MSC codes indicate areas where ChatGPT performs better
- Prompt engineering is allowed
- Chats with ChatGPT are “cold”
- ChatGPT must provide correct solution without clarification
The subdatasets
- Used L A T E X to encode mathematical input
- Experiments showed ChatGPT can process L A T E X-encoded mathematics
- Used exercises from books to teach undergraduate/graduate courses in mathematics
- Used exercises from Problem-Solving Strategies book for mathematical competitions
- Prompted ChatGPT to fill gaps in proofs from math.stackexchange.com, books, and MATH dataset
- Random sample of prompts from MATH dataset with level of difficulty
- Random samples of integrals from test set of [18]
- Problems generated by human expert in field
- Prompted ChatGPT to provide proof outlines of various theorems
- Prompted ChatGPT to state correctly various definitions
- Verified whether ChatGPT could deduce name of mathematical object by describing its properties
- Collected statistics on output length, stability of answer, and how close to correct answer
Results
- ChatGPT cannot help you pass a university math class.
- Copying from an average peer is a better option.
Grad-text
- ChatGPT performed best on simple set-theory and logic questions
- On other books, it performed worse
- It never failed to understand a query
- It struggled to understand unusual puzzles and strange situations
- It had difficulty integrating surprising information into its answers
- It was strong at recognizing the context
- Its ability to execute algebraic manipulations was inconsistent
Symbolic-integration
- ChatGPT was trained to solve integration problems
- ChatGPT got the structure of terms right but failed at computations
- ChatGPT had good performance when retrieving definitions
- ChatGPT had strongest performance when recovering definitions from descriptions
Overall performance
- ChatGPT performs badly on problems within the style of mathematical olympiads
- Rating corresponds closely to the ranking of mathematical difficulty
- Results for different mathematical fields analyzed in Figure 3
- Prompt length has no effect on rating
- ChatGPT’s rating reflective of mathematical difficulty
- ChatGPT usually very confident, unlike other GPT-like models
Conclusion
- Examined behavior of ChatGPT across various datasets
- Not yet ready to deliver high-quality proofs or calculations consistently
- Quality of answer can be positively surprising
- Best responses justify media sensation
- Inconsistently bad at advanced mathematics
- Performance not as good as models specifically trained for one task
- Ability to search for mathematical objects is where ChatGPT shines
- Received highest scores on Reverse Definition Retrieval files
- Dataset not large enough to fine-tune LLMs
- Prompt engineering reduces error codes but not average rating
- Focus on 9th-January-2023 version of ChatGPT
- Compared outputs of 9th and 30th January versions
- No substantial differences in average rating
- Collected further figures and descriptive statistics
- Prompts and answers lightly modified for readability
- Used SHA256 hash function for copyright protected prompts
- Best and worst answers of Chat-GPT collected
- Figure 2 shows average rating for each file in each subdataset
- Figure 5 shows effect of prompt engineering on rating
- Figure 6 shows error types per dataset
- Figure 7 shows relative frequencies of error codes
- Figure 8 shows rating by MSC codes