Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Investigated mathematical capabilities of ChatGPT
Tested on publicly available and hand-crafted datasets
Measured performance against other models
Tested usefulness to professional mathematicians
Current datasets only cover elementary mathematics
Introduced new dataset GHOSTS to cover graduate-level mathematics
Benchmarked ChatGPT on GHOSTS
ChatGPT’s mathematical abilities are below average mathematics graduate student

Paper Content

Introduction

ChatGPT is a widely known question-and-answer dialogue system
It is the most talked about language model on Twitter
It has been tested in a number of exam-related use cases
It is believed to be used as an assistant by many professionals
This paper focuses on analyzing the mathematical capabilities of ChatGPT
It introduces new natural-language math datasets to benchmark ChatGPT’s performance

ChatGPT is a large language model that can be used to perform mathematical reasoning
Mathematical reasoning has been studied since 1959
Classical approaches using symbolic encoding have reached a plateau
There is a growing body of literature on learning mathematical relationships directly
Most recently published large language models are tested on elementary-level mathematical reasoning datasets
Variations of BERT have been shown to solve between 28-37% of problems on AQuA-RAT dataset
Minerva, based on PaLM, achieved a score of 50% on MATH dataset
Supervised approaches have outperformed classical solvers
An up-to-date survey on mathematical datasets and performance of LLMs can be found in [23]
Investigations related to ChatGPT’s performance consist of anecdotal evidence
Ideas used in this article are echoed in [31] for formal mathematics

Datasets

Dataset creation

Created collection of 728 prompts
Manually rated by experts

• symbolic-integration

MATH dataset and Symbolic-Integration subdataset taken from existing datasets
Minvera and supervised-learning approach used for comparison
Hand-crafted datasets by authors
Creation of datasets requires advanced mathematical insight

Format

GHOSTS dataset consists of multiple JSON-formatted files
Each datapoint in a JSON file has a prompt, reference, MSC code, confidence, and timestamp
Rating is a number from 1 to 5
Errorcodes and warningcodes highlight failure modes of ChatGPT
Comment field can provide context
MSC codes indicate areas where ChatGPT performs better
Prompt engineering is allowed
Chats with ChatGPT are “cold”
ChatGPT must provide correct solution without clarification

The subdatasets

Used L A T E X to encode mathematical input
Experiments showed ChatGPT can process L A T E X-encoded mathematics
Used exercises from books to teach undergraduate/graduate courses in mathematics
Used exercises from Problem-Solving Strategies book for mathematical competitions
Prompted ChatGPT to fill gaps in proofs from math.stackexchange.com, books, and MATH dataset
Random sample of prompts from MATH dataset with level of difficulty
Random samples of integrals from test set of [18]
Problems generated by human expert in field
Prompted ChatGPT to provide proof outlines of various theorems
Prompted ChatGPT to state correctly various definitions
Verified whether ChatGPT could deduce name of mathematical object by describing its properties
Collected statistics on output length, stability of answer, and how close to correct answer

Results

ChatGPT cannot help you pass a university math class.
Copying from an average peer is a better option.

Grad-text

ChatGPT performed best on simple set-theory and logic questions
On other books, it performed worse
It never failed to understand a query
It struggled to understand unusual puzzles and strange situations
It had difficulty integrating surprising information into its answers
It was strong at recognizing the context
Its ability to execute algebraic manipulations was inconsistent

Symbolic-integration

ChatGPT was trained to solve integration problems
ChatGPT got the structure of terms right but failed at computations
ChatGPT had good performance when retrieving definitions
ChatGPT had strongest performance when recovering definitions from descriptions

Overall performance

ChatGPT performs badly on problems within the style of mathematical olympiads
Rating corresponds closely to the ranking of mathematical difficulty
Results for different mathematical fields analyzed in Figure 3
Prompt length has no effect on rating
ChatGPT’s rating reflective of mathematical difficulty
ChatGPT usually very confident, unlike other GPT-like models

Conclusion

Examined behavior of ChatGPT across various datasets
Not yet ready to deliver high-quality proofs or calculations consistently
Quality of answer can be positively surprising
Best responses justify media sensation
Inconsistently bad at advanced mathematics
Performance not as good as models specifically trained for one task
Ability to search for mathematical objects is where ChatGPT shines
Received highest scores on Reverse Definition Retrieval files
Dataset not large enough to fine-tune LLMs
Prompt engineering reduces error codes but not average rating
Focus on 9th-January-2023 version of ChatGPT
Compared outputs of 9th and 30th January versions
No substantial differences in average rating
Collected further figures and descriptive statistics
Prompts and answers lightly modified for readability
Used SHA256 hash function for copyright protected prompts
Best and worst answers of Chat-GPT collected
Figure 2 shows average rating for each file in each subdataset
Figure 5 shows effect of prompt engineering on rating
Figure 6 shows error types per dataset
Figure 7 shows relative frequencies of error codes
Figure 8 shows rating by MSC codes

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Datasets#

Dataset creation#

• symbolic-integration#

Format#

The subdatasets#

Results#

Grad-text#

Symbolic-integration#

Overall performance#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Datasets

Dataset creation

• symbolic-integration

Format

The subdatasets

Results

Grad-text

Symbolic-integration

Overall performance

Conclusion