Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Investigated mathematical capabilities of ChatGPT
  • Tested on publicly available and hand-crafted datasets
  • Measured performance against other models
  • Tested usefulness to professional mathematicians
  • Current datasets only cover elementary mathematics
  • Introduced new dataset GHOSTS to cover graduate-level mathematics
  • Benchmarked ChatGPT on GHOSTS
  • ChatGPT’s mathematical abilities are below average mathematics graduate student

Paper Content

Introduction

  • ChatGPT is a widely known question-and-answer dialogue system
  • It is the most talked about language model on Twitter
  • It has been tested in a number of exam-related use cases
  • It is believed to be used as an assistant by many professionals
  • This paper focuses on analyzing the mathematical capabilities of ChatGPT
  • It introduces new natural-language math datasets to benchmark ChatGPT’s performance
  • ChatGPT is a large language model that can be used to perform mathematical reasoning
  • Mathematical reasoning has been studied since 1959
  • Classical approaches using symbolic encoding have reached a plateau
  • There is a growing body of literature on learning mathematical relationships directly
  • Most recently published large language models are tested on elementary-level mathematical reasoning datasets
  • Variations of BERT have been shown to solve between 28-37% of problems on AQuA-RAT dataset
  • Minerva, based on PaLM, achieved a score of 50% on MATH dataset
  • Supervised approaches have outperformed classical solvers
  • An up-to-date survey on mathematical datasets and performance of LLMs can be found in [23]
  • Investigations related to ChatGPT’s performance consist of anecdotal evidence
  • Ideas used in this article are echoed in [31] for formal mathematics

Datasets

Dataset creation

  • Created collection of 728 prompts
  • Manually rated by experts

• symbolic-integration

  • MATH dataset and Symbolic-Integration subdataset taken from existing datasets
  • Minvera and supervised-learning approach used for comparison
  • Hand-crafted datasets by authors
  • Creation of datasets requires advanced mathematical insight

Format

  • GHOSTS dataset consists of multiple JSON-formatted files
  • Each datapoint in a JSON file has a prompt, reference, MSC code, confidence, and timestamp
  • Rating is a number from 1 to 5
  • Errorcodes and warningcodes highlight failure modes of ChatGPT
  • Comment field can provide context
  • MSC codes indicate areas where ChatGPT performs better
  • Prompt engineering is allowed
  • Chats with ChatGPT are “cold”
  • ChatGPT must provide correct solution without clarification

The subdatasets

  • Used L A T E X to encode mathematical input
  • Experiments showed ChatGPT can process L A T E X-encoded mathematics
  • Used exercises from books to teach undergraduate/graduate courses in mathematics
  • Used exercises from Problem-Solving Strategies book for mathematical competitions
  • Prompted ChatGPT to fill gaps in proofs from math.stackexchange.com, books, and MATH dataset
  • Random sample of prompts from MATH dataset with level of difficulty
  • Random samples of integrals from test set of [18]
  • Problems generated by human expert in field
  • Prompted ChatGPT to provide proof outlines of various theorems
  • Prompted ChatGPT to state correctly various definitions
  • Verified whether ChatGPT could deduce name of mathematical object by describing its properties
  • Collected statistics on output length, stability of answer, and how close to correct answer

Results

  • ChatGPT cannot help you pass a university math class.
  • Copying from an average peer is a better option.

Grad-text

  • ChatGPT performed best on simple set-theory and logic questions
  • On other books, it performed worse
  • It never failed to understand a query
  • It struggled to understand unusual puzzles and strange situations
  • It had difficulty integrating surprising information into its answers
  • It was strong at recognizing the context
  • Its ability to execute algebraic manipulations was inconsistent

Symbolic-integration

  • ChatGPT was trained to solve integration problems
  • ChatGPT got the structure of terms right but failed at computations
  • ChatGPT had good performance when retrieving definitions
  • ChatGPT had strongest performance when recovering definitions from descriptions

Overall performance

  • ChatGPT performs badly on problems within the style of mathematical olympiads
  • Rating corresponds closely to the ranking of mathematical difficulty
  • Results for different mathematical fields analyzed in Figure 3
  • Prompt length has no effect on rating
  • ChatGPT’s rating reflective of mathematical difficulty
  • ChatGPT usually very confident, unlike other GPT-like models

Conclusion

  • Examined behavior of ChatGPT across various datasets
  • Not yet ready to deliver high-quality proofs or calculations consistently
  • Quality of answer can be positively surprising
  • Best responses justify media sensation
  • Inconsistently bad at advanced mathematics
  • Performance not as good as models specifically trained for one task
  • Ability to search for mathematical objects is where ChatGPT shines
  • Received highest scores on Reverse Definition Retrieval files
  • Dataset not large enough to fine-tune LLMs
  • Prompt engineering reduces error codes but not average rating
  • Focus on 9th-January-2023 version of ChatGPT
  • Compared outputs of 9th and 30th January versions
  • No substantial differences in average rating
  • Collected further figures and descriptive statistics
  • Prompts and answers lightly modified for readability
  • Used SHA256 hash function for copyright protected prompts
  • Best and worst answers of Chat-GPT collected
  • Figure 2 shows average rating for each file in each subdataset
  • Figure 5 shows effect of prompt engineering on rating
  • Figure 6 shows error types per dataset
  • Figure 7 shows relative frequencies of error codes
  • Figure 8 shows rating by MSC codes