Mathematical Capabilities of ChatGPT

Simon Frieder*, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Julius Berner

*Department of Computer Science, University of Oxford, Oxford, UK
simon.frieder (at) cs.ox.ac.uk
<GHOST logo here. Coypright: Frieder Simon>

TL;DR: A natural-language dataset, GHOSTS, is introduced, together with a new benchmark for advanced mathematics. The dataset is made up of six subdatasets (one for each letter from G-H-O-S-T-S), and on each one we evaluate ChatGPT.


FEATURED IN THE NEWS:
  Ars Technica
  Synced Review
  German Government ("Bundestag") Report


Abstract

We investigate the mathematical capabilities of two iterations of ChatGPT (released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on publicly available datasets, as well as hand-crafted ones, using a novel methodology. In contrast to formal mathematics, where large databases of formal proofs are available (e.g., mathlib, the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, either cover only elementary mathematics or are very small. We address this by publicly releasing two new datasets: GHOSTS and miniGHOSTS. These are the first natural-language datasets curated by working researchers in mathematics that (1) aim to cover graduate-level mathematics, (2) provide a holistic overview of the mathematical capabilities of language models, and (3) distinguish multiple dimensions of mathematical reasoning. These datasets also test whether ChatGPT and GPT-4 can be helpful assistants to professional mathematicians by emulating use cases that arise in the daily professional activities of mathematicians. We benchmark the models on a range of fine-grained performance metrics. For advanced mathematics, this is the most detailed evaluation effort to date. We find that ChatGPT can be used most successfully as a mathematical assistant for querying facts, acting as a mathematical search engine and knowledge base interface. GPT-4 can additionally be used for undergraduate-level mathematics but fails on graduate-level difficulty. Contrary to many positive reports in the media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of selection bias), their overall mathematical performance is well below the level of a graduate student. Hence, if your goal is to use ChatGPT to pass a graduate-level math exam, you would be better off copying from your average peer!


Selected Findings In Graphical Form

An illustration of how the three ChatGPT models did (9-January 2023, 30-January-2023, and GPT-4), by using a Sankey diagram to show how scores transform between models: <Sankey diagram>


These are the six subdatasets, and the scores of the three models we tested on each of them: <scores by model and subdataset>


The dataset stratified by the MSC codes: <MSC scores>


The files of which each individual subdataset consists of, and the errors and warnings that appeared, in absolute as well as relative numbers, on the ChatGPT "3.5" (9-January-2023) model: <subdataset files and error codes and warning codes>