Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Premise selection is a fundamental problem of automated theorem proving.
Traditional symbolic systems can be outperformed by a neural transformer-based approach called Magnushammer.
Magnushammer achieved a proof rate of 59.5% on the PISA benchmark, compared to 38.3% for Sledgehammer.
Combining Magnushammer with a neural formal prover based on a language model improved the proof rate from 57.0% to 71.0%.

Interactive proof assistants are software tools designed to help develop formal proofs.
Theorems are proved sequentially in Isabelle.
Proof states contain information about established facts and remaining goals.
Proof steps consist of tactics and premises.
Tactics are powerful decision procedures that can complete some proofs in one step.
Finding relevant premises is difficult as a proof context can contain thousands of them.

Sledgehammer is a tool used in Isabelle to help find and apply proof methods
Sledgehammer is a premise selection tool
Sledgehammer has limitations, such as diminishing returns with increasing computational power and difficulty with type systems other than higher-order logic
Deep-learning-based approaches can help overcome these limitations
Magnushammer is a two-stage method that uses representation similarity and text similarity to select and re-rank premises
Magnushammer is trained with a modified InfoNCE loss and binary cross-entropy loss
Magnushammer uses a transformer backbone pre-trained with a language modeling task
Magnushammer is used to construct proof steps with tactics and premises, similar to Sledgehammer

Created and released a dataset of textual representations for Isabelle’s proof states and premises
Used two largest collections of Isabelle theories to create the dataset
Datapoint consists of a pair of (proof state, premise)
Used Sledgehammer to generate proof steps with alternative premises

Magnushammer outperforms Sledgehammer by a large margin on the PISA benchmark.
Magnushammer combined with Thor sets the new state of the art on the PISA benchmark.

Magnushammer outperforms Sledgehammer in single-step task
Magnushammer outperforms BM25 in single-step task
Magnushammer outperforms Thor in multi-step task
Magnushammer outperforms Sledgehammer in multi-step task with limited computational resources
Magnushammer scales well with medium computational budget
Magnushammer and Sledgehammer use different computational budgets in experiments

We compare HUMAN PROOFS LIBRARY (HPL) and MACHINE AUGMENTED PROOFS LIBRARY (MAPL) datasets to study how the amount and type of data impact the proof rate.
We use models with 38M non-embedding parameters and a computational budget of 800.
Magnushammer fine-tuned on only 0.1% of MAPL (4K samples) outperforms Sledgehammer.
Pre-training is beneficial, but the effect diminishes with more training data.
Fine-tuning on MAPL or HPL leads to subtle differences.
Positive correlation between model size and proof rate.
SELECT-ONLY (Magnushammer without RERANK phase) outperforms Sledgehammer.
Magnushammer treats proof states and premises as text and can be applied to any theorem-proving environment.
We publish HPL and MAPL datasets in textual format.
We combine Magnushammer with Thor to generate alternative proof steps.
We use TPU virtual machines from GCP for pre-training, fine-tuning, and evaluation.
We use Algorithm 3 to evaluate BM25 and Magnushammer.
We combine Thor and Magnushammer in multi-step setting.
We extend the context given to our model with tactic prompt.