Live
🔥 google-research/timesfmGitHub Trending🔥 aliasrobotics/caiGitHub Trending🔥 ComposioHQ/awesome-claude-skillsGitHub Trending🔥 SkyworkAI/Matrix-GameGitHub Trending🔥 sponsors/vas3kGitHub Trending🔥 sponsors/khoj-aiGitHub Trending🔥 PaddlePaddle/PaddleOCRGitHub TrendingTest: 15% of Americans say they would work for AI bossTechCrunch AIAutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure DesignarXivMultiverse: Language-Conditioned Multi-Game Level Blending via Shared RepresentationarXivMediHive: A Decentralized Agent Collective for Medical ReasoningarXivBitboard version of Tetris AIarXivThe Price of Meaning: Why Every Semantic Memory System ForgetsarXivWhen Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof TutoringarXivQuantification of Credal Uncertainty: A Distance-Based ApproacharXiv🔥 google-research/timesfmGitHub Trending🔥 aliasrobotics/caiGitHub Trending🔥 ComposioHQ/awesome-claude-skillsGitHub Trending🔥 SkyworkAI/Matrix-GameGitHub Trending🔥 sponsors/vas3kGitHub Trending🔥 sponsors/khoj-aiGitHub Trending🔥 PaddlePaddle/PaddleOCRGitHub TrendingTest: 15% of Americans say they would work for AI bossTechCrunch AIAutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure DesignarXivMultiverse: Language-Conditioned Multi-Game Level Blending via Shared RepresentationarXivMediHive: A Decentralized Agent Collective for Medical ReasoningarXivBitboard version of Tetris AIarXivThe Price of Meaning: Why Every Semantic Memory System ForgetsarXivWhen Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof TutoringarXivQuantification of Credal Uncertainty: A Distance-Based ApproacharXiv

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.26996v1 Announce Type: new Abstract: We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models wi — Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, Langston Nashold

View PDF HTML (experimental)

Abstract:We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.

Comments: Accepted at ICLR 2026 Workshop: VerifAI-2: The Second Workshop on AI Verification in the Wild. Live leaderboard hosted here: this https URL

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)

Cite as: arXiv:2603.26996 [cs.AI]

(or arXiv:2603.26996v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.26996

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Nikil Ravi [view email] [v1] Fri, 27 Mar 2026 21:14:53 UTC (430 KB)

Original source

arXiv

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
FormalProof…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 336 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers