Models gemini model training announce open-source reasoning

Do We Need Frontier Models to Verify Mathematical Proofs?

arXiv cs.LGby [Submitted on 2 Apr 2026]April 6, 20262 min read1 views

arXiv:2604.02450v1 Announce Type: new Abstract: Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement ac

View PDF HTML (experimental)

Abstract:Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.

Comments: 21 pages, 11 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as: arXiv:2604.02450 [cs.LG]

(or arXiv:2604.02450v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.02450

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Guruprerana Shabadi [view email] [v1] Thu, 2 Apr 2026 18:31:44 UTC (806 KB)

Original source

arXiv cs.LG

https://arxiv.org/abs/2604.02450

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

geminimodeltraining

Self-Evolving AILive

Decision Governance Architecture: A Missing Layer in Agent Governance

Autonomous agents have quietly crossed a threshold. They are no longer confined to reasoning, suggesting, or assisting. They are now… Continue reading on Medium »

Medium AI

1m19 minutes ago

ProductsLive

Building a Node.js document intelligence pipeline for under $10/day

You've got 10,000 support tickets, blog posts, or product reviews to process. You need summaries and keywords for each. What does that actually cost? This post walks through a real Node.js pipeline that processes documents in parallel with rate limiting, error handling, and retry logic — and calculates exactly what you'll pay. The economics first Using a pay-per-use API (1 USDC = 1,000 credits): Operation Credits Cost per call 10,000 docs Summarize 10 $0.01 $100 Keywords 5 $0.005 $50 Both 15 $0.015 $150 No monthly fee. No minimum. Idle months cost $0. Setting up npm init -y npm install node-fetch p-limit Get a free API key (100 credits, no card needed): curl -s -X POST https://textai-api.overtek.deno.net/keys/create \ -H "Content-Type: application/json" \ -d '{"label":"node-pipeline"}' # {

Dev.to AI

6m14 minutes ago

ProductsLive

Claude Code in Kenya: How Nairobi developers are using AI at KSh260/month

Claude Code in Kenya: How Nairobi developers are using AI at KSh260/month If you're a developer in Kenya, you've probably done the math on ChatGPT. $20/month. At current exchange rates, that's KSh2,600 every single month . For a junior developer in Nairobi earning KSh45,000–65,000/month, that's 4–6% of your take-home pay. Just for an AI tool. Before rent, before food, before M-Pesa. There's a better option. Claude Code at KSh260/month SimplyLouie gives you full Claude API access — the same Sonnet model powering Claude.ai — for KSh260/month . That's 10x cheaper than ChatGPT Plus. The difference pays for a week of lunch at a Westlands restaurant. How Kenyan developers are using it 1. USSD/SMS API integrations Africa's mobile money ecosystem runs on USSD and SMS. Building Safaricom integratio

Dev.to AI

3m13 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 268 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

Anthropic Ranks 5th in the AI Race According to AI Itself

The Paradox: Claude Is the Best AI Model, But Anthropic Ranks 5th in AI Visibility Everyone in the AI world seems to agree on one thing: Claude is exceptional. Developers praise its reasoning. Writers love its nuance. Researchers trust its accuracy. And yet, when we asked AI models to recommend AI companies, Anthropic barely made the top half of the list. That's not an opinion. That's data. We ran a four-day tracking study across 7 AI companies and 7 AI models , measuring how often each company appeared in AI-generated answers. The results were humbling — at least for Anthropic fans. OpenAI topped the chart at 82.85. No surprise. ChatGPT colonized public consciousness before most people knew what a large language model was. Brand ubiquity has a compounding effect, and OpenAI has been compo

Dev.to AI

3m8 minutes ago

ModelsFresh

ZAWYA: Khalifa University Digital Future Institute develop world’s first-of-its-kind breakthrough radio-frequency AI language model ‘RF-GPT’ - TradingView

ZAWYA: Khalifa University Digital Future Institute develop world’s first-of-its-kind breakthrough radio-frequency AI language model ‘RF-GPT’ TradingView

GNews AI UAE

1mabout 7 hours ago

Models

Fears Over U.S. AI Dominance Boost Business for France’s Mistral - WSJ

Fears Over U.S. AI Dominance Boost Business for France’s Mistral WSJ

Google News - Mistral AI France

1m10 months ago

ModelsRecent

Benchmarks and methods for 3D medical image retrieval

Nature Machine Learning

1mabout 16 hours ago