Models claude gemini model benchmark available update

You test your code. Why aren’t you testing your AI instructions?

DEV Communityby Lukas MetzlerApril 3, 20266 min read3 views

🧒Explain Like I'm 5Simple language

Imagine you have a super smart robot friend who helps you build with LEGOs! 🤖🧱

You tell your robot friend what to do, like "build a tall tower" or "make a car with red wheels." These are like your AI instructions!

Sometimes, we just tell the robot and hope it does a good job. But what if your instructions are messy? Like, "Build a tall car, but also a short tower, and use blue wheels, no wait, red wheels!" The robot gets confused! 🤯

The article says it's super important to make your instructions very clear and test them, just like you test if your LEGO tower stands up! If your instructions are good, your robot friend will build amazing things, even better than if you just picked the fanciest robot! ✨ So, clear instructions make happy robots and awesome creations!

You test your code. Why aren't you testing your AI instructions? Why instruction quality matters more than model choice, and a tool to measure it. Every team using AI coding tools writes instruction files. CLAUDE.md for Claude Code, AGENTS.md for Codex, copilot-instructions.md for GitHub Copilot, .cursorrules for Cursor. You spend time crafting these files, change a paragraph, push it, and hope for the best. Your codebase has tests. Your APIs have contracts. Your AI instructions have hope. I built agenteval to fix that. The variable nobody is testing A recent study tested three agent frameworks running the same model on 731 coding problems. Same model. Same tasks. The only difference was the instruction scaffolding. The spread was 17 points. We obsess over which model to use. Sonnet vs Opu

You test your code. Why aren't you testing your AI instructions?

Why instruction quality matters more than model choice, and a tool to measure it.

Every team using AI coding tools writes instruction files. CLAUDE.md for Claude Code, AGENTS.md for Codex, copilot-instructions.md for GitHub Copilot, .cursorrules for Cursor. You spend time crafting these files, change a paragraph, push it, and hope for the best.

Your codebase has tests. Your APIs have contracts. Your AI instructions have hope.

I built agenteval to fix that.

The variable nobody is testing

A recent study tested three agent frameworks running the same model on 731 coding problems. Same model. Same tasks. The only difference was the instruction scaffolding.

The spread was 17 points.

We obsess over which model to use. Sonnet vs Opus vs GPT-5.4. But the instructions you give the model have a bigger effect on the outcome than the model itself. And nobody tests them.

Think about that. You wouldn't deploy an API without tests. You wouldn't ship a feature without CI. But the file that controls how your AI writes code? You edit it in a text editor and hope.

What goes wrong in instruction files

I've scanned a lot of instruction files at this point. The same problems show up everywhere.

Dead references

You renamed src/auth.ts to src/authentication.ts six months ago. Your instruction file still says "see src/auth.ts for the authentication module." The AI reads that instruction, looks for a file that doesn't exist, and gets confused.

This is the most common issue. Almost every instruction file over 3 months old has at least one dead reference.

Filler that eats your context budget

"Make sure to always thoroughly test everything and ensure comprehensive coverage of all edge cases in a robust manner."

That sentence burns 25 tokens and says nothing the model doesn't already know. With a 200K context window and a 30% instruction budget, you have about 60,000 tokens. Every token spent on "make sure to" is a token not available for actual code context.

The worst offenders: "it is important that", "in order to", "at the end of the day", "make sure to", "please ensure that". They're everywhere.

Contradictions

"Always use semicolons" in your code style section. "Follow the Prettier config" three sections later, where Prettier removes semicolons. The model gets conflicting instructions and picks one at random.

It happens more than you'd think, especially in files maintained by multiple people over months.

Context budget overruns

Your CLAUDE.md is 300 lines. Your AGENTS.md is 200 lines. Your copilot-instructions.md is 150 lines. Together they consume 40% of your model's context window before a single line of code is loaded.

The AI's performance degrades uniformly as instruction count increases. It's not that later instructions get ignored. All instructions get followed less precisely.

Overlap between files

Your CLAUDE.md says "use TypeScript strict mode, tabs for indentation." Your AGENTS.md says the same thing. That's duplicated instructions consuming double the tokens for zero additional value. Worse, when you update one copy and forget the other, they drift apart and contradict each other.

What agenteval does about it

agenteval is a CLI. You install it, run one command, and see what's wrong.

curl -fsSL [https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh](https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh) | bash agenteval lint

curl -fsSL [https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh](https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh) | bash agenteval lint

Enter fullscreen mode

Exit fullscreen mode

It reads your instruction files, parses the markdown, counts tokens, checks file references, and reports real problems with actionable suggestions. No LLM in the loop. Deterministic. Runs in under a second.

Here's what it found on my own project the first time I ran it:

CLAUDE.md  ERROR Referenced file "docs/schema.md" does not exist  → Remove the reference or create the missing file  info Section "Testing" contains 1 filler phrase(s)  → Rewrite without phrases like 'make sure to'  info Vague instructions: "be careful with error handling"  → Replace with a specific example or threshold

CLAUDE.md  ERROR Referenced file "docs/schema.md" does not exist  → Remove the reference or create the missing file  info Section "Testing" contains 1 filler phrase(s)  → Rewrite without phrases like 'make sure to'  info Vague instructions: "be careful with error handling"  → Replace with a specific example or threshold

Enter fullscreen mode

Exit fullscreen mode

Every issue has a suggestion. You don't need to figure out what to do about it.

It supports every major instruction format:

CLAUDE.md (Claude Code)
AGENTS.md (OpenAI Codex)
.github/copilot-instructions.md and scoped .instructions.md files (GitHub Copilot)
.cursorrules (Cursor)
.claude/skills//SKILL.md (Anthropic skills)

Beyond linting: measuring instruction quality over time

The linter catches problems statically. But what if you want to know whether your instruction changes actually made the AI perform better?

agenteval has a deeper pipeline for that:

Harvest scans your git history for AI-assisted commits. It detects 14 tools (Claude, Copilot, Cursor, Devin, Aider, Amazon Q, Gemini, and more) and generates replayable benchmark tasks from them. Each task includes a snapshot of what your instruction files looked like at that commit. No synthetic test cases needed. Your own git history is the benchmark.
Run gives a task to an AI agent in an isolated git worktree, captures what it produces, and scores the result. Four dimensions: did it change the right files (correctness), did it only change what needed changing (precision), how many tokens did it use (efficiency), did it follow conventions (conventions).
Compare puts two runs side by side. Change your instruction files, re-run the same tasks, see if the scores improved. If both runs have instruction snapshots, it shows exactly what changed in your instructions alongside the score delta.
CI runs all your tasks and fails the build if instruction quality regresses. Add one line to your GitHub Actions workflow and instruction quality becomes a merge gate, just like tests:

- run: agenteval ci

Enter fullscreen mode

Exit fullscreen mode

If someone changes the instructions in a PR and quality drops below the threshold, the build fails.

Live review scores your working tree changes before you commit. Are your changes focused or scattered? Did you update tests? Any debug artifacts left in? Add --analyze and it sends the diff to your AI tool for convention compliance scoring.
Trends tracks scores over time. Is your team getting better at writing instructions this quarter? Which tasks are improving? Which are regressing?

The uncomfortable question

How much time has your team spent debating which AI model to use? Now how much time have you spent testing whether your instruction files actually work?

The model is a commodity. The instructions are your competitive advantage.

Try it

curl -fsSL [https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh](https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh) | bash agenteval lint

curl -fsSL [https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh](https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh) | bash agenteval lint

Enter fullscreen mode

Exit fullscreen mode

One command. Standalone binary. Works on Linux and macOS. Point it at any project with instruction files and see what it finds.

Repo: https://github.com/lukasmetzler/agenteval

If you try it, I'd love to hear what it catches and what checks are missing.

Original source

DEV Community

https://dev.to/lukasmetzler/you-test-your-code-why-arent-you-testing-your-ai-instructions-4j2p

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudegeminimodel

ModelsLive

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

arXiv:2604.03893v1 Announce Type: new Abstract: Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventi

ArXiv CS.AI

2mabout 1 hour ago

Analyst NewsLive

Toward Artificial Intelligence Enabled Earth System Coupling

arXiv:2604.03289v1 Announce Type: cross Abstract: Coupling constitutes a foundational mechanism in the Earth system, regulating the interconnected physical, chemical, and biological processes that link its spheres. This review examines how emerging artificial intelligence (AI) methods create new opportunities to enhance Earth system coupling and address long-standing limitations in multi-component models. Rather than surveying next-generation modelling efforts broadly, we focus specifically on how state-of-the-art AI techniques can strengthen cross-domain interactions, support more coherent multi-component representations, and enable progress toward unified Earth system frameworks. The scope extends beyond climate models to include any modelling system in which Earth spheres interact. We o

arXiv physics.data-an

1mabout 1 hour ago

Research PapersLive

Measurement driven birth model for the generalized labeled multi-Bernoulli filter

arXiv:2604.03918v1 Announce Type: new Abstract: This paper presents a measurement driven birth (MDB) model for the generalized labeled multi-Bernoulli (GLMB) filter. The MDB model adaptively generates target births based on measurement data, thereby eliminating the dependence of \textit{a priori} knowledge of target birth distributions. Numerical results are provided to demonstrate the performance.

arXiv eess.SP

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 283 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

ArXiv CS.AI

2mabout 1 hour ago

ModelsLive

Affording Process Auditability with QualAnalyzer: An Atomistic LLM Analysis Tool for Qualitative Research

arXiv:2604.03820v1 Announce Type: new Abstract: Large language models are increasingly used for qualitative data analysis, but many workflows obscure how analytic conclusions are produced. We present QualAnalyzer, an open-source Chrome extension for Google Workspace that supports atomistic LLM analysis by processing each data segment independently and preserving the prompt, input, and output for every unit. Through two case studies -- holistic essay scoring and deductive thematic coding of interview transcripts -- we show that this approach creates a legible audit trail and helps researchers investigate systematic differences between LLM and human judgments. We argue that process auditability is essential for making LLM-assisted qualitative research more transparent and methodologically ro

ArXiv CS.AI

1mabout 1 hour ago

ModelsFresh

OpenAI, Anthropic and Google unite to combat model copying in China - AFR

OpenAI, Anthropic and Google unite to combat model copying in China AFR

GNews AI China

1mabout 4 hours ago

ModelsFresh

“AI 운영의 복잡성, 플랫폼과 에코시스템으로 풀어내다” HS효성인포메이션시스템의 엔터프라이즈 AI 접근법

기업의 AI 활용상은 빠르게 변화하고 있다. 이제 AI는 멀티 에이전트로서 다양한 업무를 자동으로 판단하고 실행하며, 기업은 정형 데이터를 넘어 비정형, 반정형 데이터까지도 쉽게 분석할 수 있게 됐다. 분명 환영할 만한 변화지만, 여기엔 우려도 적지 않다. 기업의 운영 환경이 점점 더 복잡해진다는 의미이기 때문이다. 기업은 AI를 안정적으로 운영할 전문 인력 부족과 초기 구축 비용, 그리고 급변하는 모델 환경에 유연하게 대응하는 문제를 해결해야 하는 상황에 놓여있다. 특히 GPU·네트워크·스토리지·데이터 레이크가 유기적으로 연결되어야 하는 엔터프라이즈 AI 영역에서는, 개별 장비를 각각 도입하는 방식으로는 원하는 성능을 이끌어내기 어려울 수 있다. 개념 증명(PoC)을 넘어 실제 업무에 AI를 적용하려는 기업에는 무엇보다 통합된 인프라 설계와 일관된 관리 체계가 요구되는 시점이다. 이런 상황에서 HS효성인포메이션시스템이 AI의 운영 안정성과 비용 효율성을 동시에 확보할 현실적인 해법을 제시했다. CIOKorea와 ITWorld가 3월 25일 주최한 ‘Cloud & AI 서밋’에서 HS효성인포메이션시스템 DX아키텍트팀의 권동수 전문위원은 “지금은 실질적으로 AI 모델을 어떻게 활용해 업무에 적용할 것인가를 고민해야 하는 단계”라고 조언했다. Foundry “진화하는 AI, 운영 구조는 점점 더 간소화” 많은 기업이 구체적인 투자 대비 수익(ROI)을 확보하기 위해 나선 가운데, 권동수 전문위원은 AI의 운영 구조가 점점 더 간소화되고 있다고 진단했다. 우선 AI 모델 개발의 흐름을 살펴보면, 초기였던 지난 20

CIO Magazine

6mabout 4 hours ago