Same Instruction File, Same Score, Completely Different Failures
Two AI coding agents were given the same task with the same 10-rule instruction file. Both scored 70% adherence. Here's the breakdown: Rule Agent A Agent B camelCase variables PASS FAIL No any type FAIL PASS No console.log FAIL PASS Named exports only PASS FAIL Max 300 lines PASS FAIL Test files exist FAIL PASS Agent A had a type safety gap. It used any for request parameters even though it defined the correct types in its own types.ts file. Agent B had a structural discipline gap. It used snake_case for a variable, added a default export following Express conventions over the project rules, and generated a 338-line file by adding features beyond the task scope. Same score. Completely different engineering weaknesses. That table came from RuleProbe . About this case study The comparison us
Two AI coding agents were given the same task with the same 10-rule instruction file. Both scored 70% adherence. Here's the breakdown:
Rule Agent A Agent B
camelCase variables PASS FAIL
No any type
FAIL
PASS
No console.log FAIL PASS
Named exports only PASS FAIL
Max 300 lines PASS FAIL
Test files exist FAIL PASS
Agent A had a type safety gap. It used any for request parameters even though it defined the correct types in its own types.ts file. Agent B had a structural discipline gap. It used snake_case for a variable, added a default export following Express conventions over the project rules, and generated a 338-line file by adding features beyond the task scope.
Same score. Completely different engineering weaknesses. That table came from RuleProbe.
About this case study
The comparison uses simulated agent outputs with deliberate violations, not live agent runs. Raw JSON reports are in the repo under docs/case-study-data/. This is documented in the case study.
What RuleProbe is
RuleProbe is an open source CLI that reads AI coding agent instruction files and verifies whether the agent's output followed the rules. It covers six formats: CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, GEMINI.md, and .windsurfrules.
Verification is deterministic. No LLM in the pipeline. The same input produces the same report every time.
GitHub Repo
How it checks
Three methods, depending on the rule:
AST analysis via ts-morph handles code structure. Variable and function naming (camelCase), type and interface naming (PascalCase), type annotations (any detection), export style (named vs default), JSDoc presence on public functions, and import patterns (path aliases, deep relative imports).
Filesystem inspection handles file-level rules. File naming conventions (kebab-case) and whether test files exist for source files.
Regex handles content patterns like max line length.
v0.1.0 has 15 matchers across those three methods, covering TypeScript and JavaScript. ts-morph is the AST engine, so other languages aren't supported.
Output looks like this:
RuleProbe Adherence Report Rules: 14 total | 11 passed | 3 failed | Score: 79%RuleProbe Adherence Report Rules: 14 total | 11 passed | 3 failed | Score: 79%PASS naming/naming-camelcase-variables-5 PASS naming/naming-pascalcase-types-7 FAIL forbidden-pattern/forbidden-no-any-type-1 src/handler.ts:12 - found: req: any src/handler.ts:24 - found: data: any FAIL forbidden-pattern/forbidden-no-console-log-10 src/handler.ts:18 - found: console.log("handling request") FAIL test-requirement/test-files-exist-11 src/handler.ts - found: no test file found`
Enter fullscreen mode
Exit fullscreen mode
File, line, violation. No ambiguity.
The conservative parser
This is a design choice worth explaining. When RuleProbe reads an instruction file, it only extracts rules it can map to a deterministic mechanical check. Everything else gets reported as unparseable.
ruleprobe parse CLAUDE.md --show-unparseable
Enter fullscreen mode
Exit fullscreen mode
"Write clean code" is unparseable. "Use the repository pattern" is unparseable. "Handle errors gracefully" is unparseable. These can't be verified without judgment, and judgment means variance between runs. RuleProbe doesn't do that.
The tradeoff: a 30-rule instruction file might produce 12 verified rules and 18 unparseable ones. You see both counts so you know exactly what's being checked and what isn't.
Running it
npx ruleprobe --help
Enter fullscreen mode
Exit fullscreen mode
Parse an instruction file
ruleprobe parse CLAUDE.md
Enter fullscreen mode
Exit fullscreen mode
Extracted 14 rules:
forbidden-no-any-type-1 Category: forbidden-pattern Verifier: ast Pattern: no-any (.ts) Source: "- TypeScript strict mode, no any types"
naming-kebab-case-files-4 Category: naming Verifier: filesystem Pattern: kebab-case (.ts) Source: "- File names: kebab-case"`
Enter fullscreen mode
Exit fullscreen mode
Verify agent output
ruleprobe verify CLAUDE.md ./agent-output --format text
Enter fullscreen mode
Exit fullscreen mode
Supports --format json, --format markdown, and --format rdjson (reviewdog-compatible). Exit code 0 means all rules passed, 1 means violations found.
Compare two agents
ruleprobe compare AGENTS.md ./claude-output ./copilot-output \ --agents claude,copilot --format markdownruleprobe compare AGENTS.md ./claude-output ./copilot-output \ --agents claude,copilot --format markdownEnter fullscreen mode
Exit fullscreen mode
CI with the GitHub Action
name: RuleProbe on: [pull_request] jobs: check-rules: runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps:name: RuleProbe on: [pull_request] jobs: check-rules: runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps:- uses: actions/checkout@v4
- uses: moonrunnerkc/[email protected] with: instruction-file: AGENTS.md output-dir: src env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}`
Enter fullscreen mode
Exit fullscreen mode
No external API keys. Posts results as a PR comment. Supports reviewdog rdjson for inline annotations if you use reviewdog in your pipeline. Exposes score, passed, failed, and total as step outputs, so you can gate merges on adherence thresholds in downstream steps.
All action inputs
Input Default What it does
instruction-file
(required)
Path to your instruction file
output-dir
src
Directory of code to verify
agent
ci
Agent label for report metadata
model
unknown
Model label for report metadata
format
text
text, json, or markdown
severity
all
error, warning, or all
fail-on-violation
true
Fail the check if any rule is violated
post-comment
true
Post results as a PR comment
reviewdog-format
false
Also output rdjson
Programmatic API
Five functions if you want to integrate verification into your own tooling:
import { parseInstructionFile, verifyOutput, generateReport, formatReport, extractRules } from 'ruleprobe';import { parseInstructionFile, verifyOutput, generateReport, formatReport, extractRules } from 'ruleprobe';Enter fullscreen mode
Exit fullscreen mode
parseInstructionFile reads the instruction file. verifyOutput runs the rules. generateReport builds the adherence report with summary stats. formatReport renders it as text, JSON, markdown, or rdjson. extractRules works on raw markdown content if you don't have a file path.
What it doesn't cover
15 matchers is a starting point, not full coverage. Real instruction files have rules RuleProbe can't verify yet: architectural patterns, error handling conventions, dependency constraints, API design rules. The parser will tell you what it skipped.
TypeScript and JavaScript only. ts-morph is the AST engine. Other languages would need a different parser.
No automated agent invocation. You run the agent separately and point RuleProbe at the output directory.
Security and dependencies
RuleProbe never executes scanned code, never makes network calls, never writes to the scanned directory. Paths are resolved and bounded to process.cwd(). Symlinks outside the project are skipped by default.
Four runtime dependencies: chalk 5.6.2, commander 12.1.0, glob 11.1.0, ts-morph 24.0.0. All pinned to exact versions. No semver ranges.
npm: ruleprobe | MIT license
RuleProbe
Verify whether AI coding agents actually follow the instruction files they're given
Why
Every AI coding agent reads an instruction file. None of them prove they followed it.
You write CLAUDE.md or AGENTS.md with specific rules: camelCase variables, no any types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.
RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Binary pass/fail, with file paths and line numbers as evidence. No LLM evaluation, no judgment calls. Deterministic and reproducible.
Quick Start
npm install -g ruleprobe
Enter fullscreen mode
Exit fullscreen mode
Or run it directly:
npx ruleprobe --help
Enter fullscreen mode
Exit fullscreen mode
Parse an instruction file to see what rules RuleProbe can extract:
ruleprobe parse CLAUDE.md
Enter fullscreen mode
Exit fullscreen mode
Extracted 14 rules forbidden-no-any-type-1 Category: forbidden-pattern Verifier: ast Pattern: no-any (*.ts) Source: "- TypeScript strictExtracted 14 rules forbidden-no-any-type-1 Category: forbidden-pattern Verifier: ast Pattern: no-any (*.ts) Source: "- TypeScript strict…
Dev.to AI
https://dev.to/moonrunnerkc/same-instruction-file-same-score-completely-different-failures-46fpSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudegeminimodel
Your AI Coding Agent Isn’t a Team Member. It’s Five of Them.
Most teams using Claude Code are doing it wrong. They treat the AI like a single, brilliant intern — toss it a task, review the output, fix the mess, repeat. It works, sort of. But it’s like hiring a concert pianist and asking them to only play chopsticks. The real power isn’t in having one agent do everything. It’s in making the agent switch roles at precisely the right moment in your development lifecycle. Garry Tan — Y Combinator’s CEO, former early engineer at Palantir — recently open-sourced his Claude Code setup and shared the numbers: 10,000 lines of code and 100 pull requests per week over a 50-day stretch. Andrej Karpathy told the No Priors podcast in March 2026 that he hasn’t typed a line of code since December. Peter Steinberger built OpenClaw — 247K GitHub stars — essentially s

The Agentic AI: How Autonomous AI Systems Are Rewriting the Rules of Work, Business, and Technology
From chatbots to digital workers — inside the $10.9 billion revolution that’s turning AI from a tool you use into a colleague that works alongside you. Last week, we broke down the state of generative AI in 2026 — the models, the money, the infrastructure wars. We ended with a promise: a deep dive into what comes next. Because while generative AI learned how to talk, agentic AI learned how to work. The Underdogs: Startups That Broke Through The big labs get the headlines, but some of 2026’s most important agentic AI stories belong to startups that moved faster, thought differently, and captured massive adoption before the incumbents could react. CrewAI: From Open Source to Fortune 500 Founded by Brazilian developer Joao Moura, CrewAI went from an open-source Python library in 2023 to an en
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Before Word2Vec: The Strange, Fascinating Road from Counting Words to Learning Meaning
How NLP kept running into the limits of language — until words stopped being labels and became relations In the first post, we saw how language models turn text into tokens and tokens into numbers. In the second post, we stayed with a deeper question: Once words become numbers, how does meaning not disappear? And in the third post, we entered Word2Vec itself — the moment words stopped being treated merely as labels and began to be treated as positions in a learned relational space. But that naturally raises another question: How did the field arrive at Word2Vec at all? Did it begin with TF-IDF ? With n-grams ? With statistics ? With neural networks ? The honest answer is: all of them, but not in the same way . What makes the journey interesting is that NLP did not simply become smarter in

Compliance-by-Construction Argument Graphs: Using Generative AI to Produce Evidence-Linked Formal Arguments for Certification-Grade Accountability
arXiv:2604.04103v1 Announce Type: new Abstract: High-stakes decision systems increasingly require structured justification, traceability, and auditability to ensure accountability and regulatory compliance. Formal arguments commonly used in the certification of safety-critical systems provide a mechanism for structuring claims, reasoning, and evidence in a verifiable manner. At the same time, generative artificial intelligence systems are increasingly integrated into decision-support workflows, assisting with drafting explanations, summarizing evidence, and generating recommendations. However, current deployments often rely on language models as loosely constrained assistants, which introduces risks such as hallucinated reasoning, unsupported claims, and weak traceability. This paper propo




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!