Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessChatGPT Now Crawls 3.6x More Than Googlebot: What 24M Requests Reveal - Search Engine JournalGoogle News: ChatGPTYour Claude Code is Starving, the Food’s Scattered All Over Your Org, and Some of it is StaleTowards AIAI chatbots programmed to validate users relying on mental health advice, experts warn - FOX 10 PhoenixGNews AI mental healthThe Agentic AI: How Autonomous AI Systems Are Rewriting the Rules of Work, Business, and TechnologyTowards AIBefore Word2Vec: The Strange, Fascinating Road from Counting Words to Learning MeaningTowards AIAI Agents Are Calling Restaurants. Restaurants Can’t Talk Back.Towards AIThe Claude Code Leak Didn’t Hurt Cursor. It Forced a More Honest Competition.Towards AI30 ChatGPT Commands That Actually Save You Hours (Tested in Real Workflows)Towards AISeedream 5.0 vs Seedream 4.5 vs Nano Banana 2: Who Actually Wins in 2026?Towards AIFrom Whiteboard to IDE: Implementing Google’s TurboQuant KV Cache Compression in PythonTowards AICCA: Master the Developer Productivity scenario for the Claude Certified Architect exam — from…Towards AIYour AI Coding Agent Isn’t a Team Member. It’s Five of Them.Towards AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessChatGPT Now Crawls 3.6x More Than Googlebot: What 24M Requests Reveal - Search Engine JournalGoogle News: ChatGPTYour Claude Code is Starving, the Food’s Scattered All Over Your Org, and Some of it is StaleTowards AIAI chatbots programmed to validate users relying on mental health advice, experts warn - FOX 10 PhoenixGNews AI mental healthThe Agentic AI: How Autonomous AI Systems Are Rewriting the Rules of Work, Business, and TechnologyTowards AIBefore Word2Vec: The Strange, Fascinating Road from Counting Words to Learning MeaningTowards AIAI Agents Are Calling Restaurants. Restaurants Can’t Talk Back.Towards AIThe Claude Code Leak Didn’t Hurt Cursor. It Forced a More Honest Competition.Towards AI30 ChatGPT Commands That Actually Save You Hours (Tested in Real Workflows)Towards AISeedream 5.0 vs Seedream 4.5 vs Nano Banana 2: Who Actually Wins in 2026?Towards AIFrom Whiteboard to IDE: Implementing Google’s TurboQuant KV Cache Compression in PythonTowards AICCA: Master the Developer Productivity scenario for the Claude Certified Architect exam — from…Towards AIYour AI Coding Agent Isn’t a Team Member. It’s Five of Them.Towards AI
AI NEWS HUBbyEIGENVECTOREigenvector

Same Instruction File, Same Score, Completely Different Failures

Dev.to AIby Brad KinnardApril 7, 20266 min read2 views
Source Quiz

Two AI coding agents were given the same task with the same 10-rule instruction file. Both scored 70% adherence. Here's the breakdown: Rule Agent A Agent B camelCase variables PASS FAIL No any type FAIL PASS No console.log FAIL PASS Named exports only PASS FAIL Max 300 lines PASS FAIL Test files exist FAIL PASS Agent A had a type safety gap. It used any for request parameters even though it defined the correct types in its own types.ts file. Agent B had a structural discipline gap. It used snake_case for a variable, added a default export following Express conventions over the project rules, and generated a 338-line file by adding features beyond the task scope. Same score. Completely different engineering weaknesses. That table came from RuleProbe . About this case study The comparison us

Two AI coding agents were given the same task with the same 10-rule instruction file. Both scored 70% adherence. Here's the breakdown:

Rule Agent A Agent B

camelCase variables PASS FAIL

No any type FAIL PASS

No console.log FAIL PASS

Named exports only PASS FAIL

Max 300 lines PASS FAIL

Test files exist FAIL PASS

Agent A had a type safety gap. It used any for request parameters even though it defined the correct types in its own types.ts file. Agent B had a structural discipline gap. It used snake_case for a variable, added a default export following Express conventions over the project rules, and generated a 338-line file by adding features beyond the task scope.

Same score. Completely different engineering weaknesses. That table came from RuleProbe.

About this case study

The comparison uses simulated agent outputs with deliberate violations, not live agent runs. Raw JSON reports are in the repo under docs/case-study-data/. This is documented in the case study.

What RuleProbe is

RuleProbe is an open source CLI that reads AI coding agent instruction files and verifies whether the agent's output followed the rules. It covers six formats: CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, GEMINI.md, and .windsurfrules.

Verification is deterministic. No LLM in the pipeline. The same input produces the same report every time.

GitHub Repo

How it checks

Three methods, depending on the rule:

AST analysis via ts-morph handles code structure. Variable and function naming (camelCase), type and interface naming (PascalCase), type annotations (any detection), export style (named vs default), JSDoc presence on public functions, and import patterns (path aliases, deep relative imports).

Filesystem inspection handles file-level rules. File naming conventions (kebab-case) and whether test files exist for source files.

Regex handles content patterns like max line length.

v0.1.0 has 15 matchers across those three methods, covering TypeScript and JavaScript. ts-morph is the AST engine, so other languages aren't supported.

Output looks like this:

RuleProbe Adherence Report Rules: 14 total | 11 passed | 3 failed | Score: 79%

PASS naming/naming-camelcase-variables-5 PASS naming/naming-pascalcase-types-7 FAIL forbidden-pattern/forbidden-no-any-type-1 src/handler.ts:12 - found: req: any src/handler.ts:24 - found: data: any FAIL forbidden-pattern/forbidden-no-console-log-10 src/handler.ts:18 - found: console.log("handling request") FAIL test-requirement/test-files-exist-11 src/handler.ts - found: no test file found`

Enter fullscreen mode

Exit fullscreen mode

File, line, violation. No ambiguity.

The conservative parser

This is a design choice worth explaining. When RuleProbe reads an instruction file, it only extracts rules it can map to a deterministic mechanical check. Everything else gets reported as unparseable.

ruleprobe parse CLAUDE.md --show-unparseable

Enter fullscreen mode

Exit fullscreen mode

"Write clean code" is unparseable. "Use the repository pattern" is unparseable. "Handle errors gracefully" is unparseable. These can't be verified without judgment, and judgment means variance between runs. RuleProbe doesn't do that.

The tradeoff: a 30-rule instruction file might produce 12 verified rules and 18 unparseable ones. You see both counts so you know exactly what's being checked and what isn't.

Running it

npx ruleprobe --help

Enter fullscreen mode

Exit fullscreen mode

Parse an instruction file

ruleprobe parse CLAUDE.md

Enter fullscreen mode

Exit fullscreen mode

Extracted 14 rules:

forbidden-no-any-type-1 Category: forbidden-pattern Verifier: ast Pattern: no-any (.ts) Source: "- TypeScript strict mode, no any types"

naming-kebab-case-files-4 Category: naming Verifier: filesystem Pattern: kebab-case (.ts) Source: "- File names: kebab-case"`

Enter fullscreen mode

Exit fullscreen mode

Verify agent output

ruleprobe verify CLAUDE.md ./agent-output --format text

Enter fullscreen mode

Exit fullscreen mode

Supports --format json, --format markdown, and --format rdjson (reviewdog-compatible). Exit code 0 means all rules passed, 1 means violations found.

Compare two agents

ruleprobe compare AGENTS.md ./claude-output ./copilot-output \  --agents claude,copilot --format markdown

Enter fullscreen mode

Exit fullscreen mode

CI with the GitHub Action

name: RuleProbe on: [pull_request] jobs:  check-rules:  runs-on: ubuntu-latest  permissions:  contents: read  pull-requests: write  steps:

  • uses: actions/checkout@v4
  • uses: moonrunnerkc/[email protected] with: instruction-file: AGENTS.md output-dir: src env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}`

Enter fullscreen mode

Exit fullscreen mode

No external API keys. Posts results as a PR comment. Supports reviewdog rdjson for inline annotations if you use reviewdog in your pipeline. Exposes score, passed, failed, and total as step outputs, so you can gate merges on adherence thresholds in downstream steps.

All action inputs

Input Default What it does

instruction-file (required) Path to your instruction file

output-dir src Directory of code to verify

agent ci Agent label for report metadata

model unknown Model label for report metadata

format text text, json, or markdown

severity all error, warning, or all

fail-on-violation true Fail the check if any rule is violated

post-comment true Post results as a PR comment

reviewdog-format false Also output rdjson

Programmatic API

Five functions if you want to integrate verification into your own tooling:

import {  parseInstructionFile,  verifyOutput,  generateReport,  formatReport,  extractRules } from 'ruleprobe';

Enter fullscreen mode

Exit fullscreen mode

parseInstructionFile reads the instruction file. verifyOutput runs the rules. generateReport builds the adherence report with summary stats. formatReport renders it as text, JSON, markdown, or rdjson. extractRules works on raw markdown content if you don't have a file path.

What it doesn't cover

15 matchers is a starting point, not full coverage. Real instruction files have rules RuleProbe can't verify yet: architectural patterns, error handling conventions, dependency constraints, API design rules. The parser will tell you what it skipped.

TypeScript and JavaScript only. ts-morph is the AST engine. Other languages would need a different parser.

No automated agent invocation. You run the agent separately and point RuleProbe at the output directory.

Security and dependencies

RuleProbe never executes scanned code, never makes network calls, never writes to the scanned directory. Paths are resolved and bounded to process.cwd(). Symlinks outside the project are skipped by default.

Four runtime dependencies: chalk 5.6.2, commander 12.1.0, glob 11.1.0, ts-morph 24.0.0. All pinned to exact versions. No semver ranges.

npm: ruleprobe | MIT license

RuleProbe

Verify whether AI coding agents actually follow the instruction files they're given

Why

Every AI coding agent reads an instruction file. None of them prove they followed it.

You write CLAUDE.md or AGENTS.md with specific rules: camelCase variables, no any types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.

RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Binary pass/fail, with file paths and line numbers as evidence. No LLM evaluation, no judgment calls. Deterministic and reproducible.

Quick Start

npm install -g ruleprobe

Enter fullscreen mode

Exit fullscreen mode

Or run it directly:

npx ruleprobe --help

Enter fullscreen mode

Exit fullscreen mode

Parse an instruction file to see what rules RuleProbe can extract:

ruleprobe parse CLAUDE.md

Enter fullscreen mode

Exit fullscreen mode

Extracted 14 rules  forbidden-no-any-type-1  Category: forbidden-pattern  Verifier: ast  Pattern: no-any (*.ts)  Source: "- TypeScript strict
*

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudegeminimodel

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Same Instru…claudegeminimodelversionopen sourcefeatureDev.to AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 287 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models