Live
Black Hat USADark ReadingBlack Hat AsiaAI Business跳出幸存者偏差,从结构性资源分配解析财富真相Dev.to AIOpenClaw vs Cloud AI: Which One Actually Gives Businesses More Control?Medium AI“In a World of AI Content, Being Human Is Your Superpower”Medium AIHow AI is Transforming the Role of a CFO in 2026.Medium AIHow to Build Self-Running AI Tasks with TypeScript (No Cron Jobs Needed)Dev.to AIFaked Fire Drill!Medium AIThe Sentinel: AI-Powered Zero-Touch Insurance for Gig WorkersDev.to AIDecision Trees from Data: Building Context-Aware ModelsDev.to AIFrom Crisis to Clinic: How AI Automates Drug Shortage ResolutionDev.to AIThe Hidden Cost of ChatGPT: Your Assignments Have a Carbon FootprintMedium AIOllama vs OpenAI API: A TypeScript Developer's Honest ComparisonDev.to AIAI in Telehealth & Telemedicine Market Size, Share, Growth 2034 - Fortune Business InsightsGoogle News: Machine LearningBlack Hat USADark ReadingBlack Hat AsiaAI Business跳出幸存者偏差,从结构性资源分配解析财富真相Dev.to AIOpenClaw vs Cloud AI: Which One Actually Gives Businesses More Control?Medium AI“In a World of AI Content, Being Human Is Your Superpower”Medium AIHow AI is Transforming the Role of a CFO in 2026.Medium AIHow to Build Self-Running AI Tasks with TypeScript (No Cron Jobs Needed)Dev.to AIFaked Fire Drill!Medium AIThe Sentinel: AI-Powered Zero-Touch Insurance for Gig WorkersDev.to AIDecision Trees from Data: Building Context-Aware ModelsDev.to AIFrom Crisis to Clinic: How AI Automates Drug Shortage ResolutionDev.to AIThe Hidden Cost of ChatGPT: Your Assignments Have a Carbon FootprintMedium AIOllama vs OpenAI API: A TypeScript Developer's Honest ComparisonDev.to AIAI in Telehealth & Telemedicine Market Size, Share, Growth 2034 - Fortune Business InsightsGoogle News: Machine Learning
AI NEWS HUBbyEIGENVECTOREigenvector

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

arXivby [Submitted on 26 Mar 2026]March 30, 20262 min read1 views
Source Quiz

arXiv:2603.25764v1 Announce Type: cross Abstract: As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest v — Aman Mehta

View PDF HTML (experimental)

Abstract:As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude4.5Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2%) and highest accuracy (58%), GPT-5 is intermediate (CV: 32.2%, accuracy: 32%), and Llama shows the highest variance (CV: 47.0%) with lowest accuracy (4%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1$\times$ higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

Comments: 8 pages, 8 figures

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.25764 [cs.SE]

(or arXiv:2603.25764v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.25764

arXiv-issued DOI via DataCite

Submission history

From: Aman Mehta [view email] [v1] Thu, 26 Mar 2026 04:39:13 UTC (1,003 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Consistency…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 224 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers