Research Papers research paper arxiv ai artificial-intelligence

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

arXivby [Submitted on 26 Mar 2026]March 30, 20262 min read1 views

arXiv:2603.25764v1 Announce Type: cross Abstract: As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest v — Aman Mehta

View PDF HTML (experimental)

Abstract:As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~~4.5~~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2%) and highest accuracy (58%), GPT-5 is intermediate (CV: 32.2%, accuracy: 32%), and Llama shows the highest variance (CV: 47.0%) with lowest accuracy (4%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1$\times$ higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

Comments: 8 pages, 8 figures

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.25764 [cs.SE]

(or arXiv:2603.25764v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.25764

arXiv-issued DOI via DataCite

Submission history

From: Aman Mehta [view email] [v1] Thu, 26 Mar 2026 04:39:13 UTC (1,003 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25764

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

Instance-Optimality in PageRank Computation

arXiv:2512.16087v2 Announce Type: replace Abstract: We study the problem of estimating a vertex's PageRank within a constant relative error, with constant probability. We prove that an adaptive variant of a simple, classic algorithm is instance-optimal up to a polylogarithmic factor for all directed graphs of order $n$ whose maximum in- and out-degrees are at most a constant fraction of $n$. In other words, there is no correct algorithm that can be faster than our algorithm on any such graph by more than a polylogarithmic factor. We further extend the instance-optimality to all graphs in which at most a polylogarithmic number of vertices have unbounded degrees. This covers all sparse graphs with $\tilde{O}(n)$ edges. Finally, we provide a counterexample showing that our algorithm is not in

arXiv cs.DS

1mabout 2 hours ago

ReleasesLive

Local Node Differential Privacy

arXiv:2602.15802v2 Announce Type: replace Abstract: We initiate an investigation of node differential privacy for graphs in the local model of private data analysis. In our model, dubbed LNDP*, each node sees its own edge list and releases the output of a local randomizer on this input. These outputs are aggregated by an untrusted server to obtain a final output. We develop a novel algorithmic framework for this setting that allows us to accurately answer arbitrary linear queries about the input graph's degree distribution. Our framework is based on a new object, called the blurry degree distribution, which closely approximates the degree distribution and has lower sensitivity. Instead of answering queries about the degree distribution directly, our algorithms answer queries about the blur

arXiv cs.DS

2mabout 2 hours ago

ReleasesLive

Fully Dynamic Euclidean k-Means

arXiv:2507.11256v4 Announce Type: replace Abstract: We consider the Euclidean $k$-means clustering problem in a dynamic setting, where we have to explicitly maintain a solution (a set of $k$ centers) $S \subseteq \mathbb{R}^d$ subject to point insertions/deletions in $\mathbb{R}^d$. We present a dynamic algorithm for Euclidean $k$-means with $\mathrm{poly}(1/\epsilon)$-approximation ratio, $\tilde{O}(k^{\epsilon})$ update time, and $\tilde{O}(1)$ recourse, for any $\epsilon \in (0,1)$, even when $d$ and $k$ are both part of the input. This is the first algorithm to achieve a constant ratio with $o(k)$ update time for this problem, whereas the previous $O(1)$-approximation runs in $\tilde O(k)$ update time [Bhattacharya, Costa, Farokhnejad; STOC'25]. In fact, previous algorithms cannot go b

arXiv cs.DS

2mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 224 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Instance-Optimality in PageRank Computation

arXiv cs.DS

1mabout 2 hours ago

Research PapersLive

Near-Optimal Distributed Ruling Sets for Trees and High-Girth Graphs

arXiv:2504.21777v2 Announce Type: replace Abstract: Given a graph $G=(V,E)$, a $\beta$-ruling set is a subset $S\subseteq V$ that is i) independent, and ii) every node $v\in V$ has a node of $S$ within distance $\beta$. In this paper we present almost optimal distributed algorithms for finding ruling sets in trees and high girth graphs in the classic LOCAL model. As our first contribution we present an $O(\log\log n)$-round randomized algorithm for computing $2$-ruling sets on trees, almost matching the $\Omega(\log\log n/\log\log\log n)$ lower bound given by Balliu et al. [FOCS'20]. Second, we show that $2$-ruling sets can be solved in $\widetilde{O}(\log^{5/3}\log n)$ rounds in high-girth graphs. Lastly, we show that $O(\log\log\log n)$-ruling sets can be computed in $\widetilde{O}(\log\

arXiv cs.DS

1mabout 2 hours ago

Research PapersLive

Branch-and-Bound Algorithms as Polynomial-time Approximation Schemes

arXiv:2504.15885v3 Announce Type: replace Abstract: Branch-and-bound algorithms (B&B) and polynomial-time approximation schemes (PTAS) are two seemingly distant areas of combinatorial optimization. We intend to (partially) bridge the gap between them while expanding the boundary of theoretical knowledge on the B\&B framework. Branch-and-bound algorithms typically guarantee that an optimal solution is eventually found. However, we show that the standard implementation of branch-and-bound for certain knapsack and scheduling problems also exhibits PTAS-like behavior, yielding increasingly better solutions within polynomial time. Our findings are supported by computational experiments and comparisons with benchmark methods. This paper is an extended version of a paper accepted at ICALP 2025

arXiv cs.DS

1mabout 2 hours ago

Research PapersLive

On the Dynamics of Linear Finite Dynamical Systems Over Galois Rings

arXiv:2604.01548v1 Announce Type: cross Abstract: Linear finite dynamical systems play an important role, for example, in coding theory and simulations. Methods for analyzing such systems are often restricted to cases in which the system is defined over a field %and usually strive to achieve a complete description of the system and its dynamics. or lack practicability to effectively analyze the system's dynamical behavior. However, when analyzing and prototyping finite dynamical systems, it is often desirable to quickly obtain basic information such as the length of cycles and transients that appear in its dynamics, which is reflected in the structure of the connected components of the corresponding functional graphs. In this paper, we extend the analysis of the dynamics of linear finite d

arXiv cs.DS

1mabout 2 hours ago