Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
arXiv:2603.25764v1 Announce Type: cross Abstract: As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest v — Aman Mehta
View PDF HTML (experimental)
Abstract:As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude
4.5Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2%) and highest accuracy (58%), GPT-5 is intermediate (CV: 32.2%, accuracy: 32%), and Llama shows the highest variance (CV: 47.0%) with lowest accuracy (4%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1$\times$ higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.
Comments: 8 pages, 8 figures
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.25764 [cs.SE]
(or arXiv:2603.25764v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2603.25764
arXiv-issued DOI via DataCite
Submission history
From: Aman Mehta [view email] [v1] Thu, 26 Mar 2026 04:39:13 UTC (1,003 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Instance-Optimality in PageRank Computation
arXiv:2512.16087v2 Announce Type: replace Abstract: We study the problem of estimating a vertex's PageRank within a constant relative error, with constant probability. We prove that an adaptive variant of a simple, classic algorithm is instance-optimal up to a polylogarithmic factor for all directed graphs of order $n$ whose maximum in- and out-degrees are at most a constant fraction of $n$. In other words, there is no correct algorithm that can be faster than our algorithm on any such graph by more than a polylogarithmic factor. We further extend the instance-optimality to all graphs in which at most a polylogarithmic number of vertices have unbounded degrees. This covers all sparse graphs with $\tilde{O}(n)$ edges. Finally, we provide a counterexample showing that our algorithm is not in

Local Node Differential Privacy
arXiv:2602.15802v2 Announce Type: replace Abstract: We initiate an investigation of node differential privacy for graphs in the local model of private data analysis. In our model, dubbed LNDP*, each node sees its own edge list and releases the output of a local randomizer on this input. These outputs are aggregated by an untrusted server to obtain a final output. We develop a novel algorithmic framework for this setting that allows us to accurately answer arbitrary linear queries about the input graph's degree distribution. Our framework is based on a new object, called the blurry degree distribution, which closely approximates the degree distribution and has lower sensitivity. Instead of answering queries about the degree distribution directly, our algorithms answer queries about the blur

Fully Dynamic Euclidean k-Means
arXiv:2507.11256v4 Announce Type: replace Abstract: We consider the Euclidean $k$-means clustering problem in a dynamic setting, where we have to explicitly maintain a solution (a set of $k$ centers) $S \subseteq \mathbb{R}^d$ subject to point insertions/deletions in $\mathbb{R}^d$. We present a dynamic algorithm for Euclidean $k$-means with $\mathrm{poly}(1/\epsilon)$-approximation ratio, $\tilde{O}(k^{\epsilon})$ update time, and $\tilde{O}(1)$ recourse, for any $\epsilon \in (0,1)$, even when $d$ and $k$ are both part of the input. This is the first algorithm to achieve a constant ratio with $o(k)$ update time for this problem, whereas the previous $O(1)$-approximation runs in $\tilde O(k)$ update time [Bhattacharya, Costa, Farokhnejad; STOC'25]. In fact, previous algorithms cannot go b
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

Instance-Optimality in PageRank Computation
arXiv:2512.16087v2 Announce Type: replace Abstract: We study the problem of estimating a vertex's PageRank within a constant relative error, with constant probability. We prove that an adaptive variant of a simple, classic algorithm is instance-optimal up to a polylogarithmic factor for all directed graphs of order $n$ whose maximum in- and out-degrees are at most a constant fraction of $n$. In other words, there is no correct algorithm that can be faster than our algorithm on any such graph by more than a polylogarithmic factor. We further extend the instance-optimality to all graphs in which at most a polylogarithmic number of vertices have unbounded degrees. This covers all sparse graphs with $\tilde{O}(n)$ edges. Finally, we provide a counterexample showing that our algorithm is not in

Near-Optimal Distributed Ruling Sets for Trees and High-Girth Graphs
arXiv:2504.21777v2 Announce Type: replace Abstract: Given a graph $G=(V,E)$, a $\beta$-ruling set is a subset $S\subseteq V$ that is i) independent, and ii) every node $v\in V$ has a node of $S$ within distance $\beta$. In this paper we present almost optimal distributed algorithms for finding ruling sets in trees and high girth graphs in the classic LOCAL model. As our first contribution we present an $O(\log\log n)$-round randomized algorithm for computing $2$-ruling sets on trees, almost matching the $\Omega(\log\log n/\log\log\log n)$ lower bound given by Balliu et al. [FOCS'20]. Second, we show that $2$-ruling sets can be solved in $\widetilde{O}(\log^{5/3}\log n)$ rounds in high-girth graphs. Lastly, we show that $O(\log\log\log n)$-ruling sets can be computed in $\widetilde{O}(\log\

Branch-and-Bound Algorithms as Polynomial-time Approximation Schemes
arXiv:2504.15885v3 Announce Type: replace Abstract: Branch-and-bound algorithms (B&B) and polynomial-time approximation schemes (PTAS) are two seemingly distant areas of combinatorial optimization. We intend to (partially) bridge the gap between them while expanding the boundary of theoretical knowledge on the B\&B framework. Branch-and-bound algorithms typically guarantee that an optimal solution is eventually found. However, we show that the standard implementation of branch-and-bound for certain knapsack and scheduling problems also exhibits PTAS-like behavior, yielding increasingly better solutions within polynomial time. Our findings are supported by computational experiments and comparisons with benchmark methods. This paper is an extended version of a paper accepted at ICALP 2025

On the Dynamics of Linear Finite Dynamical Systems Over Galois Rings
arXiv:2604.01548v1 Announce Type: cross Abstract: Linear finite dynamical systems play an important role, for example, in coding theory and simulations. Methods for analyzing such systems are often restricted to cases in which the system is defined over a field %and usually strive to achieve a complete description of the system and its dynamics. or lack practicability to effectively analyze the system's dynamical behavior. However, when analyzing and prototyping finite dynamical systems, it is often desirable to quickly obtain basic information such as the length of cycles and transients that appear in its dynamics, which is reflected in the structure of the connected components of the corresponding functional graphs. In this paper, we extend the analysis of the dynamics of linear finite d


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!