Decision-Centric Design for LLM Systems
arXiv:2604.00414v1 Announce Type: new Abstract: LLM systems must make control decisions in addition to generating outputs: whether to answer, clarify, retrieve, call tools, repair, or escalate. In many current architectures, these decisions remain implicit within generation, entangling assessment and action in a single model call and making failures hard to inspect, constrain, or repair. We propose a decision-centric framework that separates decision-relevant signals from the policy that maps them to actions, turning control into an explicit and inspectable layer of the system. This separation supports attribution of failures to signal estimation, decision policy, or execution, and enables modular improvement of each component. It unifies familiar single-step settings such as routing and a
View PDF HTML (experimental)
Abstract:LLM systems must make control decisions in addition to generating outputs: whether to answer, clarify, retrieve, call tools, repair, or escalate. In many current architectures, these decisions remain implicit within generation, entangling assessment and action in a single model call and making failures hard to inspect, constrain, or repair. We propose a decision-centric framework that separates decision-relevant signals from the policy that maps them to actions, turning control into an explicit and inspectable layer of the system. This separation supports attribution of failures to signal estimation, decision policy, or execution, and enables modular improvement of each component. It unifies familiar single-step settings such as routing and adaptive inference, and extends naturally to sequential settings in which actions alter the information available before acting. Across three controlled experiments, the framework reduces futile actions, improves task success, and reveals interpretable failure modes. More broadly, it offers a general architectural principle for building more reliable, controllable, and diagnosable LLM systems.
Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: arXiv:2604.00414 [cs.AI]
(or arXiv:2604.00414v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.00414
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Wei Sun [view email] [v1] Wed, 1 Apr 2026 02:57:23 UTC (54 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelannounceavailable
40,000 taxi drivers sign up for e-payment systems as new rules come into force
More than 40,000 Hong Kong taxi drivers have adopted major electronic payment platforms AlipayHK and WeChat Pay HK, easing a protracted pain point for mainland Chinese visitors who are expected to arrive in droves during the coming Ching Ming Festival break. AlipayHK, a subsidiary of Ant International, an affiliate of Alibaba Group Holding, which owns the South China Morning Post, said on Thursday that its e-payment system was available to 47,000 taxi drivers. WeChat Pay HK said more than 40,000...
An interview with Mustafa Suleyman on Microsoft s AI reorg, how revising its OpenAI contract "unlocked [Microsoft s] ability to pursue superintelligence", more (Hayden Field/The Verge)
Hayden Field / The Verge : An interview with Mustafa Suleyman on Microsoft's AI reorg, how revising its OpenAI contract unlocked [Microsoft's] ability to pursue superintelligence , more Its new transcription model is a step towards those goals, says Microsoft AI's Mustafa Suleyman.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
An interview with Mustafa Suleyman on Microsoft s AI reorg, how revising its OpenAI contract "unlocked [Microsoft s] ability to pursue superintelligence", more (Hayden Field/The Verge)
Hayden Field / The Verge : An interview with Mustafa Suleyman on Microsoft's AI reorg, how revising its OpenAI contract unlocked [Microsoft's] ability to pursue superintelligence , more Its new transcription model is a step towards those goals, says Microsoft AI's Mustafa Suleyman.
Vulkan backend much easier on the CPU and GPU memory than CUDA.
On linux and compiled my own llama.cpp with CUDA support, top would always show one pegged CPU core at 100% when running Qwen3.5-9B-GGUF:Q4_K_M on my potato like RTX A2000 12GB. Also, nvidia-smi would show 11GB+ of GPU memory usage. Speed is ~30 tokens per second. My system fans would spin up when this single core gets pegged which was annoying to listen to. Decided to compile llama.cpp again with Vulkan backend to see if anything would be different. Well it was a big difference when using the exact same model Now, top is only showing one CPU core at about 30% usage and nvidia-smi is only showing 7.2GB of GPU memory usage. Speed is the same at ~30 tokens per second. No longer have my system fan spinning up when running inferencing. Just curious why the GPU memory footprint is lower and CPU


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!