Search AI News
Find articles across all categories and topics
30 results for "Tool Use"

Agentic Tool Use in Large Language Models
arXiv:2604.00835v1 Announce Type: new Abstract: Large language models are increasingly being deployed as autonomous agents yet their real world effectiveness depends on reliable tools for information retrieval, computation and external action. Existing studies remain fragmented across tasks, tool types, and training settings, lacking a unified view of how tool-use methods differ and evolve. This paper organizes the literature into three paradigms: prompting as plug-and-play, supervised tool learning and reward-driven tool policy learning, analyzes their methods, strengths and failure modes, re — Jinchao Hu (Harbin Institute of Technology Shenzhen, Shenzhen, China), Meizhi Zhong (TikTok Inc, Beijing, China), Kehai Chen (Harbin Institute of Technology Shenzhen, Shenzhen, China), Xuefeng Bai (Harbin Institute of Technology Shenzhen, Shenzhen, China), Min Zhang (Harbin Institute of Technology Shenzhen, Shenzhen, China)

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
arXiv:2505.23752v3 Announce Type: replace Abstract: Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradi — Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, Salman Khan

The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
arXiv:2603.22862v2 Announce Type: replace Abstract: Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguis

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods
arXiv:2510.16609v2 Announce Type: replace Abstract: Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity pro — Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless

Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use
The landscape of open-source artificial intelligence has shifted from purely generative models toward systems capable of complex, multi-step reasoning. While proprietary reasoning models have dominated the conversation, Arcee AI has released Trinity Large Thinking. This release is an open-weight reasoning model distributed under the Apache 2.0 license, positioning it as a transparent alternative for developers [ ] The post Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use appeared first on MarkTechPost .

Gemma 4 is efficient with thinking tokens, but it will also happily reason for 10+ minutes if you prompt it to do so.
Tested both 26b and 31b in AI Studio. The task I asked of it was to crack a cypher. The top closed source models can crack this cypher at max thinking parameters, and Kimi 2.5 Thinking and Deepseek 3.2 are the only open source models to crack the cypher without tool use. (Of course, with the closed models you can't rule out 'secret' tool use on the backend.) When I first asked these models to crack the cypher, they thought for a short amount of time and then both hallucinated false 'translations' of the cypher. I added this to my prompt: Spare no effort to solve this, the stakes are high. Increase your thinking length to maximum in order to solve it. Double check and verify your results to rule out hallucination of an incorrect response. I did not expect dramatic results (we all laugh at p

Open Models have crossed a threshold
💡 TL;DR: Open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks — file operations, tool use, and instruction following — at a fraction of the cost and latency. Here s what our evals show and how to start using them

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
arXiv:2512.17312v2 Announce Type: replace Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orc — Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao

Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents
arXiv:2603.30031v2 Announce Type: replace Abstract: Autonomous tool-using agents operating in networked environments must decide which information source to query and when to stop querying and act. Without principled bounds on information-acquisition costs, unconstrained agents exhibit systematic failure modes: excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence. We propose the Triadic Cognitive Architecture (TCA), a unified decision-theoretic framework that formalizes these failure modes through the concept of Cognitive F — Davide Di Gioia

HippoCamp: Benchmarking Contextual Agents on Personal Computers
arXiv:2604.01221v1 Announce Type: new Abstract: We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide

Liquid AI releases LFM2.5-350M -> Agentic loops at 350M parameters
<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1s8u1c1/liquid_ai_releases_lfm25350m_agentic_loops_at/"> <img src="https://preview.redd.it/q6muz2r11fsg1.jpeg?width=640&crop=smart&auto=webp&s=57ac980d60b453976405eaedf1b7f50fe734af54" alt="Liquid AI releases LFM2.5-350M -> Agentic loops at 350M parameters" title="Liquid AI releases LFM2.5-350M -> Agentic loops at 350M parameters" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>LFM2.5-350M by Liquid AI was trained for reliable data extraction and tool use. </p> <p>At <500MB when quantized, it is built for environments where compute, memory, and latency are particularly constrained.</p> <p>Trained on 28T tokens with scaled RL, it outperforms larger models like Qwen3.5-0.8B in most benchmarks; while being

The AI industry loves token inflation. Your company shouldn’t
The AI industry has a quiet addiction problem: It is addicted to tokens. Every new generation of agentic AI seems to assume that the answer to complexity is to throw more context at the model, keep longer histories, spawn more calls, loop over more tools, and let the token meter run wild. The rise of agentic systems, and now projects like OpenClaw , makes that temptation even stronger. Once you give models more autonomy, they do not just consume tokens to answer questions. They consume them to plan, reflect, retry, summarize, call tools, inspect outputs, and keep themselves on track. OpenClaw itself describes the product as an “agent-native” gateway with sessions, memory, tool use, and multi-agent routing across messaging platforms—which tells you exactly where this is going: more autonomy

PaperVoyager : Building Interactive Web with Visual Language Models
arXiv:2603.22999v2 Announce Type: replace Abstract: Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing withou — Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang
