Models model language model benchmark announce open-source application

Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

arXiv cs.SEby [Submitted on 9 Oct 2025 (v1), last revised 1 Apr 2026 (this version, v3)]April 3, 20262 min read1 views

arXiv:2510.08640v3 Announce Type: replace Abstract: Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-ar

View PDF HTML (experimental)

Abstract:Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer's success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high-level reasoning and effective low-level execution.

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as: arXiv:2510.08640 [cs.SE]

(or arXiv:2510.08640v3 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2510.08640

arXiv-issued DOI via DataCite

Submission history

From: Ha Min Son [view email] [v1] Thu, 9 Oct 2025 01:33:25 UTC (83 KB) [v2] Wed, 19 Nov 2025 18:46:34 UTC (79 KB) [v3] Wed, 1 Apr 2026 19:05:59 UTC (81 KB)

Original source

arXiv cs.SE

https://arxiv.org/abs/2510.08640

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ReleasesLive

The Spaceballs sequel will be released in April next year

There's finally a release date for the Spaceballs sequel — but before you get too excited, it's a whole year away. As first reported by Deadline , Amazon MGM Studios announced on Friday night that the upcoming Spaceballs movie will hit theaters on April 23, 2027, right around the 40th anniversary of the first film. Several members of the original cast will be reprising their roles, according to Deadline , including Mel Brooks, Rick Moranis, Bill Pullman, George Wynder and Daphne Zuniga. Spaceballs: The Release Date. April 23, 2027. pic.twitter.com/5Xv0BKmf7C — Amazon MGM Studios (@AmazonMGMStudio) April 4, 2026 Whispers of a potential Spaceballs 2 go back a couple of years, but Brooks officially confirmed in an extremely on-brand announcement video last summer that the movie is actually ha

Engadget

2mabout 1 hour ago

Models

Intel Labs Introduces AI Diffusion Model, Generates 360-Degree Images from Text Prompts - intc.com

Intel Labs Introduces AI Diffusion Model, Generates 360-Degree Images from Text Prompts intc.com

GNews AI diffusion

1malmost 3 years ago

ProductsLive

Why I Run 22 Docker Services at Home

Somewhere in my living room, a 2018 gaming PC is running 22 Docker containers, processing 15,000 emails through a local LLM, and managing the finances of a real business. It was never supposed to do any of this. I run a one-person software consultancy in the Netherlands; web development, 3D printing, and consulting. Last year, I started building an AI system to help me manage it all. Eight specialized agents handling email triage, financial tracking, infrastructure monitoring, and scheduling. Every piece of inference runs locally. No cloud APIs touching my private data. This post covers the hardware, what it actually costs, and what I'd do differently if I started over. The Setup: Three Machines, One Mesh Network The entire system runs on three machines connected via Tailscale mesh VPN: do

DEV Community

13mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 210 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Intel Labs Introduces AI Diffusion Model, Generates 360-Degree Images from Text Prompts - intc.com

Intel Labs Introduces AI Diffusion Model, Generates 360-Degree Images from Text Prompts intc.com

GNews AI diffusion

1malmost 3 years ago

Models

Tencent's X-Omni uses open source components to challenge GPT-4o image generation - the-decoder.com

Tencent's X-Omni uses open source components to challenge GPT-4o image generation the-decoder.com

GNews AI diffusion

1m8 months ago

ModelsLive

Research across 1,372 participants and 9K+ trials details "cognitive surrender", where most subjects had minimal AI skepticism and accepted faulty AI reasoning (Kyle Orland/Ars Technica)

Kyle Orland / Ars Technica : Research across 1,372 participants and 9K+ trials details cognitive surrender , where most subjects had minimal AI skepticism and accepted faulty AI reasoning When it comes to large language model-powered tools, there are generally two broad categories of users.

Techmeme

1mabout 1 hour ago

Models

Diffusion model | Image Generation, Explained, & Example - Britannica

Diffusion model | Image Generation, Explained, & Example Britannica

GNews AI diffusion

1m7 months ago