Products benchmark announce product application feature code generation

Automated Functional Testing for Malleable Mobile Application Driven from User Intent

arXiv cs.SEby Yuying Wang, Kaifeng Huang, Hao Deng, Zhiyuan Sun, Jinxuan Zhou, Shengjie ZhaoApril 3, 20261 min read0 views

Source Quiz

arXiv:2604.02079v1 Announce Type: new Abstract: Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose \tool, a user-requirement-driven GUI test generation framework that i

View PDF HTML (experimental)

Abstract:Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose \tool, a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. We build a benchmark spanning six popular mobile applications with both correct and faulty user-requested functionalities, demonstrating that \tool effectively validates per-user features and is practical for real-world deployment. Our work highlights the feasibility of shifting mobile app development from a product-manager-driven to an end-user-driven paradigm.

Subjects:

Software Engineering (cs.SE)

Cite as: arXiv:2604.02079 [cs.SE]

(or arXiv:2604.02079v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.02079

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yuying Wang [view email] [v1] Thu, 2 Apr 2026 14:10:11 UTC (1,023 KB)

Original source

arXiv cs.SE

https://arxiv.org/abs/2604.02079

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkannounceproduct

ProductsLive

I found Android Auto's hidden shortcut that automates any task in your car - and it's brilliant

Android Auto's best feature is one you probably haven't discovered yet - and Custom Assistant takes only a minute to set up.

ZDNet Big Data

1m30 minutes ago

ModelsLive

[R] 94.42% on BANKING77 Official Test Split with Lightweight Embedding + Example Reranking (strict full-train protocol)

BANKING77 (77 fine-grained banking intents) is a well-established but increasingly saturated intent classification benchmark. did this while using a lightweight embedding-based classifier + example reranking approach (no LLMs involved), I obtained 94.42% accuracy on the official PolyAI test split. Strict Full train protocol was used: Hyperparameter tuning / recipe selection performed via 5-fold stratified CV on the official training set only, final model retrained on 100% of the official training data (recipe frozen) and single evaluation on the held-out official PolyAI test split Here are the results: Accuracy: 94.42%, Macro-F1: 0.9441, Model size: ~68 MiB (FP32), Inference: ~225 ms per query This represents +0.59pp over the commonly cited 93.83% baseline and places the result in clear 2n

Reddit r/MachineLearning

1mabout 2 hours ago

ModelsFresh

Cross-Model Activation Generalizability Isn't Strong (Yet)

TL;DR Tested activation similarities across different LLM families (Llama, Gemma, Qwen, Pythia) at small scale (1~3B) CKA Similarity : Cross-architectural activation similarity is statistically real, but weak. Within-family activations are much stronger (4~9x) Linear Transferability : Trained linear bridges for linear activation transfers for binary classification and next token prediction tasks. Within-family stands strong, cross-architecture yields better than random guessing, but not strong enough to be practically useful. Bottom Line : Shared structure across cross-architectures : broad linguistics / semantics only for now. Not enough for fine grained auditing tools. Code can be found here . This is my first interpretability research project, coming from a different field. I'm about 4

lesswrong.com

20mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 196 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsLive

I found Android Auto's hidden shortcut that automates any task in your car - and it's brilliant

Android Auto's best feature is one you probably haven't discovered yet - and Custom Assistant takes only a minute to set up.

ZDNet Big Data

1m30 minutes ago

ProductsFresh

OpenAI Addresses AI's Effects and Poses Possible Answers in New Doc

The policy provides OpenAI with a way to position itself as a company that's thinking about the implications of AI technology, especially as it affects enterprise workers.

AI Business

1mabout 6 hours ago

ProductsLive

A look at Catches and other startups that are offering AI tools to let shoppers visualize fit and style before buying clothes, aiming to curb online returns (Elsa Ohlen/CNBC)

Elsa Ohlen / CNBC : A look at Catches and other startups that are offering AI tools to let shoppers visualize fit and style before buying clothes, aiming to curb online returns It pinches here; drags there; the draping is wrong. These are some of the examples of the feedback a new crop

Techmeme

1mabout 1 hour ago

ProductsFresh

Congress urged to open antitrust investigation into Apple and OpenAI over ‘left-leaning bias’ - coloradopolitics.com

Congress urged to open antitrust investigation into Apple and OpenAI over ‘left-leaning bias’ coloradopolitics.com

GNews AI Apple

1mabout 5 hours ago