Models model language model announce analysis arxiv

Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition

arXiv cs.SEby Denis Neum\"uller, Sebastian Boll, David Sch\"uler, Matthias TichyApril 6, 20262 min read0 views

arXiv:2604.03048v1 Announce Type: new Abstract: Context: Since it is well-established that developers spend a substantial portion of their time understanding source code, the ability to automatically identify algorithms within source code presents a valuable opportunity. This capability can support program comprehension, facilitate maintenance, and enhance overall software quality. Objective: We empirically evaluate how combining LLMs with static code analysis can improve the automated recognition of algorithms, while also evaluating their standalone performance and dependence on identifier names. Method: We perform multiple experiments evaluating the combination of LLMs with static analysis using different filter patterns. We compare this combined approach against their standalone perform

View PDF

Abstract:Context: Since it is well-established that developers spend a substantial portion of their time understanding source code, the ability to automatically identify algorithms within source code presents a valuable opportunity. This capability can support program comprehension, facilitate maintenance, and enhance overall software quality. Objective: We empirically evaluate how combining LLMs with static code analysis can improve the automated recognition of algorithms, while also evaluating their standalone performance and dependence on identifier names. Method: We perform multiple experiments evaluating the combination of LLMs with static analysis using different filter patterns. We compare this combined approach against their standalone performance under various prompting strategies and investigate the impact of systematic identifier obfuscation on classification performance and runtime. Results: The combination of LLMs with lightweight static analysis performs surprisingly well, reducing required LLM calls by 72.39-97.50% depending on the filter pattern. This not only lowers runtime significantly but also improves F1-scores by up to 12 percentage points (pp) compared to the baseline. Regarding the different prompting strategies, in-context learning with two examples provides an effective trade-off between classification performance and runtime efficiency, achieving F1-scores of 75-77% with only a modest increase in inference time. Lastly, we find that LLMs are not solely dependent on name-information as they are still able to identify most algorithm implementations when identifiers are obfuscated. Conclusion: By combining LLMs with static analysis, we achieve substantial reductions in runtime while simultaneously improving F1-scores, underscoring the value of a hybrid approach.

Subjects:

Software Engineering (cs.SE)

Cite as: arXiv:2604.03048 [cs.SE]

(or arXiv:2604.03048v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.03048

arXiv-issued DOI via DataCite

Submission history

From: Denis Neumüller [view email] [v1] Fri, 3 Apr 2026 13:56:39 UTC (356 KB)

Original source

arXiv cs.SE

https://arxiv.org/abs/2604.03048

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

Open Source AILive

Why APEX Matters for MoE Coding Models and why it's NOT the same as K quants

I posted about my APEX quantization of QWEN Coder 80B Next yesterday and got a ton of great questions. Some people loved it, some people were skeptical, and one person asked "what exactly is the point of this when K quants already do mixed precision?" It's a great question. I've been deep in this for the last few days running APEX on my own hardware and I want to break down what I've learned because I think most people are missing the bigger picture here. So yes K quants like Q4_K_M already apply different precision to different layers. Attention gets higher precision, feed-forward gets lower. That's been in llama.cpp for a while and it works. But here's the thing nobody is talking about. MoE models have a coherence problem. I was reading this article last night and it clicked for me. When

Reddit r/LocalLLaMA

3mabout 1 hour ago

ModelsFresh

qwen3.5 vs gemma4 vs cloud llms in python turtle

I have found python turtle to be a pretty good test for a model. All of these models have received the same prompt: "write a python turtle program that draws a cat" you can actually see similarity in gemma's and gemini pro's outputs, they share the color pallete and minimalist approach in terms of details. I have a 16 gb vram gpu so couldn't test bigger versions of qwen and gemma without quantisation. gemma_4_31B_it_UD_IQ3_XXS.gguf Qwen3_5_9B_Q8_0.gguf Qwen_3_5_27B_Opus_Distilled_Q4_K_S.gguf deepseek from web browser with reasoning claude sonnet 4.6 extended gemini pro from web browser with thinking submitted by /u/SirKvil [link] [comments]

Reddit r/LocalLLaMA

1mabout 3 hours ago

ModelsFresh

[Benchmark] Altered Riddles: Can LLMs ignore what they've memorised?

In the past year you may have encountered the following prompt: The surgeon, who is the boy's father, says, 'I cannot operate on this boy—he's my son!'. Who is the surgeon to the boy? If you try to give this prompt to an LLM right now you will probably still receive “The mother” as an answer, even though the text explicitly states that the surgeon is the boy’s father; this is probably due to the fact that this prompt is an alteration of a very common “riddle”, to which the answer is, in fact, the mother: A man and his son are in a terrible accident and are rushed to the hospital in critical condition. The doctor looks at the boy and exclaims, "I can't operate on this boy; he's my son!" How could this be? Working on this failure mode, I initially decided to create a small dataset of altered

Reddit r/LocalLLaMA

2mabout 5 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 202 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

qwen3.5 vs gemma4 vs cloud llms in python turtle

Reddit r/LocalLLaMA

1mabout 3 hours ago

ModelsFresh

[Benchmark] Altered Riddles: Can LLMs ignore what they've memorised?

Reddit r/LocalLLaMA

2mabout 5 hours ago

ModelsFresh

A Multi-Language Perspective on the Robustness of LLM Code Generation

arXiv:2504.19108v5 Announce Type: replace Abstract: Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the robustness of code generation models remains an ongoing endeavor. Previous studies have primarily focused on code generation models specifically for the Python language, overlooking other widely used programming languages. In this work, we conduct a comprehensive comparative analysis to assess the robustness performance of several prominent code generation models and investigate whether robustness can be improved by repairing perturbed docstrings using an LLM. Furthermore, we investigate how their performanc

arXiv cs.SE

2mabout 10 hours ago

ModelsFresh

Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

arXiv:2411.10656v2 Announce Type: replace Abstract: Context: Large Language Models (LLMs) like GPT-5 and LLaMA-405b exhibit advanced code generation abilities, but their deployment demands substantial computation resources and energy. Quantization can reduce memory footprint and hardware requirements, yet may degrade code quality. Objective: This study investigates code generation performance of smaller LLMs, examines the effect of quantization, and identifies common code quality issues as a proof of concepts (PoC). Method: Four open-source LLMs are evaluated on Python benchmarks using code similarity metrics, with an analysis on 8-bit and 4-bit quantization, alongside static code quality assessment. Results: While smaller LLMs can generate functional code, benchmark performance is limited

arXiv cs.SE

1mabout 10 hours ago