Do Audio-Visual Large Language Models Really See and Hear?
arXiv:2604.02605v1 Announce Type: new Abstract: Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly ma
View PDF HTML (experimental)
Abstract:Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.
Comments: CVPR Findings
Subjects:
Artificial Intelligence (cs.AI); Sound (cs.SD)
Cite as: arXiv:2604.02605 [cs.AI]
(or arXiv:2604.02605v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.02605
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Ramaneswaran Selvakumar [view email] [v1] Fri, 3 Apr 2026 00:48:49 UTC (14,633 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modeltraining
On Signal Peak Power Constraint of Over-the-Air Federated Learning
arXiv:2512.23381v2 Announce Type: replace Abstract: Federated learning (FL) has been considered a promising privacy preserving distributed edge learning framework. Over-the-air computation (AirComp) leveraging analog transmission enables the aggregation of local updates directly over-the-air by exploiting the superposition properties of wireless multiple-access channels, thereby alleviating the communication bottleneck issues of FL compared with digital transmission schemes. This work points out that existing AirComp-FL overlooks a key practical constraint, the instantaneous peak-power constraints due to the non-linearity of radio-frequency power amplifiers. Operating directly in non-linear region causes in-band and out-of-band distortions. We present and analyze the effect of the default

Enhanced Direction-Sensing Methods and Performance Analysis in Low-Altitude Wireless Network via a Rotation Antenna Array
arXiv:2603.20784v3 Announce Type: replace Abstract: Due to the directive property of each antenna element, the received signal power can be severely attenuated when the emitter deviates from the array boresight, which will lead to a severe degradation in sensing performance along the corresponding direction. Although existing rotatable array sensing methods such as recursive rotation (RR-Root-MUSIC) can mitigate this issue by iteratively rotating and sensing, several mechanical rotations and repeated eigendecomposition operations are required to yield a high computational complexity and low time-efficiency. To address this problem, a pre-rotation initialization with recieve power as a rule is proposed to signifcantly reduce the computational complexity and improve the time-efficiency. Usin

BIRA: A Spherical Bistatic Radar Reflectivity Measurement System
arXiv:2407.13749v5 Announce Type: replace Abstract: The upcoming 6G mobile communication standard will offer a revolutionary new feature: Integrated sensing and communication (ISAC) reuses mobile communication signals to realize multi-static radar for various applications including localization. Consequently, applied ISAC propagation research necessitates to evolve from classical monostatic radar cross section (RCS) measurement of static targets on to bistatic radar reflectivity characterization of dynamic objects. Here, we introduce our Bistatic Radar (BIRA) measurement facility for independent spherical positioning of two probes with sub-millimeter accuracy on a diameter of up to 7 m and with almost continuous frequency coverage from 0.7 up to 260 GHz. Currently, BIRA is the only bistati
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Enhanced Direction-Sensing Methods and Performance Analysis in Low-Altitude Wireless Network via a Rotation Antenna Array
arXiv:2603.20784v3 Announce Type: replace Abstract: Due to the directive property of each antenna element, the received signal power can be severely attenuated when the emitter deviates from the array boresight, which will lead to a severe degradation in sensing performance along the corresponding direction. Although existing rotatable array sensing methods such as recursive rotation (RR-Root-MUSIC) can mitigate this issue by iteratively rotating and sensing, several mechanical rotations and repeated eigendecomposition operations are required to yield a high computational complexity and low time-efficiency. To address this problem, a pre-rotation initialization with recieve power as a rule is proposed to signifcantly reduce the computational complexity and improve the time-efficiency. Usin

PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair
arXiv:2604.03113v1 Announce Type: new Abstract: Large language models (LLMs) are effective for automated program repair, but plausible patches that pass the full test suite often rewrite more code than necessary, increasing review and maintenance costs. This over-editing is common because most bugs are localized, while standard supervised fine-tuning provides no explicit signal about which tokens should be preserved and which should be changed. We propose PAFT, a preservation-aware fine-tuning method for minimal-edit program repair. PAFT derives token-level preservation signals by aligning buggy and fixed code, combines them with full-sequence masking, and applies an edit-difficulty curriculum. Across Defects4J and HumanEval-Java, PAFT improves pass@1 by up to 65.6% over standard supervise

SkillRT: Compiling Skills for Efficient Execution Everywhere
arXiv:2604.03088v1 Announce Type: new Abstract: LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat them as raw context, causing the same skill to behave inconsistently for different agents. This fragility undermines skill portability and execution efficiency. To address this challenge, we analyze 118,000 skills and draw inspiration from traditional compiler design. We treat skills as code and LLMs as heterogeneous processors. To make portability actionable, we decompose a skill's requirements into a set of primitive capabilities, and measure how well each model-harness pair supports them. Based on these capability profiles, we propose SkillRT, a compilation and runtime system designed for por

Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
arXiv:2604.03048v1 Announce Type: new Abstract: Context: Since it is well-established that developers spend a substantial portion of their time understanding source code, the ability to automatically identify algorithms within source code presents a valuable opportunity. This capability can support program comprehension, facilitate maintenance, and enhance overall software quality. Objective: We empirically evaluate how combining LLMs with static code analysis can improve the automated recognition of algorithms, while also evaluating their standalone performance and dependence on identifier names. Method: We perform multiple experiments evaluating the combination of LLMs with static analysis using different filter patterns. We compare this combined approach against their standalone perform


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!