Models model language model training announce feature insight

Do Audio-Visual Large Language Models Really See and Hear?

ArXiv CS.AIby Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh ManochaApril 6, 20261 min read0 views

Source Quiz

arXiv:2604.02605v1 Announce Type: new Abstract: Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly ma

View PDF HTML (experimental)

Abstract:Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

Comments: CVPR Findings

Subjects:

Artificial Intelligence (cs.AI); Sound (cs.SD)

Cite as: arXiv:2604.02605 [cs.AI]

(or arXiv:2604.02605v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.02605

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ramaneswaran Selvakumar [view email] [v1] Fri, 3 Apr 2026 00:48:49 UTC (14,633 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2604.02605

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modeltraining

ReleasesFresh

On Signal Peak Power Constraint of Over-the-Air Federated Learning

arXiv:2512.23381v2 Announce Type: replace Abstract: Federated learning (FL) has been considered a promising privacy preserving distributed edge learning framework. Over-the-air computation (AirComp) leveraging analog transmission enables the aggregation of local updates directly over-the-air by exploiting the superposition properties of wireless multiple-access channels, thereby alleviating the communication bottleneck issues of FL compared with digital transmission schemes. This work points out that existing AirComp-FL overlooks a key practical constraint, the instantaneous peak-power constraints due to the non-linearity of radio-frequency power amplifiers. Operating directly in non-linear region causes in-band and out-of-band distortions. We present and analyze the effect of the default

arXiv eess.SP

2mabout 3 hours ago

ModelsFresh

Enhanced Direction-Sensing Methods and Performance Analysis in Low-Altitude Wireless Network via a Rotation Antenna Array

arXiv:2603.20784v3 Announce Type: replace Abstract: Due to the directive property of each antenna element, the received signal power can be severely attenuated when the emitter deviates from the array boresight, which will lead to a severe degradation in sensing performance along the corresponding direction. Although existing rotatable array sensing methods such as recursive rotation (RR-Root-MUSIC) can mitigate this issue by iteratively rotating and sensing, several mechanical rotations and repeated eigendecomposition operations are required to yield a high computational complexity and low time-efficiency. To address this problem, a pre-rotation initialization with recieve power as a rule is proposed to signifcantly reduce the computational complexity and improve the time-efficiency. Usin

arXiv eess.SP

2mabout 3 hours ago

ProductsFresh

BIRA: A Spherical Bistatic Radar Reflectivity Measurement System

arXiv:2407.13749v5 Announce Type: replace Abstract: The upcoming 6G mobile communication standard will offer a revolutionary new feature: Integrated sensing and communication (ISAC) reuses mobile communication signals to realize multi-static radar for various applications including localization. Consequently, applied ISAC propagation research necessitates to evolve from classical monostatic radar cross section (RCS) measurement of static targets on to bistatic radar reflectivity characterization of dynamic objects. Here, we introduce our Bistatic Radar (BIRA) measurement facility for independent spherical positioning of two probes with sub-millimeter accuracy on a diameter of up to 7 m and with almost continuous frequency coverage from 0.7 up to 260 GHz. Currently, BIRA is the only bistati

arXiv eess.SP

1mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 282 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Enhanced Direction-Sensing Methods and Performance Analysis in Low-Altitude Wireless Network via a Rotation Antenna Array

arXiv eess.SP

2mabout 3 hours ago

ModelsFresh

PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair

arXiv:2604.03113v1 Announce Type: new Abstract: Large language models (LLMs) are effective for automated program repair, but plausible patches that pass the full test suite often rewrite more code than necessary, increasing review and maintenance costs. This over-editing is common because most bugs are localized, while standard supervised fine-tuning provides no explicit signal about which tokens should be preserved and which should be changed. We propose PAFT, a preservation-aware fine-tuning method for minimal-edit program repair. PAFT derives token-level preservation signals by aligning buggy and fixed code, combines them with full-sequence masking, and applies an edit-difficulty curriculum. Across Defects4J and HumanEval-Java, PAFT improves pass@1 by up to 65.6% over standard supervise

arXiv cs.SE

1mabout 3 hours ago

ModelsFresh

SkillRT: Compiling Skills for Efficient Execution Everywhere

arXiv:2604.03088v1 Announce Type: new Abstract: LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat them as raw context, causing the same skill to behave inconsistently for different agents. This fragility undermines skill portability and execution efficiency. To address this challenge, we analyze 118,000 skills and draw inspiration from traditional compiler design. We treat skills as code and LLMs as heterogeneous processors. To make portability actionable, we decompose a skill's requirements into a set of primitive capabilities, and measure how well each model-harness pair supports them. Based on these capability profiles, we propose SkillRT, a compilation and runtime system designed for por

arXiv cs.SE

2mabout 3 hours ago

ModelsFresh

Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition

arXiv:2604.03048v1 Announce Type: new Abstract: Context: Since it is well-established that developers spend a substantial portion of their time understanding source code, the ability to automatically identify algorithms within source code presents a valuable opportunity. This capability can support program comprehension, facilitate maintenance, and enhance overall software quality. Objective: We empirically evaluate how combining LLMs with static code analysis can improve the automated recognition of algorithms, while also evaluating their standalone performance and dependence on identifier names. Method: We perform multiple experiments evaluating the combination of LLMs with static analysis using different filter patterns. We compare this combined approach against their standalone perform

arXiv cs.SE

2mabout 3 hours ago