Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessAI Mastery Course in Telugu: Hands-On Training with Real ProjectsDev.to AIHow I'm Running Autonomous AI Agents That Actually Earn USDCDev.to AIUnderstanding NLP Token Classification: From Basics to Real-World ApplicationsMedium AIGPT-5.4 Scored 75% on a Test That Measures Real Human Work. My Data Team Scored 72%.Medium AIBizNode Workflow Marketplace: chain multiple bot handles into multi-step pipelines. Client onboarding, contract-to-payment,...Dev.to AITop Artificial Intelligence Development Companies in Dubai, UAE (2026 Edition)Medium AIЯ потратил месяц на AI-инструменты и удалил половину из нихDev.to AI500 AI Demos at AZ Tech Week. Every One Hits the Same Scaling Ceiling.Dev.to AIPEDIGREE® Uses Artificial Intelligence to Drive Responsible Dog Adoption in Brazil - PA MediaGoogle News: AIWhen the accountability tool becomes the procrastination toolDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIUnlocking the Power of AI: Introducing MindPal - The Ultimate Developer Tool for 2026Dev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessAI Mastery Course in Telugu: Hands-On Training with Real ProjectsDev.to AIHow I'm Running Autonomous AI Agents That Actually Earn USDCDev.to AIUnderstanding NLP Token Classification: From Basics to Real-World ApplicationsMedium AIGPT-5.4 Scored 75% on a Test That Measures Real Human Work. My Data Team Scored 72%.Medium AIBizNode Workflow Marketplace: chain multiple bot handles into multi-step pipelines. Client onboarding, contract-to-payment,...Dev.to AITop Artificial Intelligence Development Companies in Dubai, UAE (2026 Edition)Medium AIЯ потратил месяц на AI-инструменты и удалил половину из нихDev.to AI500 AI Demos at AZ Tech Week. Every One Hits the Same Scaling Ceiling.Dev.to AIPEDIGREE® Uses Artificial Intelligence to Drive Responsible Dog Adoption in Brazil - PA MediaGoogle News: AIWhen the accountability tool becomes the procrastination toolDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIUnlocking the Power of AI: Introducing MindPal - The Ultimate Developer Tool for 2026Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Do Audio-Visual Large Language Models Really See and Hear?

ArXiv CS.AIby Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh ManochaApril 6, 20261 min read0 views
Source Quiz

arXiv:2604.02605v1 Announce Type: new Abstract: Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly ma

View PDF HTML (experimental)

Abstract:Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

Comments: CVPR Findings

Subjects:

Artificial Intelligence (cs.AI); Sound (cs.SD)

Cite as: arXiv:2604.02605 [cs.AI]

(or arXiv:2604.02605v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.02605

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ramaneswaran Selvakumar [view email] [v1] Fri, 3 Apr 2026 00:48:49 UTC (14,633 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modeltraining

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Do Audio-Vi…modellanguage mo…trainingannouncefeatureinsightArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 282 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models