Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessIntel koopt zijn Ierse 3nm -chipfabriek volledig terug van investeringsbedrijfTweakers.netAI Slop DetectorHacker News AI TopRambus Unveils HBM4E Controller: 16 GT/s, 2,048-Bit Interface, Enabling C-HBM4EEE TimesGPT reasoning models have "line of sight" to AGI, says OpenAI's Greg Brockman - the-decoder.comGoogle News: OpenAIGPT reasoning models have "line of sight" to AGI, says OpenAI s Greg BrockmanThe DecoderCornell study reveals AI can regenerate famous books with amazing accuracy, sparks copyright concerns - India TodayGNews AI copyrightStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - ScienceAlertGoogle News: ChatGPTThe Sequence Chat #835: Illia Polosukhin on NEAR AI, Authoring the Transformer Paper and Decentralized and Private AI - TheSequenceGoogle News: Machine LearningOpenClaw Unlocks China’s AI Token Export BusinessBloomberg TechnologySector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025 - Crunchbase NewsGNews AI startupsSector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025Crunchbase NewsJob Pivots in the Age of AI: Lessons From Mike Mulligan and His Steam Shovel - MIT Sloan Management ReviewGoogle News: AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessIntel koopt zijn Ierse 3nm -chipfabriek volledig terug van investeringsbedrijfTweakers.netAI Slop DetectorHacker News AI TopRambus Unveils HBM4E Controller: 16 GT/s, 2,048-Bit Interface, Enabling C-HBM4EEE TimesGPT reasoning models have "line of sight" to AGI, says OpenAI's Greg Brockman - the-decoder.comGoogle News: OpenAIGPT reasoning models have "line of sight" to AGI, says OpenAI s Greg BrockmanThe DecoderCornell study reveals AI can regenerate famous books with amazing accuracy, sparks copyright concerns - India TodayGNews AI copyrightStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - ScienceAlertGoogle News: ChatGPTThe Sequence Chat #835: Illia Polosukhin on NEAR AI, Authoring the Transformer Paper and Decentralized and Private AI - TheSequenceGoogle News: Machine LearningOpenClaw Unlocks China’s AI Token Export BusinessBloomberg TechnologySector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025 - Crunchbase NewsGNews AI startupsSector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025Crunchbase NewsJob Pivots in the Age of AI: Lessons From Mike Mulligan and His Steam Shovel - MIT Sloan Management ReviewGoogle News: AI
AI NEWS HUBbyEIGENVECTOREigenvector

Speech LLMs are Contextual Reasoning Transcribers

arXiv cs.CLby Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu LiApril 2, 20261 min read0 views
Source Quiz

arXiv:2604.00610v1 Announce Type: new Abstract: Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-genera

View PDF HTML (experimental)

Abstract:Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.00610 [cs.CL]

(or arXiv:2604.00610v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00610

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Keqi Deng [view email] [v1] Wed, 1 Apr 2026 08:13:50 UTC (248 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Speech LLMs…modellanguage mo…announceanalysisreasoningpaperarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 159 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!