Models model language model announce analysis reasoning paper

Speech LLMs are Contextual Reasoning Transcribers

arXiv cs.CLby Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu LiApril 2, 20261 min read0 views

arXiv:2604.00610v1 Announce Type: new Abstract: Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-genera

View PDF HTML (experimental)

Abstract:Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.00610 [cs.CL]

(or arXiv:2604.00610v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00610

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Keqi Deng [view email] [v1] Wed, 1 Apr 2026 08:13:50 UTC (248 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2604.00610

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

Models

Machine Learning at Scale: Managing More Than One Model in Production - Towards Data Science

Machine Learning at Scale: Managing More Than One Model in Production Towards Data Science

Google News: Machine Learning

1m24 days ago

ProductsLive

I Talked to 500GB of Retail Data With Zero Domain Knowledge. AI Designed a Strategy I never expected

I Talked to 500GB of Retail Data With Zero Domain Knowledge. The AI Designed a Strategy I Didn’t Know to Ask For. The end of a single investigation: the AI’s planogram optimisation strategy with projected business impact — +10% total sales, +£86,424 annual revenue. 134.9 seconds. $1.66 in compute. The agent is already suggesting what to investigate next. Here is what happened. A retail chain had 500GB of transactional data stored in Apache Iceberg. Sales records, promotions, product catalogues, store layouts, inventory movements — the full picture of a multi-store retail operation spread across multiple tables with complex relationships. I connected this data source to an agentic data platform I had built. One connection. No schema mapping, no ETL pipeline, no data dictionary. The system a

Generative AI

15mabout 2 hours ago

ModelsLive

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

Simulation what the Qwen3.5 model family would look like using 1-bit technology and TurboQuant. The table below shows the results, this would be a revolution: Model Parameters Q4_K_M File (Current) KV Cache (256K) (Current) Hypothetical 1-bit Weights KV Cache 256K with TurboQuant Hypothetical Total Memory Usage Qwen3.5-122B-A10B 122B total / 10B active 74.99 GB 81.43 GB 17.13 GB 1.07 GB 18.20 GB Qwen3.5-35B-A3B 35B total / 3B active 21.40 GB 26.77 GB 4.91 GB 0.89 GB 5.81 GB Qwen3.5-27B 27B 17.13 GB 34.31 GB 3.79 GB 2.86 GB 6.65 GB Qwen3.5-9B 9B 5.89 GB 14.48 GB 1.26 GB 1.43 GB 2.69 GB Qwen3.5-4B 4B 2.87 GB 11.46 GB 0.56 GB 1.43 GB 1.99 GB Qwen3.5-2B 2B 1.33 GB 4.55 GB 0.28 GB 0.54 GB 0.82 GB submitted by /u/GizmoR13 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 159 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Speech LLMs are Contextual Reasoning Transcribers

Submission history

Daily AI Digest

More about

Machine Learning at Scale: Managing More Than One Model in Production - Towards Data Science

I Talked to 500GB of Retail Data With Zero Domain Knowledge. AI Designed a Strategy I never expected

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Scale AI cuts more contractors as it shifts toward more specialized AI training - Business Insider

Machine Learning at Scale: Managing More Than One Model in Production - Towards Data Science

I Paid for Claude Max 5x as a Normal User So You Don’t Have To. Or Do You?

A bug in Bun may have been the root cause of the Claude Code source code leak.