Search AI News
Find articles across all categories and topics
74 results for "Throughput"

DHFP-PE: Dual-Precision Hybrid Floating Point Processing Element for AI Acceleration
arXiv:2604.04507v1 Announce Type: cross Abstract: The rapid adoption of low-precision arithmetic in artificial intelligence and edge computing has created a strong demand for energy-efficient and flexible floating-point multiply-accumulate (MAC) units. This paper presents a fully pipelined dual-precision floating-point MAC processing engine supporting FP8 formats (E4M3, E5M2) and FP4 formats (E2M1, E1M2), specifically optimized for low-power and high-throughput AI workloads. The proposed architecture employs a novel bit-partitioning technique that enables a single 4-bit unit multiplier to operate either as a standard 4x4 multiplier for FP8 or as two parallel 2x2 multipliers for 2-bit operands, achieving 100 percent hardware utilization without duplicating logic. Implemented in 28 nm techno

VectraFlow: Long-Horizon Semantic Processing over Data and Event Streams with LLMs
arXiv:2604.03855v1 Announce Type: new Abstract: Monitoring continuous data for meaningful signals increasingly demands long-horizon, stateful reasoning over unstructured streams. However, today's LLM frameworks remain stateless and one-shot, and traditional Complex Event Processing (CEP) systems, while capable of temporal pattern detection, assume structured, typed event streams that leave unstructured text out of reach. We demonstrate VectraFlow, a semantic streaming dataflow engine, to address both gaps. VectraFlow extends traditional relational operators with LLM-powered execution over free-text streams, offering a suite of continuous semantic operators -- filter, map, aggregate, join, group-by, and window -- each with configurable throughput-accuracy tradeoffs across LLM-based, embeddi

Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead
arXiv:2604.03676v1 Announce Type: new Abstract: Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effective

Photonic convolutional neural network with pre-trained in-situ training
arXiv:2604.02429v1 Announce Type: cross Abstract: Photonic computing is a computing paradigm which have great potential to overcome the energy bottlenecks of electronic von Neumann architecture. Throughput and power consumption are fundamental limitations of Complementary-metal-oxide-semiconductor (CMOS) chips, therefore convolutional neural network (CNN) is revolutionising machine learning, computer vision and other image based applications. In this work, we propose and validate a fully photonic convolutional neural network (PCNN) that performs MNIST image classification entirely in the optic — Saurabh Ranjan, Sonika Thakral, Amit Sehgal

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
arXiv:2604.02715v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert pa — Qingxiu Liu, Cyril Y. He, Hanser Jiang, Zion Wang, Alan Zhao, Patrick P. C. Lee

LitMOF: An LLM Multi-Agent for Literature-Validated Metal-Organic Frameworks Database Correction and Expansion
arXiv:2512.01693v2 Announce Type: replace Abstract: Metal-organic framework (MOF) databases have grown rapidly through experimental deposition and large-scale literature extraction, but recent analyses show that nearly half of their entries contain substantial structural errors. These inaccuracies propagate through high-throughput screening and machine-learning workflows, limiting the reliability of data-driven MOF discovery. Correcting such errors is exceptionally difficult because true repairs require integrating crystallographic files, synthesis descriptions, and contextual evidence scattered across the literature. Here we introduce LitMOF, a large language model-driven multi-agent framework that validates crystallographic information directly from the original literature and cross-vali

TurboQuant on Apple Silicon: real benchmarks on Mac Mini M4 16GB and M3 Max 48GB
I’ve been testing TurboQuant this week on two machines and wanted to share the actual numbers. Why this matters: TurboQuant compresses the KV cache, not the model weights. On long contexts, KV cache can take several GB of memory, so reducing it can make a big difference even when throughput stays similar. In the setup I tested, K stays at q8_0 and V goes to turbo3 (~3-bit). That asymmetric tradeoff makes sense because errors in the keys affect attention routing more directly, while values often tolerate heavier compression better. Benchmark 1: Mac Mini M4 16GB — Qwen3-14B Q4_K_M at 8K context → Without TurboQuant: KV cache 1280 MiB, K (f16): 640 MiB, V (f16): 640 MiB — 9.95 t/s → With TurboQuant: KV cache 465 MiB, K (q8_0): 340 MiB, V (turbo3): 125 MiB — 9.25 t/s Almost 3x compression, wit

Intel B70 with Qwen3.5 35B
Intel recently released support for Qwen3.5: https://github.com/intel/llm-scaler/releases/tag/vllm-0.14.0-b8.1 Anyone with a B70 willing to run a lllama benchy with the below settings on the 35B model? uvx llama-benchy --base-url $URL --model $MODEL --depth 0 --pp 2048 --tg 512 --concurrency 1 --runs 3 --latency-mode generation --no-cache --save-total-throughput-timeseries submitted by /u/Fmstrat [link] [comments]

Why We Ditched Bedrock Agents for Nova Pro and Built a Custom Orchestrator
We're building a healthcare prior authorization platform. If you've never dealt with prior auths, congratulations, you've been spared one of the most soul-crushing workflows in American healthcare. Our platform tries to make it less painful. One of our core features is an AI assistant that helps clinical staff review denial cases, check patient eligibility, and generate appeal letters. We wanted to use Amazon Nova Pro as the foundation model for this particular feature. The reasoning was simple: it's AWS's own model. AWS removes most calls-per-minute limitations on their own models, so you're not fighting throttling issues or provisioned throughput caps. With third-party models on Bedrock you can run into rate limits that require you to request increases or provision dedicated capacity. No

Claude Code Advanced Workflow: Subagents, Commands & Multi-Session
Claude Code Advanced Workflow: Subagents, Commands Multi-Session Most Claude Code tutorials stop at "write a good CLAUDE.md and let Claude handle the rest." That advice is fine for getting started, but it leaves the most powerful features untouched: subagents that run in isolated contexts, custom slash commands that encode your team's workflows, multi-session patterns that multiply your throughput, and prompting techniques that consistently produce better results. At Effloow , we run a fully AI-powered content company with 14 agents orchestrated through Paperclip. Every agent runs Claude Code. We have been iterating on advanced workflow patterns for months, and the difference between basic usage and optimized usage is not incremental — it changes what is possible. This guide covers the adv

Gemma 4 MoE hitting 120 TPS on Dual 3090s!
Thought I'd share some benchmark numbers from my local setup. Hardware: Dual NVIDIA RTX 3090s Model: Gemma 4 (MoE architecture) Performance: ~120 Tokens Per Second The efficiency of this MoE implementation is unreal. Even with a heavy load, the throughput stays incredibly consistent. It's a massive upgrade for anyone running local LLMs for high-frequency tasks or complex agentic workflows. The speed allows for near-instantaneous reasoning, which is a total paradigm shift compared to older dense models. If you have the VRAM to spare, this is definitely the way to go. submitted by /u/AaZzEL [link] [comments]

GR4AD: Kuaishou's Production-Ready Generative Recommender for Ads Delivers 4.2% Revenue Lift
Researchers from Kuaishou present GR4AD, a generative recommendation system designed for high-throughput ad serving. It introduces innovations in tokenization (UA-SID), decoding (LazyAR), and optimization (RSPO) to balance performance with cost. Online A/B tests on 400M users show a 4.2% ad revenue improvement. The Innovation — What the Source Reports A new technical paper on arXiv, "Generative Recommendation for Large-Scale Advertising," details a production-deployed system named GR4AD (Generative Recommendation for ADdvertising) from Kuaishou. The work addresses the core challenge of deploying generative recommendation—which uses sequence-to-sequence models to generate candidate items—in a real-time, large-scale advertising environment where latency and compute budgets are rigid constrai

Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
arXiv:2604.01622v1 Announce Type: new Abstract: Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity — Shuibai Zhang, Caspian Zhuang, Chihan Cui, Zhihan Yang, Fred Zhangzhi Peng, Yanxin Zhang, Haoyue Bai, Zack Jia, Yang Zhou, Guanhua Chen, Ming Liu

GPU-RMQ: Accelerating Range Minimum Queries on Modern GPUs
arXiv:2604.01811v1 Announce Type: new Abstract: Range minimum queries are frequently used in string processing and database applications including biological sequence analysis, document retrieval, and web search. Hence, various data structures have been proposed for improving their efficiency on both CPUs and GPUs.Recent work has also shown that hardware-accelerated ray tracing on modern NVIDIA RTX graphic cards can be exploited to answer range minimum queries by expressing queries as rays, which are fired into a scene of triangles representing minima of ranges at different granularities. While these approaches are promising, they suffer from at least one of three issues: severe memory overhead, high index construction time, and low query throughput. This renders these methods practically

