If Memory Could Compute, Would We Still Need GPUs?
If Memory Could Compute, Would We Still Need GPUs? The bottleneck for LLM inference isn't GPU compute. It's memory bandwidth. A February 2026 ArXiv paper (arXiv:2601.05047) states it plainly: the primary challenges for LLM inference are memory and interconnect, not computation. GPU arithmetic units spend more than half their time idle, waiting for data to arrive. So flip the paradigm. Compute where the data lives, and data movement disappears. This is the core idea behind Processing-in-Memory (PIM). SK Hynix's AiM is shipping as a commercial product. Samsung announced LPDDR5X-PIM in February 2026. HBM4 integrates logic dies, turning the memory stack itself into a co-processor. Is the GPU era ending? Short answer: no. But PIM will change LLM inference architecture. How far the change goes,
If Memory Could Compute, Would We Still Need GPUs?
The bottleneck for LLM inference isn't GPU compute. It's memory bandwidth.
A February 2026 ArXiv paper (arXiv:2601.05047) states it plainly: the primary challenges for LLM inference are memory and interconnect, not computation. GPU arithmetic units spend more than half their time idle, waiting for data to arrive.
So flip the paradigm. Compute where the data lives, and data movement disappears. This is the core idea behind Processing-in-Memory (PIM). SK Hynix's AiM is shipping as a commercial product. Samsung announced LPDDR5X-PIM in February 2026. HBM4 integrates logic dies, turning the memory stack itself into a co-processor.
Is the GPU era ending? Short answer: no. But PIM will change LLM inference architecture. How far the change goes, and where it stops — that's what the papers and product data reveal.
The Memory Wall: Why GPUs Sit Idle
LLM inference has two phases with different bottlenecks:
Prefill phase (prompt processing): Batch-processes many tokens at once Matrix-matrix multiply → Compute-bound GPU arithmetic units fully utilizedPrefill phase (prompt processing): Batch-processes many tokens at once Matrix-matrix multiply → Compute-bound GPU arithmetic units fully utilizedDecode phase (token generation): Generates one token at a time, autoregressively KV cache reads → Memory bandwidth-bound GPU arithmetic units idle, waiting for data`
Enter fullscreen mode
Exit fullscreen mode
The problem is the Decode phase. Most LLM inference time is spent in Decode. And Decode is memory bandwidth-limited.
# Memory bandwidth bottleneck on RTX 4060 8GB rtx_4060_specs = { "compute": "15.11 TFLOPS (FP16)", "memory_bandwidth": "272 GB/s", "required_arithmetic_intensity": "15110 / 272 = 55.6 FLOP/byte", }# Memory bandwidth bottleneck on RTX 4060 8GB rtx_4060_specs = { "compute": "15.11 TFLOPS (FP16)", "memory_bandwidth": "272 GB/s", "required_arithmetic_intensity": "15110 / 272 = 55.6 FLOP/byte", }Actual arithmetic intensity during LLM Decode
llm_decode = { "typical_arithmetic_intensity": "1-2 FLOP/byte", "bottleneck": "memory bandwidth (272 GB/s wall)", "gpu_utilization": "< 5% of compute capacity during decode", }
95%+ of GPU compute sits idle during Decode`
Enter fullscreen mode
Exit fullscreen mode
The A100 80GB has the same structure. HBM2e bandwidth: 2 TB/s. Compute: 312 TFLOPS. Required arithmetic intensity: 156 FLOP/byte vs. Decode's 1-2 FLOP/byte. Bandwidth falls short by 50-100x.
PIM Principle: Don't Move Data, Move Computation
Traditional architecture:
DRAM/HBM → bus → GPU compute → bus → DRAM/HBM Data travels round-trip. Bus bandwidth is the bottleneck.DRAM/HBM → bus → GPU compute → bus → DRAM/HBM Data travels round-trip. Bus bandwidth is the bottleneck.Enter fullscreen mode
Exit fullscreen mode
PIM architecture:
Compute units inside DRAM/HBM → only results output Data doesn't move. Computation moves to the data. Internal bandwidth is orders of magnitude higher than bus bandwidth.Compute units inside DRAM/HBM → only results output Data doesn't move. Computation moves to the data. Internal bandwidth is orders of magnitude higher than bus bandwidth.Enter fullscreen mode
Exit fullscreen mode
HBM's internal bandwidth (aggregate across banks) is tens of times higher than external bandwidth. Computing inside HBM without moving data out eliminates the bandwidth wall.
Shipping Products
SK Hynix AiM (Accelerator in Memory):
- Commercial PIM processor based on GDDR6
- Compute units per memory bank (AiMX card shipping)
- Deployed in production environments
- Specialized for GEMV (matrix-vector multiply)
Samsung LPDDR5X-PIM (announced Feb 2026):
- In-memory compute in mobile LPDDR5X
- Significant energy efficiency improvement for edge AI inference (industry estimates: several-fold)
- Targeting smartphones and edge devices
Samsung/SK Hynix HBM4 plans:
- Logic die integrated into HBM stack
- Memory stack becomes a co-processor
- Mass production from February 2026
- Targeting NVIDIA's "Rubin" architecture`
Enter fullscreen mode
Exit fullscreen mode
How PIM Changes LLM Inference
2025-2026 ArXiv papers propose concrete PIM × LLM architectures.
HPIM: Heterogeneous PIM Integration (arXiv:2509.12993)
HPIM (Heterogeneous PIM) Architecture:
SRAM-PIM (low latency):
- Attention score computation
- Small but ultra-fast
- Equivalent to GPU L2 cache position
HBM-PIM (high bandwidth, large capacity):
- KV cache storage and processing
- Large capacity, medium speed
- Equivalent to main memory position
Parallel execution of both: SRAM-PIM: attention score ← low latency HBM-PIM: KV multiplication ← high bandwidth → Parallelizes serial dependencies in autoregressive Decode`
Enter fullscreen mode
Exit fullscreen mode
This envisions PIM across the entire memory hierarchy. Both cache and main memory contribute computation, each leveraging its strengths.
PAM: Processing Across Memory Hierarchy (arXiv:2602.11521)
PAM (Processing Across Memory):
HBM-PIM: Hot data (frequent access) DRAM-PIM: Warm data (moderate access) SSD-PIM: Cold data (rare access)
→ Optimize processing location by data temperature → Handle long-context LLMs (100K+ tokens)`
Enter fullscreen mode
Exit fullscreen mode
When the entire model doesn't fit in HBM (an everyday reality for us RTX 4060 8GB users), cross-hierarchy PIM could become an alternative to partial offloading.
Three Reasons PIM Can't Kill GPUs
PIM is compelling, but it won't make GPUs unnecessary.
1. Training Is Off the Table
LLM training is compute-bound. Large matrix multiplications, gradient computation, parameter updates — these need GPU's high compute density. PIM's in-memory units are good for matrix-vector products (GEMV) but are vastly outperformed by GPUs on matrix-matrix products (GEMM).
# Training vs Inference workload characteristics workload_characteristics = { "training": { "dominant_op": "GEMM (matrix-matrix multiply)", "arithmetic_intensity": "high (100+ FLOP/byte)", "bottleneck": "compute", "pim_advantage": "none (GEMM is GPU territory)", }, "inference_prefill": { "dominant_op": "GEMM (batched)", "arithmetic_intensity": "medium-high", "bottleneck": "compute (batch-size dependent)", "pim_advantage": "limited", }, "inference_decode": { "dominant_op": "GEMV (matrix-vector multiply)", "arithmetic_intensity": "low (1-2 FLOP/byte)", "bottleneck": "memory bandwidth", "pim_advantage": "★ significant", }, }# Training vs Inference workload characteristics workload_characteristics = { "training": { "dominant_op": "GEMM (matrix-matrix multiply)", "arithmetic_intensity": "high (100+ FLOP/byte)", "bottleneck": "compute", "pim_advantage": "none (GEMM is GPU territory)", }, "inference_prefill": { "dominant_op": "GEMM (batched)", "arithmetic_intensity": "medium-high", "bottleneck": "compute (batch-size dependent)", "pim_advantage": "limited", }, "inference_decode": { "dominant_op": "GEMV (matrix-vector multiply)", "arithmetic_intensity": "low (1-2 FLOP/byte)", "bottleneck": "memory bandwidth", "pim_advantage": "★ significant", }, }Enter fullscreen mode
Exit fullscreen mode
PIM's window of advantage is inference Decode only. Training and Prefill are GPU's domain.
2. Immature Programming Model
Leveraging PIM requires explicit programming of data placement and compute mapping. No CUDA-equivalent exists.
GPU: Mature software stack (CUDA: 15 years of development) Full framework support (PyTorch, TensorFlow, llama.cpp) Massive developer ecosystemGPU: Mature software stack (CUDA: 15 years of development) Full framework support (PyTorch, TensorFlow, llama.cpp) Massive developer ecosystemPIM: Vendor-specific APIs (SK Hynix AiM SDK, Samsung PIM SDK) No framework integration Small developer community Manual memory placement optimization required`
Enter fullscreen mode
Exit fullscreen mode
Hardware without software is useless. CUDA made GPUs a compute platform. PIM needs its "CUDA moment." It hasn't happened yet.
3. Cost Structure
PIM costs more to manufacture than standard memory. Added die area for compute units, more complex testing, lower yields.
Standard HBM3E: ~$10-18/GB (2026 market estimates) HBM-PIM: ~$20-30/GB (estimated, compute unit premium) GPU (A100): ~$10,000 (includes 80GB HBM2e)Standard HBM3E: ~$10-18/GB (2026 market estimates) HBM-PIM: ~$20-30/GB (estimated, compute unit premium) GPU (A100): ~$10,000 (includes 80GB HBM2e)PIM ROI conditions: Inference server power cost savings > PIM premium → May work at datacenter scale → Not relevant for individual users in the near term`
Enter fullscreen mode
Exit fullscreen mode
Will PIM Reach RTX 4060 Users?
Honestly, consumer PIM is 3-5 years away.
Currently available PIM:
- SK Hynix AiM: Datacenter only, not consumer-purchasable
- Samsung LPDDR5X-PIM: Mobile only, not in PCs
Expected 2027-2028:
- Possible PIM integration in HBM4-equipped GPUs
- NVIDIA Rubin architecture adopts HBM4
- Whether PIM features ship in consumer GPUs: unknown
Current practical solutions for consumers:
- Higher-bandwidth GPUs (RTX 5090: GDDR7 at ~1.8 TB/s)
- MoE models to reduce active parameters (lower bandwidth demand)
- Speculative decoding to improve effective bandwidth utilization`
Enter fullscreen mode
Exit fullscreen mode
Indirect benefits exist, though. As datacenter PIM adoption grows, cloud API inference costs drop. Energy efficiency improvements translate to lower per-token pricing.
Where PIM Redraws the Line
PIM won't kill GPUs. But it redraws the boundary of GPU work.
Current boundary: GPU = training + inference (everything) Memory = data storage onlyCurrent boundary: GPU = training + inference (everything) Memory = data storage onlyPost-PIM boundary: GPU = training + Prefill (compute-bound) PIM = Decode (memory-bound) Memory = data storage + Decode computation`
Enter fullscreen mode
Exit fullscreen mode
When Decode shifts to PIM, GPUs can specialize in Prefill and training. The idle-GPU problem during Decode disappears.
This also shifts semiconductor industry dynamics. Part of the inference market that NVIDIA monopolizes today moves to memory makers (Samsung, SK Hynix, Micron). TrendForce reported in March 2026 that Samsung and SK Hynix are exploring "next-gen AI memory that could challenge NVIDIA."
The Memory Wall Is Crumbling — From the Inside
Back to the opening question: "If memory could compute, would we still need GPUs?"
Yes, we need GPUs. Training and Prefill are non-negotiable. But Decode's lead actor may shift from GPU to PIM.
-
PIM fundamentally solves the memory bandwidth bottleneck for inference Decode
-
SK Hynix AiM is shipping, Samsung LPDDR5X-PIM is announced, HBM4 integrates logic dies
-
But: can't handle training, software stack is immature, carries a cost premium
-
Consumer PIM is 3-5 years out. MoE + Speculative Decoding remain the practical solution for now
The memory wall isn't being broken from the outside (more bandwidth). It's crumbling from the inside (putting computation in). That wave will take a while to reach individual GPUs, but in datacenters, it's already underway.
References
-
"Challenges and Research Directions for Large Language Model Inference Hardware" (2026) arXiv:2601.05047
-
"HPIM: Heterogeneous Processing-In-Memory-based Accelerator for LLM Inference" (2025) arXiv:2509.12993
-
"PAM: Processing Across Memory Hierarchy" (2026) arXiv:2602.11521
-
"Memory Is All You Need: Compute-in-Memory Architectures for LLM Inference" (2024) arXiv:2406.08413
-
TrendForce. "Beyond HBM: Samsung, SK hynix Explore Next-Gen AI Memory" (2026-03-10)
-
Samsung. "LPDDR5X-PIM for AI Computing" (2026-02)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamamodellanguage model
MCP servers turn Claude into a reasoning engine for your data
Claude knows virtually everything that s ever been publicly available on the internet by default. But it knows absolutely nothing about The post MCP servers turn Claude into a reasoning engine for your data appeared first on The New Stack .

incident.io Alternative: Open Source AI Incident Management
Key Takeaway: incident.io is one of the strongest incident management platforms available — used by Netflix, Airbnb, and Etsy with a free Basic tier. But it's closed-source SaaS with no self-hosted option and undisclosed AI. Aurora is an open source (Apache 2.0) alternative focused on autonomous AI investigation with full infrastructure access — free, self-hosted, and works with any LLM. What is incident.io? incident.io describes itself as "the all-in-one AI platform for on-call, incident response, and status pages — built for fast-moving teams." It's one of the most well-regarded tools in the space, with customers including Netflix, Airbnb, Etsy, Intercom, and Vanta . incident.io offers four core products: Incident Response — Slack-native workflows, catalog, post-mortems On-Call — Schedul
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

MCP servers turn Claude into a reasoning engine for your data
Claude knows virtually everything that s ever been publicly available on the internet by default. But it knows absolutely nothing about The post MCP servers turn Claude into a reasoning engine for your data appeared first on The New Stack .

Я продал AI-услуги на 500к. Вот что реально убедило клиентов
Пятница. 19:00. Я сижу в кофейне в центре Киева. Вдруг на экране ноутбука появляется сообщение от клиента из США. Он хочет AI для своего бизнеса. Но после двух часов разговоров и презентаций понимаю - его интересуют конкретные результаты, а не теория. Проблема: Почему классические подходы не работают Сначала я действовал по стандартной схеме: описывал, как технология работает, какие у неё характеристики, насколько она крутая. Но клиенты оставались равнодушными, откладывали решения. Это меня расстраивало - я терял время и потенциальные деньги. Многие коллеги наступают на те же грабли. Мы увлекаемся деталями, забывая о сути - зачем это нужно клиенту. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Решение: Что действительно привлекает клиентов Переломный момен

AMD's AI director slams Claude Code for becoming dumber and lazier since last update
'Claude cannot be trusted to perform complex engineering tasks' according to GitHub ticket If you've noticed Claude Code's performance degrading to the point where you find you don't trust it to handle complicated tasks anymore, you're not alone.…



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!