Models llama model language model training announce available

If Memory Could Compute, Would We Still Need GPUs?

Dev.to AIby plasmonApril 5, 20268 min read1 views

If Memory Could Compute, Would We Still Need GPUs? The bottleneck for LLM inference isn't GPU compute. It's memory bandwidth. A February 2026 ArXiv paper (arXiv:2601.05047) states it plainly: the primary challenges for LLM inference are memory and interconnect, not computation. GPU arithmetic units spend more than half their time idle, waiting for data to arrive. So flip the paradigm. Compute where the data lives, and data movement disappears. This is the core idea behind Processing-in-Memory (PIM). SK Hynix's AiM is shipping as a commercial product. Samsung announced LPDDR5X-PIM in February 2026. HBM4 integrates logic dies, turning the memory stack itself into a co-processor. Is the GPU era ending? Short answer: no. But PIM will change LLM inference architecture. How far the change goes,

If Memory Could Compute, Would We Still Need GPUs?

The bottleneck for LLM inference isn't GPU compute. It's memory bandwidth.

A February 2026 ArXiv paper (arXiv:2601.05047) states it plainly: the primary challenges for LLM inference are memory and interconnect, not computation. GPU arithmetic units spend more than half their time idle, waiting for data to arrive.

So flip the paradigm. Compute where the data lives, and data movement disappears. This is the core idea behind Processing-in-Memory (PIM). SK Hynix's AiM is shipping as a commercial product. Samsung announced LPDDR5X-PIM in February 2026. HBM4 integrates logic dies, turning the memory stack itself into a co-processor.

Is the GPU era ending? Short answer: no. But PIM will change LLM inference architecture. How far the change goes, and where it stops — that's what the papers and product data reveal.

The Memory Wall: Why GPUs Sit Idle

LLM inference has two phases with different bottlenecks:

Prefill phase (prompt processing):  Batch-processes many tokens at once  Matrix-matrix multiply → Compute-bound  GPU arithmetic units fully utilized

Prefill phase (prompt processing):  Batch-processes many tokens at once  Matrix-matrix multiply → Compute-bound  GPU arithmetic units fully utilized

Decode phase (token generation): Generates one token at a time, autoregressively KV cache reads → Memory bandwidth-bound GPU arithmetic units idle, waiting for data`

Enter fullscreen mode

Exit fullscreen mode

The problem is the Decode phase. Most LLM inference time is spent in Decode. And Decode is memory bandwidth-limited.

# Memory bandwidth bottleneck on RTX 4060 8GB rtx_4060_specs = {  "compute": "15.11 TFLOPS (FP16)",  "memory_bandwidth": "272 GB/s",  "required_arithmetic_intensity": "15110 / 272 = 55.6 FLOP/byte", }

# Memory bandwidth bottleneck on RTX 4060 8GB rtx_4060_specs = {  "compute": "15.11 TFLOPS (FP16)",  "memory_bandwidth": "272 GB/s",  "required_arithmetic_intensity": "15110 / 272 = 55.6 FLOP/byte", }

Actual arithmetic intensity during LLM Decode

llm_decode = { "typical_arithmetic_intensity": "1-2 FLOP/byte", "bottleneck": "memory bandwidth (272 GB/s wall)", "gpu_utilization": "< 5% of compute capacity during decode", }

95%+ of GPU compute sits idle during Decode`

Enter fullscreen mode

Exit fullscreen mode

The A100 80GB has the same structure. HBM2e bandwidth: 2 TB/s. Compute: 312 TFLOPS. Required arithmetic intensity: 156 FLOP/byte vs. Decode's 1-2 FLOP/byte. Bandwidth falls short by 50-100x.

PIM Principle: Don't Move Data, Move Computation

Traditional architecture:

DRAM/HBM → bus → GPU compute → bus → DRAM/HBM  Data travels round-trip. Bus bandwidth is the bottleneck.

DRAM/HBM → bus → GPU compute → bus → DRAM/HBM  Data travels round-trip. Bus bandwidth is the bottleneck.

Enter fullscreen mode

Exit fullscreen mode

PIM architecture:

Compute units inside DRAM/HBM → only results output  Data doesn't move. Computation moves to the data.  Internal bandwidth is orders of magnitude higher than bus bandwidth.

Compute units inside DRAM/HBM → only results output  Data doesn't move. Computation moves to the data.  Internal bandwidth is orders of magnitude higher than bus bandwidth.

Enter fullscreen mode

Exit fullscreen mode

HBM's internal bandwidth (aggregate across banks) is tens of times higher than external bandwidth. Computing inside HBM without moving data out eliminates the bandwidth wall.

Shipping Products

SK Hynix AiM (Accelerator in Memory):

Commercial PIM processor based on GDDR6
Compute units per memory bank (AiMX card shipping)
Deployed in production environments
Specialized for GEMV (matrix-vector multiply)

Samsung LPDDR5X-PIM (announced Feb 2026):

In-memory compute in mobile LPDDR5X
Significant energy efficiency improvement for edge AI inference (industry estimates: several-fold)
Targeting smartphones and edge devices

Samsung/SK Hynix HBM4 plans:

Logic die integrated into HBM stack
Memory stack becomes a co-processor
Mass production from February 2026
Targeting NVIDIA's "Rubin" architecture`

Enter fullscreen mode

Exit fullscreen mode

How PIM Changes LLM Inference

2025-2026 ArXiv papers propose concrete PIM × LLM architectures.

HPIM: Heterogeneous PIM Integration (arXiv:2509.12993)

HPIM (Heterogeneous PIM) Architecture:

SRAM-PIM (low latency):

Attention score computation
Small but ultra-fast
Equivalent to GPU L2 cache position

HBM-PIM (high bandwidth, large capacity):

KV cache storage and processing
Large capacity, medium speed
Equivalent to main memory position

Parallel execution of both: SRAM-PIM: attention score ← low latency HBM-PIM: KV multiplication ← high bandwidth → Parallelizes serial dependencies in autoregressive Decode`

Enter fullscreen mode

Exit fullscreen mode

This envisions PIM across the entire memory hierarchy. Both cache and main memory contribute computation, each leveraging its strengths.

PAM: Processing Across Memory Hierarchy (arXiv:2602.11521)

PAM (Processing Across Memory):

HBM-PIM: Hot data (frequent access) DRAM-PIM: Warm data (moderate access) SSD-PIM: Cold data (rare access)

→ Optimize processing location by data temperature → Handle long-context LLMs (100K+ tokens)`

Enter fullscreen mode

Exit fullscreen mode

When the entire model doesn't fit in HBM (an everyday reality for us RTX 4060 8GB users), cross-hierarchy PIM could become an alternative to partial offloading.

Three Reasons PIM Can't Kill GPUs

PIM is compelling, but it won't make GPUs unnecessary.

1. Training Is Off the Table

LLM training is compute-bound. Large matrix multiplications, gradient computation, parameter updates — these need GPU's high compute density. PIM's in-memory units are good for matrix-vector products (GEMV) but are vastly outperformed by GPUs on matrix-matrix products (GEMM).

# Training vs Inference workload characteristics workload_characteristics = {  "training": {  "dominant_op": "GEMM (matrix-matrix multiply)",  "arithmetic_intensity": "high (100+ FLOP/byte)",  "bottleneck": "compute",  "pim_advantage": "none (GEMM is GPU territory)",  },  "inference_prefill": {  "dominant_op": "GEMM (batched)",  "arithmetic_intensity": "medium-high",  "bottleneck": "compute (batch-size dependent)",  "pim_advantage": "limited",  },  "inference_decode": {  "dominant_op": "GEMV (matrix-vector multiply)",  "arithmetic_intensity": "low (1-2 FLOP/byte)",  "bottleneck": "memory bandwidth",  "pim_advantage": "★ significant",  }, }

# Training vs Inference workload characteristics workload_characteristics = {  "training": {  "dominant_op": "GEMM (matrix-matrix multiply)",  "arithmetic_intensity": "high (100+ FLOP/byte)",  "bottleneck": "compute",  "pim_advantage": "none (GEMM is GPU territory)",  },  "inference_prefill": {  "dominant_op": "GEMM (batched)",  "arithmetic_intensity": "medium-high",  "bottleneck": "compute (batch-size dependent)",  "pim_advantage": "limited",  },  "inference_decode": {  "dominant_op": "GEMV (matrix-vector multiply)",  "arithmetic_intensity": "low (1-2 FLOP/byte)",  "bottleneck": "memory bandwidth",  "pim_advantage": "★ significant",  }, }

Enter fullscreen mode

Exit fullscreen mode

PIM's window of advantage is inference Decode only. Training and Prefill are GPU's domain.

2. Immature Programming Model

Leveraging PIM requires explicit programming of data placement and compute mapping. No CUDA-equivalent exists.

GPU:  Mature software stack (CUDA: 15 years of development)  Full framework support (PyTorch, TensorFlow, llama.cpp)  Massive developer ecosystem

GPU:  Mature software stack (CUDA: 15 years of development)  Full framework support (PyTorch, TensorFlow, llama.cpp)  Massive developer ecosystem

PIM: Vendor-specific APIs (SK Hynix AiM SDK, Samsung PIM SDK) No framework integration Small developer community Manual memory placement optimization required`

Enter fullscreen mode

Exit fullscreen mode

Hardware without software is useless. CUDA made GPUs a compute platform. PIM needs its "CUDA moment." It hasn't happened yet.

3. Cost Structure

PIM costs more to manufacture than standard memory. Added die area for compute units, more complex testing, lower yields.

Standard HBM3E: ~$10-18/GB (2026 market estimates) HBM-PIM: ~$20-30/GB (estimated, compute unit premium) GPU (A100): ~$10,000 (includes 80GB HBM2e)

Standard HBM3E: ~$10-18/GB (2026 market estimates) HBM-PIM: ~$20-30/GB (estimated, compute unit premium) GPU (A100): ~$10,000 (includes 80GB HBM2e)

PIM ROI conditions: Inference server power cost savings > PIM premium → May work at datacenter scale → Not relevant for individual users in the near term`

Enter fullscreen mode

Exit fullscreen mode

Will PIM Reach RTX 4060 Users?

Honestly, consumer PIM is 3-5 years away.

Currently available PIM:

SK Hynix AiM: Datacenter only, not consumer-purchasable
Samsung LPDDR5X-PIM: Mobile only, not in PCs

Expected 2027-2028:

Possible PIM integration in HBM4-equipped GPUs
NVIDIA Rubin architecture adopts HBM4
Whether PIM features ship in consumer GPUs: unknown

Current practical solutions for consumers:

Higher-bandwidth GPUs (RTX 5090: GDDR7 at ~1.8 TB/s)
MoE models to reduce active parameters (lower bandwidth demand)
Speculative decoding to improve effective bandwidth utilization`

Enter fullscreen mode

Exit fullscreen mode

Indirect benefits exist, though. As datacenter PIM adoption grows, cloud API inference costs drop. Energy efficiency improvements translate to lower per-token pricing.

Where PIM Redraws the Line

PIM won't kill GPUs. But it redraws the boundary of GPU work.

Current boundary:  GPU = training + inference (everything)  Memory = data storage only

Current boundary:  GPU = training + inference (everything)  Memory = data storage only

Post-PIM boundary: GPU = training + Prefill (compute-bound) PIM = Decode (memory-bound) Memory = data storage + Decode computation`

Enter fullscreen mode

Exit fullscreen mode

When Decode shifts to PIM, GPUs can specialize in Prefill and training. The idle-GPU problem during Decode disappears.

This also shifts semiconductor industry dynamics. Part of the inference market that NVIDIA monopolizes today moves to memory makers (Samsung, SK Hynix, Micron). TrendForce reported in March 2026 that Samsung and SK Hynix are exploring "next-gen AI memory that could challenge NVIDIA."

The Memory Wall Is Crumbling — From the Inside

Back to the opening question: "If memory could compute, would we still need GPUs?"

Yes, we need GPUs. Training and Prefill are non-negotiable. But Decode's lead actor may shift from GPU to PIM.

PIM fundamentally solves the memory bandwidth bottleneck for inference Decode
SK Hynix AiM is shipping, Samsung LPDDR5X-PIM is announced, HBM4 integrates logic dies
But: can't handle training, software stack is immature, carries a cost premium
Consumer PIM is 3-5 years out. MoE + Speculative Decoding remain the practical solution for now

The memory wall isn't being broken from the outside (more bandwidth). It's crumbling from the inside (putting computation in). That wave will take a while to reach individual GPUs, but in datacenters, it's already underway.

References

"Challenges and Research Directions for Large Language Model Inference Hardware" (2026) arXiv:2601.05047
"HPIM: Heterogeneous Processing-In-Memory-based Accelerator for LLM Inference" (2025) arXiv:2509.12993
"PAM: Processing Across Memory Hierarchy" (2026) arXiv:2602.11521
"Memory Is All You Need: Compute-in-Memory Architectures for LLM Inference" (2024) arXiv:2406.08413
TrendForce. "Beyond HBM: Samsung, SK hynix Explore Next-Gen AI Memory" (2026-03-10)
Samsung. "LPDDR5X-PIM for AI Computing" (2026-02)

Original source

Dev.to AI

https://dev.to/plasmon_imp/if-memory-could-compute-would-we-still-need-gpus-4ccb

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodellanguage model

ModelsFresh

MCP servers turn Claude into a reasoning engine for your data

Claude knows virtually everything that s ever been publicly available on the internet by default. But it knows absolutely nothing about The post MCP servers turn Claude into a reasoning engine for your data appeared first on The New Stack .

The New Stack

1mabout 2 hours ago

Models

Accurate tropical cyclone intensity forecasts using a non-iterative spatiotemporal transformer model - Nature

Accurate tropical cyclone intensity forecasts using a non-iterative spatiotemporal transformer model Nature

GNews AI transformer

1m4 months ago

ProductsLive

incident.io Alternative: Open Source AI Incident Management

Key Takeaway: incident.io is one of the strongest incident management platforms available — used by Netflix, Airbnb, and Etsy with a free Basic tier. But it's closed-source SaaS with no self-hosted option and undisclosed AI. Aurora is an open source (Apache 2.0) alternative focused on autonomous AI investigation with full infrastructure access — free, self-hosted, and works with any LLM. What is incident.io? incident.io describes itself as "the all-in-one AI platform for on-call, incident response, and status pages — built for fast-moving teams." It's one of the most well-regarded tools in the space, with customers including Netflix, Airbnb, Etsy, Intercom, and Vanta . incident.io offers four core products: Incident Response — Slack-native workflows, catalog, post-mortems On-Call — Schedul

Dev.to AI

7mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 197 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

MCP servers turn Claude into a reasoning engine for your data

The New Stack

1mabout 2 hours ago

Models

Accurate tropical cyclone intensity forecasts using a non-iterative spatiotemporal transformer model - Nature

Accurate tropical cyclone intensity forecasts using a non-iterative spatiotemporal transformer model Nature

GNews AI transformer

1m4 months ago

ModelsLive

Я продал AI-услуги на 500к. Вот что реально убедило клиентов

Пятница. 19:00. Я сижу в кофейне в центре Киева. Вдруг на экране ноутбука появляется сообщение от клиента из США. Он хочет AI для своего бизнеса. Но после двух часов разговоров и презентаций понимаю - его интересуют конкретные результаты, а не теория. Проблема: Почему классические подходы не работают Сначала я действовал по стандартной схеме: описывал, как технология работает, какие у неё характеристики, насколько она крутая. Но клиенты оставались равнодушными, откладывали решения. Это меня расстраивало - я терял время и потенциальные деньги. Многие коллеги наступают на те же грабли. Мы увлекаемся деталями, забывая о сути - зачем это нужно клиенту. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Решение: Что действительно привлекает клиентов Переломный момен

Dev.to AI

2m41 minutes ago

ModelsFresh

AMD's AI director slams Claude Code for becoming dumber and lazier since last update

'Claude cannot be trusted to perform complex engineering tasks' according to GitHub ticket If you've noticed Claude Code's performance degrading to the point where you find you don't trust it to handle complicated tasks anymore, you're not alone.…

The Register AI/ML

1mabout 3 hours ago