Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessSamsung Profit Up Eight-Fold After AI Chip Sales Defy War FearsBloomberg TechnologyThe League of Legends KeSPA cup will air globally on Disney+EngadgetHow Creators Use Instagram DM Automation to Scale Faster (2026 Guide)Dev.to AIPress Releases vs RSS vs AI Feeds: Why Structured Government Data MattersDev.to AIWhy Smart Creators Are Automating Instagram DMs in 2026Dev.to AIЯ продал AI-услуги на 500к. Вот что реально убедило клиентовDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIThe Gardenlesswrong.comOpenAI Calls for Investigation Into Musk by California, DelawareBloomberg TechnologyWeb Theme Loader, hand-crafted or AI generatedDev.to AII Built an Autonomous Phone Verification Agent (Full Code + Tutorial)Dev.to AIAI Citation Registries as Information Infrastructure for AI SystemsDev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessSamsung Profit Up Eight-Fold After AI Chip Sales Defy War FearsBloomberg TechnologyThe League of Legends KeSPA cup will air globally on Disney+EngadgetHow Creators Use Instagram DM Automation to Scale Faster (2026 Guide)Dev.to AIPress Releases vs RSS vs AI Feeds: Why Structured Government Data MattersDev.to AIWhy Smart Creators Are Automating Instagram DMs in 2026Dev.to AIЯ продал AI-услуги на 500к. Вот что реально убедило клиентовDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIThe Gardenlesswrong.comOpenAI Calls for Investigation Into Musk by California, DelawareBloomberg TechnologyWeb Theme Loader, hand-crafted or AI generatedDev.to AII Built an Autonomous Phone Verification Agent (Full Code + Tutorial)Dev.to AIAI Citation Registries as Information Infrastructure for AI SystemsDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

If Memory Could Compute, Would We Still Need GPUs?

Dev.to AIby plasmonApril 5, 20268 min read1 views
Source Quiz

If Memory Could Compute, Would We Still Need GPUs? The bottleneck for LLM inference isn't GPU compute. It's memory bandwidth. A February 2026 ArXiv paper (arXiv:2601.05047) states it plainly: the primary challenges for LLM inference are memory and interconnect, not computation. GPU arithmetic units spend more than half their time idle, waiting for data to arrive. So flip the paradigm. Compute where the data lives, and data movement disappears. This is the core idea behind Processing-in-Memory (PIM). SK Hynix's AiM is shipping as a commercial product. Samsung announced LPDDR5X-PIM in February 2026. HBM4 integrates logic dies, turning the memory stack itself into a co-processor. Is the GPU era ending? Short answer: no. But PIM will change LLM inference architecture. How far the change goes,

If Memory Could Compute, Would We Still Need GPUs?

The bottleneck for LLM inference isn't GPU compute. It's memory bandwidth.

A February 2026 ArXiv paper (arXiv:2601.05047) states it plainly: the primary challenges for LLM inference are memory and interconnect, not computation. GPU arithmetic units spend more than half their time idle, waiting for data to arrive.

So flip the paradigm. Compute where the data lives, and data movement disappears. This is the core idea behind Processing-in-Memory (PIM). SK Hynix's AiM is shipping as a commercial product. Samsung announced LPDDR5X-PIM in February 2026. HBM4 integrates logic dies, turning the memory stack itself into a co-processor.

Is the GPU era ending? Short answer: no. But PIM will change LLM inference architecture. How far the change goes, and where it stops — that's what the papers and product data reveal.

The Memory Wall: Why GPUs Sit Idle

LLM inference has two phases with different bottlenecks:

Prefill phase (prompt processing):  Batch-processes many tokens at once  Matrix-matrix multiply → Compute-bound  GPU arithmetic units fully utilized

Decode phase (token generation): Generates one token at a time, autoregressively KV cache reads → Memory bandwidth-bound GPU arithmetic units idle, waiting for data`

Enter fullscreen mode

Exit fullscreen mode

The problem is the Decode phase. Most LLM inference time is spent in Decode. And Decode is memory bandwidth-limited.

# Memory bandwidth bottleneck on RTX 4060 8GB rtx_4060_specs = {  "compute": "15.11 TFLOPS (FP16)",  "memory_bandwidth": "272 GB/s",  "required_arithmetic_intensity": "15110 / 272 = 55.6 FLOP/byte", }

Actual arithmetic intensity during LLM Decode

llm_decode = { "typical_arithmetic_intensity": "1-2 FLOP/byte", "bottleneck": "memory bandwidth (272 GB/s wall)", "gpu_utilization": "< 5% of compute capacity during decode", }

95%+ of GPU compute sits idle during Decode`

Enter fullscreen mode

Exit fullscreen mode

The A100 80GB has the same structure. HBM2e bandwidth: 2 TB/s. Compute: 312 TFLOPS. Required arithmetic intensity: 156 FLOP/byte vs. Decode's 1-2 FLOP/byte. Bandwidth falls short by 50-100x.

PIM Principle: Don't Move Data, Move Computation

Traditional architecture:

DRAM/HBM → bus → GPU compute → bus → DRAM/HBM  Data travels round-trip. Bus bandwidth is the bottleneck.

Enter fullscreen mode

Exit fullscreen mode

PIM architecture:

Compute units inside DRAM/HBM → only results output  Data doesn't move. Computation moves to the data.  Internal bandwidth is orders of magnitude higher than bus bandwidth.

Enter fullscreen mode

Exit fullscreen mode

HBM's internal bandwidth (aggregate across banks) is tens of times higher than external bandwidth. Computing inside HBM without moving data out eliminates the bandwidth wall.

Shipping Products

SK Hynix AiM (Accelerator in Memory):

  • Commercial PIM processor based on GDDR6
  • Compute units per memory bank (AiMX card shipping)
  • Deployed in production environments
  • Specialized for GEMV (matrix-vector multiply)

Samsung LPDDR5X-PIM (announced Feb 2026):

  • In-memory compute in mobile LPDDR5X
  • Significant energy efficiency improvement for edge AI inference (industry estimates: several-fold)
  • Targeting smartphones and edge devices

Samsung/SK Hynix HBM4 plans:

  • Logic die integrated into HBM stack
  • Memory stack becomes a co-processor
  • Mass production from February 2026
  • Targeting NVIDIA's "Rubin" architecture`

Enter fullscreen mode

Exit fullscreen mode

How PIM Changes LLM Inference

2025-2026 ArXiv papers propose concrete PIM × LLM architectures.

HPIM: Heterogeneous PIM Integration (arXiv:2509.12993)

HPIM (Heterogeneous PIM) Architecture:

SRAM-PIM (low latency):

  • Attention score computation
  • Small but ultra-fast
  • Equivalent to GPU L2 cache position

HBM-PIM (high bandwidth, large capacity):

  • KV cache storage and processing
  • Large capacity, medium speed
  • Equivalent to main memory position

Parallel execution of both: SRAM-PIM: attention score ← low latency HBM-PIM: KV multiplication ← high bandwidth → Parallelizes serial dependencies in autoregressive Decode`

Enter fullscreen mode

Exit fullscreen mode

This envisions PIM across the entire memory hierarchy. Both cache and main memory contribute computation, each leveraging its strengths.

PAM: Processing Across Memory Hierarchy (arXiv:2602.11521)

PAM (Processing Across Memory):

HBM-PIM: Hot data (frequent access) DRAM-PIM: Warm data (moderate access) SSD-PIM: Cold data (rare access)

→ Optimize processing location by data temperature → Handle long-context LLMs (100K+ tokens)`

Enter fullscreen mode

Exit fullscreen mode

When the entire model doesn't fit in HBM (an everyday reality for us RTX 4060 8GB users), cross-hierarchy PIM could become an alternative to partial offloading.

Three Reasons PIM Can't Kill GPUs

PIM is compelling, but it won't make GPUs unnecessary.

1. Training Is Off the Table

LLM training is compute-bound. Large matrix multiplications, gradient computation, parameter updates — these need GPU's high compute density. PIM's in-memory units are good for matrix-vector products (GEMV) but are vastly outperformed by GPUs on matrix-matrix products (GEMM).

# Training vs Inference workload characteristics workload_characteristics = {  "training": {  "dominant_op": "GEMM (matrix-matrix multiply)",  "arithmetic_intensity": "high (100+ FLOP/byte)",  "bottleneck": "compute",  "pim_advantage": "none (GEMM is GPU territory)",  },  "inference_prefill": {  "dominant_op": "GEMM (batched)",  "arithmetic_intensity": "medium-high",  "bottleneck": "compute (batch-size dependent)",  "pim_advantage": "limited",  },  "inference_decode": {  "dominant_op": "GEMV (matrix-vector multiply)",  "arithmetic_intensity": "low (1-2 FLOP/byte)",  "bottleneck": "memory bandwidth",  "pim_advantage": "★ significant",  }, }

Enter fullscreen mode

Exit fullscreen mode

PIM's window of advantage is inference Decode only. Training and Prefill are GPU's domain.

2. Immature Programming Model

Leveraging PIM requires explicit programming of data placement and compute mapping. No CUDA-equivalent exists.

GPU:  Mature software stack (CUDA: 15 years of development)  Full framework support (PyTorch, TensorFlow, llama.cpp)  Massive developer ecosystem

PIM: Vendor-specific APIs (SK Hynix AiM SDK, Samsung PIM SDK) No framework integration Small developer community Manual memory placement optimization required`

Enter fullscreen mode

Exit fullscreen mode

Hardware without software is useless. CUDA made GPUs a compute platform. PIM needs its "CUDA moment." It hasn't happened yet.

3. Cost Structure

PIM costs more to manufacture than standard memory. Added die area for compute units, more complex testing, lower yields.

Standard HBM3E: ~$10-18/GB (2026 market estimates) HBM-PIM: ~$20-30/GB (estimated, compute unit premium) GPU (A100): ~$10,000 (includes 80GB HBM2e)

PIM ROI conditions: Inference server power cost savings > PIM premium → May work at datacenter scale → Not relevant for individual users in the near term`

Enter fullscreen mode

Exit fullscreen mode

Will PIM Reach RTX 4060 Users?

Honestly, consumer PIM is 3-5 years away.

Currently available PIM:

  • SK Hynix AiM: Datacenter only, not consumer-purchasable
  • Samsung LPDDR5X-PIM: Mobile only, not in PCs

Expected 2027-2028:

  • Possible PIM integration in HBM4-equipped GPUs
  • NVIDIA Rubin architecture adopts HBM4
  • Whether PIM features ship in consumer GPUs: unknown

Current practical solutions for consumers:

  1. Higher-bandwidth GPUs (RTX 5090: GDDR7 at ~1.8 TB/s)
  2. MoE models to reduce active parameters (lower bandwidth demand)
  3. Speculative decoding to improve effective bandwidth utilization`

Enter fullscreen mode

Exit fullscreen mode

Indirect benefits exist, though. As datacenter PIM adoption grows, cloud API inference costs drop. Energy efficiency improvements translate to lower per-token pricing.

Where PIM Redraws the Line

PIM won't kill GPUs. But it redraws the boundary of GPU work.

Current boundary:  GPU = training + inference (everything)  Memory = data storage only

Post-PIM boundary: GPU = training + Prefill (compute-bound) PIM = Decode (memory-bound) Memory = data storage + Decode computation`

Enter fullscreen mode

Exit fullscreen mode

When Decode shifts to PIM, GPUs can specialize in Prefill and training. The idle-GPU problem during Decode disappears.

This also shifts semiconductor industry dynamics. Part of the inference market that NVIDIA monopolizes today moves to memory makers (Samsung, SK Hynix, Micron). TrendForce reported in March 2026 that Samsung and SK Hynix are exploring "next-gen AI memory that could challenge NVIDIA."

The Memory Wall Is Crumbling — From the Inside

Back to the opening question: "If memory could compute, would we still need GPUs?"

Yes, we need GPUs. Training and Prefill are non-negotiable. But Decode's lead actor may shift from GPU to PIM.

  • PIM fundamentally solves the memory bandwidth bottleneck for inference Decode

  • SK Hynix AiM is shipping, Samsung LPDDR5X-PIM is announced, HBM4 integrates logic dies

  • But: can't handle training, software stack is immature, carries a cost premium

  • Consumer PIM is 3-5 years out. MoE + Speculative Decoding remain the practical solution for now

The memory wall isn't being broken from the outside (more bandwidth). It's crumbling from the inside (putting computation in). That wave will take a while to reach individual GPUs, but in datacenters, it's already underway.

References

  • "Challenges and Research Directions for Large Language Model Inference Hardware" (2026) arXiv:2601.05047

  • "HPIM: Heterogeneous Processing-In-Memory-based Accelerator for LLM Inference" (2025) arXiv:2509.12993

  • "PAM: Processing Across Memory Hierarchy" (2026) arXiv:2602.11521

  • "Memory Is All You Need: Compute-in-Memory Architectures for LLM Inference" (2024) arXiv:2406.08413

  • TrendForce. "Beyond HBM: Samsung, SK hynix Explore Next-Gen AI Memory" (2026-03-10)

  • Samsung. "LPDDR5X-PIM for AI Computing" (2026-02)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodellanguage model

Knowledge Map

Knowledge Map
TopicsEntitiesSource
If Memory C…llamamodellanguage mo…trainingannounceavailableDev.to AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 197 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models

Я продал AI-услуги на 500к. Вот что реально убедило клиентов
ModelsLive

Я продал AI-услуги на 500к. Вот что реально убедило клиентов

Пятница. 19:00. Я сижу в кофейне в центре Киева. Вдруг на экране ноутбука появляется сообщение от клиента из США. Он хочет AI для своего бизнеса. Но после двух часов разговоров и презентаций понимаю - его интересуют конкретные результаты, а не теория. Проблема: Почему классические подходы не работают Сначала я действовал по стандартной схеме: описывал, как технология работает, какие у неё характеристики, насколько она крутая. Но клиенты оставались равнодушными, откладывали решения. Это меня расстраивало - я терял время и потенциальные деньги. Многие коллеги наступают на те же грабли. Мы увлекаемся деталями, забывая о сути - зачем это нужно клиенту. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Решение: Что действительно привлекает клиентов Переломный момен