Gemma 4 is a KV_cache Pig
Ignoring the 8 bit size of Nvidia’s marketed 4 bit quantization of the dense model… The dense model KV cache architecture uses 3x or more the memory than what I have seen with other models. It seems like the big choice was 256 head dim instead of 128. I am looking at 490KB per 8 bit token of KV cache versus 128KB on Qwen3. I am running the nvidia weights at 4 bit on an rtx pro 6000 with 96GB of RAM and 8 bit kv cache and still only have room for 115k tokens. I was surprised is all. The model scales well in vllm and seems quite smart. submitted by /u/IngeniousIdiocy [link] [comments]
Could not retrieve the full article text.
Read on Reddit r/LocalLLaMA →Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelmarketquantization
Higher energy costs from Iran war could threaten fragile economics of AI boom | Heather Stewart
Industry with business model not yet firmly established and investments financed by huge debts is particularly at risk Donald Trump’s most immediate concern in demanding Iran reopen the strait of Hormuz may be rocketing US gasoline prices, but if the conflict drags on, higher energy costs will be felt far beyond the pumps. Systemically higher power prices and fractured supply chains will squeeze industries and consumers worldwide. For the US, one consequence may be to threaten the fragile economics of the AI boom. Continue reading...
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Higher energy costs from Iran war could threaten fragile economics of AI boom | Heather Stewart
Industry with business model not yet firmly established and investments financed by huge debts is particularly at risk Donald Trump’s most immediate concern in demanding Iran reopen the strait of Hormuz may be rocketing US gasoline prices, but if the conflict drags on, higher energy costs will be felt far beyond the pumps. Systemically higher power prices and fractured supply chains will squeeze industries and consumers worldwide. For the US, one consequence may be to threaten the fragile economics of the AI boom. Continue reading...
Anthropic’s $1B to $19B growth run: how Claude became the fastest-growing AI product in history | Amol Avasare
Listen now | Anthropic s Head of Growth on scaling from $1B to $19B ARR in 14 months through big bets, intentional onboarding friction, deep focus, and CASH an internal AI system for autonomous growth experiments



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!