Gemma 4 is a KV_cache Pig

Reddit r/LocalLLaMAby /u/IngeniousIdiocy https://www.reddit.com/user/IngeniousIdiocyApril 3, 20261 min read1 views

Ignoring the 8 bit size of Nvidia’s marketed 4 bit quantization of the dense model… The dense model KV cache architecture uses 3x or more the memory than what I have seen with other models. It seems like the big choice was 256 head dim instead of 128. I am looking at 490KB per 8 bit token of KV cache versus 128KB on Qwen3. I am running the nvidia weights at 4 bit on an rtx pro 6000 with 96GB of RAM and 8 bit kv cache and still only have room for 115k tokens. I was surprised is all. The model scales well in vllm and seems quite smart. submitted by /u/IngeniousIdiocy [link] [comments]

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →

Original source

Reddit r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1sbklxh/gemma_4_is_a_kv_cache_pig/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelmarketquantization

ModelsFresh

Higher energy costs from Iran war could threaten fragile economics of AI boom | Heather Stewart

Industry with business model not yet firmly established and investments financed by huge debts is particularly at risk Donald Trump’s most immediate concern in demanding Iran reopen the strait of Hormuz may be rocketing US gasoline prices, but if the conflict drags on, higher energy costs will be felt far beyond the pumps. Systemically higher power prices and fractured supply chains will squeeze industries and consumers worldwide. For the US, one consequence may be to threaten the fragile economics of the AI boom. Continue reading...

The Guardian AI

1mabout 4 hours ago

Models

Integration of fairness-awareness into clinical language processing models | Communications Medicine - Nature

Integration of fairness-awareness into clinical language processing models | Communications Medicine Nature

GNews AI bias

1mabout 1 month ago

ModelsFresh

Google 'Gemma 4' AI model: This new AI tool can build AI agents for you and handle text, image, audio tasks - MSN

Google 'Gemma 4' AI model: This new AI tool can build AI agents for you and handle text, image, audio tasks MSN

GNews AI Gemma

1mabout 11 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 164 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Higher energy costs from Iran war could threaten fragile economics of AI boom | Heather Stewart

The Guardian AI

1mabout 4 hours ago

Models

Integration of fairness-awareness into clinical language processing models | Communications Medicine - Nature

Integration of fairness-awareness into clinical language processing models | Communications Medicine Nature

GNews AI bias

1mabout 1 month ago

ModelsFresh

Google 'Gemma 4' AI model: This new AI tool can build AI agents for you and handle text, image, audio tasks - MSN

Google 'Gemma 4' AI model: This new AI tool can build AI agents for you and handle text, image, audio tasks MSN

GNews AI Gemma

1mabout 11 hours ago

ModelsFresh

Anthropic’s $1B to $19B growth run: how Claude became the fastest-growing AI product in history | Amol Avasare

Listen now | Anthropic s Head of Growth on scaling from $1B to $19B ARR in 14 months through big bets, intentional onboarding friction, deep focus, and CASH an internal AI system for autonomous growth experiments

lennysnewsletter.com

1mabout 2 hours ago