Open Source AI llama model launch github llama.cpp quantization

VRAM optimization for gemma 4

Reddit r/LocalLLaMAby /u/Sadman782 https://www.reddit.com/user/Sadman782April 3, 20262 min read1 views

TLDR: add -np 1 to your llama.cpp launch command if you are the only user, cuts SWA cache VRAM by 3x instantly So I was messing around with Gemma 4 and noticed the dense model hogs a massive chunk of VRAM before you even start generating anything. If you are on 16GB you might be hitting OOM and wondering why. The culprit is the SWA (Sliding Window Attention) KV cache. It allocates in F16 and does not get quantized like the rest of the KV cache. A couple days ago ggerganov merged a PR that accidentally made this worse by keeping the SWA portion unquantized even when you have KV cache quantization enabled. It got reverted about 2 hours later here https://github.com/ggml-org/llama.cpp/pull/21332 so make sure you are on a recent build. A few things that actually help with VRAM: The SWA cache s

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →

Original source

Reddit r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1sb80yv/vram_optimization_for_gemma_4/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodellaunch

Models

Mistral releases open-source speech models, advancing real-time voice AI capabilities - CXO Digitalpulse

Mistral releases open-source speech models, advancing real-time voice AI capabilities CXO Digitalpulse

GNews AI voice

1m8 days ago

Models

Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks - The Tribune

Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks The Tribune

GNews AI voice

1m8 days ago

ModelsFresh

langchain-core==1.2.26

Changes since langchain-core==1.2.25 release(core): 1.2.26 ( #36511 ) fix(core): add init validator and serialization mappings for Bedrock models ( #34510 ) feat(core): add ChatBaseten to serializable mapping ( #36510 ) chore(core): drop gpt-3.5-turbo from docstrings ( #36497 ) fix(core): correct parameter names in filter_messages docstring example ( #36462 )

LangChain Releases

1mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 146 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Open Source AI

Open Source AIFresh

B70: Quick and Early Benchmarks & Backend Comparison

llama.cpp: f1f793ad0 (8657) This is a quick attempt to just get it up and running. Lots of oneapi runtime still using "stable" from Intels repo. Kernel 6.19.8+deb13-amd64 with an updated xe firmware built. Vulkan is Debian but using latest Mesa compiled from source. Openvino is 2026.0. Feels like everything is "barely on the brink of working" (which is to be expected). sycl: $ build/bin/llama-bench -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL -p 512,16384 -n 128,512 | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | pp512 | 798.07 ± 2.72 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | pp16384

Reddit r/LocalLLaMA

3mabout 3 hours ago

Open Source AIRecent

Energy constraints loom larger than water for Colorado AI boom, experts say - Colorado Politics

Energy constraints loom larger than water for Colorado AI boom, experts say Colorado Politics

GNews AI energy

1m2 days ago

Open Source AI

What AI is actually good for, according to developers - The GitHub Blog

What AI is actually good for, according to developers The GitHub Blog

GNews AI coding

1m3 months ago

Open Source AIFresh

We Ditched LangChain. Here’s What We Built Instead — and Why It’s Better for Serious AI Research.

How two lean open-source frameworks outperform the incumbents when you need typed skill contracts, concurrent scientific tool execution… Continue reading on Medium »

Medium AI

1mabout 4 hours ago