Open Source AI llama model update reasoning multimodal github

A Quick Note on Gemma 4 Image Settings in Llama.cpp

DEV Communityby SomeOddCodeGuyApril 3, 20263 min read0 views

In my last post, I mentioned using --image-min-tokens to increase the quality of image responses from Qwen3.5 . I went to load Gemma 4 the same way, and hit an error: [58175] srv process_chun: processing image... [58175] encoding image slice... [58175] image slice encoded in 7490 ms [58175] decoding image batch 1/2, n_tokens_batch = 2048 [58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch > = n_tokens_all ) "non-causal attention requires n_ubatch >= n_tokens" ) failed [58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info. [58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash. [58175] See: https://github.com/ggml-org/llama.cpp/pull/17869 [58175] 0 libggml-base.0.9.11.dylib 0

In my last post, I mentioned using --image-min-tokens to increase the quality of image responses from Qwen3.5. I went to load Gemma 4 the same way, and hit an error:

[58175] srv process_chun: processing image... [58175] encoding image slice... [58175] image slice encoded in 7490 ms [58175] decoding image batch 1/2, n_tokens_batch = 2048 [58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed [58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info. [58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash. [58175] See: https://github.com/ggml-org/llama.cpp/pull/17869 [58175] 0 libggml-base.0.9.11.dylib 0x0000000103a6136c ggml_print_backtrace + 276 [58175] 1 libggml-base.0.9.11.dylib 0x0000000103a61558 ggml_abort + 156 [58175] 2 libllama.0.0.0.dylib 0x0000000103eacd70 _ZN13llama_context6decodeERK11llama_batch + 5484 [58175] 3 libllama.0.0.0.dylib 0x0000000103eb098c llama_decode + 20 [58175] 4 libmtmd.0.0.0.dylib 0x0000000103b8f7e8 mtmd_helper_decode_image_chunk + 948 [58175] 5 libmtmd.0.0.0.dylib 0x0000000103b8fea4 mtmd_helper_eval_chunk_single + 536 [58175] 6 llama-server 0x0000000102fb4d94 _ZNK13server_tokens13process_chunkEP13llama_contextP12mtmd_contextmiiRm + 256 [58175] 7 llama-server 0x0000000102fe3318 _ZN19server_context_impl12update_slotsEv + 8396 [58175] 8 llama-server 0x0000000102faaca0 _ZN12server_queue10start_loopEx + 504 [58175] 9 llama-server 0x0000000102f3a610 main + 14376 [58175] 10 dyld 0x00000001968edd54 start + 7184 srv operator(): http client error: Failed to read connection srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500 srv operator(): instance name=gemma-4-31B-it-UD-Q8_K_XL exited with status 1

[58175] srv process_chun: processing image... [58175] encoding image slice... [58175] image slice encoded in 7490 ms [58175] decoding image batch 1/2, n_tokens_batch = 2048 [58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed [58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info. [58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash. [58175] See: https://github.com/ggml-org/llama.cpp/pull/17869 [58175] 0 libggml-base.0.9.11.dylib 0x0000000103a6136c ggml_print_backtrace + 276 [58175] 1 libggml-base.0.9.11.dylib 0x0000000103a61558 ggml_abort + 156 [58175] 2 libllama.0.0.0.dylib 0x0000000103eacd70 _ZN13llama_context6decodeERK11llama_batch + 5484 [58175] 3 libllama.0.0.0.dylib 0x0000000103eb098c llama_decode + 20 [58175] 4 libmtmd.0.0.0.dylib 0x0000000103b8f7e8 mtmd_helper_decode_image_chunk + 948 [58175] 5 libmtmd.0.0.0.dylib 0x0000000103b8fea4 mtmd_helper_eval_chunk_single + 536 [58175] 6 llama-server 0x0000000102fb4d94 _ZNK13server_tokens13process_chunkEP13llama_contextP12mtmd_contextmiiRm + 256 [58175] 7 llama-server 0x0000000102fe3318 _ZN19server_context_impl12update_slotsEv + 8396 [58175] 8 llama-server 0x0000000102faaca0 _ZN12server_queue10start_loopEx + 504 [58175] 9 llama-server 0x0000000102f3a610 main + 14376 [58175] 10 dyld 0x00000001968edd54 start + 7184 srv operator(): http client error: Failed to read connection srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500 srv operator(): instance name=gemma-4-31B-it-UD-Q8_K_XL exited with status 1

Enter fullscreen mode

Exit fullscreen mode

As you can see, the crash is caused by the fact that I'm not setting ubatch.

[58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed

Enter fullscreen mode

Exit fullscreen mode

The reason is because Gemma 4's vision encoder uses non-causal attention for image tokens, which means all the image tokens have to fit within a single ubatch; since I specified that gotta be at least 2048, that's a problem since ubatch defaults to 512.

First, we need to make sure the model actually supports going that high. If we peek over at Unsloth's page, we'll see that's not the case

Gemma 4 supports multiple visual token budgets:

70

140

280

560

1120

Use them like this:

70 / 140: classification, captioning, fast video understanding

280 / 560: general multimodal chat, charts, screens, UI reasoning

1120: OCR, document parsing, handwriting, small text

So our max is actually 1120 here. So for my case, Im going to want to set the --image-min-tokens and --image-max-tokens both 1120, and then I'll buffer up the batch and ubatch to 2048.

./llama-server -ngl 200 --ctx-size 65535 --models-dir /Users/socg/models --models-max 1 --port 5001 --host 0.0.0.0 --jinja --image-min-tokens 1120 --image-max-tokens 1120 --ubatch-size 2048 --batch-size 2048

Enter fullscreen mode

Exit fullscreen mode

Original source

DEV Community

https://dev.to/someoddcodeguy/a-quick-note-on-gemma-4-image-settings-in-llamacpp-39ng

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodelupdate

Analyst NewsLive

Inter-Speaker Relative Cues for Two-Stage Text-Guided Target Speech Extraction

arXiv:2603.01316v2 Announce Type: replace Abstract: This paper investigates the use of relative cues for text-based target speech extraction (TSE). We first provide a theoretical justification for relative cues from the perspectives of human perception and label quantization, showing that relative cues preserve fine-grained distinctions that are often lost in absolute categorical representations for continuous-valued attributes. Building on this analysis, we propose a two-stage TSE framework in which a speech separation model first generates candidate sources, followed by a text-guided classifier that selects the target speaker based on embedding similarity. Within this framework, we train two separate classification models to evaluate the advantages of relative cues over independent cues

arXiv eess.AS

2m20 minutes ago

Analyst NewsLive

Empirical and Statistical Characterisation of 28 GHz mmWave Propagation in Office Environments

arXiv:2604.01814v1 Announce Type: new Abstract: Millimeter wave (mmWave) technology at 28 GHz is vital for beyond-5G systems, but indoor deployment remains challenging due to limited statistical evidence on propagation. This study investigates path loss, material penetration, and coverage enhancement using TMYTEK-based measurements. Statistical tests and confidence interval analysis show that path loss aligns with free-space theory, with an exponent of n = 2.07 plus or minus 0.073 (p = 0.385), confirming the suitability of classical models. Material analysis reveals significant variation: desk dividers introduce 3.4 dB more attenuation than display boards (95 percent CI: 1.81 to 4.98 dB, p less than 0.01), contradicting thickness-based assumptions. Reflector optimisation yields a significa

arXiv eess.SP

1m20 minutes ago

Research PapersLive

MIMO Capacity Enhancement by Grating Walls: A Physics-Based Proof of Principle

arXiv:2604.01786v1 Announce Type: new Abstract: This paper investigates the passive enhancement of MIMO spectral efficiency through boundary engineering in a simplified two dimensional indoor proof of principle model. The propagation channel is constructed from the electromagnetic Green's function of a room with boundaries modeled as free space, drywall, perfect electric conductor (PEC), or binary gratings. Within this framework, grating coated walls enrich the non line of sight (NLoS) multipath field, reduce channel correlation, and enhance spatial multiplexing over a broad range of receiver locations. Comparisons with the drywall and PEC reference cases further reveal that the observed capacity enhancement arises not from diffraction alone, but from the combined effects of effective wall

arXiv eess.SP

1m20 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 327 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Open Source AI

Open Source AILive

From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

arXiv:2604.01496v1 Announce Type: new Abstract: We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series.

arXiv cs.SE

1m20 minutes ago

Open Source AIFresh

Building an AI-Powered DevSecOps Guardrail Pipeline with GitHub Actions

Learn how to build an AI-powered DevSecOps guardrail pipeline using GitHub Actions to automatically detect security vulnerabilities before deployment. Read All

Hackernoon AI

1mabout 3 hours ago

Open Source AIFresh

langchain-core==1.2.25

Changes since langchain-core==1.2.24 release(core): 1.2.25 ( #36473 ) fix(core): harden check for txt files in deprecated prompt loading functions ( #36471 ) fix(core): fixed typos in the documentation ( #36459 )

LangChain Releases

1mabout 6 hours ago

Open Source AIRecent

v0.16.0

Axolotl v0.16.0 Release Notes We’re very excited to share this new packed release. We had ~80 new commits since v0.15.0 (March 6, 2026). Highlights Async GRPO — Asynchronous Reinforcement Learning Training ( #3486 ) Full support for asynchronous Group Relative Policy Optimization with vLLM integration. Includes async data producer with replay buffer, streaming partial-batch training, native LoRA weight sync to vLLM, and FP8 compatibility. Supports multi-GPU via FSDP1/FSDP2 and DeepSpeed ZeRO-3. Achieves up to 58% faster step times (1.59s/step vs 3.79s baseline on Qwen2-0.5B). Optimization Step Time Improvement Baseline 3.79s — + Batched weight sync 2.52s 34% faster + Liger kernel fusion 2.01s 47% faster + Streaming partial batch 1.79s 53% faster + Element chunking + re-roll fix (500 steps)

Axolotl Releases

12mabout 12 hours ago