Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessDySCo: Dynamic Semantic Compression for Effective Long-term Time Series ForecastingarXivUQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engressionarXivAn Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance AnalysisarXivMalliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement LearningarXivEfficient and Principled Scientific Discovery through Bayesian Optimization: A TutorialarXivMassively Parallel Exact Inference for Hawkes ProcessesarXivModel Merging via Data-Free Covariance EstimationarXivDetecting Complex Money Laundering Patterns with Incremental and Distributed Graph ModelingarXivForecasting Supply Chain Disruptions with Foresight LearningarXivSven: Singular Value Descent as a Computationally Efficient Natural Gradient MethodarXivSECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous DrivingarXivJetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physicsarXivBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessDySCo: Dynamic Semantic Compression for Effective Long-term Time Series ForecastingarXivUQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engressionarXivAn Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance AnalysisarXivMalliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement LearningarXivEfficient and Principled Scientific Discovery through Bayesian Optimization: A TutorialarXivMassively Parallel Exact Inference for Hawkes ProcessesarXivModel Merging via Data-Free Covariance EstimationarXivDetecting Complex Money Laundering Patterns with Incremental and Distributed Graph ModelingarXivForecasting Supply Chain Disruptions with Foresight LearningarXivSven: Singular Value Descent as a Computationally Efficient Natural Gradient MethodarXivSECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous DrivingarXivJetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physicsarXiv
AI NEWS HUBbyEIGENVECTOREigenvector

A Quick Note on Gemma 4 Image Settings in Llama.cpp

DEV Communityby SomeOddCodeGuyApril 3, 20263 min read0 views
Source Quiz

In my last post, I mentioned using --image-min-tokens to increase the quality of image responses from Qwen3.5 . I went to load Gemma 4 the same way, and hit an error: [58175] srv process_chun: processing image... [58175] encoding image slice... [58175] image slice encoded in 7490 ms [58175] decoding image batch 1/2, n_tokens_batch = 2048 [58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch > = n_tokens_all ) "non-causal attention requires n_ubatch >= n_tokens" ) failed [58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info. [58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash. [58175] See: https://github.com/ggml-org/llama.cpp/pull/17869 [58175] 0 libggml-base.0.9.11.dylib 0

In my last post, I mentioned using --image-min-tokens to increase the quality of image responses from Qwen3.5. I went to load Gemma 4 the same way, and hit an error:

[58175] srv process_chun: processing image... [58175] encoding image slice... [58175] image slice encoded in 7490 ms [58175] decoding image batch 1/2, n_tokens_batch = 2048 [58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed [58175] WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info. [58175] WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash. [58175] See: https://github.com/ggml-org/llama.cpp/pull/17869 [58175] 0 libggml-base.0.9.11.dylib 0x0000000103a6136c ggml_print_backtrace + 276 [58175] 1 libggml-base.0.9.11.dylib 0x0000000103a61558 ggml_abort + 156 [58175] 2 libllama.0.0.0.dylib 0x0000000103eacd70 _ZN13llama_context6decodeERK11llama_batch + 5484 [58175] 3 libllama.0.0.0.dylib 0x0000000103eb098c llama_decode + 20 [58175] 4 libmtmd.0.0.0.dylib 0x0000000103b8f7e8 mtmd_helper_decode_image_chunk + 948 [58175] 5 libmtmd.0.0.0.dylib 0x0000000103b8fea4 mtmd_helper_eval_chunk_single + 536 [58175] 6 llama-server 0x0000000102fb4d94 _ZNK13server_tokens13process_chunkEP13llama_contextP12mtmd_contextmiiRm + 256 [58175] 7 llama-server 0x0000000102fe3318 _ZN19server_context_impl12update_slotsEv + 8396 [58175] 8 llama-server 0x0000000102faaca0 _ZN12server_queue10start_loopEx + 504 [58175] 9 llama-server 0x0000000102f3a610 main + 14376 [58175] 10 dyld 0x00000001968edd54 start + 7184 srv operator(): http client error: Failed to read connection srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500 srv operator(): instance name=gemma-4-31B-it-UD-Q8_K_XL exited with status 1

Enter fullscreen mode

Exit fullscreen mode

As you can see, the crash is caused by the fact that I'm not setting ubatch.

[58175] /Users/socg/llama.cpp-b8639/src/llama-context.cpp:1597: GGML_ASSERT((cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens") failed

Enter fullscreen mode

Exit fullscreen mode

The reason is because Gemma 4's vision encoder uses non-causal attention for image tokens, which means all the image tokens have to fit within a single ubatch; since I specified that gotta be at least 2048, that's a problem since ubatch defaults to 512.

First, we need to make sure the model actually supports going that high. If we peek over at Unsloth's page, we'll see that's not the case

Gemma 4 supports multiple visual token budgets:

  • 70

  • 140

  • 280

  • 560

  • 1120

Use them like this:

  • 70 / 140: classification, captioning, fast video understanding

  • 280 / 560: general multimodal chat, charts, screens, UI reasoning

  • 1120: OCR, document parsing, handwriting, small text

So our max is actually 1120 here. So for my case, Im going to want to set the --image-min-tokens and --image-max-tokens both 1120, and then I'll buffer up the batch and ubatch to 2048.

./llama-server -ngl 200 --ctx-size 65535 --models-dir /Users/socg/models --models-max 1 --port 5001 --host 0.0.0.0 --jinja --image-min-tokens 1120 --image-max-tokens 1120 --ubatch-size 2048 --batch-size 2048

Enter fullscreen mode

Exit fullscreen mode

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodelupdate

Knowledge Map

Knowledge Map
TopicsEntitiesSource
A Quick Not…llamamodelupdatereasoningmultimodalgithubDEV Communi…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 327 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Open Source AI