🔥 google-deepmind/gemma
Gemma open-weight LLM library, from Google DeepMind — Trending on GitHub today with 45 new stars.
Gemma is a family of open-weights Large Language Model (LLM) by Google DeepMind, based on Gemini research and technology.
This repository contains the implementation of the gemma PyPI package. A JAX library to use and fine-tune Gemma.
For examples and use cases, see our documentation. Please report issues and feedback in our GitHub.
Installation
-
Install JAX for CPU, GPU or TPU. Follow the instructions on the JAX website.
-
Run
pip install gemma
Examples
Here is a minimal example to have a multi-turn, multi-modal conversation with Gemma:
from gemma import gm
Model and parameters
model = gm.nn.Gemma3_4B() params = gm.ckpts.load_params(gm.ckpts.CheckpointPath.GEMMA3_4B_IT)
Example of multi-turn conversation
sampler = gm.text.ChatSampler( model=model, params=params, multi_turn=True, )
prompt = """Which of the two images do you prefer?
Image 1: Image 2:
Write your answer as a poem.""" out0 = sampler.chat(prompt, images=[image1, image2])
out1 = sampler.chat('What about the other image ?')`
Our documentation contains various Colabs and tutorials, including:
-
Sampling
-
Multi-modal
-
Fine-tuning
-
LoRA
-
...
Additionally, our examples/ folder contain additional scripts to fine-tune and sample with Gemma.
Learn more about Gemma
-
To use this library: Gemma documentation
-
Technical reports for metrics and model capabilities:
Gemma 1 Gemma 2 Gemma 3
- Other Gemma implementations and doc on the Gemma ecosystem
Downloading the models
To download the model weights. See our documentation.
System Requirements
Gemma can run on a CPU, GPU and TPU. For GPU, we recommend 8GB+ RAM on GPU for The 2B checkpoint and 24GB+ RAM on GPU are used for the 7B checkpoint.
Contributing
We welcome contributions! Please read our Contributing Guidelines before submitting a pull request.
This is not an official Google product.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
githubtrendingopen-source
Street-Legal Physical-World Adversarial Rim for License Plates
arXiv:2604.02457v1 Announce Type: new Abstract: Automatic license plate reader (ALPR) systems are widely deployed to identify and track vehicles. While prior work has demonstrated vulnerabilities in ALPR systems, far less attention has been paid to their legality and physical-world practicality. We investigate whether low-resourced threat actors can engineer a successful adversarial attack against a modern open-source ALPR system. We introduce the Street-legal Physical Adversarial Rim (SPAR), a physically realizable white-box attack against the popular ALPR system fast-alpr. SPAR requires no access to ALPR infrastructure during attack deployment and does not alter or obscure the attacker's license plate. Based on prior legislation and case law, we argue that SPAR is street-legal in the sta

Differentiable SpaTiaL: Symbolic Learning and Reasoning with Geometric Temporal Logic for Manipulation Tasks
arXiv:2604.02643v1 Announce Type: new Abstract: Executing complex manipulation in cluttered environments requires satisfying coupled geometric and temporal constraints. Although Spatio-Temporal Logic (SpaTiaL) offers a principled specification framework, its use in gradient-based optimization is limited by non-differentiable geometric operations. Existing differentiable temporal logics focus on the robot's internal state and neglect interactive object-environment relations, while spatial logic approaches that capture such interactions rely on discrete geometry engines that break the computational graph and preclude exact gradient propagation. To overcome this limitation, we propose Differentiable SpaTiaL, a fully tensorized toolbox that constructs smooth, autograd-compatible geometric prim

Reinforcement Learning from Human Feedback: A Statistical Perspective
arXiv:2604.02507v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine-tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley-Terry-Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

I tested speculative decoding on my home GPU cluster. Here's why it didn't help.
I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel. I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into. The setup My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp. For this test I deployed two models: Gemma 4 26B-A4B : Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs a

Vllm gemma4 26b a4b it-nvfp4 run success
#!/usr/bin/env bash set -euo pipefail BASE_DIR="/mnt/d/AI/docker-gemma4" PATCH_DIR="$BASE_DIR/nvfp4_patch" BUILD_DIR="$BASE_DIR/build" HF_CACHE_DIR="$BASE_DIR/hf-cache" LOG_DIR="$BASE_DIR/logs" PATCH_FILE="$PATCH_DIR/gemma4_patched.py" DOCKERFILE_PATH="$BUILD_DIR/Dockerfile" BASE_IMAGE="vllm/vllm-openai:gemma4" PATCHED_IMAGE="vllm-gemma4-nvfp4-patched" CONTAINER_NAME="vllm-gemma4-nvfp4" MODEL_ID="bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4" SERVED_MODEL_NAME="gemma-4-26b-a4b-it-nvfp4" GPU_MEMORY_UTILIZATION="0.88" MAX_MODEL_LEN="512" MAX_NUM_SEQS="1" PORT=" " PATCH_URL=" https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4/resolve/main/gemma4_patched.py?download=true " if [[ -z "${HF_TOKEN:-}" ]]; then echo "[ERROR] HF_TOKEN environment variable is empty." echo "Please run th

TurboQuant on Apple Silicon: real benchmarks on Mac Mini M4 16GB and M3 Max 48GB
I’ve been testing TurboQuant this week on two machines and wanted to share the actual numbers. Why this matters: TurboQuant compresses the KV cache, not the model weights. On long contexts, KV cache can take several GB of memory, so reducing it can make a big difference even when throughput stays similar. In the setup I tested, K stays at q8_0 and V goes to turbo3 (~3-bit). That asymmetric tradeoff makes sense because errors in the keys affect attention routing more directly, while values often tolerate heavier compression better. Benchmark 1: Mac Mini M4 16GB — Qwen3-14B Q4_K_M at 8K context → Without TurboQuant: KV cache 1280 MiB, K (f16): 640 MiB, V (f16): 640 MiB — 9.95 t/s → With TurboQuant: KV cache 465 MiB, K (q8_0): 340 MiB, V (turbo3): 125 MiB — 9.25 t/s Almost 3x compression, wit

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!