🔥 NVIDIA/Model-Optimizer
A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed. — Trending on GitHub today with 25 new stars.
NVIDIA Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.
[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA Megatron-Bridge, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.
[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM. The unified Hugging Face export API now supports both transformers and diffusers models.
Latest News
-
[2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: FP8, NVFP4. Learn more in the Nemotron 3 Super release blog. Check out how to quantize Nemotron 3 models for deployment acceleration here
-
[2026/03/11] NeMo Megatron Bridge now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the Quantization (PTQ and QAT) guide for FP8/NVFP4 quantization and HF export instructions.
-
[2025/12/11] BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference
-
[2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
-
[2025/10/07] BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer
-
[2025/09/17] BLOG: An Introduction to Speculative Decoding for Reducing Latency in AI Inference
-
[2025/09/11] BLOG: How Quantization Aware Training Enables Low-Precision Accuracy Recovery
-
[2025/08/29] BLOG: Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
-
[2025/08/01] BLOG: Optimizing LLMs for Performance and Accuracy with Post-Training Quantization
-
[2025/06/24] BLOG: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
-
[2025/05/14] NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs
-
[2025/04/21] Adobe optimized deployment using Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership
-
[2025/04/05] NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick. Check out how to quantize Llama4 for deployment acceleration here
-
[2025/03/18] World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell
-
[2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: DeepSeek-R1-FP4, Llama-3.3-70B-Instruct-FP4, Llama-3.1-405B-Instruct-FP4
-
[2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ here.
-
[2025/01/28] Model Optimizer is now open source!
Previous News
-
[2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: 8B, 70B, 405B.
-
[2024/09/10] Post-Training Quantization of LLMs with NVIDIA NeMo and Model Optimizer.
-
[2024/08/28] Boosting Llama 3.1 405B Performance up to 44% with Model Optimizer on NVIDIA H200 GPUs
-
[2024/08/28] Up to 1.9X Higher Llama 3.1 Performance with Medusa
-
[2024/08/15] New features in recent releases: Cache Diffusion, QLoRA workflow with NVIDIA NeMo, and more. Check out our blog for details.
-
[2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow here
-
[2024/05/08] Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance
-
[2024/03/27] Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records
-
[2024/03/18] GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT
-
[2024/03/07] Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster
-
[2024/02/01] Speed up inference with Model Optimizer quantization techniques in TRT-LLM
Install
To install stable release packages for Model Optimizer with pip from PyPI:
pip install -U nvidia-modelopt[all]
To install from source in editable mode with all development dependencies or to use the latest features, run:
# Clone the Model Optimizer repository git clone [email protected]:NVIDIA/Model-Optimizer.git cd Model-Optimizer# Clone the Model Optimizer repository git clone [email protected]:NVIDIA/Model-Optimizer.git cd Model-Optimizerpip install -e .[dev]`
You can also directly use the TensorRT-LLM docker images (e.g., nvcr.io/nvidia/tensorrt-llm/release:), which have Model Optimizer pre-installed. Make sure to upgrade Model Optimizer to the latest version as described above. Visit our installation guide for more fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.
Techniques
Technique Description Examples Docs
Post Training Quantization Compress model size by 2x-4x, speeding up inference while preserving model quality! [LLMs] [diffusers] [VLMs] [onnx] [windows] [docs]
Quantization Aware Training Refine accuracy even further with a few training steps! [Hugging Face] [docs]
Pruning Reduce your model size and accelerate inference by removing unnecessary weights! [General] [Megatron-Bridge]
Distillation Reduce deployment model size by teaching small models to behave like larger models! [Megatron-Bridge] [Megatron-LM] [Hugging Face] [docs]
Speculative Decoding Train draft modules to predict extra tokens during inference! [Megatron] [Hugging Face] [docs]
Sparsity Efficiently compress your model by storing only its non-zero parameter values and their locations [PyTorch] [docs]
Pre-Quantized Checkpoints
-
Ready-to-deploy checkpoints [🤗 Hugging Face - Nvidia Model Optimizer Collection]
-
Deployable on TensorRT-LLM, vLLM and SGLang
-
More models coming soon!
Resources
-
📅 Roadmap
-
📖 Documentation
-
🎯 Benchmarks
-
💡 Release Notes
-
🐛 File a bug
-
✨ File a Feature Request
Model Support Matrix
Model Type Support Matrix
LLM Quantization View Support Matrix
Diffusers Quantization View Support Matrix
VLM Quantization View Support Matrix
ONNX Quantization View Support Matrix
Windows Quantization View Support Matrix
Quantization Aware Training View Support Matrix
Pruning View Support Matrix
Distillation View Support Matrix
Speculative Decoding View Support Matrix
Contributing
Model Optimizer is now open source! We welcome any feedback, feature requests and PRs. Please read our Contributing guidelines for details on how to contribute to this project.
Top Contributors
Happy optimizing!
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
githubtrendingopen-source
Claude Code Skills Have a Model Field. Here's Why You Should Be Using It.
I've been building Claude Code skills for a few weeks. Writing the prompts, testing them, tweaking descriptions so Claude knows when to use which one. Felt pretty on top of it. Then I got annoyed that every skill was running on the same model — my fastest, most expensive one — even for tasks like "open the dashboard" or "run git status." So I went looking for a way to change that. I opened the source code. There are 15 frontmatter fields in a Claude Code skill. I was using 3. The Fields That Actually Matter Most people write a skill like this: --- name : my-skill description : Does the thing. --- That's fine. It works. But you're leaving a lot on the table. Here are the fields that change runtime behavior — not just metadata: model — Which brain runs this skill model : haiku Claude Code ac

How to Build a Professional AI Agent with EClaw: Identity, Rules, and Soul
How to Build a Professional AI Agent with EClaw: Identity, Rules, and Soul Your AI agent is only as good as its configuration. A generic chatbot that answers everything the same way isn't useful in production. What you need is an agent with a clear role, consistent behavior, and a personality that fits your use case. EClaw provides three layers of agent configuration that work together: Identity (what the agent does), Rules (how it behaves), and Soul (who it is). This tutorial walks through each layer with real API examples. Layer 1: Identity — What Your Agent Does Identity is the foundation. It tells the agent its role, capabilities, and boundaries. Think of it as a job description. Setting Identity curl -s -X PUT "https://eclawbot.com/api/entity/identity" \ -H "Content-Type: application/
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

Gemma 4 E4B + E2B Uncensored (Aggressive) — GGUF + K_P Quants (Multimodal: Vision, Video, Audio)
My first Gemma 4 uncensors are out. Two models dropping today, the E4B (4B) and E2B (2B). Both Aggressive variants, both fully multimodal. Aggressive means no refusals. I don't do any personality changes or alterations. The ORIGINAL Google release, just uncensored. Gemma 4 E4B (4B): https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive Gemma 4 E2B (2B): https://huggingface.co/HauhauCS/Gemma-4-E2B-Uncensored-HauhauCS-Aggressive 0/465 refusals * on both. Fully unlocked with zero capability loss. These are natively multimodal so text, image, video, and audio all in one model. The mmproj file is included for vision/audio support. What's included: E4B: Q8_K_P, Q6_K_P, Q5_K_P, Q5_K_M, Q4_K_P, Q4_K_M, IQ4_XS, Q3_K_P, Q3_K_M, IQ3_M, Q2_K_P + mmproj E2B: Q8_K_P, Q6_K_P, Q5_K_P,

Gemma 4 - 31b abliterated quants
Got inspired to try and crack this egg without using heretic. FP16, Q8_0 and Q4_K_M quants, plus the abliteration script for modification/use is here: https://huggingface.co/paperscarecrow/Gemma-4-31B-it-abliterated-gguf based off of mlabonne's Orthogonalized Representation Intervention method , because I loved his ablits of gemma3 so much. Edit: Overestimated my internet speeds, still uploading the models. submitted by /u/Polymorphic-X [link] [comments]

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!