Skip to content

Commit 74b11ed

Browse files
committed
Merge branch 'main' of github.com:NVIDIA/Model-Optimizer into chenjiel/moe2
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
2 parents df8fde3 + 1dc890d commit 74b11ed

125 files changed

Lines changed: 8564 additions & 1735 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/unit_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ jobs:
6464
timeout-minutes: 30
6565
strategy:
6666
matrix:
67-
py: [10, 11]
67+
py: [10, 11, 13]
6868
steps:
6969
- uses: actions/checkout@v6
7070
- uses: ./.github/actions/ubuntu-setup

CHANGELOG.rst

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,36 @@ NVIDIA Model Optimizer Changelog
88

99
- ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
1010

11+
**Backward Breaking Changes**
12+
13+
- Default ``--kv_cache_qformat`` in ``hf_ptq.py`` changed from ``fp8`` to ``fp8_cast``. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass ``--kv_cache_qformat fp8``.
14+
- Removed KV cache scale clamping (``clamp_(min=1.0)``) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (``--kv_cache_qformat fp8`` or ``nvfp4``), consider using the casting methods (``fp8_cast`` or ``nvfp4_cast``) instead.
15+
1116
**New Features**
1217

18+
- Add ``fp8_cast`` and ``nvfp4_cast`` modes for ``--kv_cache_qformat`` in ``hf_ptq.py``. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new ``use_constant_amax`` field in :class:`QuantizerAttributeConfig <modelopt.torch.quantization.config.QuantizerAttributeConfig>` controls this behavior.
1319
- User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
1420
- ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
15-
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
21+
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
1622
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
1723
- Add support for rotating the input before quantization for RHT.
1824
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
1925
- Enable PTQ workflow for Qwen3.5 MoE models.
26+
- Enable PTQ workflow for the Kimi-K2.5 model.
2027
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
2128
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
2229
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
23-
- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
30+
- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
2431
- Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
2532
- Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
26-
- Add NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
33+
- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
34+
- Add support for block-granular RHT for non-power-of-2 dimensions.
35+
- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
2736

2837
**Misc**
2938

3039
- Migrated project metadata from ``setup.py`` to a fully declarative ``pyproject.toml``.
40+
- Enable experimental Python 3.13 wheel support and unit tests in CI/CD.
3141

3242
0.42 (2026-02-xx)
3343
^^^^^^^^^^^^^^^^^

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
2626

2727
## Latest News
2828

29+
- [2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). Learn more in the [Nemotron 3 Super release blog](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/). Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/llm_ptq/README.md)
30+
- [2026/03/11] [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the [Quantization (PTQ and QAT) guide](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md#quantization-ptq-and-qat) for FP8/NVFP4 quantization and HF export instructions.
2931
- [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)
3032
- [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
3133
- [2025/10/07] [BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)

docs/source/getting_started/_installation_for_Linux.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
1212
+-------------------------+-----------------------------+
1313
| Architecture | x86_64, aarch64 (SBSA) |
1414
+-------------------------+-----------------------------+
15-
| Python | >=3.10,<3.13 |
15+
| Python | >=3.10,<3.14 |
1616
+-------------------------+-----------------------------+
1717
| CUDA | 12.x, 13.x |
1818
+-------------------------+-----------------------------+
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
===============================================
2-
Automated Q/DQ Placement Optimization (ONNX)
2+
Autotune (ONNX)
33
===============================================
44

55
.. contents:: Table of Contents
@@ -9,7 +9,7 @@ Automated Q/DQ Placement Optimization (ONNX)
99
Overview
1010
========
1111

12-
The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
12+
The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement optimization in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
1313

1414
**Key Features:**
1515

examples/cnn_qat/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an
143143

144144
## Deployment with TensorRT
145145

146-
The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
146+
The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.

examples/diffusers/quantization/models_utils.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,11 @@
2525
StableDiffusion3Pipeline,
2626
WanPipeline,
2727
)
28+
29+
try:
30+
from diffusers import Flux2Pipeline
31+
except ImportError:
32+
Flux2Pipeline = None
2833
from utils import (
2934
filter_func_default,
3035
filter_func_flux_dev,
@@ -42,6 +47,7 @@ class ModelType(str, Enum):
4247
SD35_MEDIUM = "sd3.5-medium"
4348
FLUX_DEV = "flux-dev"
4449
FLUX_SCHNELL = "flux-schnell"
50+
FLUX2_DEV = "flux2-dev"
4551
LTX_VIDEO_DEV = "ltx-video-dev"
4652
LTX2 = "ltx-2"
4753
WAN22_T2V_14b = "wan2.2-t2v-14b"
@@ -61,6 +67,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
6167
filter_func_map = {
6268
ModelType.FLUX_DEV: filter_func_flux_dev,
6369
ModelType.FLUX_SCHNELL: filter_func_default,
70+
ModelType.FLUX2_DEV: filter_func_flux_dev,
6471
ModelType.SDXL_BASE: filter_func_default,
6572
ModelType.SDXL_TURBO: filter_func_default,
6673
ModelType.SD3_MEDIUM: filter_func_default,
@@ -82,6 +89,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
8289
ModelType.SD35_MEDIUM: "stabilityai/stable-diffusion-3.5-medium",
8390
ModelType.FLUX_DEV: "black-forest-labs/FLUX.1-dev",
8491
ModelType.FLUX_SCHNELL: "black-forest-labs/FLUX.1-schnell",
92+
ModelType.FLUX2_DEV: "black-forest-labs/FLUX.2-dev",
8593
ModelType.LTX_VIDEO_DEV: "Lightricks/LTX-Video-0.9.7-dev",
8694
ModelType.LTX2: "Lightricks/LTX-2",
8795
ModelType.WAN22_T2V_14b: "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
@@ -95,6 +103,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
95103
ModelType.SD35_MEDIUM: StableDiffusion3Pipeline,
96104
ModelType.FLUX_DEV: FluxPipeline,
97105
ModelType.FLUX_SCHNELL: FluxPipeline,
106+
ModelType.FLUX2_DEV: Flux2Pipeline,
98107
ModelType.LTX_VIDEO_DEV: LTXConditionPipeline,
99108
ModelType.LTX2: None,
100109
ModelType.WAN22_T2V_14b: WanPipeline,
@@ -149,9 +158,18 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
149158
ModelType.SD35_MEDIUM: _SD3_BASE_CONFIG,
150159
ModelType.FLUX_DEV: _FLUX_BASE_CONFIG,
151160
ModelType.FLUX_SCHNELL: _FLUX_BASE_CONFIG,
152-
ModelType.LTX_VIDEO_DEV: {
161+
ModelType.FLUX2_DEV: {
153162
"backbone": "transformer",
154163
"dataset": _SD_PROMPTS_DATASET,
164+
"inference_extra_args": {
165+
"height": 768,
166+
"width": 1024,
167+
"guidance_scale": 4.0,
168+
},
169+
},
170+
ModelType.LTX_VIDEO_DEV: {
171+
"backbone": "transformer",
172+
"dataset": _OPENVID_DATASET,
155173
"inference_extra_args": {
156174
"height": 512,
157175
"width": 704,
@@ -161,7 +179,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
161179
},
162180
ModelType.LTX2: {
163181
"backbone": "transformer",
164-
"dataset": _SD_PROMPTS_DATASET,
182+
"dataset": _OPENVID_DATASET,
165183
"inference_extra_args": {
166184
"height": 768,
167185
"width": 1280,

examples/diffusers/quantization/utils.py

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,15 @@ def check_conv_and_mha(backbone, if_fp4, quantize_mha):
5555
elif isinstance(module, (Attention, AttentionModuleMixin)):
5656
head_size = int(module.inner_dim / module.heads)
5757
if not quantize_mha or head_size % 16 != 0:
58-
module.q_bmm_quantizer.disable()
59-
module.k_bmm_quantizer.disable()
60-
module.v_bmm_quantizer.disable()
61-
module.softmax_quantizer.disable()
62-
module.bmm2_output_quantizer.disable()
58+
for attr in (
59+
"q_bmm_quantizer",
60+
"k_bmm_quantizer",
61+
"v_bmm_quantizer",
62+
"softmax_quantizer",
63+
"bmm2_output_quantizer",
64+
):
65+
if hasattr(module, attr):
66+
getattr(module, attr).disable()
6367
setattr(module, "_disable_fp8_mha", True)
6468

6569
print(f"Disabled Attention layer quantization for layer {name}")
@@ -70,14 +74,16 @@ def check_conv_and_mha(backbone, if_fp4, quantize_mha):
7074
def filter_func_ltx_video(name: str) -> bool:
7175
"""Filter function specifically for LTX-Video models."""
7276
pattern = re.compile(
73-
r".*(proj_in|time_embed|caption_projection|proj_out|patchify_proj|adaln_single).*"
77+
r".*(proj_in|time_embed|caption_projection|proj_out|patchify_proj|adaln_single|transformer_blocks\.(0|1|2|45|46|47)\.).*"
7478
)
7579
return pattern.match(name) is not None
7680

7781

7882
def filter_func_flux_dev(name: str) -> bool:
7983
"""Filter function specifically for Flux-dev models."""
80-
pattern = re.compile(r"(proj_out.*|.*(time_text_embed|context_embedder|x_embedder|norm_out).*)")
84+
pattern = re.compile(
85+
r"(proj_out.*|.*(time_text_embed|context_embedder|x_embedder|norm_out|time_guidance_embed|stream_modulation).*)"
86+
)
8187
return pattern.match(name) is not None
8288

8389

examples/llm_ptq/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
116116
| MiniMax M2.1 | - | - | - | - ||
117117
| T5 ||||| - |
118118
| Whisper ||||| - |
119+
| Nemotron-3 ||||||
119120

120121
> *This is a subset of the models supported. For the full list please check the [TensorRT-LLM support matrix](https://nvidia.github.io/TensorRT-LLM/reference/precision.html#support-matrix)*
121122
@@ -240,7 +241,7 @@ Here is an example usage for `AutoQuantize` algorithm (Please see [auto_quantize
240241

241242
### AutoQuantize for Hugging Face models
242243

243-
`AutoQuantize` can be performed for Huggingface LLM models like [Llama-3](https://huggingface.co/meta-llama) as shown below:
244+
`AutoQuantize` can be performed for Huggingface LLM models like [Qwen](https://huggingface.co/Qwen/Qwen3-8B) / [Nemotron](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) as shown below:
244245

245246
[Script](./scripts/huggingface_example.sh)
246247

@@ -249,11 +250,11 @@ export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or si
249250
# --auto_quantize_bits specifies the constraint for `AutoQuantize`
250251
# --quant specifies the formats to be searched for `AutoQuantize`
251252
# NOTE: auto_quantize_bits cannot be lower than the number of bits for the smallest quantization format in --quant
252-
scripts/huggingface_example.sh --model $HF_PATH --quant w4a8_awq,fp8 --auto_quantize_bits 4.8 --calib_batch_size 4
253+
scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8 --auto_quantize_bits 4.75 --calib_batch_size 4
253254
```
254255

255-
The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
256-
are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
256+
The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers
257+
are kept un-quantized such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).
257258

258259
The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `quant,mmlu,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
259260

0 commit comments

Comments
 (0)