NVIDIA
diff --git a/‎.github/workflows/unit_tests.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/unit_tests.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.rst‎
Lines changed: 13 additions & 3 deletions b/‎CHANGELOG.rst‎
Lines changed: 13 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/guides/9_autoqdq.rst‎ ‎docs/source/guides/9_autotune.rst‎docs/source/guides/9_autoqdq.rst renamed to docs/source/guides/9_autotune.rst
Lines changed: 2 additions & 2 deletions b/‎docs/source/guides/9_autoqdq.rst‎ ‎docs/source/guides/9_autotune.rst‎docs/source/guides/9_autoqdq.rst renamed to docs/source/guides/9_autotune.rst
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/cnn_qat/README.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/cnn_qat/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/diffusers/quantization/models_utils.py‎
Lines changed: 20 additions & 2 deletions b/‎examples/diffusers/quantization/models_utils.py‎
Lines changed: 20 additions & 2 deletions
diff --git a/‎examples/diffusers/quantization/utils.py‎
Lines changed: 13 additions & 7 deletions b/‎examples/diffusers/quantization/utils.py‎
Lines changed: 13 additions & 7 deletions
diff --git a/‎examples/llm_ptq/README.md‎
Lines changed: 5 additions & 4 deletions b/‎examples/llm_ptq/README.md‎
Lines changed: 5 additions & 4 deletions
@@ -64,7 +64,7 @@ jobs:
     timeout-minutes: 30
     strategy:
       matrix:
-        py: [10, 11]
+        py: [10, 11, 13]
     steps:
       - uses: actions/checkout@v6
       - uses: ./.github/actions/ubuntu-setup
 
@@ -8,26 +8,36 @@ NVIDIA Model Optimizer Changelog
 
 - ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
 
+**Backward Breaking Changes**
+
+- Default ``--kv_cache_qformat`` in ``hf_ptq.py`` changed from ``fp8`` to ``fp8_cast``. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass ``--kv_cache_qformat fp8``.
+- Removed KV cache scale clamping (``clamp_(min=1.0)``) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (``--kv_cache_qformat fp8`` or ``nvfp4``), consider using the casting methods (``fp8_cast`` or ``nvfp4_cast``) instead.
+
 **New Features**
 
+- Add ``fp8_cast`` and ``nvfp4_cast`` modes for ``--kv_cache_qformat`` in ``hf_ptq.py``. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new ``use_constant_amax`` field in :class:`QuantizerAttributeConfig <modelopt.torch.quantization.config.QuantizerAttributeConfig>` controls this behavior.
 - User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
 - ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
-- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
+- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
 - Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
 - Add support for rotating the input before quantization for RHT.
 - Add support for advanced weight scale search for NVFP4 quantization and its export path.
 - Enable PTQ workflow for Qwen3.5 MoE models.
+- Enable PTQ workflow for the Kimi-K2.5 model.
 - Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
 - ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
 - Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
-- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
+- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
 - Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
 - Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
-- Add NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
+- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
+- Add support for block-granular RHT for non-power-of-2 dimensions.
+- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
 
 **Misc**
 
 - Migrated project metadata from ``setup.py`` to a fully declarative ``pyproject.toml``.
+- Enable experimental Python 3.13 wheel support and unit tests in CI/CD.
 
 0.42 (2026-02-xx)
 ^^^^^^^^^^^^^^^^^
 
@@ -26,6 +26,8 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
 
 ## Latest News
 
+- [2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). Learn more in the [Nemotron 3 Super release blog](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/). Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/llm_ptq/README.md)
+- [2026/03/11] [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the [Quantization (PTQ and QAT) guide](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md#quantization-ptq-and-qat) for FP8/NVFP4 quantization and HF export instructions.
 - [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)
 - [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
 - [2025/10/07] [BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
 
@@ -12,7 +12,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
 +-------------------------+-----------------------------+
 | Architecture            |  x86_64, aarch64 (SBSA)     |
 +-------------------------+-----------------------------+
-| Python                  |  >=3.10,<3.13               |
+| Python                  |  >=3.10,<3.14               |
 +-------------------------+-----------------------------+
 | CUDA                    |  12.x, 13.x                 |
 +-------------------------+-----------------------------+
 
@@ -1,5 +1,5 @@
 ===============================================
-Automated Q/DQ Placement Optimization (ONNX)
+Autotune (ONNX)
 ===============================================
 
 .. contents:: Table of Contents
@@ -9,7 +9,7 @@ Automated Q/DQ Placement Optimization (ONNX)
 Overview
 ========
 
-The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
+The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement optimization in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
 
 **Key Features:**
 
 
@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an
 
 ## Deployment with TensorRT
 
-The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
+The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
@@ -25,6 +25,11 @@
     StableDiffusion3Pipeline,
     WanPipeline,
 )
+
+try:
+    from diffusers import Flux2Pipeline
+except ImportError:
+    Flux2Pipeline = None
 from utils import (
     filter_func_default,
     filter_func_flux_dev,
@@ -42,6 +47,7 @@ class ModelType(str, Enum):
     SD35_MEDIUM = "sd3.5-medium"
     FLUX_DEV = "flux-dev"
     FLUX_SCHNELL = "flux-schnell"
+    FLUX2_DEV = "flux2-dev"
     LTX_VIDEO_DEV = "ltx-video-dev"
     LTX2 = "ltx-2"
     WAN22_T2V_14b = "wan2.2-t2v-14b"
@@ -61,6 +67,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
     filter_func_map = {
         ModelType.FLUX_DEV: filter_func_flux_dev,
         ModelType.FLUX_SCHNELL: filter_func_default,
+        ModelType.FLUX2_DEV: filter_func_flux_dev,
         ModelType.SDXL_BASE: filter_func_default,
         ModelType.SDXL_TURBO: filter_func_default,
         ModelType.SD3_MEDIUM: filter_func_default,
@@ -82,6 +89,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
     ModelType.SD35_MEDIUM: "stabilityai/stable-diffusion-3.5-medium",
     ModelType.FLUX_DEV: "black-forest-labs/FLUX.1-dev",
     ModelType.FLUX_SCHNELL: "black-forest-labs/FLUX.1-schnell",
+    ModelType.FLUX2_DEV: "black-forest-labs/FLUX.2-dev",
     ModelType.LTX_VIDEO_DEV: "Lightricks/LTX-Video-0.9.7-dev",
     ModelType.LTX2: "Lightricks/LTX-2",
     ModelType.WAN22_T2V_14b: "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
@@ -95,6 +103,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
     ModelType.SD35_MEDIUM: StableDiffusion3Pipeline,
     ModelType.FLUX_DEV: FluxPipeline,
     ModelType.FLUX_SCHNELL: FluxPipeline,
+    ModelType.FLUX2_DEV: Flux2Pipeline,
     ModelType.LTX_VIDEO_DEV: LTXConditionPipeline,
     ModelType.LTX2: None,
     ModelType.WAN22_T2V_14b: WanPipeline,
@@ -149,9 +158,18 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
     ModelType.SD35_MEDIUM: _SD3_BASE_CONFIG,
     ModelType.FLUX_DEV: _FLUX_BASE_CONFIG,
     ModelType.FLUX_SCHNELL: _FLUX_BASE_CONFIG,
-    ModelType.LTX_VIDEO_DEV: {
+    ModelType.FLUX2_DEV: {
         "backbone": "transformer",
         "dataset": _SD_PROMPTS_DATASET,
+        "inference_extra_args": {
+            "height": 768,
+            "width": 1024,
+            "guidance_scale": 4.0,
+        },
+    },
+    ModelType.LTX_VIDEO_DEV: {
+        "backbone": "transformer",
+        "dataset": _OPENVID_DATASET,
         "inference_extra_args": {
             "height": 512,
             "width": 704,
@@ -161,7 +179,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
     },
     ModelType.LTX2: {
         "backbone": "transformer",
-        "dataset": _SD_PROMPTS_DATASET,
+        "dataset": _OPENVID_DATASET,
         "inference_extra_args": {
             "height": 768,
             "width": 1280,
 
@@ -55,11 +55,15 @@ def check_conv_and_mha(backbone, if_fp4, quantize_mha):
         elif isinstance(module, (Attention, AttentionModuleMixin)):
             head_size = int(module.inner_dim / module.heads)
             if not quantize_mha or head_size % 16 != 0:
-                module.q_bmm_quantizer.disable()
-                module.k_bmm_quantizer.disable()
-                module.v_bmm_quantizer.disable()
-                module.softmax_quantizer.disable()
-                module.bmm2_output_quantizer.disable()
+                for attr in (
+                    "q_bmm_quantizer",
+                    "k_bmm_quantizer",
+                    "v_bmm_quantizer",
+                    "softmax_quantizer",
+                    "bmm2_output_quantizer",
+                ):
+                    if hasattr(module, attr):
+                        getattr(module, attr).disable()
                 setattr(module, "_disable_fp8_mha", True)
 
                 print(f"Disabled Attention layer quantization for layer {name}")
@@ -70,14 +74,16 @@ def check_conv_and_mha(backbone, if_fp4, quantize_mha):
 def filter_func_ltx_video(name: str) -> bool:
     """Filter function specifically for LTX-Video models."""
     pattern = re.compile(
-        r".*(proj_in|time_embed|caption_projection|proj_out|patchify_proj|adaln_single).*"
+        r".*(proj_in|time_embed|caption_projection|proj_out|patchify_proj|adaln_single|transformer_blocks\.(0|1|2|45|46|47)\.).*"
     )
     return pattern.match(name) is not None
 
 
 def filter_func_flux_dev(name: str) -> bool:
     """Filter function specifically for Flux-dev models."""
-    pattern = re.compile(r"(proj_out.*|.*(time_text_embed|context_embedder|x_embedder|norm_out).*)")
+    pattern = re.compile(
+        r"(proj_out.*|.*(time_text_embed|context_embedder|x_embedder|norm_out|time_guidance_embed|stream_modulation).*)"
+    )
     return pattern.match(name) is not None
 
 
 
@@ -116,6 +116,7 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
 | MiniMax M2.1 | - | - | - | - | ✅ |
 | T5 | ✅ | ✅ | ✅ | ✅ | - |
 | Whisper | ✅ | ❌ | ❌ | ❌ | - |
+| Nemotron-3 | ✅ | ❌ | ❌ | ❌ | ✅ |
 
 > *This is a subset of the models supported. For the full list please check the [TensorRT-LLM support matrix](https://nvidia.github.io/TensorRT-LLM/reference/precision.html#support-matrix)*
 
@@ -240,7 +241,7 @@ Here is an example usage for `AutoQuantize` algorithm (Please see [auto_quantize
 
 ### AutoQuantize for Hugging Face models
 
-`AutoQuantize` can be performed for Huggingface LLM models like [Llama-3](https://huggingface.co/meta-llama) as shown below:
+`AutoQuantize` can be performed for Huggingface LLM models like [Qwen](https://huggingface.co/Qwen/Qwen3-8B) / [Nemotron](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) as shown below:
 
 [Script](./scripts/huggingface_example.sh)
 
@@ -249,11 +250,11 @@ export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or si
 # --auto_quantize_bits specifies the constraint for `AutoQuantize`
 # --quant specifies the formats to be searched for `AutoQuantize`
 # NOTE: auto_quantize_bits cannot be lower than the number of bits for the smallest quantization format in --quant
-scripts/huggingface_example.sh --model $HF_PATH --quant w4a8_awq,fp8 --auto_quantize_bits 4.8 --calib_batch_size 4
+scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8 --auto_quantize_bits 4.75 --calib_batch_size 4
 ```
 
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
-are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
+The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers
+are kept un-quantized such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).
 
 The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `quant,mmlu,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
Original file line number	Diff line number	Diff line change
`@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an`
`143`	`143`
`144`	`144`	`## Deployment with TensorRT`
`145`	`145`
`146`		-The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
	`146`	+The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.