Skip to content

Commit c8ae68a

Browse files
authored
Update CHANGELOG.rst
Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
1 parent a32a409 commit c8ae68a

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

CHANGELOG.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ NVIDIA Model Optimizer Changelog
2525
- Enable PTQ workflow for Qwen3.5 MoE models.
2626
- Enable PTQ workflow for the Kimi-K2.5 model.
2727
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
28-
- Add ``NVFP4_EXPERTS_ONLY_CFG`` quantization config that targets only MoE expert layers with NVFP4 (W4A4) quantization.
28+
- Add ``nvfp4_experts_only`` quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
2929
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
3030
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
3131
- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.

0 commit comments

Comments
 (0)