NVIDIA · ajrasane · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -34,10 +34,12 @@ Changelog
 - Add ``DATASET_COMBOS`` to ``modelopt.torch.utils.dataset_utils`` — single ``--dataset`` tokens that fan out to multiple registered datasets; per-entry ``num_samples`` is split evenly across the members. Initial combos: ``cnn_nemotron_v2_mix`` (``cnn_dailymail`` + ``nemotron-post-training-dataset-v2``, used by ``hf_ptq.py`` when no ``--dataset`` is provided) and ``nemotron-post-training-v3`` (the seven ``nvidia/Nemotron-*`` SFT datasets added in #1498, mirroring the `nemotron-post-training-v3 collection <https://huggingface.co/collections/nvidia/nemotron-post-training-v3>`_). Combo names are listed by ``get_supported_datasets()`` and surfaced in ``--dataset`` help. ``get_dataset_dataloader`` rejects inputs that mix a combo with one of its member datasets (e.g. ``cnn_dailymail,cnn_nemotron_v2_mix``) to avoid double-sampling, and ``get_dataset_samples`` rejects combo names so callers route through the dataloader. ``hf_ptq.py`` default ``--calib_size`` is bumped from ``512`` to ``1024`` so the total calibration sample count under the new default combo matches the previous two-dataset fallback.
 - The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
 - Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
+- Add Torch-TensorRT FP8 / NVFP4 deployment example for HuggingFace ViT (``examples/torch_trt/``) covering ``mtq.quantize`` → ``torch_tensorrt.compile(ir="dynamo")``. Ships two ViT-tuned PTQ recipes under ``modelopt_recipes/huggingface/vit/ptq/`` (``fp8.yaml``, ``nvfp4.yaml``) — encoder Linear weights+inputs quantized; attention Q/K/V BMMs, softmax, and per-block LayerNorm outputs at FP8; patch-embed ``nn.Conv2d``, ``classifier``, and the final ``vit.layernorm`` left FP16. Verified on ``google/vit-base-patch16-224`` (ImageNet-1k 50k validation): FP8 stays within 0.13 pp Top-1 of the FP16 baseline.
 
 **Bug Fixes**
 
 - Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
+- Apply ``self.softmax_quantizer`` on the standard (non-kitchen) attention forward path in ``_QuantAttention._quantized_attention`` for HuggingFace transformer attention modules. Previously the slot was created on every registered attention class but only invoked through the optional Kitchen MXFP8 branch, so FP8 / NVFP4 recipes that enabled ``*softmax_quantizer`` saw it stay uncalibrated (``amax=None``) and emitted no Q/DQ around the softmax output during ONNX / Torch-TRT export. The fix temporarily replaces ``torch.nn.functional.softmax`` with a wrapper that pipes its output through ``self.softmax_quantizer`` while the original attention interface runs; the patch is reverted as soon as the attention call returns, and short-circuits to the unwrapped call when the quantizer is disabled (zero-overhead). SDPA-fused softmax inside the C++ kernel is unaffected.
 
 0.44 (2026-05-14)
 ^^^^^^^^^^^^^^^^^

@@ -0,0 +1,103 @@
+# ModelOpt + Torch-TensorRT Deployment
+
+End-to-end examples that quantize a PyTorch model with NVIDIA ModelOpt and
+then compile the quantized graph with
+[Torch-TensorRT](https://docs.pytorch.org/TensorRT/) for deployment.
+
+The flow follows the
+[Torch-TensorRT quantization guide](https://docs.pytorch.org/TensorRT/user_guide/shapes_precision/quantization.html):
+ModelOpt inserts Q/DQ nodes into the eager PyTorch graph, then
+`torch_tensorrt.compile(ir="dynamo")` converts those Q/DQ nodes into native
+TensorRT precision layers.
+
+## Setup
+
+```bash
+# From the NVIDIA TensorRT docker image (recommended):
+docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace nvcr.io/nvidia/tensorrt:26.02-py3 bash
+
+pip install -U "nvidia-modelopt[torch]"
-pip install -U "nvidia-modelopt[torch]"
+pip install -U "nvidia-modelopt"
-pip install -U "nvidia-modelopt[torch]"
+pip install -U "nvidia-modelopt"
+pip install -r examples/torch_trt/requirements.txt
+```
+
+Torch-TensorRT itself follows the
+[official install instructions](https://docs.pytorch.org/TensorRT/getting_started/installation.html) —
+the version pulled by `pip` must match your installed PyTorch.
+
+## Usage
+
+```bash
+# FP8 / NVFP4 default model is google/vit-large-patch16-224
+python examples/torch_trt/torch_tensorrt_ptq.py \
+    --precision fp8/nvfp4 \
+    --calib_samples 128 \
+    --batch_size 1
+
+# Quantize but don't TRT-compile (handy on a non-TRT host)
+python examples/torch_trt/torch_tensorrt_ptq.py \
+    --precision fp8/nvfp4 \
+    --skip_trt
+
+# Custom model + custom recipe
+python examples/torch_trt/torch_tensorrt_ptq.py \
+    --model_id <huggingface/model-id> \
+    --recipe <recipe-path-relative-to-modelopt_recipes-or-absolute-yaml>
+```
+
+## What the example does
+
+1. Loads a HuggingFace model (default: `google/vit-large-patch16-224`).
+2. Builds a tiny calibration loader from `zh-plus/tiny-imagenet` (avoids the
+   gated `ILSVRC/imagenet-1k` repo so the example runs unauthenticated).
+3. Runs `mtq.quantize` with one of the recipes shipped under
+   [`modelopt_recipes/`](../../modelopt_recipes/). The default recipes
+   target ViT; pass `--recipe <path>` to use a different one for a
+   different model.
+4. Compiles the quantized model with `torch_tensorrt.compile` and verifies
+   that the compiled-model argmax matches the fake-quant argmax on a sample
+   input.
+
+## ViT-specific recipes shipped with the example
+
+These are the recipes the CLI selects by default when `--model_id` points
+at a HF ViT classifier. They are **not** thin wrappers around the modelopt
+defaults — they're tuned for the HF ViT module layout.
+
+| Flag | Recipe path | Key differences from the default |
+|------|-------------|----------------------------------|
+| `--precision fp8` | `huggingface/vit/ptq/fp8` | W8A8 FP8 **plus** MHA-aware FP8 on every per-block `nn.LayerNorm` output (shared Q/DQ feeds Q/K/V + MLP), FP8 attention Q/K/V BMM + softmax slots, patch-embedding `nn.Conv2d` left FP16, `classifier` head left in FP16, final `vit.layernorm` left FP16. |
+| `--precision nvfp4` | `huggingface/vit/ptq/nvfp4` | Same skip list as the FP8 recipe; encoder Linear weights/inputs run NVFP4 W4A4 (E2M1, block 16, FP8 scales). Attention BMMs, softmax, and per-block LayerNorm outputs stay at FP8 — NVFP4 is too aggressive there. Uses `awq_lite` calibration. |
+
+Each recipe is self-contained (no `$import` of shared snippets) and uses
+the "specific-enable" style: narrow `parent_class` + path scoping on the
+enable rules means no `enable: false` carve-outs are needed.
+
+## Hardware requirements
+
+| Recipe | Minimum GPU |
+|--------|-------------|
+| `fp8`   | Hopper (H100) / Ada (RTX 4090 / 6000 Ada) — compute capability 8.9+ |
+| `nvfp4` | Blackwell (B100/B200) — TRT ≥ 10.8 |
+
+Older GPUs will still let `mtq.quantize` succeed (it emits fake-quant
+nodes in PyTorch), but `torch_tensorrt.compile` will not find a real
+low-precision kernel.
+
+### Resuming from a saved checkpoint
+
+Pass `--save_dir <path>` to persist the modelopt-quantized model
+(`vit_modelopt_state.pt`). To reload without recalibrating, restore it
+before the TRT compile step with:
+
+```python
+import modelopt.torch.opt as mto
+mto.restore(model, "vit_modelopt_state.pt")
+```
+
+## Custom recipes
+
+Use `--recipe <path>` to plug in a different recipe — either a path
+relative to `modelopt_recipes/` (resolved against the built-in library) or
+an absolute filesystem path to a YAML file. The recipe must declare
+`metadata.recipe_type: ptq` and a `quantize:` section; see existing
+`modelopt_recipes/huggingface/vit/ptq/*.yaml` for the patterns used here.
@@ -0,0 +1,3 @@
+datasets>=2.14.4
+torch-tensorrt>=2.4.0
+transformers>=4.40
-transformers>=4.40
+transformers>=4.56
-transformers>=4.40
+transformers>=4.56