Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,12 @@ Changelog
- Add ``DATASET_COMBOS`` to ``modelopt.torch.utils.dataset_utils`` — single ``--dataset`` tokens that fan out to multiple registered datasets; per-entry ``num_samples`` is split evenly across the members. Initial combos: ``cnn_nemotron_v2_mix`` (``cnn_dailymail`` + ``nemotron-post-training-dataset-v2``, used by ``hf_ptq.py`` when no ``--dataset`` is provided) and ``nemotron-post-training-v3`` (the seven ``nvidia/Nemotron-*`` SFT datasets added in #1498, mirroring the `nemotron-post-training-v3 collection <https://huggingface.co/collections/nvidia/nemotron-post-training-v3>`_). Combo names are listed by ``get_supported_datasets()`` and surfaced in ``--dataset`` help. ``get_dataset_dataloader`` rejects inputs that mix a combo with one of its member datasets (e.g. ``cnn_dailymail,cnn_nemotron_v2_mix``) to avoid double-sampling, and ``get_dataset_samples`` rejects combo names so callers route through the dataloader. ``hf_ptq.py`` default ``--calib_size`` is bumped from ``512`` to ``1024`` so the total calibration sample count under the new default combo matches the previous two-dataset fallback.
- The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
- Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
- Add Torch-TensorRT FP8 / NVFP4 deployment example for HuggingFace ViT (``examples/torch_trt/``) covering ``mtq.quantize`` → ``torch_tensorrt.compile(ir="dynamo")``. Ships two ViT-tuned PTQ recipes under ``modelopt_recipes/huggingface/vit/ptq/`` (``fp8.yaml``, ``nvfp4.yaml``) — encoder Linear weights+inputs quantized; attention Q/K/V BMMs, softmax, and per-block LayerNorm outputs at FP8; patch-embed ``nn.Conv2d``, ``classifier``, and the final ``vit.layernorm`` left FP16. Verified on ``google/vit-base-patch16-224`` (ImageNet-1k 50k validation): FP8 stays within 0.13 pp Top-1 of the FP16 baseline.

**Bug Fixes**

- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
- Apply ``self.softmax_quantizer`` on the standard (non-kitchen) attention forward path in ``_QuantAttention._quantized_attention`` for HuggingFace transformer attention modules. Previously the slot was created on every registered attention class but only invoked through the optional Kitchen MXFP8 branch, so FP8 / NVFP4 recipes that enabled ``*softmax_quantizer`` saw it stay uncalibrated (``amax=None``) and emitted no Q/DQ around the softmax output during ONNX / Torch-TRT export. The fix temporarily replaces ``torch.nn.functional.softmax`` with a wrapper that pipes its output through ``self.softmax_quantizer`` while the original attention interface runs; the patch is reverted as soon as the attention call returns, and short-circuits to the unwrapped call when the quantizer is disabled (zero-overhead). SDPA-fused softmax inside the C++ kernel is unaffected.

0.44 (2026-05-14)
^^^^^^^^^^^^^^^^^
Expand Down
103 changes: 103 additions & 0 deletions examples/torch_trt/README.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# ModelOpt + Torch-TensorRT Deployment

End-to-end examples that quantize a PyTorch model with NVIDIA ModelOpt and
then compile the quantized graph with
[Torch-TensorRT](https://docs.pytorch.org/TensorRT/) for deployment.

The flow follows the
[Torch-TensorRT quantization guide](https://docs.pytorch.org/TensorRT/user_guide/shapes_precision/quantization.html):
ModelOpt inserts Q/DQ nodes into the eager PyTorch graph, then
`torch_tensorrt.compile(ir="dynamo")` converts those Q/DQ nodes into native
TensorRT precision layers.

## Setup

```bash
# From the NVIDIA TensorRT docker image (recommended):
docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace nvcr.io/nvidia/tensorrt:26.02-py3 bash

pip install -U "nvidia-modelopt[torch]"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pip install -U "nvidia-modelopt[torch]"
pip install -U "nvidia-modelopt"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Suggested change from previous review (pip install -U "nvidia-modelopt") was not applied — this still reads "nvidia-modelopt[torch]".

pip install -r examples/torch_trt/requirements.txt
```

Torch-TensorRT itself follows the
[official install instructions](https://docs.pytorch.org/TensorRT/getting_started/installation.html) —
the version pulled by `pip` must match your installed PyTorch.

## Usage

```bash
# FP8 / NVFP4 default model is google/vit-large-patch16-224
python examples/torch_trt/torch_tensorrt_ptq.py \
--precision fp8/nvfp4 \
--calib_samples 128 \
--batch_size 1

# Quantize but don't TRT-compile (handy on a non-TRT host)
python examples/torch_trt/torch_tensorrt_ptq.py \
--precision fp8/nvfp4 \
--skip_trt

# Custom model + custom recipe
python examples/torch_trt/torch_tensorrt_ptq.py \
--model_id <huggingface/model-id> \
--recipe <recipe-path-relative-to-modelopt_recipes-or-absolute-yaml>
```

## What the example does

1. Loads a HuggingFace model (default: `google/vit-large-patch16-224`).
2. Builds a tiny calibration loader from `zh-plus/tiny-imagenet` (avoids the
gated `ILSVRC/imagenet-1k` repo so the example runs unauthenticated).
3. Runs `mtq.quantize` with one of the recipes shipped under
[`modelopt_recipes/`](../../modelopt_recipes/). The default recipes
target ViT; pass `--recipe <path>` to use a different one for a
different model.
4. Compiles the quantized model with `torch_tensorrt.compile` and verifies
that the compiled-model argmax matches the fake-quant argmax on a sample
input.

## ViT-specific recipes shipped with the example

These are the recipes the CLI selects by default when `--model_id` points
at a HF ViT classifier. They are **not** thin wrappers around the modelopt
defaults — they're tuned for the HF ViT module layout.

| Flag | Recipe path | Key differences from the default |
|------|-------------|----------------------------------|
| `--precision fp8` | `huggingface/vit/ptq/fp8` | W8A8 FP8 **plus** MHA-aware FP8 on every per-block `nn.LayerNorm` output (shared Q/DQ feeds Q/K/V + MLP), FP8 attention Q/K/V BMM + softmax slots, patch-embedding `nn.Conv2d` left FP16, `classifier` head left in FP16, final `vit.layernorm` left FP16. |
| `--precision nvfp4` | `huggingface/vit/ptq/nvfp4` | Same skip list as the FP8 recipe; encoder Linear weights/inputs run NVFP4 W4A4 (E2M1, block 16, FP8 scales). Attention BMMs, softmax, and per-block LayerNorm outputs stay at FP8 — NVFP4 is too aggressive there. Uses `awq_lite` calibration. |

Each recipe is self-contained (no `$import` of shared snippets) and uses
the "specific-enable" style: narrow `parent_class` + path scoping on the
enable rules means no `enable: false` carve-outs are needed.

## Hardware requirements

| Recipe | Minimum GPU |
|--------|-------------|
| `fp8` | Hopper (H100) / Ada (RTX 4090 / 6000 Ada) — compute capability 8.9+ |
| `nvfp4` | Blackwell (B100/B200) — TRT ≥ 10.8 |

Older GPUs will still let `mtq.quantize` succeed (it emits fake-quant
nodes in PyTorch), but `torch_tensorrt.compile` will not find a real
low-precision kernel.

### Resuming from a saved checkpoint

Pass `--save_dir <path>` to persist the modelopt-quantized model
(`vit_modelopt_state.pt`). To reload without recalibrating, restore it
before the TRT compile step with:

```python
import modelopt.torch.opt as mto
mto.restore(model, "vit_modelopt_state.pt")
```

## Custom recipes

Use `--recipe <path>` to plug in a different recipe — either a path
relative to `modelopt_recipes/` (resolved against the built-in library) or
an absolute filesystem path to a YAML file. The recipe must declare
`metadata.recipe_type: ptq` and a `quantize:` section; see existing
`modelopt_recipes/huggingface/vit/ptq/*.yaml` for the patterns used here.
3 changes: 3 additions & 0 deletions examples/torch_trt/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
datasets>=2.14.4
torch-tensorrt>=2.4.0
transformers>=4.40
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
transformers>=4.40
transformers>=4.56

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Suggested bump to transformers>=4.56 from previous review was not applied; still pinned at >=4.40.

Loading
Loading