diff --git a/CHANGELOG.rst b/CHANGELOG.rst
index 304f800644a..b0888efd714 100755
--- a/CHANGELOG.rst
+++ b/CHANGELOG.rst
@@ -34,10 +34,12 @@ Changelog
 - Add ``DATASET_COMBOS`` to ``modelopt.torch.utils.dataset_utils`` — single ``--dataset`` tokens that fan out to multiple registered datasets; per-entry ``num_samples`` is split evenly across the members. Initial combos: ``cnn_nemotron_v2_mix`` (``cnn_dailymail`` + ``nemotron-post-training-dataset-v2``, used by ``hf_ptq.py`` when no ``--dataset`` is provided) and ``nemotron-post-training-v3`` (the seven ``nvidia/Nemotron-*`` SFT datasets added in #1498, mirroring the `nemotron-post-training-v3 collection <https://huggingface.co/collections/nvidia/nemotron-post-training-v3>`_). Combo names are listed by ``get_supported_datasets()`` and surfaced in ``--dataset`` help. ``get_dataset_dataloader`` rejects inputs that mix a combo with one of its member datasets (e.g. ``cnn_dailymail,cnn_nemotron_v2_mix``) to avoid double-sampling, and ``get_dataset_samples`` rejects combo names so callers route through the dataloader. ``hf_ptq.py`` default ``--calib_size`` is bumped from ``512`` to ``1024`` so the total calibration sample count under the new default combo matches the previous two-dataset fallback.
 - The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
 - Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
+- Add Torch-TensorRT FP8 / NVFP4 deployment example for HuggingFace ViT (``examples/torch_trt/``) covering ``mtq.quantize`` → ``torch_tensorrt.compile(ir="dynamo")``. Ships two ViT-tuned PTQ recipes under ``modelopt_recipes/huggingface/vit/ptq/`` (``fp8.yaml``, ``nvfp4.yaml``) — encoder Linear weights+inputs quantized; attention Q/K/V BMMs, softmax, and per-block LayerNorm outputs at FP8; patch-embed ``nn.Conv2d``, ``classifier``, and the final ``vit.layernorm`` left FP16. Verified on ``google/vit-base-patch16-224`` (ImageNet-1k 50k validation): FP8 stays within 0.13 pp Top-1 of the FP16 baseline.
 
 **Bug Fixes**
 
 - Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
+- Apply ``self.softmax_quantizer`` on the standard (non-kitchen) attention forward path in ``_QuantAttention._quantized_attention`` for HuggingFace transformer attention modules. Previously the slot was created on every registered attention class but only invoked through the optional Kitchen MXFP8 branch, so FP8 / NVFP4 recipes that enabled ``*softmax_quantizer`` saw it stay uncalibrated (``amax=None``) and emitted no Q/DQ around the softmax output during ONNX / Torch-TRT export. The fix temporarily replaces ``torch.nn.functional.softmax`` with a wrapper that pipes its output through ``self.softmax_quantizer`` while the original attention interface runs; the patch is reverted as soon as the attention call returns, and short-circuits to the unwrapped call when the quantizer is disabled (zero-overhead). SDPA-fused softmax inside the C++ kernel is unaffected.
 
 0.44 (2026-05-14)
 ^^^^^^^^^^^^^^^^^
diff --git a/examples/torch_trt/README.md b/examples/torch_trt/README.md
new file mode 100644
index 00000000000..d826f7fc3d6
--- /dev/null
+++ b/examples/torch_trt/README.md
@@ -0,0 +1,103 @@
+# ModelOpt + Torch-TensorRT Deployment
+
+End-to-end examples that quantize a PyTorch model with NVIDIA ModelOpt and
+then compile the quantized graph with
+[Torch-TensorRT](https://docs.pytorch.org/TensorRT/) for deployment.
+
+The flow follows the
+[Torch-TensorRT quantization guide](https://docs.pytorch.org/TensorRT/user_guide/shapes_precision/quantization.html):
+ModelOpt inserts Q/DQ nodes into the eager PyTorch graph, then
+`torch_tensorrt.compile(ir="dynamo")` converts those Q/DQ nodes into native
+TensorRT precision layers.
+
+## Setup
+
+```bash
+# From the NVIDIA TensorRT docker image (recommended):
+docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace nvcr.io/nvidia/tensorrt:26.02-py3 bash
+
+pip install -U "nvidia-modelopt[torch]"
+pip install -r examples/torch_trt/requirements.txt
+```
+
+Torch-TensorRT itself follows the
+[official install instructions](https://docs.pytorch.org/TensorRT/getting_started/installation.html) —
+the version pulled by `pip` must match your installed PyTorch.
+
+## Usage
+
+```bash
+# FP8 / NVFP4 default model is google/vit-large-patch16-224
+python examples/torch_trt/torch_tensorrt_ptq.py \
+    --precision fp8/nvfp4 \
+    --calib_samples 128 \
+    --batch_size 1
+
+# Quantize but don't TRT-compile (handy on a non-TRT host)
+python examples/torch_trt/torch_tensorrt_ptq.py \
+    --precision fp8/nvfp4 \
+    --skip_trt
+
+# Custom model + custom recipe
+python examples/torch_trt/torch_tensorrt_ptq.py \
+    --model_id <huggingface/model-id> \
+    --recipe <recipe-path-relative-to-modelopt_recipes-or-absolute-yaml>
+```
+
+## What the example does
+
+1. Loads a HuggingFace model (default: `google/vit-large-patch16-224`).
+2. Builds a tiny calibration loader from `zh-plus/tiny-imagenet` (avoids the
+   gated `ILSVRC/imagenet-1k` repo so the example runs unauthenticated).
+3. Runs `mtq.quantize` with one of the recipes shipped under
+   [`modelopt_recipes/`](../../modelopt_recipes/). The default recipes
+   target ViT; pass `--recipe <path>` to use a different one for a
+   different model.
+4. Compiles the quantized model with `torch_tensorrt.compile` and verifies
+   that the compiled-model argmax matches the fake-quant argmax on a sample
+   input.
+
+## ViT-specific recipes shipped with the example
+
+These are the recipes the CLI selects by default when `--model_id` points
+at a HF ViT classifier. They are **not** thin wrappers around the modelopt
+defaults — they're tuned for the HF ViT module layout.
+
+| Flag | Recipe path | Key differences from the default |
+|------|-------------|----------------------------------|
+| `--precision fp8` | `huggingface/vit/ptq/fp8` | W8A8 FP8 **plus** MHA-aware FP8 on every per-block `nn.LayerNorm` output (shared Q/DQ feeds Q/K/V + MLP), FP8 attention Q/K/V BMM + softmax slots, patch-embedding `nn.Conv2d` left FP16, `classifier` head left in FP16, final `vit.layernorm` left FP16. |
+| `--precision nvfp4` | `huggingface/vit/ptq/nvfp4` | Same skip list as the FP8 recipe; encoder Linear weights/inputs run NVFP4 W4A4 (E2M1, block 16, FP8 scales). Attention BMMs, softmax, and per-block LayerNorm outputs stay at FP8 — NVFP4 is too aggressive there. Uses `awq_lite` calibration. |
+
+Each recipe is self-contained (no `$import` of shared snippets) and uses
+the "specific-enable" style: narrow `parent_class` + path scoping on the
+enable rules means no `enable: false` carve-outs are needed.
+
+## Hardware requirements
+
+| Recipe | Minimum GPU |
+|--------|-------------|
+| `fp8`   | Hopper (H100) / Ada (RTX 4090 / 6000 Ada) — compute capability 8.9+ |
+| `nvfp4` | Blackwell (B100/B200) — TRT ≥ 10.8 |
+
+Older GPUs will still let `mtq.quantize` succeed (it emits fake-quant
+nodes in PyTorch), but `torch_tensorrt.compile` will not find a real
+low-precision kernel.
+
+### Resuming from a saved checkpoint
+
+Pass `--save_dir <path>` to persist the modelopt-quantized model
+(`vit_modelopt_state.pt`). To reload without recalibrating, restore it
+before the TRT compile step with:
+
+```python
+import modelopt.torch.opt as mto
+mto.restore(model, "vit_modelopt_state.pt")
+```
+
+## Custom recipes
+
+Use `--recipe <path>` to plug in a different recipe — either a path
+relative to `modelopt_recipes/` (resolved against the built-in library) or
+an absolute filesystem path to a YAML file. The recipe must declare
+`metadata.recipe_type: ptq` and a `quantize:` section; see existing
+`modelopt_recipes/huggingface/vit/ptq/*.yaml` for the patterns used here.
diff --git a/examples/torch_trt/requirements.txt b/examples/torch_trt/requirements.txt
new file mode 100644
index 00000000000..8caf5b5fd93
--- /dev/null
+++ b/examples/torch_trt/requirements.txt
@@ -0,0 +1,3 @@
+datasets>=2.14.4
+torch-tensorrt>=2.4.0
+transformers>=4.40
diff --git a/examples/torch_trt/torch_tensorrt_ptq.py b/examples/torch_trt/torch_tensorrt_ptq.py
new file mode 100644
index 00000000000..da584928865
--- /dev/null
+++ b/examples/torch_trt/torch_tensorrt_ptq.py
@@ -0,0 +1,298 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Quantize a HuggingFace ViT model with ModelOpt and compile with Torch-TensorRT.
+
+Pipeline:
+
+1. Load ``google/vit-large-patch16-224`` (`ViTForImageClassification`) from HF.
+2. Build a calibration loader from `zh-plus/tiny-imagenet` so the recipe runs
+   end-to-end without ImageNet access.
+3. Run ``mtq.quantize`` with one of the ViT-specific recipes under
+   `modelopt_recipes/huggingface/vit/ptq/` (FP8 or NVFP4).
+4. Compile the quantized model with ``torch_tensorrt.compile(ir="dynamo",
+   min_block_size=1)`` and verify the compiled-model argmax matches the
+   fake-quant argmax on a sample input.
+
+The quantized graph keeps Q/DQ nodes; the TRT compile step is what turns
+them into TRT precision layers.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import torch
+from datasets import load_dataset
+from transformers import AutoImageProcessor, ViTConfig, ViTForImageClassification
+
+import modelopt.torch.opt as mto
+import modelopt.torch.quantization as mtq
+from modelopt.recipe import ModelOptPTQRecipe, load_recipe
+from modelopt.torch.quantization.utils import export_torch_mode
+
+# Maps the user-facing precision flag to the ViT-specific recipe under
+# `modelopt_recipes/huggingface/vit/ptq/`. The recipe loader resolves this
+# relative path against the built-in recipe library.
+PRECISION_TO_RECIPE: dict[str, str] = {
+    "fp8": "huggingface/vit/ptq/fp8",
+    "nvfp4": "huggingface/vit/ptq/nvfp4",
+}
+
+
+def load_model_and_processor(
+    model_id: str,
+    device: torch.device,
+    dtype: torch.dtype,
+    pretrained: bool = True,
+    config_overrides: dict | None = None,
+):
+    """Pull the HF ViT classifier and its preprocessor.
+
+    With ``pretrained=False`` the model is built from a config with random
+    weights (test path); ``config_overrides`` lets the caller shrink it
+    (e.g. ``{"num_hidden_layers": 1, "hidden_size": 64, ...}``). The
+    preprocessor is always loaded from ``model_id`` since it only carries
+    a small JSON config.
+    """
+    print(f"Loading {model_id} (dtype={dtype}, pretrained={pretrained})...")
+    processor = AutoImageProcessor.from_pretrained(model_id)
+    if pretrained:
+        model = ViTForImageClassification.from_pretrained(model_id, torch_dtype=dtype)
+    else:
+        config = ViTConfig.from_pretrained(model_id)
+        for k, v in (config_overrides or {}).items():
+            setattr(config, k, v)
+        model = ViTForImageClassification(config).to(dtype)
+    model.eval().to(device)
+    return model, processor
+
+
+def build_calibration_loader(
+    processor,
+    num_samples: int,
+    batch_size: int,
+    device: torch.device,
+    dtype: torch.dtype,
+):
+    """Build a calibration tensor stream from tiny-imagenet."""
+    print(f"Loading calibration data ({num_samples} samples)...")
+    dataset = load_dataset("zh-plus/tiny-imagenet", split="train")
+    dataset = dataset.shuffle(seed=42).select(range(num_samples))
+
+    tensors: list[torch.Tensor] = []
+    for sample in dataset:
+        image = sample["image"]
+        if image.mode != "RGB":
+            image = image.convert("RGB")
+        pixel_values = processor(images=image, return_tensors="pt")["pixel_values"]
+        tensors.append(pixel_values.squeeze(0))
+
+    batched = torch.stack(tensors).to(device=device, dtype=dtype)
+    return torch.split(batched, batch_size)
+
+
+def quantize_with_recipe(model, recipe_path: str, calib_batches):
+    """Resolve the YAML recipe and run `mtq.quantize`."""
+    print(f"Loading recipe: {recipe_path}")
+    recipe = load_recipe(recipe_path)
+    if not isinstance(recipe, ModelOptPTQRecipe):
+        raise TypeError(f"Expected PTQ recipe, got {type(recipe).__name__}")
+    quant_cfg = recipe.quantize.model_dump()
+
+    def forward_loop(model_):
+        with torch.no_grad():
+            for batch in calib_batches:
+                model_(pixel_values=batch)
+
+    print("Running mtq.quantize ...")
+    mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
+    mtq.print_quant_summary(model)
+    return model
+
+
+class ViTLogitsWrapper(torch.nn.Module):
+    """Returns raw logits as a single tensor.
+
+    HF's `ViTForImageClassification.forward` returns an `ImageClassifierOutput`
+    dataclass. `torch_tensorrt.compile` (and `torch.export`) need a tensor-tree
+    return, so we unwrap it here.
+    """
+
+    def __init__(self, vit_model: torch.nn.Module):
+        super().__init__()
+        self.vit = vit_model
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        return self.vit(pixel_values=pixel_values).logits
+
+
+def compile_with_torch_tensorrt(model: torch.nn.Module, example_input: torch.Tensor):
+    """Compile the quantized model with Torch-TensorRT (Dynamo IR, strongly-typed).
+
+    `min_block_size=1` follows the Torch-TRT quantization guide so single-node
+    Q/DQ + matmul subgraphs become TRT precision layers. `export_torch_mode`
+    makes modelopt emit Q/DQ in the TRT-friendly form during `torch.export`.
+    """
+    import torch_tensorrt
+
+    print("Compiling with torch_tensorrt.compile (Dynamo IR)...")
+    # `aten.cat.default` is force-executed in PyTorch because torch_tensorrt
+    # 2.10's cat converter chokes on the cls-token + patch-embedding concat
+    # in HF ViT (BFloat16 path: `TypeError: Got unsupported ScalarType
+    # BFloat16`; FP16 path: rank-(-1) TRT tensor that trips the downstream
+    # `embeddings + position_embeddings` add). The cat is a tiny [1,1,H]
+    # + [1,N,H] concat that runs once per forward, so falling back to
+    # PyTorch costs essentially nothing.
+    with export_torch_mode():
+        trt_model = torch_tensorrt.compile(
+            model,
+            ir="dynamo",
+            arg_inputs=[example_input],
+            min_block_size=1,
+            truncate_double=True,
+            torch_executed_ops={torch.ops.aten.cat.default},
+        )
+    return trt_model
+
+
+def _argmax_logits(out) -> torch.Tensor:
+    """Handle either an HF `ImageClassifierOutput` or a raw tensor."""
+    logits = out.logits if hasattr(out, "logits") else out
+    return logits.argmax(dim=-1)
+
+
+def main():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--model_id",
+        default="google/vit-large-patch16-224",
+        help="HuggingFace model id of the ViT classifier to quantize.",
+    )
+    parser.add_argument(
+        "--precision",
+        choices=sorted(PRECISION_TO_RECIPE),
+        default="fp8",
+        help="Which ViT recipe variant to apply.",
+    )
+    parser.add_argument(
+        "--recipe",
+        default=None,
+        help="Override the recipe path (relative to modelopt_recipes/ or absolute). "
+        "If unset, the recipe is picked by --precision.",
+    )
+    parser.add_argument(
+        "--calib_samples",
+        type=int,
+        default=128,
+        help="Number of tiny-imagenet samples to use for calibration.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="Batch size for calibration / TRT compile.",
+    )
+    parser.add_argument(
+        "--save_dir",
+        type=str,
+        default=None,
+        help="If set, save the quantized modelopt state-dict here (BF16 weights "
+        "+ Q/DQ metadata) — re-usable across runs without recalibration.",
+    )
+    parser.add_argument(
+        "--skip_trt",
+        action="store_true",
+        help="Quantize + run the fake-quant model only; skip torch_tensorrt.compile. "
+        "Useful for environments without torch_tensorrt installed.",
+    )
+    parser.add_argument(
+        "--no_pretrained",
+        action="store_true",
+        help="Build the model from config with random weights instead of "
+        "downloading pretrained weights. Useful for fast e2e tests.",
+    )
+    parser.add_argument(
+        "--model_kwargs",
+        type=str,
+        default=None,
+        help="JSON string of ViTConfig overrides applied when --no_pretrained "
+        'is set (e.g. \'{"num_hidden_layers": 1, "hidden_size": 64, '
+        '"intermediate_size": 128, "num_attention_heads": 2}\').',
+    )
+    args = parser.parse_args()
+
+    if not torch.cuda.is_available():
+        raise SystemExit("This example requires a CUDA-capable GPU.")
+    device = torch.device("cuda")
+    dtype = torch.bfloat16
+
+    config_overrides = json.loads(args.model_kwargs) if args.model_kwargs else None
+    model, processor = load_model_and_processor(
+        args.model_id,
+        device,
+        dtype,
+        pretrained=not args.no_pretrained,
+        config_overrides=config_overrides,
+    )
+    image_size = model.config.image_size
+    num_channels = model.config.num_channels
+    example_input = torch.randn(
+        args.batch_size, num_channels, image_size, image_size, device=device, dtype=dtype
+    )
+
+    print("\n=== Baseline (BF16) ===")
+    with torch.no_grad():
+        baseline_pred = _argmax_logits(model(example_input))
+    print(f"Baseline argmax class: {baseline_pred.tolist()}")
+
+    calib_batches = build_calibration_loader(
+        processor, args.calib_samples, args.batch_size, device, dtype
+    )
+
+    recipe_path = args.recipe or PRECISION_TO_RECIPE[args.precision]
+    quantize_with_recipe(model, recipe_path, calib_batches)
+
+    if args.save_dir:
+        save_path = Path(args.save_dir)
+        save_path.mkdir(parents=True, exist_ok=True)
+        ckpt = save_path / "vit_modelopt_state.pt"
+        mto.save(model, ckpt)
+        print(f"Saved quantized modelopt state to {ckpt}")
+
+    print("\n=== Fake-quant (modelopt) ===")
+    with torch.no_grad():
+        fq_pred = _argmax_logits(model(example_input))
+    fq_match = (fq_pred == baseline_pred).all().item()
+    print(f"Quantized argmax class: {fq_pred.tolist()} (matches baseline: {fq_match})")
+
+    if args.skip_trt:
+        print("\n--skip_trt set; not compiling with Torch-TensorRT.")
+        return
+
+    wrapped = ViTLogitsWrapper(model).to(device).eval()
+    trt_model = compile_with_torch_tensorrt(wrapped, example_input)
+
+    print("\n=== Torch-TensorRT compiled ===")
+    with torch.no_grad():
+        trt_pred = trt_model(example_input).argmax(dim=-1)
+    trt_match = (trt_pred == baseline_pred).all().item()
+    print(f"TRT argmax class: {trt_pred.tolist()} (matches baseline: {trt_match})")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/modelopt/torch/quantization/plugins/huggingface.py b/modelopt/torch/quantization/plugins/huggingface.py
index 1873ecda528..ceede41a891 100644
--- a/modelopt/torch/quantization/plugins/huggingface.py
+++ b/modelopt/torch/quantization/plugins/huggingface.py
@@ -134,6 +134,17 @@ def _quantized_attention(
         key_states = self.k_bmm_quantizer(key_states)
         value_states = self.v_bmm_quantizer(value_states)
         if not self.use_kitchen:
+            if self.softmax_quantizer.is_enabled:
+                _sq = self.softmax_quantizer
+                _orig_softmax = torch.nn.functional.softmax
+
+                def _quantized_softmax(*s_args, **s_kwargs):
+                    return _sq(_orig_softmax(*s_args, **s_kwargs))
+
+                with replace_function(torch.nn.functional, "softmax", _quantized_softmax):
+                    return original_attention_interface(
+                        self, query_states, key_states, value_states, *args, **kwargs
+                    )
             return original_attention_interface(
                 self, query_states, key_states, value_states, *args, **kwargs
             )
diff --git a/modelopt_recipes/huggingface/vit/ptq/fp8.yaml b/modelopt_recipes/huggingface/vit/ptq/fp8.yaml
new file mode 100644
index 00000000000..6158a99590a
--- /dev/null
+++ b/modelopt_recipes/huggingface/vit/ptq/fp8.yaml
@@ -0,0 +1,55 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+metadata:
+  recipe_type: ptq
+  description: >-
+    HuggingFace ViT FP8 PTQ recipe matching modelopt.onnx FP8 layout:
+    per-tensor FP8 E4M3 on encoder Linear weights + inputs, attention QKV
+    BMMs and softmax outputs, MHA-aware per-block LayerNorm outputs;
+    patch-embed Conv2d, classifier head, and the final LayerNorm are
+    left FP16. Uses max calibration.
+quantize:
+  algorithm: max
+  quant_cfg:
+    - quantizer_name: '*'
+      enable: false
+
+    - parent_class: 'nn.Linear'
+      quantizer_name: 'vit.encoder.layer.*.weight_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
+    - parent_class: 'nn.Linear'
+      quantizer_name: 'vit.encoder.layer.*.input_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
+
+    - quantizer_name: '*[qkv]_bmm_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
+
+    - quantizer_name: '*softmax_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
+
+    - parent_class: 'nn.LayerNorm'
+      quantizer_name: 'vit.encoder.layer.*.output_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
diff --git a/modelopt_recipes/huggingface/vit/ptq/nvfp4.yaml b/modelopt_recipes/huggingface/vit/ptq/nvfp4.yaml
new file mode 100644
index 00000000000..e30f7184dd6
--- /dev/null
+++ b/modelopt_recipes/huggingface/vit/ptq/nvfp4.yaml
@@ -0,0 +1,63 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+metadata:
+  recipe_type: ptq
+  description: >-
+    HuggingFace ViT NVFP4 W4A4 PTQ recipe. Skip list mirrors modelopt.onnx
+    FP8 (patch-embed Conv2d, classifier, and final vit.layernorm stay
+    FP16). Encoder Linear weights/inputs run NVFP4; attention BMMs,
+    softmax, and per-block LayerNorm outputs stay at FP8.
+    Uses AWQ-lite calibration.
+quantize:
+  algorithm: awq_lite
+  quant_cfg:
+    - quantizer_name: '*'
+      enable: false
+
+    - parent_class: 'nn.Linear'
+      quantizer_name: 'vit.encoder.layer.*.weight_quantizer'
+      cfg:
+        num_bits: e2m1
+        axis:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+    - parent_class: 'nn.Linear'
+      quantizer_name: 'vit.encoder.layer.*.input_quantizer'
+      cfg:
+        num_bits: e2m1
+        axis:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+
+    - quantizer_name: '*[qkv]_bmm_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
+
+    - quantizer_name: '*softmax_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
+
+    - parent_class: 'nn.LayerNorm'
+      quantizer_name: 'vit.encoder.layer.*.output_quantizer'
+      cfg:
+        num_bits: e4m3
+        axis:
diff --git a/tests/examples/torch_trt/test_torch_tensorrt_ptq.py b/tests/examples/torch_trt/test_torch_tensorrt_ptq.py
new file mode 100644
index 00000000000..f0350d525ad
--- /dev/null
+++ b/tests/examples/torch_trt/test_torch_tensorrt_ptq.py
@@ -0,0 +1,59 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import pytest
+from _test_utils.examples.run_command import extend_cmd_parts, run_example_command
+
+# Recipe variants the example ships. Mirrors the parametrization style of
+# ``tests/examples/torch_onnx/test_torch_quant_to_onnx.py``.
+_PRECISIONS = ["fp8", "nvfp4"]
+
+# Tiny ViT config (~1 encoder block) so the test stays under a few seconds
+# of GPU time while exercising every code path the recipe touches: encoder
+# Linear weight/input quantizers, attention BMM + softmax quantizers,
+# per-block LayerNorm output quantizer, and the patch-embed Conv / final
+# vit.layernorm / classifier skip rules.
+_TINY_VIT_KWARGS = {
+    "num_hidden_layers": 1,
+    "hidden_size": 64,
+    "intermediate_size": 128,
+    "num_attention_heads": 2,
+}
+
+
+@pytest.mark.parametrize("precision", _PRECISIONS)
+def test_torch_tensorrt_ptq(precision):
+    """End-to-end: load tiny ViT -> mtq.quantize via recipe -> torch_tensorrt.compile.
+
+    Runs against the smallest viable ``ViTForImageClassification`` config so
+    the test stays fast; ``--no_pretrained`` skips the multi-GB pretrained
+    download. The example's CLI exits non-zero if any step (calibration,
+    quantization, TRT compile) fails or if the compiled-model argmax doesn't
+    match the fake-quant argmax on the sample input.
+    """
+    pytest.importorskip("torch_tensorrt")
+
+    cmd_parts = extend_cmd_parts(
+        ["python", "torch_tensorrt_ptq.py"],
+        model_id="google/vit-base-patch16-224",
+        precision=precision,
+        calib_samples="4",
+        batch_size="1",
+        model_kwargs=json.dumps(_TINY_VIT_KWARGS),
+    )
+    cmd_parts.append("--no_pretrained")
+    run_example_command(cmd_parts, "torch_trt")