NVIDIA · realAsma · Jun 3, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
@@ -109,7 +109,7 @@ repos:
               examples/llm_eval/lm_eval_hf.py|
               examples/llm_eval/mmlu.py|
               examples/llm_eval/modeling.py|
-              examples/llm_qat/main.py|
+              examples/llm_qat/train.py|
               examples/llm_sparsity/weight_sparsity/finetune.py|
               examples/specdec_bench/specdec_bench/models/specbench_medusa.py|
               examples/speculative_decoding/main.py|
@@ -137,6 +137,21 @@ repos:
         args: ["-c", "pyproject.toml", "-q"]
         additional_dependencies: ["bandit[toml]"]
 
+  - repo: local
+    hooks:
+      - id: generate-arguments-md
+        name: Regenerate examples/llm_qat/ARGUMENTS.md
+        entry: bash -c 'python examples/llm_qat/arguments.py --generate_docs examples/llm_qat/ARGUMENTS.md'
+        language: system
+        files: >-
+          (?x)^(
+            examples/llm_qat/arguments\.py|
+            modelopt/torch/distill/plugins/huggingface\.py|
+            modelopt/torch/opt/plugins/transformers\.py|
+            modelopt/torch/quantization/plugins/transformers_trainer\.py
+          )$
+        pass_filenames: false
+
   - repo: https://github.com/DavidAnson/markdownlint-cli2
     rev: v0.18.1
     hooks:

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -43,12 +43,20 @@ Changelog
 - Add Nemotron-3-Super-120B-A12B PTQ recipes ``modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`` (MSE-mixed) and ``super-nvfp4-max-calib.yaml`` (max-calib mixed): NVFP4 W4A4 routed experts + FP8 per-tensor shared experts / Mamba in/out_proj + FP8 KV cache.
 - Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
 - Add post-training quantization (PTQ) example for the Megatron-Bridge framework: ``examples/megatron_bridge/quantize.py`` calibrates an HF model (via ``--quant_cfg`` alias / full config name or a ``--recipe`` YAML, with optional KV-cache quant, weight-only, compression, and MoE expert-ratio calibration) and saves a Megatron checkpoint (tensor / pipeline / expert parallelism supported), and ``examples/megatron_bridge/export.py`` converts that checkpoint to a deployable HuggingFace (unified) checkpoint for TensorRT-LLM / vLLM / SGLang. See `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for details.
+- Refactor ``llm_qat`` example with unified YAML-based configuration and flexible dataset blending.
+  ``ModelOptArgParser`` adds ``--config`` YAML support with CLI overrides and auto-generates ``ARGUMENTS.md`` from dataclass definitions.
+  Dataset blending (``configs/dataset/blend.yaml``) supports HuggingFace datasets, local JSON/JSONL/Parquet files, and weighted multi-source blends.
+  The legacy FSDP1 accelerate config is removed; ``llm_qat`` now documents FSDP2, DeepSpeed, and DDP backends.
 
 **Bug Fixes**
 
 - In Megatron-Core only do EP amax sync for routed expert weights if ``sync_expert_weight_amax=True``. Previously EP amax sync would sync routed expert weights across EP ranks even when ``sync_expert_weight_amax`` was False.
 - Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
 
+**Deprecations**
+
+- Deprecate the public ``QuantizationArgumentsWithConfig`` name in ``modelopt.torch.quantization.plugins.transformers_trainer``; it now aliases ``QuantizationArguments`` and will be removed in a future release.
+
 0.44 (2026-05-14)
 ^^^^^^^^^^^^^^^^^
 

@@ -2,6 +2,8 @@
 
 Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.
 
+> **Note:** For Hugging Face LLM QAD, see the [LLM QAT QAD section](../llm_qat/README.md#end-to-end-qad-example).
+
 ## Overview
 
 | Script | Purpose |

@@ -0,0 +1,2 @@
+.cache/
+.dataset_cache/
@@ -0,0 +1,63 @@
+# Argument Reference
+
+<!-- Auto-generated — do not edit by hand. Regenerate with: python examples/llm_qat/arguments.py --generate_docs examples/llm_qat/ARGUMENTS.md -->
+
+## Arguments by Script
+
+| Argument group | `quantize.py` | `train.py` |
+|---|:---:|:---:|
+| ModelArguments | ✅ | ✅ |
+| TrainingArguments | - | ✅ |
+| DataArguments | ✅ | ✅ |
+| DistillArguments | - | ✅ |
+| QuantizeArguments | ✅ | - |
+
+**Note:** Each script accepts only the arguments in its supported groups ✅. Arguments from other groups are ignored (if set in a `--config` YAML) or error out (if passed as a CLI flag).
+
+## DistillArguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--distill` | `bool` | `False` | Enable training with knowledge distillation. |
+| `--teacher_model` | `str` | `None` | The name or path of the teacher model to use for distillation. |
+| `--criterion` | `str` | `"logits_loss"` | Distillation loss criterion. Currently only 'logits_loss' is supported. |
+
+## DataArguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--dataset_config` | `str` | `"configs/dataset/blend.yaml"` | Path to a dataset blend YAML config file. |
+| `--train_samples` | `int` | `20000` | Number of training samples to use. |
+| `--eval_samples` | `int` | `2000` | Number of evaluation samples to use. |
+| `--dataset_seed` | `int` | `42` | Random seed for dataset shuffling. |
+| `--dataset_cache_dir` | `str` | `".dataset_cache/tokenized"` | Directory for caching tokenized datasets. |
+| `--shuffle` | `bool` | `True` | Whether to shuffle dataset sources (reservoir sampling). |
+| `--shuffle_buffer` | `int` | `10000` | Buffer size for streaming shuffle. |
+| `--num_proc` | `int` | `16` | Number of CPU workers for tokenization. |
+
+## ModelArguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--model_name_or_path` | `str` | `"Qwen/Qwen3-8B"` | HuggingFace model name or local path to the base model to quantize/train. |
+| `--model_max_length` | `int` | `4096` | Maximum sequence length. Sequences will be right-padded (and possibly truncated). |
+
+## QuantizeArguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--recipe` | `str` | `None` | Path to a quantization recipe YAML file (built-in or custom). Built-in recipes can be specified by relative path, e.g. 'general/ptq/nvfp4_default-kv_fp8'. Replaces the deprecated --quant_cfg flag. |
+| `--quant_cfg` | `modelopt.torch.quantization.config.QuantizeConfig` | `None` | Deprecated: pre-quantize the model with a separate quantization step instead. Specify the quantization format for PTQ/QAT by name (e.g. NVFP4_DEFAULT_CFG). |
+| `--calib_size` | `int` | `512` | Specify the calibration size for quantization. The calibration dataset is used to setup the quantization scale parameters for PTQ/QAT. |
+| `--compress` | `bool` | `False` | Whether to compress the model weights after quantization for QLoRA. This is useful for reducing the model size. |
+| `--calib_batch_size` | `int` | `1` | Batch size for calibration data during quantization. |
+| `--output_dir` | `str` | `"quantized_model"` | Directory to save the quantized model checkpoint. |
+
+## TrainingArguments
+
+Extends [HuggingFace TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). Only additional arguments are shown below.
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--cache_dir` | `str` | `None` |  |
+| `--lora` | `bool` | `False` | Whether to add LoRA (Low-Rank Adaptation) adapter before training. When using real quantization, the LoRA adapter must be set, as quantized weights will be frozen during training. |