Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ repos:
examples/llm_eval/lm_eval_hf.py|
examples/llm_eval/mmlu.py|
examples/llm_eval/modeling.py|
examples/llm_qat/main.py|
examples/llm_qat/train.py|
examples/llm_sparsity/weight_sparsity/finetune.py|
examples/specdec_bench/specdec_bench/models/specbench_medusa.py|
examples/speculative_decoding/main.py|
Expand Down Expand Up @@ -137,6 +137,21 @@ repos:
args: ["-c", "pyproject.toml", "-q"]
additional_dependencies: ["bandit[toml]"]

- repo: local
hooks:
- id: generate-arguments-md
name: Regenerate examples/llm_qat/ARGUMENTS.md
entry: bash -c 'python examples/llm_qat/arguments.py --generate_docs examples/llm_qat/ARGUMENTS.md'
language: system
files: >-
(?x)^(
examples/llm_qat/arguments\.py|
modelopt/torch/distill/plugins/huggingface\.py|
modelopt/torch/opt/plugins/transformers\.py|
modelopt/torch/quantization/plugins/transformers_trainer\.py
)$
pass_filenames: false
Comment thread
coderabbitai[bot] marked this conversation as resolved.

- repo: https://github.com/DavidAnson/markdownlint-cli2
rev: v0.18.1
hooks:
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,20 @@ Changelog
- Add Nemotron-3-Super-120B-A12B PTQ recipes ``modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`` (MSE-mixed) and ``super-nvfp4-max-calib.yaml`` (max-calib mixed): NVFP4 W4A4 routed experts + FP8 per-tensor shared experts / Mamba in/out_proj + FP8 KV cache.
- Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
- Add post-training quantization (PTQ) example for the Megatron-Bridge framework: ``examples/megatron_bridge/quantize.py`` calibrates an HF model (via ``--quant_cfg`` alias / full config name or a ``--recipe`` YAML, with optional KV-cache quant, weight-only, compression, and MoE expert-ratio calibration) and saves a Megatron checkpoint (tensor / pipeline / expert parallelism supported), and ``examples/megatron_bridge/export.py`` converts that checkpoint to a deployable HuggingFace (unified) checkpoint for TensorRT-LLM / vLLM / SGLang. See `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for details.
- Refactor ``llm_qat`` example with unified YAML-based configuration and flexible dataset blending.
``ModelOptArgParser`` adds ``--config`` YAML support with CLI overrides and auto-generates ``ARGUMENTS.md`` from dataclass definitions.
Dataset blending (``configs/dataset/blend.yaml``) supports HuggingFace datasets, local JSON/JSONL/Parquet files, and weighted multi-source blends.
The legacy FSDP1 accelerate config is removed; ``llm_qat`` now documents FSDP2, DeepSpeed, and DDP backends.

**Bug Fixes**

- In Megatron-Core only do EP amax sync for routed expert weights if ``sync_expert_weight_amax=True``. Previously EP amax sync would sync routed expert weights across EP ranks even when ``sync_expert_weight_amax`` was False.
- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.

**Deprecations**

- Deprecate the public ``QuantizationArgumentsWithConfig`` name in ``modelopt.torch.quantization.plugins.transformers_trainer``; it now aliases ``QuantizationArguments`` and will be removed in a future release.

0.44 (2026-05-14)
^^^^^^^^^^^^^^^^^

Expand Down
2 changes: 2 additions & 0 deletions examples/llm_qad/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.

> **Note:** For Hugging Face LLM QAD, see the [LLM QAT QAD section](../llm_qat/README.md#end-to-end-qad-example).
## Overview

| Script | Purpose |
Expand Down
2 changes: 2 additions & 0 deletions examples/llm_qat/.gitignore
Comment thread
realAsma marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.cache/
.dataset_cache/
63 changes: 63 additions & 0 deletions examples/llm_qat/ARGUMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Argument Reference

<!-- Auto-generated — do not edit by hand. Regenerate with: python examples/llm_qat/arguments.py --generate_docs examples/llm_qat/ARGUMENTS.md -->

## Arguments by Script

| Argument group | `quantize.py` | `train.py` |
|---|:---:|:---:|
| ModelArguments | ✅ | ✅ |
| TrainingArguments | - | ✅ |
| DataArguments | ✅ | ✅ |
| DistillArguments | - | ✅ |
| QuantizeArguments | ✅ | - |

**Note:** Each script accepts only the arguments in its supported groups ✅. Arguments from other groups are ignored (if set in a `--config` YAML) or error out (if passed as a CLI flag).

## DistillArguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--distill` | `bool` | `False` | Enable training with knowledge distillation. |
| `--teacher_model` | `str` | `None` | The name or path of the teacher model to use for distillation. |
| `--criterion` | `str` | `"logits_loss"` | Distillation loss criterion. Currently only 'logits_loss' is supported. |

## DataArguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--dataset_config` | `str` | `"configs/dataset/blend.yaml"` | Path to a dataset blend YAML config file. |
| `--train_samples` | `int` | `20000` | Number of training samples to use. |
| `--eval_samples` | `int` | `2000` | Number of evaluation samples to use. |
| `--dataset_seed` | `int` | `42` | Random seed for dataset shuffling. |
| `--dataset_cache_dir` | `str` | `".dataset_cache/tokenized"` | Directory for caching tokenized datasets. |
| `--shuffle` | `bool` | `True` | Whether to shuffle dataset sources (reservoir sampling). |
| `--shuffle_buffer` | `int` | `10000` | Buffer size for streaming shuffle. |
| `--num_proc` | `int` | `16` | Number of CPU workers for tokenization. |

## ModelArguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--model_name_or_path` | `str` | `"Qwen/Qwen3-8B"` | HuggingFace model name or local path to the base model to quantize/train. |
| `--model_max_length` | `int` | `4096` | Maximum sequence length. Sequences will be right-padded (and possibly truncated). |

## QuantizeArguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--recipe` | `str` | `None` | Path to a quantization recipe YAML file (built-in or custom). Built-in recipes can be specified by relative path, e.g. 'general/ptq/nvfp4_default-kv_fp8'. Replaces the deprecated --quant_cfg flag. |
| `--quant_cfg` | `modelopt.torch.quantization.config.QuantizeConfig` | `None` | Deprecated: pre-quantize the model with a separate quantization step instead. Specify the quantization format for PTQ/QAT by name (e.g. NVFP4_DEFAULT_CFG). |
| `--calib_size` | `int` | `512` | Specify the calibration size for quantization. The calibration dataset is used to setup the quantization scale parameters for PTQ/QAT. |
| `--compress` | `bool` | `False` | Whether to compress the model weights after quantization for QLoRA. This is useful for reducing the model size. |
| `--calib_batch_size` | `int` | `1` | Batch size for calibration data during quantization. |
| `--output_dir` | `str` | `"quantized_model"` | Directory to save the quantized model checkpoint. |

## TrainingArguments

Extends [HuggingFace TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). Only additional arguments are shown below.

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--cache_dir` | `str` | `None` | |
| `--lora` | `bool` | `False` | Whether to add LoRA (Low-Rank Adaptation) adapter before training. When using real quantization, the LoRA adapter must be set, as quantized weights will be frozen during training. |
Loading
Loading