Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe by cjluo-nv · Pull Request #1030 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-03-12T15:56:59Z

What does this PR do?

Type of change: New feature

Add NVFP4_EXPERTS_ONLY_CFG quantization config that targets only MoE expert layers (*mlp.experts* and *block_sparse_moe*) with NVFP4 (W4A4) quantization, leaving all other layers (including non-expert MLP) unquantized. This is useful for MoE models where selectively quantizing only expert layers provides a good accuracy-performance tradeoff.

Changes:

Refactored _nvfp4_experts_only_quant_cfg as a reusable building block in config.py, with _nvfp4_mlp_only_quant_cfg now composing on top of it
Added NVFP4_EXPERTS_ONLY_CFG to the Python config choices
Added corresponding nvfp4_experts_only-fp8_kv.yml YAML recipe to the new recipe system (modelopt_recipes/general/ptq/)
Updated hf_ptq.py, multinode_ptq.py, example scripts, and README to include the new config

Usage

import modelopt.torch.quantization as mtq

model = mtq.quantize(model, mtq.NVFP4_EXPERTS_ONLY_CFG, forward_loop)

Or via the YAML recipe system:

from modelopt.recipe import load_recipe

recipe = load_recipe("general/ptq/nvfp4_experts_only-fp8_kv")

Testing

Verified the YAML recipe matches the Python config definition
Existing unit tests cover the quantization config infrastructure

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: ❌

Additional Information

The experts_only config is a subset of mlp_only: it quantizes *mlp.experts* and *block_sparse_moe* patterns but not the broader *mlp* pattern. The Python config was refactored so _nvfp4_mlp_only_quant_cfg composes on top of _nvfp4_experts_only_quant_cfg.

Summary by CodeRabbit

New Features
- Added an "experts-only" NVFP4 quantization option that selectively quantizes MoE expert layers (preserving dense MLP/attention) for improved PTQ accuracy.
- Added a PTQ recipe enabling expert-only W4A4 quantization with FP8 KV cache support and corresponding CLI/selectable quantization choice.
Documentation
- Updated README, examples, scripts, and changelog to document and surface the new experts-only quantization option.

copy-pr-bot · 2026-03-12T15:57:04Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-03-12T15:57:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1c1d9d9b-d567-4f65-838c-298888def8b0

📥 Commits

Reviewing files that changed from the base of the PR and between a32a409 and c8ae68a.

📒 Files selected for processing (1)

CHANGELOG.rst

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

📝 Walkthrough

Walkthrough

Adds a new NVFP4 experts-only quantization configuration (NVFP4_EXPERTS_ONLY_CFG) and integrates it across docs, example scripts, the core quantization config generator, a new PTQ recipe, and a CLI helper; it targets MoE expert layers while leaving dense MLP/attention unchanged.

Changes

Cohort / File(s)	Summary
Docs & Shell `examples/llm_ptq/README.md`, `examples/llm_ptq/scripts/huggingface_example.sh`	Documented and added `nvfp4_experts_only` to examples and the shell QFORMAT validation/listing.
Example Python CLI Scripts `examples/llm_ptq/hf_ptq.py`, `examples/llm_ptq/multinode_ptq.py`	Added `"nvfp4_experts_only": mtq.NVFP4_EXPERTS_ONLY_CFG` to `QUANT_CFG_CHOICES` and included it in auto-quantize/CLI `--qformat` choices.
Core Quantization Config `modelopt/torch/quantization/config.py`	Added `_nvfp4_selective_quant_cfg(...)` helper and `_nvfp4_quantizer_bs32`, refactored NVFP4 recipe constructions to use the selector, introduced `NVFP4_EXPERTS_ONLY_CFG`, and updated exported choices.
PTQ Recipe `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml`	New PTQ recipe for NVFP4 experts-only W4A4 with FP8 KV cache: enables expert-related quantizers, disables non-expert modules, and sets algorithm `max`.
Changelog `CHANGELOG.rst`	Added entry noting the new `NVFP4_EXPERTS_ONLY_CFG` under 0.43 “New Features”.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding a new NVFP4_EXPERTS_ONLY_CFG quantization configuration with its corresponding YAML recipe.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	No security anti-patterns detected in modified Python files. No torch.load with weights_only=False, numpy.load with allow_pickle=True, trust_remote_code=True, eval/exec usage, or nosec comments found.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/moe2

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…l/moe2 Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

🧹 Nitpick comments (2)

examples/llm_ptq/hf_ptq.py (1)

266-284: Avoid duplicating the auto-quantize qformat allow-list.

The hardcoded list can drift from QUANT_CFG_CHOICES. Centralizing the allowed set keeps additions/removals safer.

♻️ Suggested refactor

 QUANT_CFG_CHOICES: dict[str, dict[str, Any]] = {
@@
     "mxfp8": mtq.MXFP8_DEFAULT_CFG,
 }
 
+AUTO_QUANTIZE_QFORMAT_CHOICES = {
+    "fp8",
+    "int8_sq",
+    "int8_wo",
+    "int4_awq",
+    "nvfp4",
+    "nvfp4_awq",
+    "w4a8_awq",
+    "fp8_pb_wo",
+    "w4a8_mxfp4_fp8",
+    "nvfp4_mlp_only",
+    "nvfp4_experts_only",
+    "nvfp4_omlp_only",
+    "mxfp8",
+}
+
@@
-    assert all(
-        qformat
-        in [
-            "fp8",
-            "int8_sq",
-            "int8_wo",
-            "int4_awq",
-            "nvfp4",
-            "nvfp4_awq",
-            "w4a8_awq",
-            "fp8_pb_wo",
-            "w4a8_mxfp4_fp8",
-            "nvfp4_mlp_only",
-            "nvfp4_experts_only",
-            "nvfp4_omlp_only",
-            "mxfp8",
-        ]
-        for qformat in qformat_list
-    ), "One or more quantization formats provided are not supported for unified checkpoint export"
+    assert set(qformat_list).issubset(AUTO_QUANTIZE_QFORMAT_CHOICES), (
+        "One or more quantization formats provided are not supported for unified checkpoint export"
+    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 266 - 284, The assertion hardcodes
the allowed qformat strings for qformat_list, which duplicates and can drift
from the canonical QUANT_CFG_CHOICES; replace the hardcoded list usage in the
assert with a check against the canonical set (e.g., QUANT_CFG_CHOICES) so the
assertion becomes something like verifying every qformat in qformat_list is in
QUANT_CFG_CHOICES (or its setified form) and keep the error message the same;
update the assertion that references qformat_list to use QUANT_CFG_CHOICES to
centralize allowed formats.

modelopt/torch/quantization/config.py (1)

658-661: Add a focused regression test for NVFP4_EXPERTS_ONLY_CFG matching behavior.

A small test that asserts expert quantizers are enabled while dense MLP/attention stay disabled would guard against future wildcard-regression drift.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 658 - 661, Add a focused
regression test named something like test_nvfp4_experts_only_cfg_behavior that
loads NVFP4_EXPERTS_ONLY_CFG (and its inner _nvfp4_experts_only_quant_cfg) and
asserts the expected toggles: expert-related quantizers are enabled while dense
MLP and dense attention quantization flags remain disabled; specifically inspect
NVFP4_EXPERTS_ONLY_CFG["quant_cfg"] for the expert quantizer entries and verify
their enable/active flags are true and verify dense MLP/attention entries are
false (or absent) so the behavior is explicitly captured and fails if future
wildcard changes re-enable dense paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 266-284: The assertion hardcodes the allowed qformat strings for
qformat_list, which duplicates and can drift from the canonical
QUANT_CFG_CHOICES; replace the hardcoded list usage in the assert with a check
against the canonical set (e.g., QUANT_CFG_CHOICES) so the assertion becomes
something like verifying every qformat in qformat_list is in QUANT_CFG_CHOICES
(or its setified form) and keep the error message the same; update the assertion
that references qformat_list to use QUANT_CFG_CHOICES to centralize allowed
formats.

In `@modelopt/torch/quantization/config.py`:
- Around line 658-661: Add a focused regression test named something like
test_nvfp4_experts_only_cfg_behavior that loads NVFP4_EXPERTS_ONLY_CFG (and its
inner _nvfp4_experts_only_quant_cfg) and asserts the expected toggles:
expert-related quantizers are enabled while dense MLP and dense attention
quantization flags remain disabled; specifically inspect
NVFP4_EXPERTS_ONLY_CFG["quant_cfg"] for the expert quantizer entries and verify
their enable/active flags are true and verify dense MLP/attention entries are
false (or absent) so the behavior is explicitly captured and fails if future
wildcard changes re-enable dense paths.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9f0763b-454f-4a74-a271-bef65bf12618

📥 Commits

Reviewing files that changed from the base of the PR and between 1dc890d and 28be434.

📒 Files selected for processing (6)

examples/llm_ptq/README.md
examples/llm_ptq/hf_ptq.py
examples/llm_ptq/multinode_ptq.py
examples/llm_ptq/scripts/huggingface_example.sh
modelopt/torch/quantization/config.py
modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml

Edwardf0t1

Review: Consider consolidating layer-selective NVFP4 configs with a factory function

The new NVFP4_EXPERTS_ONLY_CFG config is correct and well-integrated across the codebase, but this PR highlights a growing pattern of near-identical configs that only differ in layer glob patterns:

Config	Layer patterns	Quantizes
`NVFP4_MLP_ONLY_CFG`	`mlp`, `block_sparse_moe`	weight + input
`NVFP4_EXPERTS_ONLY_CFG` (new)	`mlp.experts`, `block_sparse_moe`	weight + input
`NVFP4_OMLP_ONLY_CFG`	`o_proj` + MLP_ONLY patterns	weight + input
`NVFP4_MLP_WEIGHT_ONLY_CFG`	`mlp`, `block_sparse_moe`	weight only

Each one is 10+ lines of copy-pasted dict with _nvfp4_quantizer repeated for each pattern. A factory function would eliminate the boilerplate and make adding new variants trivial:

def _nvfp4_selective_quant_cfg(
    layer_patterns: list[str],
    *,
    weight_only: bool = False,
    algorithm: str | dict = "max",
) -> dict:
    """Build an NVFP4 config that quantizes only the specified layer patterns."""
    quant_cfg = {}
    for pattern in layer_patterns:
        quant_cfg[f"{pattern}weight_quantizer"] = _nvfp4_quantizer
        if not weight_only:
            quant_cfg[f"{pattern}input_quantizer"] = _nvfp4_quantizer
    quant_cfg.update(_default_disabled_quantizer_cfg)
    return {"quant_cfg": quant_cfg, "algorithm": algorithm}


# Named constants become one-liners — public API unchanged
NVFP4_EXPERTS_ONLY_CFG = _nvfp4_selective_quant_cfg(["*mlp.experts*", "*block_sparse_moe*"])
NVFP4_MLP_ONLY_CFG = _nvfp4_selective_quant_cfg(["*mlp*", "*block_sparse_moe*"])
NVFP4_OMLP_ONLY_CFG = _nvfp4_selective_quant_cfg(["*o_proj*", "*mlp*", "*block_sparse_moe*"])
NVFP4_MLP_WEIGHT_ONLY_CFG = _nvfp4_selective_quant_cfg(
    ["*mlp*", "*block_sparse_moe*"], weight_only=True
)

Benefits:

Adding a new variant is a single line instead of 10+ lines of copy-paste
The relationship between configs is explicit (e.g., OMLP = MLP + o_proj)
Quantizer settings (_nvfp4_quantizer, _default_disabled_quantizer_cfg) defined in one place with no risk of drift
Public API (NVFP4_MLP_ONLY_CFG, etc.) is completely unchanged

Since this PR is already touching the composition of these configs (refactoring _nvfp4_mlp_only_quant_cfg to build on _nvfp4_experts_only_quant_cfg), it's a natural place to introduce this consolidation.

Minor items

YAML recipe copyright year: The new nvfp4_experts_only-fp8_kv.yml has Copyright (c) 2024 — should be updated for a new file.
multinode_ptq.py: nvfp4_omlp_only is missing from this file's choices — pre-existing gap but worth fixing while you're here.
CHANGELOG: For a new public config exported in __all__, a one-liner in the changelog would be appropriate.

modelopt/torch/quantization/config.py

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/quantization/config.py (1)

431-440: Avoid shared mutable quantizer references in the selective builder.

Line 438 and Line 440 reuse the same quantizer dict instance across all matched keys. Any downstream mutation of one entry can unintentionally affect others in the same config.

Proposed fix

+import copy
+
 def _nvfp4_selective_quant_cfg(
     layer_patterns: list[str],
     *,
     quantizer: dict = _nvfp4_quantizer,
     weight_only: bool = False,
     algorithm: str | dict = "max",
 ) -> dict:
     """Build an NVFP4 config that quantizes only the specified layer patterns."""
     quant_cfg: dict[str, object] = {}
     for pattern in layer_patterns:
-        quant_cfg[f"{pattern}weight_quantizer"] = quantizer
+        quant_cfg[f"{pattern}weight_quantizer"] = copy.deepcopy(quantizer)
         if not weight_only:
-            quant_cfg[f"{pattern}input_quantizer"] = quantizer
+            quant_cfg[f"{pattern}input_quantizer"] = copy.deepcopy(quantizer)
     quant_cfg.update(_default_disabled_quantizer_cfg)
     return {"quant_cfg": quant_cfg, "algorithm": algorithm}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 431 - 440, The selective
NVFP4 config builder (the function with signature taking quantizer: dict =
_nvfp4_quantizer and iterating over layer_patterns to populate quant_cfg)
currently assigns the same quantizer dict instance to multiple keys, risking
shared-mutation bugs; update the assignments for f"{pattern}weight_quantizer"
and f"{pattern}input_quantizer" to store a fresh copy of the quantizer for each
key (e.g., use copy.deepcopy(quantizer) or quantizer.copy()) and add the
appropriate import for copy if using deepcopy so each entry is independent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 431-440: The selective NVFP4 config builder (the function with
signature taking quantizer: dict = _nvfp4_quantizer and iterating over
layer_patterns to populate quant_cfg) currently assigns the same quantizer dict
instance to multiple keys, risking shared-mutation bugs; update the assignments
for f"{pattern}weight_quantizer" and f"{pattern}input_quantizer" to store a
fresh copy of the quantizer for each key (e.g., use copy.deepcopy(quantizer) or
quantizer.copy()) and add the appropriate import for copy if using deepcopy so
each entry is independent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 94c56063-d790-45ca-b7eb-8bc613192cf8

📥 Commits

Reviewing files that changed from the base of the PR and between 28be434 and a32a409.

📒 Files selected for processing (4)

CHANGELOG.rst
examples/llm_ptq/multinode_ptq.py
modelopt/torch/quantization/config.py
modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml

✅ Files skipped from review due to trivial changes (2)

CHANGELOG.rst
modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml

🚧 Files skipped from review as they are similar to previous changes (1)

examples/llm_ptq/multinode_ptq.py

codecov · 2026-03-19T17:25:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.30%. Comparing base (1dc890d) to head (c8ae68a).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1030   +/-   ##
=======================================
  Coverage   70.30%   70.30%           
=======================================
  Files         227      227           
  Lines       25854    25866   +12     
=======================================
+ Hits        18176    18185    +9     
- Misses       7678     7681    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Edwardf0t1

LGTM

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

cjluo-nv force-pushed the chenjiel/moe2 branch from 4f8e6f0 to cb7568a Compare March 18, 2026 17:24

cjluo-nv added 2 commits March 18, 2026 17:34

Add nvfp4_experts_only config

df8fde3

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/Model-Optimizer into chenjie…

74b11ed

…l/moe2 Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv force-pushed the chenjiel/moe2 branch from cb7568a to 74b11ed Compare March 18, 2026 17:39

cjluo-nv added 2 commits March 18, 2026 17:40

Merge

9a55ed3

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Add to the new config

28be434

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv changed the title ~~Add experts_only mode~~ Add nvfp4_experts_only quant config Mar 18, 2026

cjluo-nv changed the title ~~Add nvfp4_experts_only quant config~~ Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe Mar 18, 2026

cjluo-nv marked this pull request as ready for review March 18, 2026 17:50

cjluo-nv requested review from a team as code owners March 18, 2026 17:50

cjluo-nv requested review from Edwardf0t1, chadvoegele and meenchen March 18, 2026 17:50

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Edwardf0t1 requested changes Mar 18, 2026

View reviewed changes

meenchen reviewed Mar 18, 2026

View reviewed changes

modelopt/torch/quantization/config.py Outdated Show resolved Hide resolved

Address comment

a32a409

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Edwardf0t1 approved these changes Mar 19, 2026

View reviewed changes

Update CHANGELOG.rst

c8ae68a

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

cjluo-nv enabled auto-merge (squash) March 19, 2026 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe#1030

Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe#1030
cjluo-nv wants to merge 6 commits intomainfrom
chenjiel/moe2

cjluo-nv commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 12, 2026

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Edwardf0t1 left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Edwardf0t1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cjluo-nv commented Mar 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 12, 2026

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Review: Consider consolidating layer-selective NVFP4 configs with a factory function

Minor items

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

codecov bot commented Mar 19, 2026 •

edited

Loading