Skip to content

Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe#1030

Open
cjluo-nv wants to merge 6 commits intomainfrom
chenjiel/moe2
Open

Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe#1030
cjluo-nv wants to merge 6 commits intomainfrom
chenjiel/moe2

Conversation

@cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Mar 12, 2026

What does this PR do?

Type of change: New feature

Add NVFP4_EXPERTS_ONLY_CFG quantization config that targets only MoE expert layers (*mlp.experts* and *block_sparse_moe*) with NVFP4 (W4A4) quantization, leaving all other layers (including non-expert MLP) unquantized. This is useful for MoE models where selectively quantizing only expert layers provides a good accuracy-performance tradeoff.

Changes:

  • Refactored _nvfp4_experts_only_quant_cfg as a reusable building block in config.py, with _nvfp4_mlp_only_quant_cfg now composing on top of it
  • Added NVFP4_EXPERTS_ONLY_CFG to the Python config choices
  • Added corresponding nvfp4_experts_only-fp8_kv.yml YAML recipe to the new recipe system (modelopt_recipes/general/ptq/)
  • Updated hf_ptq.py, multinode_ptq.py, example scripts, and README to include the new config

Usage

import modelopt.torch.quantization as mtq

model = mtq.quantize(model, mtq.NVFP4_EXPERTS_ONLY_CFG, forward_loop)

Or via the YAML recipe system:

from modelopt.recipe import load_recipe

recipe = load_recipe("general/ptq/nvfp4_experts_only-fp8_kv")

Testing

  • Verified the YAML recipe matches the Python config definition
  • Existing unit tests cover the quantization config infrastructure

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A <!-- Config is exercised by existing quantization tests -->
  • Did you update Changelog?: ❌ <!-- Minor config addition -->

Additional Information

The experts_only config is a subset of mlp_only: it quantizes *mlp.experts* and *block_sparse_moe* patterns but not the broader *mlp* pattern. The Python config was refactored so _nvfp4_mlp_only_quant_cfg composes on top of _nvfp4_experts_only_quant_cfg.

Summary by CodeRabbit

  • New Features

    • Added an "experts-only" NVFP4 quantization option that selectively quantizes MoE expert layers (preserving dense MLP/attention) for improved PTQ accuracy.
    • Added a PTQ recipe enabling expert-only W4A4 quantization with FP8 KV cache support and corresponding CLI/selectable quantization choice.
  • Documentation

    • Updated README, examples, scripts, and changelog to document and surface the new experts-only quantization option.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1c1d9d9b-d567-4f65-838c-298888def8b0

📥 Commits

Reviewing files that changed from the base of the PR and between a32a409 and c8ae68a.

📒 Files selected for processing (1)
  • CHANGELOG.rst
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.rst

📝 Walkthrough

Walkthrough

Adds a new NVFP4 experts-only quantization configuration (NVFP4_EXPERTS_ONLY_CFG) and integrates it across docs, example scripts, the core quantization config generator, a new PTQ recipe, and a CLI helper; it targets MoE expert layers while leaving dense MLP/attention unchanged.

Changes

Cohort / File(s) Summary
Docs & Shell
examples/llm_ptq/README.md, examples/llm_ptq/scripts/huggingface_example.sh
Documented and added nvfp4_experts_only to examples and the shell QFORMAT validation/listing.
Example Python CLI Scripts
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/multinode_ptq.py
Added "nvfp4_experts_only": mtq.NVFP4_EXPERTS_ONLY_CFG to QUANT_CFG_CHOICES and included it in auto-quantize/CLI --qformat choices.
Core Quantization Config
modelopt/torch/quantization/config.py
Added _nvfp4_selective_quant_cfg(...) helper and _nvfp4_quantizer_bs32, refactored NVFP4 recipe constructions to use the selector, introduced NVFP4_EXPERTS_ONLY_CFG, and updated exported choices.
PTQ Recipe
modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml
New PTQ recipe for NVFP4 experts-only W4A4 with FP8 KV cache: enables expert-related quantizers, disables non-expert modules, and sets algorithm max.
Changelog
CHANGELOG.rst
Added entry noting the new NVFP4_EXPERTS_ONLY_CFG under 0.43 “New Features”.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding a new NVFP4_EXPERTS_ONLY_CFG quantization configuration with its corresponding YAML recipe.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed No security anti-patterns detected in modified Python files. No torch.load with weights_only=False, numpy.load with allow_pickle=True, trust_remote_code=True, eval/exec usage, or nosec comments found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/moe2
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…l/moe2

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv changed the title Add experts_only mode Add nvfp4_experts_only quant config Mar 18, 2026
@cjluo-nv cjluo-nv changed the title Add nvfp4_experts_only quant config Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe Mar 18, 2026
@cjluo-nv cjluo-nv marked this pull request as ready for review March 18, 2026 17:50
@cjluo-nv cjluo-nv requested review from a team as code owners March 18, 2026 17:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
examples/llm_ptq/hf_ptq.py (1)

266-284: Avoid duplicating the auto-quantize qformat allow-list.

The hardcoded list can drift from QUANT_CFG_CHOICES. Centralizing the allowed set keeps additions/removals safer.

♻️ Suggested refactor
 QUANT_CFG_CHOICES: dict[str, dict[str, Any]] = {
@@
     "mxfp8": mtq.MXFP8_DEFAULT_CFG,
 }
 
+AUTO_QUANTIZE_QFORMAT_CHOICES = {
+    "fp8",
+    "int8_sq",
+    "int8_wo",
+    "int4_awq",
+    "nvfp4",
+    "nvfp4_awq",
+    "w4a8_awq",
+    "fp8_pb_wo",
+    "w4a8_mxfp4_fp8",
+    "nvfp4_mlp_only",
+    "nvfp4_experts_only",
+    "nvfp4_omlp_only",
+    "mxfp8",
+}
+
@@
-    assert all(
-        qformat
-        in [
-            "fp8",
-            "int8_sq",
-            "int8_wo",
-            "int4_awq",
-            "nvfp4",
-            "nvfp4_awq",
-            "w4a8_awq",
-            "fp8_pb_wo",
-            "w4a8_mxfp4_fp8",
-            "nvfp4_mlp_only",
-            "nvfp4_experts_only",
-            "nvfp4_omlp_only",
-            "mxfp8",
-        ]
-        for qformat in qformat_list
-    ), "One or more quantization formats provided are not supported for unified checkpoint export"
+    assert set(qformat_list).issubset(AUTO_QUANTIZE_QFORMAT_CHOICES), (
+        "One or more quantization formats provided are not supported for unified checkpoint export"
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 266 - 284, The assertion hardcodes
the allowed qformat strings for qformat_list, which duplicates and can drift
from the canonical QUANT_CFG_CHOICES; replace the hardcoded list usage in the
assert with a check against the canonical set (e.g., QUANT_CFG_CHOICES) so the
assertion becomes something like verifying every qformat in qformat_list is in
QUANT_CFG_CHOICES (or its setified form) and keep the error message the same;
update the assertion that references qformat_list to use QUANT_CFG_CHOICES to
centralize allowed formats.
modelopt/torch/quantization/config.py (1)

658-661: Add a focused regression test for NVFP4_EXPERTS_ONLY_CFG matching behavior.

A small test that asserts expert quantizers are enabled while dense MLP/attention stay disabled would guard against future wildcard-regression drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 658 - 661, Add a focused
regression test named something like test_nvfp4_experts_only_cfg_behavior that
loads NVFP4_EXPERTS_ONLY_CFG (and its inner _nvfp4_experts_only_quant_cfg) and
asserts the expected toggles: expert-related quantizers are enabled while dense
MLP and dense attention quantization flags remain disabled; specifically inspect
NVFP4_EXPERTS_ONLY_CFG["quant_cfg"] for the expert quantizer entries and verify
their enable/active flags are true and verify dense MLP/attention entries are
false (or absent) so the behavior is explicitly captured and fails if future
wildcard changes re-enable dense paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 266-284: The assertion hardcodes the allowed qformat strings for
qformat_list, which duplicates and can drift from the canonical
QUANT_CFG_CHOICES; replace the hardcoded list usage in the assert with a check
against the canonical set (e.g., QUANT_CFG_CHOICES) so the assertion becomes
something like verifying every qformat in qformat_list is in QUANT_CFG_CHOICES
(or its setified form) and keep the error message the same; update the assertion
that references qformat_list to use QUANT_CFG_CHOICES to centralize allowed
formats.

In `@modelopt/torch/quantization/config.py`:
- Around line 658-661: Add a focused regression test named something like
test_nvfp4_experts_only_cfg_behavior that loads NVFP4_EXPERTS_ONLY_CFG (and its
inner _nvfp4_experts_only_quant_cfg) and asserts the expected toggles:
expert-related quantizers are enabled while dense MLP and dense attention
quantization flags remain disabled; specifically inspect
NVFP4_EXPERTS_ONLY_CFG["quant_cfg"] for the expert quantizer entries and verify
their enable/active flags are true and verify dense MLP/attention entries are
false (or absent) so the behavior is explicitly captured and fails if future
wildcard changes re-enable dense paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9f0763b-454f-4a74-a271-bef65bf12618

📥 Commits

Reviewing files that changed from the base of the PR and between 1dc890d and 28be434.

📒 Files selected for processing (6)
  • examples/llm_ptq/README.md
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/multinode_ptq.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/quantization/config.py
  • modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Consider consolidating layer-selective NVFP4 configs with a factory function

The new NVFP4_EXPERTS_ONLY_CFG config is correct and well-integrated across the codebase, but this PR highlights a growing pattern of near-identical configs that only differ in layer glob patterns:

Config Layer patterns Quantizes
NVFP4_MLP_ONLY_CFG *mlp*, *block_sparse_moe* weight + input
NVFP4_EXPERTS_ONLY_CFG (new) *mlp.experts*, *block_sparse_moe* weight + input
NVFP4_OMLP_ONLY_CFG *o_proj* + MLP_ONLY patterns weight + input
NVFP4_MLP_WEIGHT_ONLY_CFG *mlp*, *block_sparse_moe* weight only

Each one is 10+ lines of copy-pasted dict with _nvfp4_quantizer repeated for each pattern. A factory function would eliminate the boilerplate and make adding new variants trivial:

def _nvfp4_selective_quant_cfg(
    layer_patterns: list[str],
    *,
    weight_only: bool = False,
    algorithm: str | dict = "max",
) -> dict:
    """Build an NVFP4 config that quantizes only the specified layer patterns."""
    quant_cfg = {}
    for pattern in layer_patterns:
        quant_cfg[f"{pattern}weight_quantizer"] = _nvfp4_quantizer
        if not weight_only:
            quant_cfg[f"{pattern}input_quantizer"] = _nvfp4_quantizer
    quant_cfg.update(_default_disabled_quantizer_cfg)
    return {"quant_cfg": quant_cfg, "algorithm": algorithm}


# Named constants become one-liners — public API unchanged
NVFP4_EXPERTS_ONLY_CFG = _nvfp4_selective_quant_cfg(["*mlp.experts*", "*block_sparse_moe*"])
NVFP4_MLP_ONLY_CFG = _nvfp4_selective_quant_cfg(["*mlp*", "*block_sparse_moe*"])
NVFP4_OMLP_ONLY_CFG = _nvfp4_selective_quant_cfg(["*o_proj*", "*mlp*", "*block_sparse_moe*"])
NVFP4_MLP_WEIGHT_ONLY_CFG = _nvfp4_selective_quant_cfg(
    ["*mlp*", "*block_sparse_moe*"], weight_only=True
)

Benefits:

  • Adding a new variant is a single line instead of 10+ lines of copy-paste
  • The relationship between configs is explicit (e.g., OMLP = MLP + o_proj)
  • Quantizer settings (_nvfp4_quantizer, _default_disabled_quantizer_cfg) defined in one place with no risk of drift
  • Public API (NVFP4_MLP_ONLY_CFG, etc.) is completely unchanged

Since this PR is already touching the composition of these configs (refactoring _nvfp4_mlp_only_quant_cfg to build on _nvfp4_experts_only_quant_cfg), it's a natural place to introduce this consolidation.

Minor items

  • YAML recipe copyright year: The new nvfp4_experts_only-fp8_kv.yml has Copyright (c) 2024 — should be updated for a new file.
  • multinode_ptq.py: nvfp4_omlp_only is missing from this file's choices — pre-existing gap but worth fixing while you're here.
  • CHANGELOG: For a new public config exported in __all__, a one-liner in the changelog would be appropriate.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/quantization/config.py (1)

431-440: Avoid shared mutable quantizer references in the selective builder.

Line 438 and Line 440 reuse the same quantizer dict instance across all matched keys. Any downstream mutation of one entry can unintentionally affect others in the same config.

Proposed fix
+import copy
+
 def _nvfp4_selective_quant_cfg(
     layer_patterns: list[str],
     *,
     quantizer: dict = _nvfp4_quantizer,
     weight_only: bool = False,
     algorithm: str | dict = "max",
 ) -> dict:
     """Build an NVFP4 config that quantizes only the specified layer patterns."""
     quant_cfg: dict[str, object] = {}
     for pattern in layer_patterns:
-        quant_cfg[f"{pattern}weight_quantizer"] = quantizer
+        quant_cfg[f"{pattern}weight_quantizer"] = copy.deepcopy(quantizer)
         if not weight_only:
-            quant_cfg[f"{pattern}input_quantizer"] = quantizer
+            quant_cfg[f"{pattern}input_quantizer"] = copy.deepcopy(quantizer)
     quant_cfg.update(_default_disabled_quantizer_cfg)
     return {"quant_cfg": quant_cfg, "algorithm": algorithm}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 431 - 440, The selective
NVFP4 config builder (the function with signature taking quantizer: dict =
_nvfp4_quantizer and iterating over layer_patterns to populate quant_cfg)
currently assigns the same quantizer dict instance to multiple keys, risking
shared-mutation bugs; update the assignments for f"{pattern}weight_quantizer"
and f"{pattern}input_quantizer" to store a fresh copy of the quantizer for each
key (e.g., use copy.deepcopy(quantizer) or quantizer.copy()) and add the
appropriate import for copy if using deepcopy so each entry is independent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 431-440: The selective NVFP4 config builder (the function with
signature taking quantizer: dict = _nvfp4_quantizer and iterating over
layer_patterns to populate quant_cfg) currently assigns the same quantizer dict
instance to multiple keys, risking shared-mutation bugs; update the assignments
for f"{pattern}weight_quantizer" and f"{pattern}input_quantizer" to store a
fresh copy of the quantizer for each key (e.g., use copy.deepcopy(quantizer) or
quantizer.copy()) and add the appropriate import for copy if using deepcopy so
each entry is independent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 94c56063-d790-45ca-b7eb-8bc613192cf8

📥 Commits

Reviewing files that changed from the base of the PR and between 28be434 and a32a409.

📒 Files selected for processing (4)
  • CHANGELOG.rst
  • examples/llm_ptq/multinode_ptq.py
  • modelopt/torch/quantization/config.py
  • modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml
✅ Files skipped from review due to trivial changes (2)
  • CHANGELOG.rst
  • modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm_ptq/multinode_ptq.py

@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.30%. Comparing base (1dc890d) to head (c8ae68a).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1030   +/-   ##
=======================================
  Coverage   70.30%   70.30%           
=======================================
  Files         227      227           
  Lines       25854    25866   +12     
=======================================
+ Hits        18176    18185    +9     
- Misses       7678     7681    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
@cjluo-nv cjluo-nv enabled auto-merge (squash) March 19, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants