Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe#1030
Add NVFP4_EXPERTS_ONLY_CFG quantization config and YAML recipe#1030
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a new NVFP4 experts-only quantization configuration (NVFP4_EXPERTS_ONLY_CFG) and integrates it across docs, example scripts, the core quantization config generator, a new PTQ recipe, and a CLI helper; it targets MoE expert layers while leaving dense MLP/attention unchanged. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…l/moe2 Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
examples/llm_ptq/hf_ptq.py (1)
266-284: Avoid duplicating the auto-quantize qformat allow-list.The hardcoded list can drift from
QUANT_CFG_CHOICES. Centralizing the allowed set keeps additions/removals safer.♻️ Suggested refactor
QUANT_CFG_CHOICES: dict[str, dict[str, Any]] = { @@ "mxfp8": mtq.MXFP8_DEFAULT_CFG, } +AUTO_QUANTIZE_QFORMAT_CHOICES = { + "fp8", + "int8_sq", + "int8_wo", + "int4_awq", + "nvfp4", + "nvfp4_awq", + "w4a8_awq", + "fp8_pb_wo", + "w4a8_mxfp4_fp8", + "nvfp4_mlp_only", + "nvfp4_experts_only", + "nvfp4_omlp_only", + "mxfp8", +} + @@ - assert all( - qformat - in [ - "fp8", - "int8_sq", - "int8_wo", - "int4_awq", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - "nvfp4_experts_only", - "nvfp4_omlp_only", - "mxfp8", - ] - for qformat in qformat_list - ), "One or more quantization formats provided are not supported for unified checkpoint export" + assert set(qformat_list).issubset(AUTO_QUANTIZE_QFORMAT_CHOICES), ( + "One or more quantization formats provided are not supported for unified checkpoint export" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 266 - 284, The assertion hardcodes the allowed qformat strings for qformat_list, which duplicates and can drift from the canonical QUANT_CFG_CHOICES; replace the hardcoded list usage in the assert with a check against the canonical set (e.g., QUANT_CFG_CHOICES) so the assertion becomes something like verifying every qformat in qformat_list is in QUANT_CFG_CHOICES (or its setified form) and keep the error message the same; update the assertion that references qformat_list to use QUANT_CFG_CHOICES to centralize allowed formats.modelopt/torch/quantization/config.py (1)
658-661: Add a focused regression test forNVFP4_EXPERTS_ONLY_CFGmatching behavior.A small test that asserts expert quantizers are enabled while dense MLP/attention stay disabled would guard against future wildcard-regression drift.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 658 - 661, Add a focused regression test named something like test_nvfp4_experts_only_cfg_behavior that loads NVFP4_EXPERTS_ONLY_CFG (and its inner _nvfp4_experts_only_quant_cfg) and asserts the expected toggles: expert-related quantizers are enabled while dense MLP and dense attention quantization flags remain disabled; specifically inspect NVFP4_EXPERTS_ONLY_CFG["quant_cfg"] for the expert quantizer entries and verify their enable/active flags are true and verify dense MLP/attention entries are false (or absent) so the behavior is explicitly captured and fails if future wildcard changes re-enable dense paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 266-284: The assertion hardcodes the allowed qformat strings for
qformat_list, which duplicates and can drift from the canonical
QUANT_CFG_CHOICES; replace the hardcoded list usage in the assert with a check
against the canonical set (e.g., QUANT_CFG_CHOICES) so the assertion becomes
something like verifying every qformat in qformat_list is in QUANT_CFG_CHOICES
(or its setified form) and keep the error message the same; update the assertion
that references qformat_list to use QUANT_CFG_CHOICES to centralize allowed
formats.
In `@modelopt/torch/quantization/config.py`:
- Around line 658-661: Add a focused regression test named something like
test_nvfp4_experts_only_cfg_behavior that loads NVFP4_EXPERTS_ONLY_CFG (and its
inner _nvfp4_experts_only_quant_cfg) and asserts the expected toggles:
expert-related quantizers are enabled while dense MLP and dense attention
quantization flags remain disabled; specifically inspect
NVFP4_EXPERTS_ONLY_CFG["quant_cfg"] for the expert quantizer entries and verify
their enable/active flags are true and verify dense MLP/attention entries are
false (or absent) so the behavior is explicitly captured and fails if future
wildcard changes re-enable dense paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c9f0763b-454f-4a74-a271-bef65bf12618
📒 Files selected for processing (6)
examples/llm_ptq/README.mdexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/multinode_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/quantization/config.pymodelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml
Edwardf0t1
left a comment
There was a problem hiding this comment.
Review: Consider consolidating layer-selective NVFP4 configs with a factory function
The new NVFP4_EXPERTS_ONLY_CFG config is correct and well-integrated across the codebase, but this PR highlights a growing pattern of near-identical configs that only differ in layer glob patterns:
| Config | Layer patterns | Quantizes |
|---|---|---|
NVFP4_MLP_ONLY_CFG |
*mlp*, *block_sparse_moe* |
weight + input |
NVFP4_EXPERTS_ONLY_CFG (new) |
*mlp.experts*, *block_sparse_moe* |
weight + input |
NVFP4_OMLP_ONLY_CFG |
*o_proj* + MLP_ONLY patterns |
weight + input |
NVFP4_MLP_WEIGHT_ONLY_CFG |
*mlp*, *block_sparse_moe* |
weight only |
Each one is 10+ lines of copy-pasted dict with _nvfp4_quantizer repeated for each pattern. A factory function would eliminate the boilerplate and make adding new variants trivial:
def _nvfp4_selective_quant_cfg(
layer_patterns: list[str],
*,
weight_only: bool = False,
algorithm: str | dict = "max",
) -> dict:
"""Build an NVFP4 config that quantizes only the specified layer patterns."""
quant_cfg = {}
for pattern in layer_patterns:
quant_cfg[f"{pattern}weight_quantizer"] = _nvfp4_quantizer
if not weight_only:
quant_cfg[f"{pattern}input_quantizer"] = _nvfp4_quantizer
quant_cfg.update(_default_disabled_quantizer_cfg)
return {"quant_cfg": quant_cfg, "algorithm": algorithm}
# Named constants become one-liners — public API unchanged
NVFP4_EXPERTS_ONLY_CFG = _nvfp4_selective_quant_cfg(["*mlp.experts*", "*block_sparse_moe*"])
NVFP4_MLP_ONLY_CFG = _nvfp4_selective_quant_cfg(["*mlp*", "*block_sparse_moe*"])
NVFP4_OMLP_ONLY_CFG = _nvfp4_selective_quant_cfg(["*o_proj*", "*mlp*", "*block_sparse_moe*"])
NVFP4_MLP_WEIGHT_ONLY_CFG = _nvfp4_selective_quant_cfg(
["*mlp*", "*block_sparse_moe*"], weight_only=True
)Benefits:
- Adding a new variant is a single line instead of 10+ lines of copy-paste
- The relationship between configs is explicit (e.g., OMLP = MLP +
o_proj) - Quantizer settings (
_nvfp4_quantizer,_default_disabled_quantizer_cfg) defined in one place with no risk of drift - Public API (
NVFP4_MLP_ONLY_CFG, etc.) is completely unchanged
Since this PR is already touching the composition of these configs (refactoring _nvfp4_mlp_only_quant_cfg to build on _nvfp4_experts_only_quant_cfg), it's a natural place to introduce this consolidation.
Minor items
- YAML recipe copyright year: The new
nvfp4_experts_only-fp8_kv.ymlhasCopyright (c) 2024— should be updated for a new file. multinode_ptq.py:nvfp4_omlp_onlyis missing from this file's choices — pre-existing gap but worth fixing while you're here.- CHANGELOG: For a new public config exported in
__all__, a one-liner in the changelog would be appropriate.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/quantization/config.py (1)
431-440: Avoid shared mutable quantizer references in the selective builder.Line 438 and Line 440 reuse the same
quantizerdict instance across all matched keys. Any downstream mutation of one entry can unintentionally affect others in the same config.Proposed fix
+import copy + def _nvfp4_selective_quant_cfg( layer_patterns: list[str], *, quantizer: dict = _nvfp4_quantizer, weight_only: bool = False, algorithm: str | dict = "max", ) -> dict: """Build an NVFP4 config that quantizes only the specified layer patterns.""" quant_cfg: dict[str, object] = {} for pattern in layer_patterns: - quant_cfg[f"{pattern}weight_quantizer"] = quantizer + quant_cfg[f"{pattern}weight_quantizer"] = copy.deepcopy(quantizer) if not weight_only: - quant_cfg[f"{pattern}input_quantizer"] = quantizer + quant_cfg[f"{pattern}input_quantizer"] = copy.deepcopy(quantizer) quant_cfg.update(_default_disabled_quantizer_cfg) return {"quant_cfg": quant_cfg, "algorithm": algorithm}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 431 - 440, The selective NVFP4 config builder (the function with signature taking quantizer: dict = _nvfp4_quantizer and iterating over layer_patterns to populate quant_cfg) currently assigns the same quantizer dict instance to multiple keys, risking shared-mutation bugs; update the assignments for f"{pattern}weight_quantizer" and f"{pattern}input_quantizer" to store a fresh copy of the quantizer for each key (e.g., use copy.deepcopy(quantizer) or quantizer.copy()) and add the appropriate import for copy if using deepcopy so each entry is independent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 431-440: The selective NVFP4 config builder (the function with
signature taking quantizer: dict = _nvfp4_quantizer and iterating over
layer_patterns to populate quant_cfg) currently assigns the same quantizer dict
instance to multiple keys, risking shared-mutation bugs; update the assignments
for f"{pattern}weight_quantizer" and f"{pattern}input_quantizer" to store a
fresh copy of the quantizer for each key (e.g., use copy.deepcopy(quantizer) or
quantizer.copy()) and add the appropriate import for copy if using deepcopy so
each entry is independent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 94c56063-d790-45ca-b7eb-8bc613192cf8
📒 Files selected for processing (4)
CHANGELOG.rstexamples/llm_ptq/multinode_ptq.pymodelopt/torch/quantization/config.pymodelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml
✅ Files skipped from review due to trivial changes (2)
- CHANGELOG.rst
- modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/llm_ptq/multinode_ptq.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1030 +/- ##
=======================================
Coverage 70.30% 70.30%
=======================================
Files 227 227
Lines 25854 25866 +12
=======================================
+ Hits 18176 18185 +9
- Misses 7678 7681 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
What does this PR do?
Type of change: New feature
Add
NVFP4_EXPERTS_ONLY_CFGquantization config that targets only MoE expert layers (*mlp.experts*and*block_sparse_moe*) with NVFP4 (W4A4) quantization, leaving all other layers (including non-expert MLP) unquantized. This is useful for MoE models where selectively quantizing only expert layers provides a good accuracy-performance tradeoff.Changes:
_nvfp4_experts_only_quant_cfgas a reusable building block inconfig.py, with_nvfp4_mlp_only_quant_cfgnow composing on top of itNVFP4_EXPERTS_ONLY_CFGto the Python config choicesnvfp4_experts_only-fp8_kv.ymlYAML recipe to the new recipe system (modelopt_recipes/general/ptq/)hf_ptq.py,multinode_ptq.py, example scripts, and README to include the new configUsage
Or via the YAML recipe system:
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
The
experts_onlyconfig is a subset ofmlp_only: it quantizes*mlp.experts*and*block_sparse_moe*patterns but not the broader*mlp*pattern. The Python config was refactored so_nvfp4_mlp_only_quant_cfgcomposes on top of_nvfp4_experts_only_quant_cfg.Summary by CodeRabbit
New Features
Documentation