-
Notifications
You must be signed in to change notification settings - Fork 312
Step3.5 MoE support #1063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Step3.5 MoE support #1063
Changes from all commits
8634500
4cd8893
ccc7ae1
7d2b032
3ce0335
04f99a4
b0d8adc
d6d71a7
6a79b41
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -641,6 +641,13 @@ | |
| "*mlp*input_quantizer": _nvfp4_quantizer, | ||
| "*block_sparse_moe*weight_quantizer": _nvfp4_quantizer, | ||
| "*block_sparse_moe*input_quantizer": _nvfp4_quantizer, | ||
| # Step3p5 MoE experts: MoELinear lives at *.moe.{up,gate,down}_proj | ||
| "*moe*weight_quantizer": _nvfp4_quantizer, | ||
| "*moe*input_quantizer": _nvfp4_quantizer, | ||
|
Comment on lines
+645
to
+646
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this wildcard pattern moe too broad? This matches ANY module with "moe" anywhere in its path |
||
| # disable *mode.gate.* for router | ||
| "*moe.gate.*": {"enable": False}, | ||
| # Disable share_expert (dense MLP alongside MoE, not in MLP-only quant scope) | ||
| "*share_expert*": {"enable": False}, | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we move it to the experts_only cfg? |
||
| **_default_disabled_quantizer_cfg, | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,84 @@ | ||||||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||||||
| # SPDX-License-Identifier: Apache-2.0 | ||||||
| # | ||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| # you may not use this file except in compliance with the License. | ||||||
| # You may obtain a copy of the License at | ||||||
| # | ||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| # | ||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| # See the License for the specific language governing permissions and | ||||||
| # limitations under the License. | ||||||
|
|
||||||
| metadata: | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shengliangxu should we go with /models/Step3.5-Flash/nvfp4-mlp-only.yaml instead? |
||||||
| recipe_type: ptq | ||||||
| description: NVFP4 static weight and dynamic activation for all linear layers (W4A4), FP8 KV cache, max calibration. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Align the recipe description with the enabled scope. Only 📝 Suggested wording- description: NVFP4 static weight and dynamic activation for all linear layers (W4A4), FP8 KV cache, max calibration.
+ description: NVFP4 W4A4 for MoE/MLP projections, FP8 KV cache, max calibration.📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Align the description with the actual quantization scope. The description says “for all linear layers,” but Also applies to: 54-55 🤖 Prompt for AI Agents |
||||||
| ptq_cfg: | ||||||
| algorithm: max | ||||||
| quant_cfg: | ||||||
| '*moe*weight_quantizer': | ||||||
| block_sizes: | ||||||
| -1: 16 | ||||||
| type: dynamic | ||||||
| scale_bits: e4m3 | ||||||
| num_bits: e2m1 | ||||||
| enable: true | ||||||
| '*moe*input_quantizer': | ||||||
| block_sizes: | ||||||
| -1: 16 | ||||||
| type: dynamic | ||||||
| scale_bits: e4m3 | ||||||
| num_bits: e2m1 | ||||||
| enable: true | ||||||
| '*mlp*weight_quantizer': | ||||||
| block_sizes: | ||||||
| -1: 16 | ||||||
| type: dynamic | ||||||
| scale_bits: e4m3 | ||||||
| num_bits: e2m1 | ||||||
| enable: true | ||||||
| '*mlp*input_quantizer': | ||||||
| block_sizes: | ||||||
| -1: 16 | ||||||
| type: dynamic | ||||||
| scale_bits: e4m3 | ||||||
| num_bits: e2m1 | ||||||
| enable: true | ||||||
| '*share_expert.*': | ||||||
| enable: false | ||||||
| '*moe.gate.*': | ||||||
| enable: false | ||||||
| default: | ||||||
| enable: false | ||||||
| '*linear_attn.conv1d*': | ||||||
| enable: false | ||||||
| '*lm_head*': | ||||||
| enable: false | ||||||
| '*mixer.conv1d*': | ||||||
| enable: false | ||||||
| '*output_layer*': | ||||||
| enable: false | ||||||
| '*proj_out.*': | ||||||
| enable: false | ||||||
| '*router*': | ||||||
| enable: false | ||||||
| output.*: | ||||||
| enable: false | ||||||
| nn.BatchNorm1d: | ||||||
| '*': | ||||||
| enable: false | ||||||
| nn.BatchNorm2d: | ||||||
| '*': | ||||||
| enable: false | ||||||
| nn.BatchNorm3d: | ||||||
| '*': | ||||||
| enable: false | ||||||
| nn.LeakyReLU: | ||||||
| '*': | ||||||
| enable: false | ||||||
| '*[kv]_bmm_quantizer': | ||||||
| num_bits: e4m3 | ||||||
| enable: true | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle untouched experts before stacking
input_scale.This assumes every Step3p5 expert exported the same scale buffers. Unlike the other MoE export paths earlier in this file,
QuantMoELinear.expertsnever get aset_expert_quantizer_amax()fallback, so any calibration run that leaves some experts untouched can makeinput_scalemissing on only a subset of experts. That turns export into either anAttributeErrorhere or a silently missing stackedinput_scalewhen expert 0 was never hit. Please backfill/validate buffer presence before this stack.🤖 Prompt for AI Agents