[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817
[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4#7817lizexu123 wants to merge 31 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 2 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 8/10 通过
2.2 可选任务 — 28/30 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率不足(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例: 无(所有单元测试通过,TEST_EXIT_CODE=0) 根因详情: 修复建议:
关联变更: 链接: 查看日志 Approval — 审批不足(置信度: 高)Approval
根因详情: 修复建议:
链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7817 +/- ##
==========================================
Coverage ? 63.40%
==========================================
Files ? 462
Lines ? 64413
Branches ? 9879
==========================================
Hits ? 40840
Misses ? 20794
Partials ? 2779
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 2 个 required 失败任务需优先处理。
2 任务状态汇总2.1 Required任务 : 7/9 通过
2.2 可选任务 — 29/31 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率不达标(置信度: 高)run_tests_with_coverage 分析
根因详情: 关键日志: 修复建议:
修复建议摘要: 为 nvfp4.py/mix_quant.py 等新增文件补充单元测试 关联变更: Pre Commit — 代码规范(置信度: 高)Pre Commit 分析
根因详情: 关键日志: 修复建议:
修复建议摘要: helper.h 末尾添加换行符后重新提交 关联变更: |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 3 个 required 任务失败,阻塞合并,需优先处理。
2 任务状态汇总2.1 Required任务 : 7/10 通过
2.2 可选任务 — 26/28 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
根因详情: 未覆盖文件详情:
关键日志: 修复建议:
修复建议摘要: 为 nvfp4.py 等新增 FP4 量化代码补充单元测试 关联变更: 链接: 查看日志 Approval — 审批未通过(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 RD 成员(jiangjiajun/liuyuanle 等)对 PR 进行 Approve 关联变更: 链接: 查看日志 Pre Commit — 代码规范(置信度: 高)Pre Commit
根因详情: 关键日志: 修复建议:
修复建议摘要: 在 helper.h 末尾 关联变更: 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-19 18:21:26
📋 Review 摘要
PR 概述:支持 FP4 通信量化、offline NVFP4 MoE + online block_wise_fp8 dense 混合量化配置及 block-wise CUDA Graph;同时修复 eb5 加载时 audio_token_num=None 报错问题。
变更范围:custom_ops/gpu_ops/moe/、model_executor/layers/quantization/、model_executor/utils.py、worker/gpu_model_runner.py、docs/
影响面 Tag:[Quantization] [OP] [Graph Optimization] [Docs]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | nvfp4.py:431 |
移除权重 dtype 断言,静默掩盖权重加载异常 |
| 🟡 建议 | nvfp4.py:669 |
fc1_latent_proj/fc2_latent_proj 参数完全未使用(死代码) |
| ❓ 疑问 | prefill_permute_to_masked_gemm.cu:338 |
BFLOAT16+FLOAT32 case 缺 break,触发 -Wimplicit-fallthrough 警告 |
| ❓ 疑问 | utils.py:135 |
使用 Paddle 私有方法 _is_initialized(),稳定性存疑 |
📝 PR 规范检查
标题格式合规([Feature] 为官方 Tag ✓)。但 ## Modifications、## Usage or Command、## Accuracy Tests 三段均为空(仅 HTML 注释占位),Checklist 全部未勾选。
标题建议(可直接复制):
[Feature][BugFix] Support FP4 communication quantization, block_wise_fp8 dense + NVFP4 MoE mix_quant, and fix audio_token_num NoneType bug
PR 描述建议(可直接复制):
## Motivation
1. 修复 eb5 旗舰版加载时 `audio_token_num` 为 `None` 导致 `NoneType > 0` 报错的 bug
2. 支持 offline NVFP4 MoE 权重 + online block_wise_fp8 dense 量化混合配置(`mix_quant` 覆盖 NVFP4 checkpoint)
3. 支持 FP4 通信量化(`FD_DISPATCH_USE_FP4=1`),dispatch 前将 hidden states 量化为 FP4,减少通信量约 2x
4. `prefill_permute_to_masked_gemm` 新增 `make_scale_interleaved` 支持,scale 直接写入 flashinfer swizzled layout
## Modifications
- `fastdeploy/model_executor/forward_meta.py`:`audio_token_num` 字段默认值改为 `0`,修复 `NoneType > 0` 异常
- `fastdeploy/model_executor/layers/quantization/__init__.py`:新增 `mix_quant_overrides_nvfp4` 逻辑,支持 CLI `mix_quant` 配置覆盖模型 config.json 中的 NVFP4 配置
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`:新增 `moe_quant_config` 字段及 `_build_moe_sub_config()` 方法,正确传递 MoE offline 量化子配置
- `fastdeploy/model_executor/layers/quantization/nvfp4.py`:`apply_ep_prefill` 新增 FP4 通信量化路径;`call_prefill_permute_to_masked_gemm` 新增 `make_scale_interleaved` 参数
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`:新增 `MAKE_SCALE_INTERLEAVED` 模板参数和 UINT8/BF16 混合 dtype 分发路径
- `fastdeploy/model_executor/utils.py`:修复 hybrid mix_quant 下 MoE 子层权重加载顺序问题
- `fastdeploy/worker/gpu_model_runner.py`:新增 block-wise CUDA graph 清理逻辑
- `fastdeploy/envs.py`:新增 `FD_DISPATCH_USE_FP4` 环境变量
- `docs/`:更新 nvfp4.md(中英文)补充 flashinfer-cutedsl backend 使用示例
## Usage or Command
启用 block-wise CUDA Graph:
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048,4096"
启用 FP4 通信量化:
export FD_DISPATCH_USE_FP4=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports "9811,9812,9813,9814" \
--num-servers 4 \
--model /path/to/eb5-fp4-model \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 512 \
--ep-prefill-use-worst-num-tokens \
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'
## Accuracy Tests
N/A(本 PR 主要为性能优化和 bug 修复,未修改模型计算语义;PR 描述中附有 nsys profile 对比截图,prefill 阶段 kernel 间空隙从有到无)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
PR 整体思路清晰,FP4 通信量化路径和 mix_quant 混合配置逻辑较为完整,bug 修复(audio_token_num)简洁有效。主要关注点:apply() 中移除 weight dtype 断言会降低错误检测能力,建议改为条件保留;fc1/fc2_latent_proj 死参数需补充说明意图或实现。
| @@ -425,10 +429,6 @@ def apply( | |||
| x_fp4, x_scale_interleaved = fp4_quantize(x, layer.input_scale_inv) | |||
|
|
|||
| assert x_fp4.dtype == paddle.uint8 | |||
There was a problem hiding this comment.
🟡 建议 移除了三个权重 dtype 断言,可能掩盖权重加载异常
原本对 layer.weight.dtype、layer.weight_scale_interleaved.dtype、layer.alpha.dtype 的断言是推理路径的关键 guard,能在权重类型出错时快速报错。现在这些断言被无条件移除,若 NVFP4 权重加载失败(dtype 不符),会静默运行并产生错误结果。
建议修复:仅在 hybrid mix_quant 路径下跳过断言,或改为条件检查:
if not getattr(self, 'skip_weight_dtype_check', False):
assert layer.weight.dtype == paddle.uint8
assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn
assert layer.alpha.dtype == paddle.float32| gate: nn.Layer, | ||
| topk_ids_hookfunc: Callable = None, | ||
| shared_experts: nn.Layer = None, | ||
| fc1_latent_proj: nn.Layer = None, |
There was a problem hiding this comment.
🟡 建议 fc1_latent_proj 和 fc2_latent_proj 参数在函数体内完全未使用,构成死代码
apply_ep_prefill 和 apply_ep_decode 都新增了这两个参数,但函数体内没有任何使用。若是预留接口,请加注释说明意图;若是遗漏实现,需补充相关逻辑。
| switch (topk) { | ||
| DISPATCH_TOPK( | ||
| paddle::DataType::BFLOAT16, paddle::DataType::UINT8, 4) | ||
| DISPATCH_TOPK( |
There was a problem hiding this comment.
❓ 疑问 BFLOAT16 x + FLOAT32 scale 的 case 块缺少 break,存在隐式 fall-through
case paddle::DataType::FLOAT32: 块结尾没有 break,紧接着新增的 case paddle::DataType::UINT8: 分支,会触发 -Wimplicit-fallthrough 编译警告。虽然实际执行路径均通过 return(DISPATCH_TOPK 宏)或 PD_THROW 终止,不会真正 fall-through,但建议加 break 或 [[fallthrough]] 注释以明确意图并消除警告。
| @@ -133,6 +133,9 @@ def slice_fn(weight_or_parameter, output_dim, start, end, step=1): | |||
|
|
|||
| def process_weight_transpose(layer, weight_name): | |||
| weight = getattr(layer, weight_name) | |||
There was a problem hiding this comment.
❓ 疑问 使用了 Paddle 私有方法 _is_initialized()
_is_initialized() 为下划线前缀的私有/内部方法,Paddle 版本升级时可能被重命名或移除,稳定性无保证。建议改为公开 API(如捕获 AttributeError)或确认该方法已在 FastDeploy 所支持的 Paddle 版本中稳定存在。
Motivation
1、修复在eb5跑fp4时,audio_token_num为None,导致会判断 NoneType >0的bug,以及加载eb5旗舰版的问题
支持fp4 通信量化,以hidden_size = 7168为例子
2、当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的,粒度较粗,存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制,支持在单个算子/层级别(如 Linear、RMSNorm)独立捕获和回放 CUDA Graph,从而实现更细粒度的图优化,提升 prefill 阶段的推理性能。
3、支持block_wise_fp8 dense在线量化+nvfp4离线量化配置
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}' \4、支持fp4 deepep通信
开启fp4通信量化 export FD_DISPATCH_USE_FP4=1支持了prefill阶段进cuda_graph,kernel间空隙有所减少,如下图所示。


上图为之前的空隙
优化后基本无空隙
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.