[Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4 by lizexu123 · Pull Request #7817 · PaddlePaddle/FastDeploy

lizexu123 · 2026-05-14T09:21:01Z

Motivation

1、修复在eb5跑fp4时，audio_token_num为None，导致会判断 NoneType >0的bug，以及加载eb5旗舰版的问题
支持fp4 通信量化,以hidden_size = 7168为例子

2、当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的，粒度较粗，存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制，支持在单个算子/层级别（如 Linear、RMSNorm）独立捕获和回放 CUDA Graph，从而实现更细粒度的图优化，提升 prefill 阶段的推理性能。

3、支持block_wise_fp8 dense在线量化+nvfp4离线量化配置
--quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}' \

4、支持fp4 deepep通信
开启fp4通信量化 export FD_DISPATCH_USE_FP4=1

# 启用 block-wise CUDA Graph
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1

# 自定义预捕获的 token 数（可选）
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

# 如果想确认prefill哪些linear进入cuda_graph
export FD_BLOCK_WISE_DEBUG=1

# 开启fp4通信量化
export FD_USE_NVFP4_COMM_QUANT=1

支持了prefill阶段进cuda_graph,kernel间空隙有所减少,如下图所示。

上图为之前的空隙

优化后基本无空隙

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…into kc_d

paddle-bot · 2026-05-14T09:21:08Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-14T09:55:37Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-20 08:23:16

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: e20676d
Merge base: bda1756 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 2 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
58(18)	40	36	4	0	0	0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`run_tests_with_coverage`	1h21m	PR问题：diff覆盖率仅35%，未达80%阈值	为nvfp4.py等新增测试或申请豁免	Job	🔄×1
❌	`Approval`	10s	PR问题：修改envs.py缺少指定RD审批	请jiangjiajun等完成Approve	Job	-
✅	其余 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 28/30 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	2m35s	Job	-
❌	`Check PR Template`	14s	Job	-
✅	其余 28 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率不足（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 代码覆盖率不足
置信度: 高
根因摘要: PR新增量化代码diff覆盖率仅35%，未达80%阈值
分析器: ci_analyze_unittest_fastdeploy

失败用例: 无（所有单元测试通过，TEST_EXIT_CODE=0）

根因详情:
单元测试全部通过（TEST_EXIT_CODE=0），但代码覆盖率验证失败（COVERAGE_EXIT_CODE=9）。PR 新增了 FP4/FP8 量化相关代码，diff 共 645 行变更中仅 35% 被测试覆盖（53 行统计中有 34 行未覆盖），未达 80% 门槛。主要零覆盖文件：nvfp4.py（0%，16行未覆盖）、fused_moe_deepgemm_backend.py（0%，1行未覆盖）、gpu_model_runner.py（0%，3行未覆盖）、mix_quant.py（44.4%）、quantization/__init__.py（33.3%）。

修复建议:

为 fastdeploy/model_executor/layers/quantization/nvfp4.py（0% 覆盖，16行未覆盖）添加单元测试
为 fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py（0% 覆盖）添加单元测试
若上述代码需要真实 GPU 硬件（H100/H200）才能测试，可向 CI 维护者申请覆盖率豁免

关联变更: fastdeploy/model_executor/layers/quantization/nvfp4.py、fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py、fastdeploy/worker/gpu_model_runner.py、fastdeploy/model_executor/layers/quantization/mix_quant.py、fastdeploy/model_executor/layers/quantization/__init__.py

链接: 查看日志

Approval — 审批不足（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批不足
置信度: 高
根因摘要: 修改 fastdeploy/envs.py 需指定 RD 审批，当前缺少 1 个审批
分析器: ci_analyze_infra

根因详情:
PR 新增了 FD_DISPATCH_USE_FP4 环境变量，修改了 fastdeploy/envs.py。根据仓库保护规则，修改该文件须获得以下指定 FastDeploy RD 之一的 Approve：Jiang-Jia-Jun（jiangjiajun）、yuanlehome（liuyuanle）、rainyfly（chenjian26）、Wanglongzhi2001（wanglongzhi）。当前缺少 1 个必要审批（exit code 6）。

修复建议:

请 jiangjiajun / liuyuanle / chenjian26 / wanglongzhi 中任意一位 Review 并 Approve 此 PR

链接: 查看日志

codecov-commenter · 2026-05-14T10:08:49Z

Codecov Report

❌ Patch coverage is 28.30189% with 38 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@bda1756). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...deploy/model_executor/layers/quantization/nvfp4.py	0.00%	16 Missing ⚠️
...loy/model_executor/layers/quantization/__init__.py	16.66%	8 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py	33.33%	5 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	0.00%	3 Missing ⚠️
fastdeploy/model_executor/utils.py	81.81%	1 Missing and 1 partial ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7817   +/-   ##
==========================================
  Coverage           ?   63.40%           
==========================================
  Files              ?      462           
  Lines              ?    64413           
  Branches           ?     9879           
==========================================
  Hits               ?    40840           
  Misses             ?    20794           
  Partials           ?     2779

Flag	Coverage Δ
GPU	`72.51% <28.30%> (?)`
XPU	`7.12% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

CLAassistant · 2026-05-18T11:00:53Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ lonelygsh
✅ lizexu123
❌ root

root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…into kkc

PaddlePaddle-bot · 2026-05-18T16:13:50Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 00:11:42

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d2545c1
Merge base: e3541c2 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 2 个 required 失败任务需优先处理。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
40(0)	40	36	3	0	1	0

2 任务状态汇总

2.1 Required任务 : 7/9 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h20m	PR问题：新增代码覆盖率 35%，未达 80% 阈值	为 nvfp4.py 等新增文件补充单元测试	Job	-
❌	`Pre Commit`	40s	PR问题：helper.h 末尾缺换行符	helper.h 末尾添加换行符后重新提交	Job	-
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 29/31 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	11m17s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 29 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率不达标（置信度: 高）

run_tests_with_coverage 分析

状态: ❌ 失败
错误类型: 代码覆盖率不达标
置信度: 高
根因摘要: 新增代码整体覆盖率仅 35%，未达 80% 阈值，nvfp4.py 等文件 0% 覆盖
分析器: ci_analyze_unittest_fastdeploy

根因详情:
本次 PR 新增了 FP4 量化通信、dense block_wise_fp8 及 moe nvfp4 相关代码，涉及 636 行变更，但 diff 覆盖率统计结果仅 35%（53 行中 34 行未覆盖），远低于 80% 阈值。主要问题集中在 fastdeploy/model_executor/layers/quantization/nvfp4.py（0%，16 行未覆盖），以及 __init__.py（33%）和 mix_quant.py（44%）。单元测试步骤本身通过（TEST_EXIT_CODE=0），仅覆盖率校验失败（COVERAGE_EXIT_CODE=9）。

关键日志:

COVERAGE_EXIT_CODE: 9
"fastdeploy/model_executor/layers/quantization/nvfp4.py": {"percent_covered": 0.0, "violation_lines": [109, 679, 681, 684, ...]},
"fastdeploy/model_executor/layers/quantization/__init__.py": {"percent_covered": 33.33},
"fastdeploy/model_executor/layers/quantization/mix_quant.py": {"percent_covered": 44.44},
"total_percent_covered": 35
##[error]Process completed with exit code 9.

修复建议:

为 fastdeploy/model_executor/layers/quantization/nvfp4.py（重点：L109, L679-L694, L774-L799）添加单元测试
补充 fastdeploy/model_executor/layers/quantization/__init__.py（L101-L118）和 mix_quant.py（L99-L103）的测试覆盖
若新增代码暂无法测试，可申请覆盖率豁免

修复建议摘要: 为 nvfp4.py/mix_quant.py 等新增文件补充单元测试

关联变更: nvfp4.py, __init__.py, mix_quant.py, gpu_model_runner.py
链接: 查看日志

Pre Commit — 代码规范（置信度: 高）

Pre Commit 分析

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: helper.h 末尾缺少换行符，end-of-file-fixer 钩子失败
分析器: ci_analyze_infra

根因详情:
pre-commit 的 end-of-file-fixer 钩子检测到 custom_ops/gpu_ops/helper.h 文件末尾缺少换行符（文件以 #endif 结尾但无 \n）。black、isort、flake8、ruff、clang-format、PyMarkdown 等检查均通过，仅该文件末尾换行问题导致失败。

关键日志:

fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
Fixing custom_ops/gpu_ops/helper.h

-#endif
\ No newline at end of file
+#endif

修复建议:

在 custom_ops/gpu_ops/helper.h 最后一行 #endif 后添加换行符

本地运行 pre-commit 格式化后重新提交：

pip install pre-commit==4.2.0 clang-format==13.0.0
pre-commit run --files custom_ops/gpu_ops/helper.h

修复建议摘要: helper.h 末尾添加换行符后重新提交

关联变更: custom_ops/gpu_ops/helper.h
链接: 查看日志

PaddlePaddle-bot · 2026-05-19T00:36:29Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 08:29:24

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d2545c1
Merge base: e3541c2 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 3 个 required 任务失败，阻塞合并，需优先处理。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
38(0)	38	33	5	0	0	0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`run_tests_with_coverage`	1h20m	PR问题：差异覆盖率 35%，未达 80% 阈值	为 nvfp4.py 等新增文件补充单元测试	Job	-
❌	`Approval`	8s	PR问题：修改 envs.py 需 FastDeploy RD 审批	请 jiangjiajun/liuyuanle/chenjian26/wanglongzhi 审批	Job	-
❌	`Pre Commit`	40s	PR问题：helper.h 末尾缺换行符	在 helper.h 末尾添加换行后重新提交	Job	-
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 26/28 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	10s	Job	-
❌	`CI_HPU`	1h4m	Job	-
✅	其余 26 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 代码覆盖率不达标
置信度: 高
根因摘要: 差异代码覆盖率 35%，未达 80% 阈值（636 新增行，53 被覆盖）
分析器: 通用分析(fallback)

根因详情:
本次 PR 新增了 FP4 量化相关代码（nvfp4.py、mix_quant.py 等），但差异代码整体覆盖率仅 35%（34/53 行未覆盖），触发了 80% 覆盖率阈值检查失败（exit code 9）。其中 nvfp4.py 新增 16 处未覆盖，gpu_model_runner.py 新增 3 处未覆盖，mix_quant.py 5 处未覆盖。

未覆盖文件详情:

文件	覆盖率	未覆盖行
`fastdeploy/model_executor/layers/quantization/nvfp4.py`	0%	L109, L679-694, L774-799 等16处
`fastdeploy/worker/gpu_model_runner.py`	0%	L3056, L3057, L3061
`fastdeploy/model_executor/layers/moe/fused_moe_deepgemm_backend.py`	0%	L83
`fastdeploy/model_executor/layers/quantization/mix_quant.py`	44%	L99-L103
`fastdeploy/model_executor/layers/quantization/__init__.py`	33%	L101-L118 等8处

关键日志:

COVERAGE_EXIT_CODE: 9
Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
{"total_num_lines": 53, "total_num_violations": 34, "total_percent_covered": 35, "num_changed_lines": 636}
##[error]Process completed with exit code 9.

修复建议:

为 fastdeploy/model_executor/layers/quantization/nvfp4.py 中新增函数（L679-694, L774-799）编写单元测试
为 fastdeploy/worker/gpu_model_runner.py L3056-L3061 新增的 FP4 通信量化分支添加测试用例
为 fastdeploy/model_executor/layers/quantization/mix_quant.py L99-L103 的 nvfp4 分支添加测试
若相关功能依赖特殊硬件难以测试，可在 CI 配置中申请豁免

修复建议摘要: 为 nvfp4.py 等新增 FP4 量化代码补充单元测试

关联变更: fastdeploy/model_executor/layers/quantization/nvfp4.py, fastdeploy/worker/gpu_model_runner.py, fastdeploy/model_executor/layers/quantization/mix_quant.py

链接: 查看日志

Approval — 审批未通过（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批缺失
置信度: 高
根因摘要: 修改 fastdeploy/envs.py，需 FastDeploy RD 成员审批（exit code 6）
分析器: 通用分析(fallback)

根因详情:
本次 PR 修改了 fastdeploy/envs.py，该文件属于受保护文件，需要 FastDeploy RD 团队成员至少一人审批才能通过。当前 PR 尚无符合条件的审批人，check_approval.sh 脚本检测到 1 个审批错误后以 exit code 6 退出。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), 
   rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 @Jiang-Jia-Jun、@yuanlehome、@rainyfly 或 @Wanglongzhi2001 中的任一成员对 PR 进行 Approve

修复建议摘要: 请 RD 成员（jiangjiajun/liuyuanle 等）对 PR 进行 Approve

关联变更: fastdeploy/envs.py

链接: 查看日志

Pre Commit — 代码规范（置信度: 高）

Pre Commit

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: custom_ops/gpu_ops/helper.h 末尾缺换行符，end-of-file-fixer 检查失败
分析器: 通用分析(fallback)

根因详情:
pre-commit 的 fix end of files hook 检测到 custom_ops/gpu_ops/helper.h 文件末尾缺少换行符（#endif 后无换行），自动修复后要求开发者在本地执行相同修复后重新提交。其他 hook（black、isort、flake8、ruff、PyMarkdown 等）均通过。

关键日志:

fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook
Fixing custom_ops/gpu_ops/helper.h
diff: -#endif\  No newline at end of file
      +#endif
##[error]Process completed with exit code 1.

修复建议:

在 custom_ops/gpu_ops/helper.h 文件最后一行 #endif 后添加一个换行符

本地安装 pre-commit 并运行检查：

pip install pre-commit==4.2.0 clang-format==13.0.0
pre-commit install
pre-commit run --files custom_ops/gpu_ops/helper.h

修复建议摘要: 在 helper.h 末尾 #endif 后添加换行符并重新提交

关联变更: custom_ops/gpu_ops/helper.h

链接: 查看日志

…into kkc

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 18:21:26

📋 Review 摘要

PR 概述：支持 FP4 通信量化、offline NVFP4 MoE + online block_wise_fp8 dense 混合量化配置及 block-wise CUDA Graph；同时修复 eb5 加载时 audio_token_num=None 报错问题。
变更范围：custom_ops/gpu_ops/moe/、model_executor/layers/quantization/、model_executor/utils.py、worker/gpu_model_runner.py、docs/
影响面 Tag：[Quantization] [OP] [Graph Optimization] [Docs]

问题

级别	文件	概述
🟡 建议	`nvfp4.py:431`	移除权重 dtype 断言，静默掩盖权重加载异常
🟡 建议	`nvfp4.py:669`	`fc1_latent_proj`/`fc2_latent_proj` 参数完全未使用（死代码）
❓ 疑问	`prefill_permute_to_masked_gemm.cu:338`	BFLOAT16+FLOAT32 case 缺 `break`，触发 `-Wimplicit-fallthrough` 警告
❓ 疑问	`utils.py:135`	使用 Paddle 私有方法 `_is_initialized()`，稳定性存疑

📝 PR 规范检查

标题格式合规（[Feature] 为官方 Tag ✓）。但 ## Modifications、## Usage or Command、## Accuracy Tests 三段均为空（仅 HTML 注释占位），Checklist 全部未勾选。

标题建议（可直接复制）：

[Feature][BugFix] Support FP4 communication quantization, block_wise_fp8 dense + NVFP4 MoE mix_quant, and fix audio_token_num NoneType bug

PR 描述建议（可直接复制）：

## Motivation
1. 修复 eb5 旗舰版加载时 `audio_token_num` 为 `None` 导致 `NoneType > 0` 报错的 bug
2. 支持 offline NVFP4 MoE 权重 + online block_wise_fp8 dense 量化混合配置（`mix_quant` 覆盖 NVFP4 checkpoint）
3. 支持 FP4 通信量化（`FD_DISPATCH_USE_FP4=1`），dispatch 前将 hidden states 量化为 FP4，减少通信量约 2x
4. `prefill_permute_to_masked_gemm` 新增 `make_scale_interleaved` 支持，scale 直接写入 flashinfer swizzled layout

## Modifications
- `fastdeploy/model_executor/forward_meta.py`：`audio_token_num` 字段默认值改为 `0`，修复 `NoneType > 0` 异常
- `fastdeploy/model_executor/layers/quantization/__init__.py`：新增 `mix_quant_overrides_nvfp4` 逻辑，支持 CLI `mix_quant` 配置覆盖模型 config.json 中的 NVFP4 配置
- `fastdeploy/model_executor/layers/quantization/mix_quant.py`：新增 `moe_quant_config` 字段及 `_build_moe_sub_config()` 方法，正确传递 MoE offline 量化子配置
- `fastdeploy/model_executor/layers/quantization/nvfp4.py`：`apply_ep_prefill` 新增 FP4 通信量化路径；`call_prefill_permute_to_masked_gemm` 新增 `make_scale_interleaved` 参数
- `custom_ops/gpu_ops/moe/prefill_permute_to_masked_gemm.cu`：新增 `MAKE_SCALE_INTERLEAVED` 模板参数和 UINT8/BF16 混合 dtype 分发路径
- `fastdeploy/model_executor/utils.py`：修复 hybrid mix_quant 下 MoE 子层权重加载顺序问题
- `fastdeploy/worker/gpu_model_runner.py`：新增 block-wise CUDA graph 清理逻辑
- `fastdeploy/envs.py`：新增 `FD_DISPATCH_USE_FP4` 环境变量
- `docs/`：更新 nvfp4.md（中英文）补充 flashinfer-cutedsl backend 使用示例

## Usage or Command

启用 block-wise CUDA Graph：
    export FD_USE_BLOCK_WISE_CUDA_GRAPH=1
    export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048,4096"

启用 FP4 通信量化：
    export FD_DISPATCH_USE_FP4=1

    python -m fastdeploy.entrypoints.openai.multi_api_server \
           --ports "9811,9812,9813,9814" \
           --num-servers 4 \
           --model /path/to/eb5-fp4-model \
           --gpu-memory-utilization 0.9 \
           --max-num-batched-tokens 512 \
           --ep-prefill-use-worst-num-tokens \
           --quantization '{"quantization": "mix_quant", "dense_quant_type":"block_wise_fp8", "is_moe_quantized":true,"moe_quant_type":"modelopt_fp4"}'

## Accuracy Tests
N/A（本 PR 主要为性能优化和 bug 修复，未修改模型计算语义；PR 描述中附有 nsys profile 对比截图，prefill 阶段 kernel 间空隙从有到无）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

PR 整体思路清晰，FP4 通信量化路径和 mix_quant 混合配置逻辑较为完整，bug 修复（audio_token_num）简洁有效。主要关注点：apply() 中移除 weight dtype 断言会降低错误检测能力，建议改为条件保留；fc1/fc2_latent_proj 死参数需补充说明意图或实现。

PaddlePaddle-bot · 2026-05-19T10:29:13Z

@@ -425,10 +429,6 @@ def apply(
        x_fp4, x_scale_interleaved = fp4_quantize(x, layer.input_scale_inv)

        assert x_fp4.dtype == paddle.uint8


🟡 建议 移除了三个权重 dtype 断言，可能掩盖权重加载异常

原本对 layer.weight.dtype、layer.weight_scale_interleaved.dtype、layer.alpha.dtype 的断言是推理路径的关键 guard，能在权重类型出错时快速报错。现在这些断言被无条件移除，若 NVFP4 权重加载失败（dtype 不符），会静默运行并产生错误结果。

建议修复：仅在 hybrid mix_quant 路径下跳过断言，或改为条件检查：

if not getattr(self, 'skip_weight_dtype_check', False): assert layer.weight.dtype == paddle.uint8 assert layer.weight_scale_interleaved.dtype == paddle.float8_e4m3fn assert layer.alpha.dtype == paddle.float32

PaddlePaddle-bot · 2026-05-19T10:29:13Z

        gate: nn.Layer,
        topk_ids_hookfunc: Callable = None,
        shared_experts: nn.Layer = None,
+        fc1_latent_proj: nn.Layer = None,


🟡 建议 fc1_latent_proj 和 fc2_latent_proj 参数在函数体内完全未使用，构成死代码

apply_ep_prefill 和 apply_ep_decode 都新增了这两个参数，但函数体内没有任何使用。若是预留接口，请加注释说明意图；若是遗漏实现，需补充相关逻辑。

PaddlePaddle-bot · 2026-05-19T10:29:13Z

+          switch (topk) {
+            DISPATCH_TOPK(
+                paddle::DataType::BFLOAT16, paddle::DataType::UINT8, 4)
+            DISPATCH_TOPK(


❓ 疑问 BFLOAT16 x + FLOAT32 scale 的 case 块缺少 break，存在隐式 fall-through

case paddle::DataType::FLOAT32: 块结尾没有 break，紧接着新增的 case paddle::DataType::UINT8: 分支，会触发 -Wimplicit-fallthrough 编译警告。虽然实际执行路径均通过 return（DISPATCH_TOPK 宏）或 PD_THROW 终止，不会真正 fall-through，但建议加 break 或 [[fallthrough]] 注释以明确意图并消除警告。

PaddlePaddle-bot · 2026-05-19T10:29:13Z

@@ -133,6 +133,9 @@ def slice_fn(weight_or_parameter, output_dim, start, end, step=1):

 def process_weight_transpose(layer, weight_name):
    weight = getattr(layer, weight_name)


❓ 疑问 使用了 Paddle 私有方法 _is_initialized()

_is_initialized() 为下划线前缀的私有/内部方法，Paddle 版本升级时可能被重命名或移除，稳定性无保证。建议改为公开 API（如捕获 AttributeError）或确认该方法已在 FastDeploy 所支持的 Paddle 版本中稳定存在。

lonelygsh added 13 commits April 15, 2026 15:15

support eb5 fp4 cuda_graph

9f6c3c0

update

55d1a05

merge develop

3509714

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

deebd2a

…into kc_d

Support FP4 communication quantization

dd4118d

fix

3fdbc08

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

1226b27

…into kc_d

update

6c3cc4b

fix

e89dff7

support mix_quant and nvfp4

24d07c6

support prefill cuda_graph

19a7019

support fp4 communication quantization

842feba

support

141ac55

lizexu123 had a problem deploying to Metax_ci May 14, 2026 09:21 — with GitHub Actions Failure

lizexu123 changed the title ~~Kkc~~ [Feature] Support FP4 communication quantization and block_wise_cuda_graph May 14, 2026

This comment was marked as outdated.

Sign in to view

fix

4c076ce

lizexu123 had a problem deploying to Metax_ci May 15, 2026 07:16 — with GitHub Actions Error

fix

b643683

lizexu123 temporarily deployed to Metax_ci May 15, 2026 07:17 — with GitHub Actions Inactive

lizexu123 changed the title ~~[Feature] Support FP4 communication quantization and block_wise_cuda_graph~~ [Feature] Support FP4 communication quantization and dense block_wise_fp8 and moe nvfp4 May 15, 2026

This comment was marked as outdated.

Sign in to view

add test

2f4151c

lizexu123 had a problem deploying to Metax_ci May 15, 2026 08:30 — with GitHub Actions Failure

merge develop

22ac5a0

lizexu123 had a problem deploying to Metax_ci May 15, 2026 08:32 — with GitHub Actions Failure

update develop

8443d62

lizexu123 had a problem deploying to Metax_ci May 18, 2026 11:00 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

update

9205ac7

lizexu123 had a problem deploying to Metax_ci May 18, 2026 11:19 — with GitHub Actions Error

update

3d64926

lizexu123 had a problem deploying to Metax_ci May 18, 2026 11:20 — with GitHub Actions Error

add document

d2545c1

lizexu123 temporarily deployed to Metax_ci May 18, 2026 11:34 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

ca03742

…into kkc

lizexu123 had a problem deploying to Metax_ci May 19, 2026 03:41 — with GitHub Actions Error

lizexu123 force-pushed the kkc branch from a767d7c to ca03742 Compare May 19, 2026 03:44

lizexu123 had a problem deploying to Metax_ci May 19, 2026 03:44 — with GitHub Actions Error

fix

7a58d12

lizexu123 had a problem deploying to Metax_ci May 19, 2026 03:46 — with GitHub Actions Error

revert helper.h to develop

aad041d

lizexu123 had a problem deploying to Metax_ci May 19, 2026 03:50 — with GitHub Actions Failure

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

fb82351

…into kkc

lizexu123 had a problem deploying to Metax_ci May 19, 2026 05:47 — with GitHub Actions Failure

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

e282850

…into kkc

lizexu123 temporarily deployed to Metax_ci May 19, 2026 07:33 — with GitHub Actions Inactive

make_scale_interleaved

a96f2a5

lizexu123 had a problem deploying to Metax_ci May 19, 2026 09:09 — with GitHub Actions Error

fix

e20676d

lizexu123 temporarily deployed to Metax_ci May 19, 2026 09:12 — with GitHub Actions Inactive

PaddlePaddle-bot reviewed May 19, 2026

View reviewed changes

		@@ -425,10 +429,6 @@ def apply(
		x_fp4, x_scale_interleaved = fp4_quantize(x, layer.input_scale_inv)

		assert x_fp4.dtype == paddle.uint8

		@@ -133,6 +133,9 @@ def slice_fn(weight_or_parameter, output_dim, start, end, step=1):

		def process_weight_transpose(layer, weight_name):
		weight = getattr(layer, weight_name)

Conversation

lizexu123 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 14, 2026

Uh oh!

PaddlePaddle-bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/10 通过

2.2 可选任务 — 28/30 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

CLAassistant commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 18, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 7/9 通过

2.2 可选任务 — 29/31 通过

3 失败详情（仅 required）

run_tests_with_coverage 分析

Pre Commit 分析

Uh oh!

PaddlePaddle-bot commented May 19, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 7/10 通过

2.2 可选任务 — 26/28 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Approval

Pre Commit

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

lizexu123 commented May 14, 2026 •

edited

Loading

PaddlePaddle-bot commented May 14, 2026 •

edited

Loading

codecov-commenter commented May 14, 2026 •

edited

Loading

CLAassistant commented May 18, 2026 •

edited

Loading