[Cherry-Pick][Optimization] Reduce logprob processing overhead by using actual topk instead of fixed K+1 (#7860) by Sunny-bot1 · Pull Request #7861 · PaddlePaddle/FastDeploy

Sunny-bot1 · 2026-05-19T12:57:11Z

Motivation

开启 logprob（top_logprobs=0）时，性能下降明显，TPOT 比不开 logprob 高约 5ms。分析发现开销大的原因之一是 logprob 数据传输和处理均按固定的 K+1=21 列处理，而用户实际只需要 actual_topk=1 列，存在约 10 倍的冗余计算。

save_output_topk C++ op 中：

sender 循环固定写入 K+1=21 列，stride 固定为 K+1
mtext[1] 只存 bsz，actual_topk 信息丢失

get_output_topk C++ op 中：

receiver 循环固定读取 K+1=21 列，stride 固定为 K+1

token_processor.py 中：

reshape 固定按 K+1=21 列展开，导致 output_scores.numpy()
拷贝 batch*21 个 float（10752），以及 per-request tolist()
每行处理 21 个元素

Modifications

save_output_msg_with_topk.cc

mtext[1] 改为 bit-pack 存储：bsz（低16位）| actual_topk（高16位）
sender 循环改为 max_num_logprobs 次，stride 改为 max_num_logprobs

get_output_msg_with_topk.cc

从 mtext[1] 解包 bsz 和 actual_topk
receiver 循环改为 actual_topk 次，stride 改为 actual_topk

token_processor.py

从 packed mtext[1] 解包 batch 和 actual_topk
output_scores.numpy() 切片范围从 batch21 缩小到 batchactual_topk
reshape 列数从固定 K+1 改为动态 actual_topk

msgsnd/msgrcv 消息结构体大小不变，向后兼容。
actual_topk 通过 mtext[1] bit-pack 传递，不增加额外字段。

top_logprobs=0，concurrency=256，GLM-4.5-Air，TP8：

指标	优化前	优化后	提升
平均TPOT	33.80ms	31.41ms	-2.39ms
平均解码速度	30.72	32.79	+6%
QPS	0.571 req/s	0.590 req/s	+3%

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-19T12:57:48Z

Thanks for your contribution!

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 21:04:51

📋 Review 摘要

PR 概述：通过 bit-packing 将实际 topk 列数（actual_topk）编码到 IPC 消息的 mtext[1] 高 16 位，使 save/get_output_topk C++ op 和 token_processor.py 均按 actual_topk（而非固定 K+1=21）处理 logprob 数据，消除约 10× 的冗余拷贝和计算开销。

变更范围：custom_ops/gpu_ops/save_output_msg_with_topk.cc、custom_ops/gpu_ops/get_output_msg_with_topk.cc、fastdeploy/output/token_processor.py

影响面 Tag：[OP] [DataProcessor]

问题

级别	文件	概述
❓ 疑问	`custom_ops/gpu_ops/save_output_msg_with_topk.cc:114`	bit-packing 无边界断言，bsz / max_num_logprobs ≥ 65536 时静默损坏
🟡 建议	`tests/operators/`（缺失）	custom op 数据格式与 stride 均发生变更，未补充单测（checklist A3）

❓ `save_output_msg_with_topk.cc:114` — bit-packing 无边界保护

msg_sed.mtext[1] = bsz | (max_num_logprobs << 16);

当前 #define MAX_BSZ 512 / #define K 20，实际不会触发问题。但代码中没有任何 PADDLE_ENFORCE 或 assert 保护，若未来 bsz 或 max_num_logprobs 超过 65535，低/高 16 位会相互覆盖而静默损坏，排查难度极高。

建议在打包前添加防御断言：

PADDLE_ENFORCE_LT(bsz, 65536,
    phi::errors::InvalidArgument("bsz %d exceeds 16-bit field for mtext[1] packing", bsz));
PADDLE_ENFORCE_LT(max_num_logprobs, 65536,
    phi::errors::InvalidArgument("max_num_logprobs %d exceeds 16-bit field", max_num_logprobs));

或至少在注释中明确写明 bsz < 65536 && max_num_logprobs < 65536 的前置约束。

🟡 `tests/operators/` — 缺少针对 save/get_output_msg_with_topk 的单测

本次变更同时修改了：

IPC 消息的数据格式（mtext[1] 语义从"纯 bsz"变为"bit-packed bsz|topk"）
sender 和 receiver 的 stride 逻辑（从固定 K+1 改为动态 actual_topk）

新的 bit-pack / unpack 逻辑是双端耦合的（C++ sender + Python receiver），任意一端的 bit 操作错误都难以在运行时直接定位。根据 checklist §A3，custom_ops/ 下 op 逻辑变更应在 tests/operators/ 补充单测。

建议修复策略：在 tests/operators/ 下新增测试用例，覆盖：

不同 top_logprobs 值（0、1、K=20）下的 pack/unpack 往返正确性
bsz=1 及 bsz>1 的多 batch 场景，验证 stride 与数据对齐

📝 PR 规范检查

标题格式 ✅：[Cherry-Pick][Optimization] ... (#7860) 符合 Cherry-Pick 格式规范。

描述结构问题：## Usage or Command 与 ## Accuracy Tests 两个 section 仅有 HTML 注释，未填写内容（应写 N/A 或将 Modifications 中的性能 benchmark 表移至 Accuracy Tests）；## Checklist 全部未勾选，至少 [x] Add at least a tag 和 [x] Cherry-pick to release 应勾选。

标题建议（可直接复制）：

[Cherry-Pick][Optimization] Reduce logprob processing overhead by using actual topk instead of fixed K+1 (#7860)（当前标题已合规，无需修改）

PR 描述建议（可直接复制）：

## Motivation

开启 logprob（top_logprobs=0）时，性能下降明显，TPOT 比不开 logprob 高约 5ms。分析发现开销大的原因之一是 logprob 数据传输和处理均按固定的 K+1=21 列处理，而用户实际只需要 actual_topk=1 列，存在约 10 倍的冗余计算。

save_output_topk C++ op 中：
- sender 循环固定写入 K+1=21 列，stride 固定为 K+1
- mtext[1] 只存 bsz，actual_topk 信息丢失

get_output_topk C++ op 中：
- receiver 循环固定读取 K+1=21 列，stride 固定为 K+1

token_processor.py 中：
- reshape 固定按 K+1=21 列展开，导致 output_scores.numpy() 拷贝 batch*21 个 float，以及 per-request tolist() 每行处理 21 个元素

## Modifications

**save_output_msg_with_topk.cc**
- mtext[1] 改为 bit-pack 存储：bsz（低16位）| actual_topk（高16位）
- sender 循环改为 max_num_logprobs 次，stride 改为 max_num_logprobs

**get_output_msg_with_topk.cc**
- 从 mtext[1] 解包 bsz 和 actual_topk
- receiver 循环改为 actual_topk 次，stride 改为 actual_topk

**token_processor.py**
- 从 packed mtext[1] 解包 batch 和 actual_topk
- output_scores.numpy() 切片范围从 batch*21 缩小到 batch*actual_topk
- reshape 列数从固定 K+1 改为动态 actual_topk

> msgsnd/msgrcv 消息结构体大小不变，向后兼容。actual_topk 通过 mtext[1] bit-pack 传递，不增加额外字段。

## Usage or Command

N/A

## Accuracy Tests

top_logprobs=0，concurrency=256，GLM-4.5-Air，TP8：

| 指标 | 优化前 | 优化后 | 提升 |
|------|--------|--------|------|
| 平均TPOT | 33.80ms | 31.41ms | -2.39ms |
| 平均解码速度 | 30.72 | 32.79 | +6% |
| QPS | 0.571 req/s | 0.590 req/s | +3% |

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

优化思路清晰，通过 bit-packing 传递 actual_topk 并压缩数据传输量，同时顺带修复了原 get_output_topk 中 k（小写）与 K（大写）不一致导致的潜在逻辑错误，整体变更质量较好。主要建议：补充 bit-pack 边界断言和对应单测，以提升代码健壮性和可维护性。

PaddlePaddle-bot · 2026-05-19T13:48:24Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 23:49:06

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 97dc738
Merge base: 41d44d6 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

有 2 个 Required 任务失败，阻塞合并，需优先处理。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
34(0)	34	29	5	0	0	0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h12m	PR问题：测试未适配packed格式 actual_topk 解码为0导致 IndexError	将测试中 output_tokens[1,0] 改为packed格式 (K+1)<<16\|1	Job	-
❌	`xpu_4cards_case_test / run_xpu_4cards_cases`	32m3s	PR问题：actual_topk 解包异常，XPU 4卡 logprob 推理挂起超时	检查 actual_topk 解包逻辑，防止XPU上解出0导致死锁	Job	-
✅	其余 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 21/24 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	23m18s	Job	-
❌	`Check PR Template`	10s	Job	-
❌	`Trigger Jenkins for PR`	1m11s	Job	-
✅	其余 21 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 测试失败
置信度: 高
根因摘要: 测试未适配新的 packed 格式 output_tokens[1,0]，actual_topk 解码为 0 导致 IndexError
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`tests/output/test_token_processor.py::test_process_batch_output_logprob_records_topk_and_caching`	IndexError: index 0 is out of bounds for axis 1 with size 0	output_tokens[1,0]=1 未 pack actual_topk，new code 解出 topk=0 导致空张量

根因详情:
PR 将 output_tokens[1, 0] 从"仅存 batch_size"改为 packed=(actual_topk<<16)|batch 格式，以支持动态 topk 优化。但测试 test_process_batch_output_logprob_records_topk_and_caching 仍使用旧格式 processor.output_tokens[1, 0] = 1（仅 batch=1），导致新代码解包时 actual_topk = (1 >> 16) & 0xFFFF = 0，tokens 被 reshape 为空张量 [1, 0]，在 L926 token_id = int(tokens[i, 0]) 处抛出 IndexError。

关键日志:

E               IndexError: index 0 is out of bounds for axis 1 with size 0
fastdeploy/output/token_processor.py:926: IndexError
>           processor._process_batch_output()
tests/output/test_token_processor.py:733:

修复建议:

tests/output/test_token_processor.py L733 附近：将 processor.output_tokens[1, 0] = 1 改为 processor.output_tokens[1, 0] = ((K + 1) << 16) | 1，适配新的 packed 格式（batch=1，actual_topk=K+1）

修复建议摘要: 将测试中 output_tokens[1,0] 改为 packed 格式 (K+1)<<16|1

关联变更: fastdeploy/output/token_processor.py L857-858（packed 解包逻辑变更引入）
链接: 查看日志

xpu_4cards_case_test / run_xpu_4cards_cases — 测试失败（置信度: 中）

xpu_4cards_case_test / run_xpu_4cards_cases

状态: ❌ 失败
错误类型: 测试失败（请求超时）
置信度: 中
根因摘要: PR修改logprob stride为actual_topk，XPU 4卡上logprob请求挂起超时
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_logprobs-topk_21b_tp4.py::test_logprobs_mode`	ReadTimeout: 300s	服务启动正常但logprob推理请求无响应

根因详情:
服务健康检查通过（110s），同批次14个测试用例全部通过，唯独 test_logprobs_mode（top_logprobs=3）在 requests.post(..., timeout=300) 处超时。PR 修改了 token_processor.py 中 logprob 处理的 stride（从固定K+1=21改为bit-pack解包的actual_topk），以及 get_output_msg_with_topk.cc 的 receiver 循环；XPU 4卡上可能在 actual_topk 解包时出现值异常（如解包出0导致下游逻辑死锁），导致推理侧挂起不返回响应。

关键日志:

FAILED tests/xpu_ci/4cards_cases/test_logprobs-topk_21b_tp4.py::test_logprobs_mode
E   requests.exceptions.ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=8188): Read timed out. (read timeout=300)
================== 1 failed, 14 passed in 1755.12s (0:29:15) ===================

修复建议:

检查 custom_ops/gpu_ops/get_output_msg_with_topk.cc L91-104：确认 XPU 上 actual_topk = (msg_rcv.mtext[1] >> 16) & 0xFFFF 解包值非零，若为0需加防御 actual_topk = max(actual_topk, 1)
检查 fastdeploy/output/token_processor.py L857-858：actual_topk 为0时 reshape([batch, 0]) 会产生空张量导致后续处理死锁，需加 guard

修复建议摘要: 检查actual_topk解包逻辑，防止XPU上解出0导致死锁

关联变更: custom_ops/gpu_ops/get_output_msg_with_topk.cc L91-104；fastdeploy/output/token_processor.py L854-862
链接: 查看日志

codecov-commenter · 2026-05-19T14:21:19Z

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@41d44d6). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/output/token_processor.py	66.66%	4 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7861   +/-   ##
==============================================
  Coverage               ?   72.35%           
==============================================
  Files                  ?      381           
  Lines                  ?    54220           
  Branches               ?     8473           
==============================================
  Hits                   ?    39231           
  Misses                 ?    12225           
  Partials               ?     2764

Flag	Coverage Δ
GPU	`72.35% <66.66%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

opt logprob process

97dc738

Sunny-bot1 had a problem deploying to Metax_ci May 19, 2026 12:57 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Optimization] Reduce logprob processing overhead by using actual topk instead of fixed K+1 (#7860)#7861

[Cherry-Pick][Optimization] Reduce logprob processing overhead by using actual topk instead of fixed K+1 (#7860)#7861
Sunny-bot1 wants to merge 1 commit into
PaddlePaddle:release/2.6from
Sunny-bot1:opt_logprob_process_26

Sunny-bot1 commented May 19, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 19, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot commented May 19, 2026 •

edited

Loading

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

xpu_4cards_case_test / run_xpu_4cards_cases

Uh oh!

codecov-commenter commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sunny-bot1 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 19, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

❓ save_output_msg_with_topk.cc:114 — bit-packing 无边界保护

🟡 tests/operators/ — 缺少针对 save/get_output_msg_with_topk 的单测

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/10 通过

2.2 可选任务 — 21/24 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

xpu_4cards_case_test / run_xpu_4cards_cases

Uh oh!

codecov-commenter commented May 19, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sunny-bot1 commented May 19, 2026 •

edited

Loading

❓ `save_output_msg_with_topk.cc:114` — bit-packing 无边界保护

🟡 `tests/operators/` — 缺少针对 save/get_output_msg_with_topk 的单测

PaddlePaddle-bot commented May 19, 2026 •

edited

Loading