[Speculative Decoding]【Hackathon 10th Spring No.54】hybrid_mtp_ngram 端到端验证 by NKNaN · Pull Request #7849 · PaddlePaddle/FastDeploy

NKNaN · 2026-05-19T03:49:38Z

Motivation

PaddlePaddle/community#1372

Modifications

算子接口（ngram_match_mixed.cu、cpp_extensions.cc）：input_ids/input_ids_len 改为 token_ids_all/prompt_lens，pre_ids 暂保留，预计下一个pr去除。
Python 调用消除拷贝（mtp_cuda.py）：_extend_draft_token_with_ngram_match 中两次 .cuda() 替换为已在 GPU 的张量。
代码清理（mtp.py）：删除 insert_tasks_v1 中的 .cpu() D→H 拷贝、input_ids_cpu/input_ids_len 写入。
ProposerInputBatch 修改（input_batch.py）：token_ids_all 从 clone 改为引用 target 张量；删除冗余字段 input_ids_cpu/input_ids_len 及其 swap/reset 中的维护。
新增 hybrid E2E 测试（test_ernie_21b_mtp_ngram.py）：覆盖 overlap + cudagraph + logprob 。

Usage or Command

N/A

Accuracy Tests

tests/e2e/test_ernie_21b_mtp_ngram.py

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-19T03:49:48Z

Thanks for your contribution!

codecov-commenter · 2026-05-19T05:23:08Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@bda1756). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/input_batch.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7849   +/-   ##
==========================================
  Coverage           ?   63.29%           
==========================================
  Files              ?      462           
  Lines              ?    64359           
  Branches           ?     9870           
==========================================
  Hits               ?    40737           
  Misses             ?    20857           
  Partials           ?     2765

Flag	Coverage Δ
GPU	`72.40% <50.00%> (?)`
XPU	`7.12% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

该 PR 围绕 hybrid_mtp_ngram（Hybrid MTP + Ngram）链路做端到端验证与代码清理：统一算子接口参数语义（从 input_ids/input_ids_len 迁移到 token_ids_all/prompt_lens），并消除 MTP hybrid 路径中不必要的 D2H/H2D 拷贝，最后补充 E2E 覆盖 overlap + cudagraph + logprob 场景。

Changes:

更新 hybrid_mtp_ngram CUDA 算子接口与内部实现：prompt 搜索源改为 token_ids_all + prompt_lens。
MTP hybrid 路径消除 input_ids_cpu/input_ids_len 相关 CPU 缓冲与 .cpu()/.cuda() 拷贝，并同步调整 ProposerInputBatch 初始化/重置逻辑。
更新相关单测并新增 ERNIE 21B 的 hybrid MTP-Ngram E2E 测试用例。

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/spec_decode/test_ngram_gpu_kernel.py	更新 CPU 参考实现与数据构造，适配 `token_ids_all/prompt_lens` 接口
tests/operators/test_hybrid_mtp_ngram.py	更新算子单测输入字段与注释，匹配新接口
tests/e2e/test_ernie_21b_mtp_ngram.py	新增 hybrid MTP-Ngram 的 E2E 覆盖（stream/non-stream、speculate_metrics、logprobs）
fastdeploy/worker/input_batch.py	`ProposerInputBatch` 移除 `input_ids_cpu/input_ids_len` 维护，`token_ids_all` 改为直接引用目标 batch
fastdeploy/spec_decode/mtp.py	删除 insert/prepare 阶段对 `input_ids_len` 与 `input_ids_cpu` 的写入与 D2H 拷贝
fastdeploy/spec_decode/mtp_cuda.py	hybrid ngram 扩展调用改为直接使用 GPU 上的 `token_ids_all/prompt_lens`
custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu	CUDA/CPU 路径统一改用 `token_ids_all/prompt_lens`，更新内核参数含义
custom_ops/gpu_ops/cpp_extensions.cc	同步更新 `HybridMtpNgram` C++ 声明签名

Comments suppressed due to low confidence (1)

tests/e2e/test_ernie_21b_mtp_ngram.py:259

这里对 speculate_metrics 做了严格的 dict 全等比较（==），如果服务端返回的浮点值存在舍入差异、或字段顺序/附加字段有微调，就会导致用例不稳定。若目的是回归关键行为，建议改为：对整数统计做精确比较；对 accept_ratio/average_accept_length 等浮点字段做容差比较；或通过 BaselineManager 管理可更新的基线数据。

    # Baseline comparison — exact match against the values captured in the reference environment.
    if BASELINE_SPECULATE_METRICS is not None:
        assert speculate_metrics == BASELINE_SPECULATE_METRICS, (
            f"speculate_metrics mismatch\n"
            f"got:      {json.dumps(speculate_metrics, indent=2)}\n"
            f"baseline: {json.dumps(BASELINE_SPECULATE_METRICS, indent=2)}"
        )

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 17:51:47

📋 Review 摘要

PR 概述：将 hybrid MTP+Ngram 算子接口从 input_ids/input_ids_len 重构为 token_ids_all/prompt_lens，消除 CPU↔GPU 数据拷贝，并新增 E2E 端到端测试。
变更范围：custom_ops/gpu_ops/speculate_decoding/、fastdeploy/spec_decode/、fastdeploy/worker/、tests/
影响面 Tag：[Speculative Decoding] [OP]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/worker/input_batch.py:773`	`token_ids_all` 改为直接引用 target 张量，但 `swap_data` 仍在原地写入，污染 target 张量

📝 PR 规范检查

PR body 结构完整，包含 ## Motivation、## Modifications、## Usage or Command、## Accuracy Tests、## Checklist 五个必填段，内容具体，勾选正确，符合 §D2 模板要求。

PR 标题含有非官方格式 【Hackathon 10th Spring No.54】，不符合 [Tag] 标题描述 的标准格式。建议替换为：

标题建议（可直接复制）：

[Speculative Decoding] hybrid_mtp_ngram 端到端验证：算子接口重构（token_ids_all/prompt_lens）+ 消除 CPU↔GPU 拷贝

总体评价

本 PR 接口重构思路清晰，成功消除了 input_ids_cpu/.cuda() 的 D→H→D 往返拷贝，逻辑对齐也通过多级测试覆盖验证。但 token_ids_all 从 paddle.clone() 改为直接引用 target_model_input_batch["token_ids_all"] 后，swap_batch_slot 中 swap_data(self.token_ids_all, i1, i2) 会以 in-place 方式写入 target 张量，在多请求并发乱序场景下会污染 target 数据，需修复后合入。

PaddlePaddle-bot · 2026-05-19T09:54:12Z

            self.batch_id_per_token_output = paddle.clone(self.target_model_input_batch["batch_id_per_token_output"])
            if "token_ids_all" in self.target_model_input_batch:
-                self.token_ids_all = paddle.clone(self.target_model_input_batch["token_ids_all"])
+                self.token_ids_all = self.target_model_input_batch["token_ids_all"]


🔴 Bug token_ids_all 改为直接引用 target 张量，但 swap_batch_slot（第 394 行）中 swap_data(self.token_ids_all, i1, i2) 会对该张量做 in-place 赋值（tensor[idx1] = tensor[idx2].clone()），从而直接污染 target_model_input_batch["token_ids_all"]。

影响：多请求并发场景下，swap_batch_slot 调用后 target 张量的行顺序被原地改变，后续 prefill/decode 使用的 token_ids_all 数据错误，导致 ngram match 命中错误 token，推测 draft 质量下降甚至乱码。reset_model_inputs 中同一位置（第 973 行）存在相同问题。

建议修复：

方案 A（推荐）：token_ids_all 保留 paddle.clone()，保持与 target 数据独立，避免 swap 污染：

self.token_ids_all = paddle.clone(self.target_model_input_batch["token_ids_all"])

方案 B：删除 swap_data(self.token_ids_all, i1, i2) 调用，改为让 swap_batch_slot 直接读写 target_model_input_batch 的对应行——但需同步评估 target_model_input_batch 中 token_ids_all 的排列维护逻辑。

PaddlePaddle-bot · 2026-05-19T10:09:17Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 18:06:28

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: f718837
Merge base: bda1756 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

存在 2 个 Required 任务失败，阻塞合并，需优先处理。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	39	2	1	0	0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h23m	PR问题：新增测试 json.dumps(ApproxScalar) 触发 TypeError	将 L284 json.dumps(BASELINE) 改为 str(BASELINE)	Job	-
❌	`Approval`	9s	流程问题：PR 待人工审批，exit code 6	等待 reviewer 审批通过	Job	-
✅	其余 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 31/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
⏳	`CI_HPU`	-	Job	-
✅	其余 31 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 测试失败
置信度: 高
根因摘要: 新增测试断言失败消息中调用 json.dumps(pytest.approx对象) 引发 TypeError
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`e2e/test_ernie_21b_mtp_ngram.py::test_mtp_ngram_speculate_metrics`	TypeError: ApproxScalar is not JSON serializable	断言失败消息格式化崩溃

根因详情:
本 PR 在 tests/e2e/test_ernie_21b_mtp_ngram.py 中新增了 _build_speculate_metrics_baseline() 函数，使用 pytest.approx() 包装浮点字段（accept_ratio、average_accept_length 等）以实现模糊匹配。当 speculate_metrics == BASELINE_SPECULATE_METRICS 断言失败时，代码通过 json.dumps(BASELINE_SPECULATE_METRICS, indent=2)（L284）格式化错误消息，但 pytest.approx() 返回的 ApproxScalar 对象无法被标准 json.JSONEncoder 序列化，导致 TypeError 覆盖了原始断言错误，使测试以非预期方式终止。

关键日志:

Unit tests failed (exit code 8)
Failed test cases:
tests/e2e/test_ernie_21b_mtp_ngram.py
E       TypeError: Object of type ApproxScalar is not JSON serializable
tests/e2e/test_ernie_21b_mtp_ngram.py:284: TypeError

修复建议:

tests/e2e/test_ernie_21b_mtp_ngram.py L284：将 f"baseline: {json.dumps(BASELINE_SPECULATE_METRICS, indent=2)}" 改为 f"baseline: {BASELINE_SPECULATE_METRICS}"（同理适用于 test_mtp_ngram_speculate_metrics_with_logprobs 函数中相同用法，约 L331）
若需保留 JSON 格式化，可对 BASELINE_SPECULATE_METRICS 提取原始值进行序列化，避免传入 ApproxScalar 对象

修复建议摘要: L284 将 json.dumps(BASELINE) 改为 str(BASELINE)

关联变更: tests/e2e/test_ernie_21b_mtp_ngram.py L55-98（_build_speculate_metrics_baseline 和 BASELINE 定义），L270-284（断言逻辑）
链接: 查看日志

Approval — 流程审批（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 流程审批
置信度: 高
根因摘要: PR 尚未获得人工审批，Approval 检查返回 exit code 6
分析器: ci_analyze_infra

关键日志:

Process completed with exit code 6.

修复建议:

请相关 reviewer 对本 PR 进行 Review 并 Approve，Approval workflow 将自动重新通过

修复建议摘要: 等待 reviewer 审批通过

链接: 查看日志

freeliuzc · 2026-05-19T13:00:47Z

代码整体实现没问题，缺少一份置信的性能以及接受率报告来佐证功能正确。
仿照 FastDeploy/benchmarks/README.md ，使用 filtered_sharedgpt_2000_input_1136_output_200_fd 数据集，对 non-spec/ngram/mtp(1步和3步)/mtp(3步)+hybrid 出一份性能报告，以及 speculate.log 里的接受率统计

NKNaN had a problem deploying to Metax_ci May 19, 2026 03:49 — with GitHub Actions Error

paddle-bot Bot added the contributor External developers label May 19, 2026

NKNaN had a problem deploying to Metax_ci May 19, 2026 04:14 — with GitHub Actions Failure

freeliuzc requested a review from Copilot May 19, 2026 07:17

Copilot started reviewing on behalf of freeliuzc May 19, 2026 07:18 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread tests/e2e/test_ernie_21b_mtp_ngram.py Outdated

NKNaN had a problem deploying to Metax_ci May 19, 2026 07:48 — with GitHub Actions Error

NKNaN and others added 3 commits May 19, 2026 15:51

update hybrid mtp ngram kernel signature

2c9cd6d

update unittests

e35cfc5

Potential fix for pull request finding

87b0941

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

NKNaN force-pushed the spec-mtp-ngram branch from 0488653 to 87b0941 Compare May 19, 2026 07:52

NKNaN had a problem deploying to Metax_ci May 19, 2026 07:52 — with GitHub Actions Error

codestyle fix

f718837

NKNaN temporarily deployed to Metax_ci May 19, 2026 07:56 — with GitHub Actions Inactive

PaddlePaddle-bot suggested changes May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding]【Hackathon 10th Spring No.54】hybrid_mtp_ngram 端到端验证#7849

[Speculative Decoding]【Hackathon 10th Spring No.54】hybrid_mtp_ngram 端到端验证#7849
NKNaN wants to merge 4 commits into
PaddlePaddle:developfrom
NKNaN:spec-mtp-ngram

NKNaN commented May 19, 2026

Uh oh!

paddle-bot Bot commented May 19, 2026

Uh oh!

codecov-commenter commented May 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 19, 2026

Uh oh!

PaddlePaddle-bot commented May 19, 2026

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Approval

Uh oh!

freeliuzc commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

NKNaN commented May 19, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 19, 2026

Uh oh!

codecov-commenter commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 19, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/10 通过

2.2 可选任务 — 31/32 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Approval

Uh oh!

freeliuzc commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented May 19, 2026 •

edited

Loading