Skip to content

Use put_along_axis for Paddle routing metadata#16

Merged
SigureMo merged 1 commit into
hybrid-ep-paddlefrom
sigure/hybridep-put-along-axis
May 19, 2026
Merged

Use put_along_axis for Paddle routing metadata#16
SigureMo merged 1 commit into
hybrid-ep-paddlefrom
sigure/hybridep-put-along-axis

Conversation

@SigureMo
Copy link
Copy Markdown

@SigureMo SigureMo commented May 19, 2026

背景

Paddle 版 HybridEP 的 indices_to_map() 之前通过 scatter_nd_add 构造 dense routing map/probs。这是为了绕过 Paddle compat 下 torch.scatter 的兼容问题,但实现比上游 scatter 语义更绕,也引入了额外的索引展开和临时 tensor。

修改

  • indices_to_map() 中的 scatter_nd_add 路径改为 paddle.put_along_axis
  • 复用 topk_idx.to(torch.int64),避免重复转换。
  • 保留 uint8 -> bool 的 routing map 写法,因为 Paddle 当前没有 CUDA bool put_along_axis kernel,不能直接恢复成上游 dtype=torch.bool scatter 写法。
  • 将 tensor 创建方式收束为 device="cuda",更接近上游写法。

验证

逐位对齐:

  • 两机 2x8 复跑 A1B topk=2 逐位对齐。
  • final_layernorm_output MD5:rank 0/8 均 ordered_unique_neq = 0/100
  • tr_loss_before_reduce:rank 0/8 均 paired_neq = 0/50

性能验证:

  • 同配置 A/B 对比 put_along_axis 与旧 scatter_nd_add 实现,统计 step 51-100。
  • global_steps_per_second0.359827 vs 0.349351,约 +2.91%
  • tokens_per_sec_per_card5895.404 vs 5723.763,约 +2.91%
  • dispatch/combine 时间整体接近,端到端吞吐有小幅提升。
This PR is authored by @codex (gpt-5.5 xhigh)

Replace the temporary scatter_nd_add construction in Paddle HybridEP indices_to_map with put_along_axis, matching the upstream scatter semantics while keeping uint8 routing storage because Paddle does not provide a CUDA bool put_along_axis kernel.

Validation:\n- 2x8 A1B topk=2 DeepEP vs HybridEP 50-step bitwise check: final_layernorm_output MD5 matched 100/100 for ranks 0 and 8; tr_loss_before_reduce matched 50/50 for ranks 0 and 8.

Co-authored-by: Codex <noreply@openai.com>
Copilot AI review requested due to automatic review settings May 19, 2026 06:46
@SigureMo SigureMo marked this pull request as draft May 19, 2026 06:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Paddle HybridEP indices_to_map() helper to build dense routing metadata using paddle.put_along_axis instead of the previous scatter_nd_add-based construction, aiming to reduce index expansion and intermediate tensor overhead under Paddle compat.

Changes:

  • Replaced the scatter_nd_add-based dense routing map/prob construction with paddle.put_along_axis.
  • Reused a single topk_idx int64 conversion to avoid repeated casts.
  • Kept routing map materialization via uint8 -> bool to work around missing CUDA bool put_along_axis support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SigureMo SigureMo marked this pull request as ready for review May 19, 2026 07:12
@SigureMo SigureMo merged commit 834a754 into hybrid-ep-paddle May 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants