Skip to content

[paddle-adapt] Fix CUDAGraphMoE for Paddle compat: ExternalStream + CUDA graph capture#27

Open
BingooYang wants to merge 1 commit into
PFCCLab:0.6from
BingooYang:adapt/nvfp4_renorm
Open

[paddle-adapt] Fix CUDAGraphMoE for Paddle compat: ExternalStream + CUDA graph capture#27
BingooYang wants to merge 1 commit into
PFCCLab:0.6from
BingooYang:adapt/nvfp4_renorm

Conversation

@BingooYang
Copy link
Copy Markdown

Description

Adapt test_renormalize_routing[BF16_logits-Swiglu-Shuffled_MajorK-Renorm-NvFP4xNvFP4-384-1024-8-RandomHiddenStates] to pass under Paddle compat mode. Two compatibility issues fixed in CUDAGraphMoE:

§53 — torch.cuda.ExternalStream unavailable in Paddle compat

torch.cuda.ExternalStream is not implemented in PaddlePaddle compat. Fix: use torch.cuda.Stream() (Paddle-managed), extract raw CUDA pointer via .stream_base.cuda_stream, wrap with runtime.cudaStream_t(raw_ptr). Paddle-managed streams are not destroyed via cudaStreamDestroy in cleanup().

§54 — torch.empty() triggers cudaMemAlloc during CUDA stream capture (error 900)

In Paddle compat, torch.empty() calls cudaMemAlloc directly — forbidden during stream capture (cudaErrorStreamCaptureUnsupported). Fix: move all tensor allocations outside capture window:

  • Pre-run quantize_inputs() before cudaStreamBeginCapture, store in self._static_quantized
  • During capture, _run_moe_computation() uses pre-allocated static buffers directly
  • In launch(), re-quantize outside graph and update buffers via copy_()

Related Issues

Paddle compat adaptation series for PFCCLab/flashinfer.

Tests

  • Target test: 1 passed in 10.62s
  • Full regression via scripts/paddle_all_test_cases.sh — 0 failures (attention: 72 passed, moe: 9 passed, comm: 1 passed, norm+gemm: 515 passed)
  • New test added to scripts/paddle_all_test_cases.sh

Reviewer Notes

Changes confined to tests/moe/test_trtllm_gen_fused_moe.py (test layer only) and scripts/paddle_all_test_cases.sh. No core API or kernel code modified.

Generated with Claude Code

…§53: torch.cuda.ExternalStream unavailable in Paddle compat.\n Use torch.cuda.Stream() + stream_base.cuda_stream + runtime.cudaStream_t(raw_ptr).\n Skip cudaStreamDestroy for Paddle-managed streams in cleanup().\n- §54: torch.empty() in Paddle compat calls cudaMemAlloc, forbidden during\n CUDA stream capture (error 900 cudaErrorStreamCaptureUnsupported).\n Fix: pre-allocate quantized input buffers before capture via quantize_inputs();\n _run_moe_computation() reuses static buffers (no alloc in capture window);\n launch() re-quantizes outside graph and updates buffers via copy_().\n- Regression: paddle_all_test_cases.sh all PASS (0 failed, skips are hw limits).\n- New test added to paddle_all_test_cases.sh.\n\nRefs: MISMATCH_EXPERIMENT §53, §54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant