Add DSv4-Pro FP4 GB200 SGLang disagg + MTP config by Ankur-singh · Pull Request #1676 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-06-05T23:25:47Z

Summary

Adds the dsv4-fp4-gb200-dynamo-sglang-mtp config: the MTP-speculative-decoding variant of dsv4-fp4-gb200-dynamo-sglang on GB200 with SGLang at 8k/1k.
8 prefill/decode topologies: low-latency 1p1d-tp8-tp8 and 1p6d-dep8-tp8; mid-curve 1p1d through 6p1d-dep8-dep16. Each scenario sets spec-decoding: "mtp".
Container image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85 (matches the non-MTP companion PR Add DSv4-Pro FP4 GB200 SGLang disagg config #1675).

8 new recipe YAMLs land under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. Same dynamo-sglang & dsv4 branch in runners/launch_gb200-nv.sh services both PRs.

Note

Low Risk
Benchmark and CI launch configuration only; no production serving or auth paths touched.

Overview
Adds dsv4-fp4-gb200-dynamo-sglang-mtp, the MTP speculative-decoding matrix entry for DeepSeek-V4-Pro FP4 disaggregated SGLang on GB200 at 8k/1k. Eight fixed-seq-len scenarios mirror the non-MTP topologies (low-latency 1p1d-tp8-tp8 and 1p6d-dep8-tp8; mid-curve 1p1d–6p1d with DEP8 prefill and DEP16 decode), each with spec-decoding: "mtp" and matching CONFIG_FILE recipe paths.

Eight new Slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ wire Dynamo + Mooncake disagg, GB200 resource counts, and EAGLE draft settings on decode (speculative-algo, steps/topk/draft tokens) alongside sa-bench concurrency targets per topology.

perf-changelog.yaml documents the initial submission. runners/launch_gb200-nv.sh drops the sa-submission-q2-2026 checkout for dynamo-sglang + dsv4 and relies on default NVIDIA/srt-slurm main while overlaying the local SGLang DSV4 recipes.

^{Reviewed by Cursor Bugbot for commit e222dc7. Bugbot is set up for automated code reviews on this repo. Configure here.}

Initial submission of the MTP-decoded variant of the DSv4-Pro FP4 disagg GB200 SGLang config at 8k/1k. Eight prefill/decode topologies: two low-latency (1p1d-tp8-tp8, 1p6d-dep8-tp8) and six mid-curve points (1p1d through 6p1d-dep8-dep16). Each scenario sets `spec-decoding: "mtp"` so the matrix turns on the MTP speculative-decode path; chat template enabled accordingly. Image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85

github-actions · 2026-06-05T23:25:55Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-05T23:26:30Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27045382415
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27045382415

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5c201f6. Configure here.}

cursor · 2026-06-05T23:28:13Z

+    MC_FORCE_MNNVL: "1"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"


Missing multinode all-reduce disable

High Severity

The low-latency GB200 MTP recipes run TP8 decode across two nodes per worker but their decode_environment blocks never set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 to 0. The mid-curve GB200 MTP recipes added in the same change do set that flag, and existing DSV4 slurm recipe notes tie custom all-reduce v2 to incorrect results on multi-node decode.

Additional Locations (1)

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb200-low-latency-1p6d-dep8-tp8-mtp.yaml#L54-L68

^{Reviewed by Cursor Bugbot for commit 5c201f6. Configure here.}

…ch pin)

claude · 2026-06-05T23:36:58Z

+  decode_environment:
+    PYTHONUNBUFFERED: "1"
+    SGLANG_RADIX_FORCE_MISS: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DEFAULT_THINKING: "1"
+    SGLANG_DSV4_REASONING_EFFORT: "max"
+    SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
+    SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"


🔴 The two new low-latency GB200 recipes (this file and disagg-gb200-low-latency-1p6d-dep8-tp8-mtp.yaml) configure decode at TP=8 on a GPU with gpus_per_node: 4, so each decode worker spans 2 nodes — but their decode_environment does not set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0". SGLang's default for CAR_V2 is on, and existing recipes in this directory document that CAR_V2 "is single-node only and corrupts results in 2-node decode setups". The six new GB200 mid-curve recipes in this same PR all correctly set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0" — please mirror that in both low-latency decode envs as well.

Extended reasoning...

What the bug is. Both new low-latency GB200 files in this PR — disagg-gb200-low-latency-1p1d-tp8-tp8-mtp.yaml and disagg-gb200-low-latency-1p6d-dep8-tp8-mtp.yaml — declare gpu_type: "gb200", gpus_per_node: 4, and decode tensor-parallel-size: 8. Each decode worker therefore spans 8/4 = 2 nodes. In that regime, every other recipe in this directory explicitly disables SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 in decode_environment, but these two new files do not.\n\nWhy existing code doesn't prevent it. The default for CAR_V2 in SGLang is on. perf-changelog.yaml:3221 (PR #1506) explicitly notes "Remove env vars redundant with sglang defaults (..., SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2)", and the single-node benchmark benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200.sh:29 sets SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 to match. So omitting the var in the YAML means CAR_V2 runs.\n\nDocumented impact. disagg-low-latency-1p1d-tp4-tp4-mtp.yaml carries this comment for the omitted var:\n\n> # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 \n> # is single-node only and corrupts results in 2-node decode setups.\n\nThat comment is fine for the GB300 1p1d-tp4-tp4 / 1p6d-dep4-tp4 files because their decode TP=4 fits a single 4-GPU node. It does not apply to the new GB200 low-latency files, whose decode TP=8 always spans 2 nodes.\n\nInconsistency with the rest of this same PR. All six new GB200 mid-curve recipes added here (disagg-gb200-mid-curve-{1..6}p1d-dep8-dep16-mtp.yaml) set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0" in decode_environment (lines ~75 of each). Their decode is TP=16 (4 nodes); the same multi-node constraint applies. The low-latency files are the only outliers in the PR — a clear copy/paste oversight rather than an intentional choice.\n\nStep-by-step proof for disagg-gb200-low-latency-1p1d-tp8-tp8-mtp.yaml.\n1. The file declares gpus_per_node: 4 (line 23) and decode tensor-parallel-size: 8 (line 100, in the decode block of sglang_config).\n2. With 4 GPUs per node and TP=8, the decode worker spans 8/4 = 2 nodes — confirmed by the file's own decode_nodes: 2, decode_workers: 1 (lines 26–28).\n3. The container is lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85 whose default for SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 is 1 (on) per the perf-changelog #1506 entry above.\n4. decode_environment (lines 50–64) does not override this var, so CAR_V2 is active on a 2-node decode all-reduce — the exact regime the in-repo comments say "corrupts results."\n5. The same logic applies to disagg-gb200-low-latency-1p6d-dep8-tp8-mtp.yaml, whose decode workers also run TP=8 on gpus_per_node: 4.\n\nFix. Add a single line to the decode_environment of both new GB200 low-latency files (mirroring the mid-curve siblings):\n\nyaml\n SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0" # CAR_V2 is single-node only.\n\n\nLeaving the bug in place means the low-latency points on the published curve this PR adds will be silently incorrect rather than crashing, which is the worst failure mode for a benchmark recipe.

github-actions · 2026-06-05T23:38:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27045383269
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27045383269

github-actions · 2026-06-05T23:44:05Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27045710438
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27045710438

Ankur-singh requested a review from a team June 5, 2026 23:25

Ankur-singh requested a review from kedarpotdar-nv as a code owner June 5, 2026 23:25

github-project-automation Bot added this to InferenceMAX Board Jun 5, 2026

Ankur-singh requested a review from jgangani as a code owner June 5, 2026 23:25

Update perf-changelog pr-link for #1676

5c201f6

Ankur-singh added the full-sweep-enabled label Jun 5, 2026

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Use NVIDIA/srt-slurm:main for DSv4 SGLang clone (drop submission-bran…

e222dc7

…ch pin)

claude Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DSv4-Pro FP4 GB200 SGLang disagg + MTP config#1676

Add DSv4-Pro FP4 GB200 SGLang disagg + MTP config#1676
Ankur-singh wants to merge 3 commits into
mainfrom
dsv4-fp4-gb200-dynamo-sglang-mtp

Ankur-singh commented Jun 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

claude Bot Jun 5, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ankur-singh commented Jun 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Missing multinode all-reduce disable

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ankur-singh commented Jun 5, 2026 •

edited by cursor Bot

Loading