Skip to content

Add DSv4-Pro FP4 GB200 SGLang disagg + MTP config#1676

Open
Ankur-singh wants to merge 3 commits into
mainfrom
dsv4-fp4-gb200-dynamo-sglang-mtp
Open

Add DSv4-Pro FP4 GB200 SGLang disagg + MTP config#1676
Ankur-singh wants to merge 3 commits into
mainfrom
dsv4-fp4-gb200-dynamo-sglang-mtp

Conversation

@Ankur-singh

@Ankur-singh Ankur-singh commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds the dsv4-fp4-gb200-dynamo-sglang-mtp config: the MTP-speculative-decoding variant of dsv4-fp4-gb200-dynamo-sglang on GB200 with SGLang at 8k/1k.
  • 8 prefill/decode topologies: low-latency 1p1d-tp8-tp8 and 1p6d-dep8-tp8; mid-curve 1p1d through 6p1d-dep8-dep16. Each scenario sets spec-decoding: "mtp".
  • Container image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85 (matches the non-MTP companion PR Add DSv4-Pro FP4 GB200 SGLang disagg config #1675).

8 new recipe YAMLs land under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. Same dynamo-sglang & dsv4 branch in runners/launch_gb200-nv.sh services both PRs.


Note

Low Risk
Benchmark and CI launch configuration only; no production serving or auth paths touched.

Overview
Adds dsv4-fp4-gb200-dynamo-sglang-mtp, the MTP speculative-decoding matrix entry for DeepSeek-V4-Pro FP4 disaggregated SGLang on GB200 at 8k/1k. Eight fixed-seq-len scenarios mirror the non-MTP topologies (low-latency 1p1d-tp8-tp8 and 1p6d-dep8-tp8; mid-curve 1p1d6p1d with DEP8 prefill and DEP16 decode), each with spec-decoding: "mtp" and matching CONFIG_FILE recipe paths.

Eight new Slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ wire Dynamo + Mooncake disagg, GB200 resource counts, and EAGLE draft settings on decode (speculative-algo, steps/topk/draft tokens) alongside sa-bench concurrency targets per topology.

perf-changelog.yaml documents the initial submission. runners/launch_gb200-nv.sh drops the sa-submission-q2-2026 checkout for dynamo-sglang + dsv4 and relies on default NVIDIA/srt-slurm main while overlaying the local SGLang DSV4 recipes.

Reviewed by Cursor Bugbot for commit e222dc7. Bugbot is set up for automated code reviews on this repo. Configure here.

Initial submission of the MTP-decoded variant of the DSv4-Pro FP4 disagg
GB200 SGLang config at 8k/1k. Eight prefill/decode topologies: two
low-latency (1p1d-tp8-tp8, 1p6d-dep8-tp8) and six mid-curve points
(1p1d through 6p1d-dep8-dep16). Each scenario sets
`spec-decoding: "mtp"` so the matrix turns on the MTP speculative-decode
path; chat template enabled accordingly.

Image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5c201f6. Configure here.

MC_FORCE_MNNVL: "1"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing multinode all-reduce disable

High Severity

The low-latency GB200 MTP recipes run TP8 decode across two nodes per worker but their decode_environment blocks never set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 to 0. The mid-curve GB200 MTP recipes added in the same change do set that flag, and existing DSV4 slurm recipe notes tie custom all-reduce v2 to incorrect results on multi-node decode.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5c201f6. Configure here.

Comment on lines +50 to +64
decode_environment:
PYTHONUNBUFFERED: "1"
SGLANG_RADIX_FORCE_MISS: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
SGLANG_DEFAULT_THINKING: "1"
SGLANG_DSV4_REASONING_EFFORT: "max"
SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
MC_FORCE_MNNVL: "1"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The two new low-latency GB200 recipes (this file and disagg-gb200-low-latency-1p6d-dep8-tp8-mtp.yaml) configure decode at TP=8 on a GPU with gpus_per_node: 4, so each decode worker spans 2 nodes — but their decode_environment does not set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0". SGLang's default for CAR_V2 is on, and existing recipes in this directory document that CAR_V2 "is single-node only and corrupts results in 2-node decode setups". The six new GB200 mid-curve recipes in this same PR all correctly set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0" — please mirror that in both low-latency decode envs as well.

Extended reasoning...

What the bug is. Both new low-latency GB200 files in this PR — disagg-gb200-low-latency-1p1d-tp8-tp8-mtp.yaml and disagg-gb200-low-latency-1p6d-dep8-tp8-mtp.yaml — declare gpu_type: "gb200", gpus_per_node: 4, and decode tensor-parallel-size: 8. Each decode worker therefore spans 8/4 = 2 nodes. In that regime, every other recipe in this directory explicitly disables SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 in decode_environment, but these two new files do not.\n\nWhy existing code doesn't prevent it. The default for CAR_V2 in SGLang is on. perf-changelog.yaml:3221 (PR #1506) explicitly notes "Remove env vars redundant with sglang defaults (..., SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2)", and the single-node benchmark benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200.sh:29 sets SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 to match. So omitting the var in the YAML means CAR_V2 runs.\n\nDocumented impact. disagg-low-latency-1p1d-tp4-tp4-mtp.yaml carries this comment for the omitted var:\n\n> # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 \n> # is single-node only and corrupts results in 2-node decode setups.\n\nThat comment is fine for the GB300 1p1d-tp4-tp4 / 1p6d-dep4-tp4 files because their decode TP=4 fits a single 4-GPU node. It does not apply to the new GB200 low-latency files, whose decode TP=8 always spans 2 nodes.\n\nInconsistency with the rest of this same PR. All six new GB200 mid-curve recipes added here (disagg-gb200-mid-curve-{1..6}p1d-dep8-dep16-mtp.yaml) set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0" in decode_environment (lines ~75 of each). Their decode is TP=16 (4 nodes); the same multi-node constraint applies. The low-latency files are the only outliers in the PR — a clear copy/paste oversight rather than an intentional choice.\n\nStep-by-step proof for disagg-gb200-low-latency-1p1d-tp8-tp8-mtp.yaml.\n1. The file declares gpus_per_node: 4 (line 23) and decode tensor-parallel-size: 8 (line 100, in the decode block of sglang_config).\n2. With 4 GPUs per node and TP=8, the decode worker spans 8/4 = 2 nodes — confirmed by the file's own decode_nodes: 2, decode_workers: 1 (lines 26–28).\n3. The container is lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85 whose default for SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 is 1 (on) per the perf-changelog #1506 entry above.\n4. decode_environment (lines 50–64) does not override this var, so CAR_V2 is active on a 2-node decode all-reduce — the exact regime the in-repo comments say "corrupts results."\n5. The same logic applies to disagg-gb200-low-latency-1p6d-dep8-tp8-mtp.yaml, whose decode workers also run TP=8 on gpus_per_node: 4.\n\nFix. Add a single line to the decode_environment of both new GB200 low-latency files (mirroring the mid-curve siblings):\n\nyaml\n SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0" # CAR_V2 is single-node only.\n\n\nLeaving the bug in place means the low-latency points on the published curve this PR adds will be silently incorrect rather than crashing, which is the worst failure mode for a benchmark recipe.

Comment thread perf-changelog.yaml
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant