Add DSv4-Pro FP4 GB200 SGLang disagg config#1675
Conversation
Initial submission of the DeepSeek-V4-Pro FP4 disagg config on GB200 with SGLang at 8k/1k. Eight prefill/decode topologies span the low-latency 1p1d-tp8-tp8 point through the max-throughput 6p1d-dep8-dep12 point. Image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2b6a334. Configure here.
| osl: 1024 | ||
| concurrencies: "1" | ||
| req_rate: "inf" | ||
| use_chat_template: false |
There was a problem hiding this comment.
Missing DSV4 custom tokenizer
High Severity
Each new GB200 SGLang sa-bench block sets use_chat_template: false but omits custom_tokenizer, unlike every other DeepSeek-V4 SGLang and vLLM recipe here. With skip-tokenizer-init on the server, the load generator likely won’t match the model’s tokenization, so fixed 8k/1k runs may not hit the intended sequence lengths.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 2b6a334. Configure here.
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| concurrencies: "1024" | ||
| req_rate: "inf" | ||
| use_chat_template: false |
There was a problem hiding this comment.
🔴 All 8 new GB200 yaml files omit custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" from the benchmark: block, while all 14 existing DSv4 SGLang sibling configs in the same directory set it. Since 7 of the 8 new configs set skip-tokenizer-init: true on both prefill and decode (SGLang exchanges raw token IDs with the client), the sa-bench client must own DSv4-Pro tokenization end-to-end — without this field, it falls back to a default tokenizer producing wrong token-id streams and skewing ISL/OSL accounting, making the new GB200 numbers non-comparable to the GB300 baseline. Affected files: disagg-gb200-1p1d-tp8-tp8-4-c1.yaml, 1p4d-dep8-tp8-10-c64.yaml, 1p2d-dep8-dep16-10-c256.yaml, 1p1d-dep8-dep16-6-c1024.yaml, 2p1d-dep8-dep16-8-c2048.yaml, 4p1d-dep8-dep16-12-c4096.yaml, 5p1d-dep8-dep16-14-c8192.yaml, 6p1d-dep8-dep12-15-c8192.yaml. Fix: add custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" to each benchmark: block.
Extended reasoning...
What the bug is
The benchmark: block at the bottom of every one of the 8 new GB200 DSv4 SGLang yaml files looks like:
benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "1024"
req_rate: "inf"
use_chat_template: falseThe custom_tokenizer field is missing. Every other DSv4 SGLang recipe in the same directory — benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ — sets it, including the direct GB300 counterparts that this sweep is meant to complement:
| File | Line |
|---|---|
disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml |
182 |
disagg-gb300-1p1d-tp4-tp4-2-c1.yaml |
165 |
disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml |
182 |
disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml |
182 |
disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml |
182 |
disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml |
182 |
8 mid-curve / high-conc / low-latency variants |
121–141 |
That is 14/14 existing configs setting custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" and 0/8 new ones.
How it manifests
7 of the 8 new wideep configs (everything except 1p1d-tp8-tp8-4-c1.yaml) set skip-tokenizer-init: true on both prefill and decode sglang_config: blocks. With that flag, SGLang refuses to do tokenization itself and expects the client to send raw token-id streams — DSv4-Pro's own tokenizer with its special tokens (<|begin▁of▁sentence|>, <|User|>, <|Assistant|>, thinking/DSML/task tokens) must be applied client-side. Without custom_tokenizer, sa-bench falls back to its default tokenizer, which does not encode these specials correctly. The result: wrong token-id streams, ISL/OSL accounting that does not match what the GB300 baseline measured, and 8k/1k numbers that cannot be apples-to-apples compared with the existing GB300 sweep.
Even the 1p1d-tp8-tp8-4-c1.yaml low-latency config (which uses disable-radix-cache: true instead of skip-tokenizer-init: true) still breaks the established convention — its GB300 sibling at disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:165 does set custom_tokenizer.
Why existing code does not prevent it
The benchmark: block has no schema validation that would flag a missing custom tokenizer; sa-bench just picks a default. Failure is silent — the run completes, numbers come back, and only a careful reviewer comparing against the GB300 baseline would notice the discrepancy.
Step-by-step proof of impact
- Launch sweep picks up the new
dsv4-fp4-gb200-dynamo-sglangconfig from.github/configs/nvidia-master.yaml. - For e.g. concurrency 1024, the harness selects
disagg-gb200-1p1d-dep8-dep16-6-c1024.yaml. - SGLang prefill and decode are started with
skip-tokenizer-init: true(lines 108, 131 of that file) — the server now accepts only token ids, not text. - sa-bench reads the
benchmark:block. Becausecustom_tokenizeris absent, it falls back to a generic HF tokenizer (or whatever its default is). - The generic tokenizer encodes the 8k input prompt differently from
SGLangDeepseekV4Tokenizer— DSv4-specific specials are tokenized as multiple BPE pieces instead of the intended single ids. - The token stream sent to SGLang has a different length and content than the GB300 sweep produced. ISL is no longer exactly 8192 in DSv4 tokenization, OSL accounting is correspondingly off, and per-token latency / throughput numbers are skewed relative to the GB300 baseline.
- The PR description explicitly positions this as the GB200 counterpart to the existing GB300 sweep, so apples-to-apples comparison is the whole point — and it is silently broken.
Fix
Add one line to each of the 8 new yaml benchmark: blocks, matching the existing convention:
benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "..."
req_rate: "inf"
use_chat_template: false
custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27044724564 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27045729036 |
…on GB200 external
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27046356300 |


Summary
dsv4-fp4-gb200-dynamo-sglangconfig: DeepSeek-V4-Pro FP4 disagg on GB200 with SGLang at 8k/1k.1p1d-tp8-tp8through max-throughput6p1d-dep8-dep12.lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85.8 new recipe YAMLs land under
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/and are picked up by the existingdynamo-sglang & dsv4branch inrunners/launch_gb200-nv.sh.Note
Low Risk
Benchmark and CI launcher configuration only; no application runtime or auth changes.
Overview
Adds
dsv4-fp4-gb200-dynamo-sglangto the NVIDIA master benchmark matrix: DeepSeek-V4-Pro FP4, disaggregated prefill/decode on GB200 via Dynamo + SGLang at 8k/1k, using imagelmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85. The search space defines eight topology/concurrency points from low-latency1p1d-tp8-tp8(c=1) through high-throughput WideEP layouts up to6p1d-dep8-dep12(c=8192).Eight matching Slurm recipe YAMLs are added under
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/.runners/launch_gb200-nv.shis adjusted for this path: DSV4 FP4 weights load from/mnt/lustre01/models/deepseek-v4-proinstead of compute-local NVMe, and the SGLang DSV4 flow clonesNVIDIA/srt-slurmonmain(drops thesa-submission-q2-2026pin) while still overlaying the local recipes.perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 567f0ec. Bugbot is set up for automated code reviews on this repo. Configure here.