Add DSv4-Pro FP4 GB200 SGLang disagg config by Ankur-singh · Pull Request #1675 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-06-05T23:05:13Z

Summary

Adds the dsv4-fp4-gb200-dynamo-sglang config: DeepSeek-V4-Pro FP4 disagg on GB200 with SGLang at 8k/1k.
8 prefill/decode topologies in the search space — low-latency 1p1d-tp8-tp8 through max-throughput 6p1d-dep8-dep12.
Container image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85.

8 new recipe YAMLs land under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ and are picked up by the existing dynamo-sglang & dsv4 branch in runners/launch_gb200-nv.sh.

Note

Low Risk
Benchmark and CI launcher configuration only; no application runtime or auth changes.

Overview
Adds dsv4-fp4-gb200-dynamo-sglang to the NVIDIA master benchmark matrix: DeepSeek-V4-Pro FP4, disaggregated prefill/decode on GB200 via Dynamo + SGLang at 8k/1k, using image lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85. The search space defines eight topology/concurrency points from low-latency 1p1d-tp8-tp8 (c=1) through high-throughput WideEP layouts up to 6p1d-dep8-dep12 (c=8192).

Eight matching Slurm recipe YAMLs are added under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. runners/launch_gb200-nv.sh is adjusted for this path: DSV4 FP4 weights load from /mnt/lustre01/models/deepseek-v4-pro instead of compute-local NVMe, and the SGLang DSV4 flow clones NVIDIA/srt-slurm on main (drops the sa-submission-q2-2026 pin) while still overlaying the local recipes. perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit 567f0ec. Bugbot is set up for automated code reviews on this repo. Configure here.}

Initial submission of the DeepSeek-V4-Pro FP4 disagg config on GB200 with SGLang at 8k/1k. Eight prefill/decode topologies span the low-latency 1p1d-tp8-tp8 point through the max-throughput 6p1d-dep8-dep12 point. Image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85

github-actions · 2026-06-05T23:05:20Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2b6a334. Configure here.}

cursor · 2026-06-05T23:06:56Z

+  osl: 1024
+  concurrencies: "1"
+  req_rate: "inf"
+  use_chat_template: false


Missing DSV4 custom tokenizer

High Severity

Each new GB200 SGLang sa-bench block sets use_chat_template: false but omits custom_tokenizer, unlike every other DeepSeek-V4 SGLang and vLLM recipe here. With skip-tokenizer-init on the server, the load generator likely won’t match the model’s tokenization, so fixed 8k/1k runs may not hit the intended sequence lengths.

Additional Locations (2)

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-dep16-6-c1024.yaml#L152-L159

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb200-1p4d-dep8-tp8-10-c64.yaml#L138-L145

^{Reviewed by Cursor Bugbot for commit 2b6a334. Configure here.}

claude · 2026-06-05T23:14:37Z

+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1024"
+  req_rate: "inf"
+  use_chat_template: false


🔴 All 8 new GB200 yaml files omit custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" from the benchmark: block, while all 14 existing DSv4 SGLang sibling configs in the same directory set it. Since 7 of the 8 new configs set skip-tokenizer-init: true on both prefill and decode (SGLang exchanges raw token IDs with the client), the sa-bench client must own DSv4-Pro tokenization end-to-end — without this field, it falls back to a default tokenizer producing wrong token-id streams and skewing ISL/OSL accounting, making the new GB200 numbers non-comparable to the GB300 baseline. Affected files: disagg-gb200-1p1d-tp8-tp8-4-c1.yaml, 1p4d-dep8-tp8-10-c64.yaml, 1p2d-dep8-dep16-10-c256.yaml, 1p1d-dep8-dep16-6-c1024.yaml, 2p1d-dep8-dep16-8-c2048.yaml, 4p1d-dep8-dep16-12-c4096.yaml, 5p1d-dep8-dep16-14-c8192.yaml, 6p1d-dep8-dep12-15-c8192.yaml. Fix: add custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" to each benchmark: block.

Extended reasoning...

What the bug is

The benchmark: block at the bottom of every one of the 8 new GB200 DSv4 SGLang yaml files looks like:

benchmark: type: "sa-bench" isl: 8192 osl: 1024 concurrencies: "1024" req_rate: "inf" use_chat_template: false

The custom_tokenizer field is missing. Every other DSv4 SGLang recipe in the same directory — benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ — sets it, including the direct GB300 counterparts that this sweep is meant to complement:

File Line

disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml 182

disagg-gb300-1p1d-tp4-tp4-2-c1.yaml 165

disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml 182

disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml 182

disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml 182

disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml 182

8 mid-curve / high-conc / low-latency variants 121–141

That is 14/14 existing configs setting custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" and 0/8 new ones.

How it manifests

7 of the 8 new wideep configs (everything except 1p1d-tp8-tp8-4-c1.yaml) set skip-tokenizer-init: true on both prefill and decode sglang_config: blocks. With that flag, SGLang refuses to do tokenization itself and expects the client to send raw token-id streams — DSv4-Pro's own tokenizer with its special tokens (<｜begin▁of▁sentence｜>, <｜User｜>, <｜Assistant｜>, thinking/DSML/task tokens) must be applied client-side. Without custom_tokenizer, sa-bench falls back to its default tokenizer, which does not encode these specials correctly. The result: wrong token-id streams, ISL/OSL accounting that does not match what the GB300 baseline measured, and 8k/1k numbers that cannot be apples-to-apples compared with the existing GB300 sweep.

Even the 1p1d-tp8-tp8-4-c1.yaml low-latency config (which uses disable-radix-cache: true instead of skip-tokenizer-init: true) still breaks the established convention — its GB300 sibling at disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:165 does set custom_tokenizer.

Why existing code does not prevent it

The benchmark: block has no schema validation that would flag a missing custom tokenizer; sa-bench just picks a default. Failure is silent — the run completes, numbers come back, and only a careful reviewer comparing against the GB300 baseline would notice the discrepancy.

Step-by-step proof of impact

Launch sweep picks up the new dsv4-fp4-gb200-dynamo-sglang config from .github/configs/nvidia-master.yaml.

For e.g. concurrency 1024, the harness selects disagg-gb200-1p1d-dep8-dep16-6-c1024.yaml.

SGLang prefill and decode are started with skip-tokenizer-init: true (lines 108, 131 of that file) — the server now accepts only token ids, not text.

sa-bench reads the benchmark: block. Because custom_tokenizer is absent, it falls back to a generic HF tokenizer (or whatever its default is).

The generic tokenizer encodes the 8k input prompt differently from SGLangDeepseekV4Tokenizer — DSv4-specific specials are tokenized as multiple BPE pieces instead of the intended single ids.

The token stream sent to SGLang has a different length and content than the GB300 sweep produced. ISL is no longer exactly 8192 in DSv4 tokenization, OSL accounting is correspondingly off, and per-token latency / throughput numbers are skewed relative to the GB300 baseline.

The PR description explicitly positions this as the GB200 counterpart to the existing GB300 sweep, so apples-to-apples comparison is the whole point — and it is silently broken.

Fix

Add one line to each of the 8 new yaml benchmark: blocks, matching the existing convention:

benchmark: type: "sa-bench" isl: 8192 osl: 1024 concurrencies: "..." req_rate: "inf" use_chat_template: false custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"

…ch pin)

github-actions · 2026-06-05T23:37:45Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27044724564
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27044724564

github-actions · 2026-06-05T23:42:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27045729036
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27045729036

…on GB200 external

github-actions · 2026-06-06T04:54:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27046356300
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27046356300

Ankur-singh requested a review from a team June 5, 2026 23:05

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners June 5, 2026 23:05

github-project-automation Bot added this to InferenceMAX Board Jun 5, 2026

Update perf-changelog pr-link for #1675

2b6a334

Ankur-singh added the full-sweep-enabled label Jun 5, 2026

cursor Bot reviewed Jun 5, 2026

View reviewed changes

claude Bot reviewed Jun 5, 2026

View reviewed changes

Ankur-singh mentioned this pull request Jun 5, 2026

Add DSv4-Pro FP4 GB200 SGLang disagg + MTP config #1676

Open

Use NVIDIA/srt-slurm:main for DSv4 SGLang clone (drop submission-bran…

1786080

…ch pin)

Point DSv4 SGLang MODEL_PATH at /mnt/lustre01/models/deepseek-v4-pro …

567f0ec

…on GB200 external

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DSv4-Pro FP4 GB200 SGLang disagg config#1675

Add DSv4-Pro FP4 GB200 SGLang disagg config#1675
Ankur-singh wants to merge 4 commits into
mainfrom
dsv4-fp4-gb200-dynamo-sglang-new

Ankur-singh commented Jun 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

claude Bot Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

File	Line
`disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml`	182
`disagg-gb300-1p1d-tp4-tp4-2-c1.yaml`	165
`disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml`	182
`disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml`	182
`disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml`	182
`disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml`	182
8 `mid-curve` / `high-conc` / `low-latency` variants	121–141

Conversation

Ankur-singh commented Jun 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Missing DSV4 custom tokenizer

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests

Why existing code does not prevent it

Step-by-step proof of impact

Fix

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ankur-singh commented Jun 5, 2026 •

edited by cursor Bot

Loading