Skip to content

Add DSv4-Pro FP4 GB200 SGLang disagg config#1675

Open
Ankur-singh wants to merge 4 commits into
mainfrom
dsv4-fp4-gb200-dynamo-sglang-new
Open

Add DSv4-Pro FP4 GB200 SGLang disagg config#1675
Ankur-singh wants to merge 4 commits into
mainfrom
dsv4-fp4-gb200-dynamo-sglang-new

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

@Ankur-singh Ankur-singh commented Jun 5, 2026

Summary

  • Adds the dsv4-fp4-gb200-dynamo-sglang config: DeepSeek-V4-Pro FP4 disagg on GB200 with SGLang at 8k/1k.
  • 8 prefill/decode topologies in the search space — low-latency 1p1d-tp8-tp8 through max-throughput 6p1d-dep8-dep12.
  • Container image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85.

8 new recipe YAMLs land under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ and are picked up by the existing dynamo-sglang & dsv4 branch in runners/launch_gb200-nv.sh.


Note

Low Risk
Benchmark and CI launcher configuration only; no application runtime or auth changes.

Overview
Adds dsv4-fp4-gb200-dynamo-sglang to the NVIDIA master benchmark matrix: DeepSeek-V4-Pro FP4, disaggregated prefill/decode on GB200 via Dynamo + SGLang at 8k/1k, using image lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85. The search space defines eight topology/concurrency points from low-latency 1p1d-tp8-tp8 (c=1) through high-throughput WideEP layouts up to 6p1d-dep8-dep12 (c=8192).

Eight matching Slurm recipe YAMLs are added under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. runners/launch_gb200-nv.sh is adjusted for this path: DSV4 FP4 weights load from /mnt/lustre01/models/deepseek-v4-pro instead of compute-local NVMe, and the SGLang DSV4 flow clones NVIDIA/srt-slurm on main (drops the sa-submission-q2-2026 pin) while still overlaying the local recipes. perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 567f0ec. Bugbot is set up for automated code reviews on this repo. Configure here.

Initial submission of the DeepSeek-V4-Pro FP4 disagg config on GB200 with
SGLang at 8k/1k. Eight prefill/decode topologies span the low-latency
1p1d-tp8-tp8 point through the max-throughput 6p1d-dep8-dep12 point.

Image: lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2b6a334. Configure here.

osl: 1024
concurrencies: "1"
req_rate: "inf"
use_chat_template: false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing DSV4 custom tokenizer

High Severity

Each new GB200 SGLang sa-bench block sets use_chat_template: false but omits custom_tokenizer, unlike every other DeepSeek-V4 SGLang and vLLM recipe here. With skip-tokenizer-init on the server, the load generator likely won’t match the model’s tokenization, so fixed 8k/1k runs may not hit the intended sequence lengths.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2b6a334. Configure here.

Comment on lines +154 to +159
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "1024"
req_rate: "inf"
use_chat_template: false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All 8 new GB200 yaml files omit custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" from the benchmark: block, while all 14 existing DSv4 SGLang sibling configs in the same directory set it. Since 7 of the 8 new configs set skip-tokenizer-init: true on both prefill and decode (SGLang exchanges raw token IDs with the client), the sa-bench client must own DSv4-Pro tokenization end-to-end — without this field, it falls back to a default tokenizer producing wrong token-id streams and skewing ISL/OSL accounting, making the new GB200 numbers non-comparable to the GB300 baseline. Affected files: disagg-gb200-1p1d-tp8-tp8-4-c1.yaml, 1p4d-dep8-tp8-10-c64.yaml, 1p2d-dep8-dep16-10-c256.yaml, 1p1d-dep8-dep16-6-c1024.yaml, 2p1d-dep8-dep16-8-c2048.yaml, 4p1d-dep8-dep16-12-c4096.yaml, 5p1d-dep8-dep16-14-c8192.yaml, 6p1d-dep8-dep12-15-c8192.yaml. Fix: add custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" to each benchmark: block.

Extended reasoning...

What the bug is

The benchmark: block at the bottom of every one of the 8 new GB200 DSv4 SGLang yaml files looks like:

benchmark:
  type: "sa-bench"
  isl: 8192
  osl: 1024
  concurrencies: "1024"
  req_rate: "inf"
  use_chat_template: false

The custom_tokenizer field is missing. Every other DSv4 SGLang recipe in the same directory — benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ — sets it, including the direct GB300 counterparts that this sweep is meant to complement:

File Line
disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml 182
disagg-gb300-1p1d-tp4-tp4-2-c1.yaml 165
disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml 182
disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml 182
disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml 182
disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml 182
8 mid-curve / high-conc / low-latency variants 121–141

That is 14/14 existing configs setting custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" and 0/8 new ones.

How it manifests

7 of the 8 new wideep configs (everything except 1p1d-tp8-tp8-4-c1.yaml) set skip-tokenizer-init: true on both prefill and decode sglang_config: blocks. With that flag, SGLang refuses to do tokenization itself and expects the client to send raw token-id streams — DSv4-Pro's own tokenizer with its special tokens (<|begin▁of▁sentence|>, <|User|>, <|Assistant|>, thinking/DSML/task tokens) must be applied client-side. Without custom_tokenizer, sa-bench falls back to its default tokenizer, which does not encode these specials correctly. The result: wrong token-id streams, ISL/OSL accounting that does not match what the GB300 baseline measured, and 8k/1k numbers that cannot be apples-to-apples compared with the existing GB300 sweep.

Even the 1p1d-tp8-tp8-4-c1.yaml low-latency config (which uses disable-radix-cache: true instead of skip-tokenizer-init: true) still breaks the established convention — its GB300 sibling at disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:165 does set custom_tokenizer.

Why existing code does not prevent it

The benchmark: block has no schema validation that would flag a missing custom tokenizer; sa-bench just picks a default. Failure is silent — the run completes, numbers come back, and only a careful reviewer comparing against the GB300 baseline would notice the discrepancy.

Step-by-step proof of impact

  1. Launch sweep picks up the new dsv4-fp4-gb200-dynamo-sglang config from .github/configs/nvidia-master.yaml.
  2. For e.g. concurrency 1024, the harness selects disagg-gb200-1p1d-dep8-dep16-6-c1024.yaml.
  3. SGLang prefill and decode are started with skip-tokenizer-init: true (lines 108, 131 of that file) — the server now accepts only token ids, not text.
  4. sa-bench reads the benchmark: block. Because custom_tokenizer is absent, it falls back to a generic HF tokenizer (or whatever its default is).
  5. The generic tokenizer encodes the 8k input prompt differently from SGLangDeepseekV4Tokenizer — DSv4-specific specials are tokenized as multiple BPE pieces instead of the intended single ids.
  6. The token stream sent to SGLang has a different length and content than the GB300 sweep produced. ISL is no longer exactly 8192 in DSv4 tokenization, OSL accounting is correspondingly off, and per-token latency / throughput numbers are skewed relative to the GB300 baseline.
  7. The PR description explicitly positions this as the GB200 counterpart to the existing GB300 sweep, so apples-to-apples comparison is the whole point — and it is silently broken.

Fix

Add one line to each of the 8 new yaml benchmark: blocks, matching the existing convention:

benchmark:
  type: "sa-bench"
  isl: 8192
  osl: 1024
  concurrencies: "..."
  req_rate: "inf"
  use_chat_template: false
  custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant