-
Notifications
You must be signed in to change notification settings - Fork 189
Add DSv4-Pro FP4 GB200 SGLang disagg config #1675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Ankur-singh
wants to merge
4
commits into
main
Choose a base branch
from
dsv4-fp4-gb200-dynamo-sglang-new
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
9d916cf
Add DSv4-Pro FP4 GB200 SGLang disagg config
Ankur-singh 2b6a334
Update perf-changelog pr-link for #1675
Ankur-singh 1786080
Use NVIDIA/srt-slurm:main for DSv4 SGLang clone (drop submission-bran…
Ankur-singh 567f0ec
Point DSv4 SGLang MODEL_PATH at /mnt/lustre01/models/deepseek-v4-pro …
Ankur-singh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
159 changes: 159 additions & 0 deletions
159
..._node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-dep16-6-c1024.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| name: "disagg-gb200-1p1d-dep8-dep16-6-c1024" | ||
|
|
||
|
|
||
| model: | ||
| path: "deepseek-v4-pro" | ||
| container: "lmsysorg/sglang:nightly-dev-cu13-20260528-0abe6a85" | ||
| precision: "fp4" | ||
|
|
||
| dynamo: | ||
| hash: "92f5b3b8d7dd5ab9179d4b1034bd2c1c0803693e" | ||
| install: true | ||
|
|
||
| slurm: | ||
| time_limit: "03:00:00" | ||
|
|
||
| sbatch_directives: | ||
| cpus-per-task: "144" | ||
| mem: "0" | ||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| gpus_per_node: 4 | ||
| prefill_nodes: 2 | ||
| prefill_workers: 1 | ||
| gpus_per_prefill: 8 | ||
| decode_nodes: 4 | ||
| decode_workers: 1 | ||
| gpus_per_decode: 16 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: false | ||
| env: | ||
| DYN_ROUTER_LOAD_BLOCK_SIZE: "1" | ||
| args: | ||
| router-mode: "kv" | ||
| router-kv-overlap-score-weight: 0 | ||
| router-queue-threshold: 64 | ||
| router-temperature: 0.5 | ||
| no-kv-events: true | ||
|
|
||
| backend: | ||
| type: sglang | ||
|
|
||
| prefill_environment: | ||
| PYTHONUNBUFFERED: "1" | ||
| SGLANG_RADIX_FORCE_MISS: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| SGLANG_ENABLE_THINKING: "1" | ||
| SGLANG_REASONING_EFFORT: "max" | ||
| SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" | ||
| SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: "1" | ||
| SGLANG_OPT_FIX_HASH_MEGA_MOE: "1" | ||
| SGLANG_OPT_USE_FAST_MASK_EP: "1" | ||
| SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1" | ||
| SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "8192" | ||
| SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS: "1" | ||
| SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND: "1" | ||
| SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1" | ||
| SGLANG_OPT_USE_ONLINE_COMPRESS: "1" | ||
| SGLANG_OPT_FP8_WO_A_GEMM: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_LOG_FORWARD_ITERS: "1" | ||
| SGLANG_LOG_MS: "1" | ||
| SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60" | ||
|
|
||
| decode_environment: | ||
| PYTHONUNBUFFERED: "1" | ||
| SGLANG_RADIX_FORCE_MISS: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| SGLANG_ENABLE_THINKING: "1" | ||
| SGLANG_REASONING_EFFORT: "max" | ||
| SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" | ||
| SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: "1" | ||
| SGLANG_OPT_FIX_HASH_MEGA_MOE: "1" | ||
| SGLANG_OPT_USE_FAST_MASK_EP: "1" | ||
| SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1" | ||
| SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "1280" | ||
| SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS: "1" | ||
| SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND: "1" | ||
| SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1" | ||
| SGLANG_OPT_USE_ONLINE_COMPRESS: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION: "8" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_LOG_FORWARD_ITERS: "1" | ||
| SGLANG_LOG_MS: "1" | ||
| SGLANG_REQUEST_STATE_WAIT_TIMEOUT: "60" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| trust-remote-code: true | ||
| watchdog-timeout: 86400 | ||
| skip-tokenizer-init: true | ||
| stream-interval: 60 | ||
|
|
||
| tensor-parallel-size: 8 | ||
| data-parallel-size: 8 | ||
| expert-parallel-size: 8 | ||
|
|
||
| enable-dp-attention: true | ||
| moe-a2a-backend: "megamoe" | ||
| deepep-config: '{"normal_dispatch":{"num_sms":88,"num_max_nvl_chunked_send_tokens":28,"num_max_nvl_chunked_recv_tokens":512},"normal_combine": {"num_sms":88,"num_max_nvl_chunked_send_tokens":16,"num_max_nvl_chunked_recv_tokens":512}}' | ||
| moe-dense-tp-size: 1 | ||
|
|
||
| disaggregation-mode: "prefill" | ||
| disaggregation-transfer-backend: mooncake | ||
|
|
||
| mem-fraction-static: 0.80 | ||
| max-running-requests: 1024 | ||
| chunked-prefill-size: 65536 | ||
|
|
||
| decode: | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| trust-remote-code: true | ||
| watchdog-timeout: 86400 | ||
| skip-tokenizer-init: true | ||
| stream-interval: 60 | ||
|
|
||
| load-balance-method: "total_requests" | ||
| moe-a2a-backend: "megamoe" | ||
|
|
||
| disaggregation-mode: "decode" | ||
| disaggregation-transfer-backend: mooncake | ||
| disaggregation-decode-polling-interval: 8 | ||
|
|
||
| mem-fraction-static: 0.94 | ||
| swa-full-tokens-ratio: 0.056 | ||
| context-length: 9216 | ||
| tensor-parallel-size: 16 | ||
| data-parallel-size: 16 | ||
| expert-parallel-size: 16 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| max-running-requests: 21504 | ||
| cuda-graph-max-bs: 1280 | ||
|
|
||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| concurrencies: "1024" | ||
| req_rate: "inf" | ||
| use_chat_template: false | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 All 8 new GB200 yaml files omit
custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"from thebenchmark:block, while all 14 existing DSv4 SGLang sibling configs in the same directory set it. Since 7 of the 8 new configs setskip-tokenizer-init: trueon both prefill and decode (SGLang exchanges raw token IDs with the client), the sa-bench client must own DSv4-Pro tokenization end-to-end — without this field, it falls back to a default tokenizer producing wrong token-id streams and skewing ISL/OSL accounting, making the new GB200 numbers non-comparable to the GB300 baseline. Affected files:disagg-gb200-1p1d-tp8-tp8-4-c1.yaml,1p4d-dep8-tp8-10-c64.yaml,1p2d-dep8-dep16-10-c256.yaml,1p1d-dep8-dep16-6-c1024.yaml,2p1d-dep8-dep16-8-c2048.yaml,4p1d-dep8-dep16-12-c4096.yaml,5p1d-dep8-dep16-14-c8192.yaml,6p1d-dep8-dep12-15-c8192.yaml. Fix: addcustom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"to eachbenchmark:block.Extended reasoning...
What the bug is
The
benchmark:block at the bottom of every one of the 8 new GB200 DSv4 SGLang yaml files looks like:The
custom_tokenizerfield is missing. Every other DSv4 SGLang recipe in the same directory —benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/— sets it, including the direct GB300 counterparts that this sweep is meant to complement:disagg-gb300-1p1d-dep4-dep16-5-c1024.yamldisagg-gb300-1p1d-tp4-tp4-2-c1.yamldisagg-gb300-4p1d-dep4-dep16-8-c1024.yamldisagg-gb300-8p1d-dep4-dep16-12-c4096.yamldisagg-gb300-10p1d-dep4-dep16-14-c8192.yamldisagg-gb300-12p1d-dep4-dep12-15-c21504.yamlmid-curve/high-conc/low-latencyvariantsThat is 14/14 existing configs setting
custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"and 0/8 new ones.How it manifests
7 of the 8 new wideep configs (everything except
1p1d-tp8-tp8-4-c1.yaml) setskip-tokenizer-init: trueon both prefill and decodesglang_config:blocks. With that flag, SGLang refuses to do tokenization itself and expects the client to send raw token-id streams — DSv4-Pro's own tokenizer with its special tokens (<|begin▁of▁sentence|>,<|User|>,<|Assistant|>, thinking/DSML/task tokens) must be applied client-side. Withoutcustom_tokenizer, sa-bench falls back to its default tokenizer, which does not encode these specials correctly. The result: wrong token-id streams, ISL/OSL accounting that does not match what the GB300 baseline measured, and 8k/1k numbers that cannot be apples-to-apples compared with the existing GB300 sweep.Even the
1p1d-tp8-tp8-4-c1.yamllow-latency config (which usesdisable-radix-cache: trueinstead ofskip-tokenizer-init: true) still breaks the established convention — its GB300 sibling atdisagg-gb300-1p1d-tp4-tp4-2-c1.yaml:165does setcustom_tokenizer.Why existing code does not prevent it
The
benchmark:block has no schema validation that would flag a missing custom tokenizer; sa-bench just picks a default. Failure is silent — the run completes, numbers come back, and only a careful reviewer comparing against the GB300 baseline would notice the discrepancy.Step-by-step proof of impact
dsv4-fp4-gb200-dynamo-sglangconfig from.github/configs/nvidia-master.yaml.disagg-gb200-1p1d-dep8-dep16-6-c1024.yaml.skip-tokenizer-init: true(lines 108, 131 of that file) — the server now accepts only token ids, not text.benchmark:block. Becausecustom_tokenizeris absent, it falls back to a generic HF tokenizer (or whatever its default is).SGLangDeepseekV4Tokenizer— DSv4-specific specials are tokenized as multiple BPE pieces instead of the intended single ids.Fix
Add one line to each of the 8 new yaml
benchmark:blocks, matching the existing convention: