[AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1#1673
[AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1#1673ChangLiu0709 wants to merge 1 commit into
Conversation
…ost1 Pin both qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x (was lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517) so the e2e matrix runs on the image where we already measured the MTP EAGLE acceleration. Measured on a single MI355X (mia1-p01-g09), Qwen/Qwen3.5-397B-A17B, 1k/1k, TP=8, EP=1, no DP-attn, --attention-backend triton, EAGLE num_steps=3 / eagle_topk=1 / num_draft_tokens=4. MTP delivers +34..69% total token throughput and -28..42% median TPOT over non-MTP for conc 1..32; the conc=64 row is depressed on tok/s (+6.9%) because EAGLE silently caps max_running_requests=48 and 16 of 64 requests queue (TPOT speedup unchanged at 1.39x). Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM, straightforward image tag bump for two benchmark matrix entries.
Extended reasoning...
Overview
This PR updates the docker image tag for the qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp entries in .github/configs/amd-master.yaml from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x to align CI with the image where MTP/EAGLE acceleration was validated on hardware. The diff is exactly four lines (two pairs of image: value changes); no model, runner, or search-space metadata changes.
Security risks
None. This is a CI benchmark matrix config; the image references a published lmsysorg tag and there are no auth, crypto, or permissions changes.
Level of scrutiny
Low. This file is a CI-only sweep configuration for AMD benchmarks; the change is purely a tag bump of an existing well-known publisher image. The PR author already validated locally on mia1-p01-g09 and provided detailed perf data. The pattern matches several adjacent recent image bumps in the same file (e.g., the qwen3.5-fp8-mi355x-sglang entries already use the same post1 family).
Other factors
No bugs were flagged by the bug hunter. The bot's recipe reminder is informational and routine for AMD/SGLang PRs. No outstanding reviewer comments.
Summary
The motivation of this PR is there is a performance drop noticed of
qwen3.5-bf16-mi355x-sglang-mtpcompared withqwen3.5-bf16-mi355x-sglang, more details can be checked in the issue.The fix is to update the docker image for both
qwen3.5-bf16-mi355x-sglangandqwen3.5-bf16-mi355x-sglang-mtpfromlmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517tolmsysorg/sglang:v0.5.12.post1-rocm720-mi35x. The root cause leading to the perf drop is potentially the regression of sglang docker image update or docker setting mis-alignment. The e2e smoke test runs on the image where the MTP EAGLE acceleration was empirically validated and details can be referred to the table below.Measured MTP performance on the new image
Setup: single MI355X node,
Qwen/Qwen3.5-397B-A17B, 1k input / 1k output,tp=8,ep=1,dp-attn=false,--attention-backend triton, EAGLEnum_steps=3/eagle_topk=1/num_draft_tokens=4. 14-run sweep withbench_serving, paired non-MTP and MTP back-to-back per concurrency.EAGLE acceptance rate ≈ 1.4–1.7 verified tokens per main forward (theoretical max = 4 with
num_draft_tokens=4), monotonically declining with conc as expected.Commands used
Serve (non-MTP):
python3 -m sglang.launch_server \ --attention-backend triton \ --model-path Qwen/Qwen3.5-397B-A17B \ --host 0.0.0.0 --port 8888 \ --tensor-parallel-size 8 --ep-size 1 \ --trust-remote-code \ --tokenizer-worker-num 6 \ --enable-aiter-allreduce-fusion \ --cuda-graph-max-bs 64 \ --disable-radix-cache \ --max-prefill-tokens 32768 \ --scheduler-recv-interval 30 \ --mem-fraction-static 0.8 \ --context-length 2068Serve (with MTP — adds 4 EAGLE flags):
python3 -m sglang.launch_server \ --attention-backend triton \ --model-path Qwen/Qwen3.5-397B-A17B \ --host 0.0.0.0 --port 8888 \ --tensor-parallel-size 8 --ep-size 1 \ --trust-remote-code \ --tokenizer-worker-num 6 \ --enable-aiter-allreduce-fusion \ --cuda-graph-max-bs 64 \ --disable-radix-cache \ --max-prefill-tokens 32768 \ --scheduler-recv-interval 30 \ --mem-fraction-static 0.8 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --context-length 2068Co-authors
@ChangLiu0709
@chunfangamd
Test plan
python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-bf16-mi355x-sglang-mtp --config-files .github/configs/amd-master.yamlvalidates and resolves both entries to the new image.mia1-p01-g09for bothnoneandmtpspec-decoding (results JSONs above are from these runs).End-to-End TestsGH Action withtest-config --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-bf16-mi355x-sglang-mtp --config-files .github/configs/amd-master.yamland confirm the per-conc result JSONs match the local-smoke numbers within the usual noise envelope.Made with Cursor
Note
Low Risk
Config-only Docker image pin for two benchmark matrix keys; no application or auth logic changes.
Overview
Updates the AMD e2e matrix in
.github/configs/amd-master.yamlso Qwen3.5 BF16 single-node MI355X SGLang runs uselmsysorg/sglang:v0.5.12.post1-rocm720-mi35xinstead of the datedlmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517image.The same image bump applies to
qwen3.5-bf16-mi355x-sglangandqwen3.5-bf16-mi355x-sglang-mtp; model, runner, and fixed-seq-len search spaces are unchanged. The intent is to align CI with the build where MTP / EAGLE acceleration was validated on hardware.Reviewed by Cursor Bugbot for commit 6c9c79f. Bugbot is set up for automated code reviews on this repo. Configure here.