Skip to content

[AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1#1673

Open
ChangLiu0709 wants to merge 1 commit into
mainfrom
chang/qwen3.5-bf16-sglang-mtp-perf-drop-fix
Open

[AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1#1673
ChangLiu0709 wants to merge 1 commit into
mainfrom
chang/qwen3.5-bf16-sglang-mtp-perf-drop-fix

Conversation

@ChangLiu0709

@ChangLiu0709 ChangLiu0709 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

The motivation of this PR is there is a performance drop noticed of qwen3.5-bf16-mi355x-sglang-mtp compared with qwen3.5-bf16-mi355x-sglang, more details can be checked in the issue.

The fix is to update the docker image for both qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x. The root cause leading to the perf drop is potentially the regression of sglang docker image update or docker setting mis-alignment. The e2e smoke test runs on the image where the MTP EAGLE acceleration was empirically validated and details can be referred to the table below.

Measured MTP performance on the new image

Setup: single MI355X node, Qwen/Qwen3.5-397B-A17B, 1k input / 1k output, tp=8, ep=1, dp-attn=false, --attention-backend triton, EAGLE num_steps=3 / eagle_topk=1 / num_draft_tokens=4. 14-run sweep with bench_serving, paired non-MTP and MTP back-to-back per concurrency.

conc non-MTP tok/s MTP tok/s tok/s speedup non-MTP TPOT (ms) MTP TPOT (ms) TPOT speedup
1 223.12 367.33 1.65× 8.72 5.05 1.73×
2 421.49 712.84 1.69× 9.14 5.27 1.73×
4 781.42 1226.04 1.57× 9.75 6.02 1.62×
8 1425.58 2046.63 1.44× 10.70 7.22 1.48×
16 2545.06 3413.03 1.34× 12.14 8.65 1.40×
32 3980.16 5367.44 1.35× 15.49 10.78 1.44×
64 6217.16 6645.44 1.07× 19.58 14.05 1.39×

EAGLE acceptance rate ≈ 1.4–1.7 verified tokens per main forward (theoretical max = 4 with num_draft_tokens=4), monotonically declining with conc as expected.

Commands used

Serve (non-MTP):

python3 -m sglang.launch_server \
    --attention-backend triton \
    --model-path Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 --port 8888 \
    --tensor-parallel-size 8 --ep-size 1 \
    --trust-remote-code \
    --tokenizer-worker-num 6 \
    --enable-aiter-allreduce-fusion \
    --cuda-graph-max-bs 64 \
    --disable-radix-cache \
    --max-prefill-tokens 32768 \
    --scheduler-recv-interval 30 \
    --mem-fraction-static 0.8 \
    --context-length 2068

Serve (with MTP — adds 4 EAGLE flags):

python3 -m sglang.launch_server \
    --attention-backend triton \
    --model-path Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 --port 8888 \
    --tensor-parallel-size 8 --ep-size 1 \
    --trust-remote-code \
    --tokenizer-worker-num 6 \
    --enable-aiter-allreduce-fusion \
    --cuda-graph-max-bs 64 \
    --disable-radix-cache \
    --max-prefill-tokens 32768 \
    --scheduler-recv-interval 30 \
    --mem-fraction-static 0.8 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --context-length 2068

Co-authors

@ChangLiu0709
@chunfangamd

Test plan

  • python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-bf16-mi355x-sglang-mtp --config-files .github/configs/amd-master.yaml validates and resolves both entries to the new image.
  • Local smoke at 1k1k / conc=64 succeeds on mia1-p01-g09 for both none and mtp spec-decoding (results JSONs above are from these runs).
  • Trigger End-to-End Tests GH Action with test-config --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-bf16-mi355x-sglang-mtp --config-files .github/configs/amd-master.yaml and confirm the per-conc result JSONs match the local-smoke numbers within the usual noise envelope.

Made with Cursor


Note

Low Risk
Config-only Docker image pin for two benchmark matrix keys; no application or auth logic changes.

Overview
Updates the AMD e2e matrix in .github/configs/amd-master.yaml so Qwen3.5 BF16 single-node MI355X SGLang runs use lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x instead of the dated lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 image.

The same image bump applies to qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp; model, runner, and fixed-seq-len search spaces are unchanged. The intent is to align CI with the build where MTP / EAGLE acceleration was validated on hardware.

Reviewed by Cursor Bugbot for commit 6c9c79f. Bugbot is set up for automated code reviews on this repo. Configure here.

…ost1

Pin both qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp
to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x (was
lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517) so the e2e matrix
runs on the image where we already measured the MTP EAGLE acceleration.

Measured on a single MI355X (mia1-p01-g09), Qwen/Qwen3.5-397B-A17B,
1k/1k, TP=8, EP=1, no DP-attn, --attention-backend triton, EAGLE
num_steps=3 / eagle_topk=1 / num_draft_tokens=4. MTP delivers
+34..69% total token throughput and -28..42% median TPOT over non-MTP
for conc 1..32; the conc=64 row is depressed on tok/s (+6.9%) because
EAGLE silently caps max_running_requests=48 and 16 of 64 requests queue
(TPOT speedup unchanged at 1.39x).

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, straightforward image tag bump for two benchmark matrix entries.

Extended reasoning...

Overview

This PR updates the docker image tag for the qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp entries in .github/configs/amd-master.yaml from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x to align CI with the image where MTP/EAGLE acceleration was validated on hardware. The diff is exactly four lines (two pairs of image: value changes); no model, runner, or search-space metadata changes.

Security risks

None. This is a CI benchmark matrix config; the image references a published lmsysorg tag and there are no auth, crypto, or permissions changes.

Level of scrutiny

Low. This file is a CI-only sweep configuration for AMD benchmarks; the change is purely a tag bump of an existing well-known publisher image. The PR author already validated locally on mia1-p01-g09 and provided detailed perf data. The pattern matches several adjacent recent image bumps in the same file (e.g., the qwen3.5-fp8-mi355x-sglang entries already use the same post1 family).

Other factors

No bugs were flagged by the bug hunter. The bot's recipe reminder is informational and routine for AMD/SGLang PRs. No outstanding reviewer comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant