[AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1 by ChangLiu0709 · Pull Request #1673 · SemiAnalysisAI/InferenceX

ChangLiu0709 · 2026-06-05T20:48:52Z

Summary

The motivation of this PR is there is a performance drop noticed of qwen3.5-bf16-mi355x-sglang-mtp compared with qwen3.5-bf16-mi355x-sglang, more details can be checked in the issue.

The fix is to update the docker image for both qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x. The root cause leading to the perf drop is potentially the regression of sglang docker image update or docker setting mis-alignment. The e2e smoke test runs on the image where the MTP EAGLE acceleration was empirically validated and details can be referred to the table below.

Measured MTP performance on the new image

Setup: single MI355X node, Qwen/Qwen3.5-397B-A17B, 1k input / 1k output, tp=8, ep=1, dp-attn=false, --attention-backend triton, EAGLE num_steps=3 / eagle_topk=1 / num_draft_tokens=4. 14-run sweep with bench_serving, paired non-MTP and MTP back-to-back per concurrency.

conc	non-MTP tok/s	MTP tok/s	tok/s speedup	non-MTP TPOT (ms)	MTP TPOT (ms)	TPOT speedup
1	223.12	367.33	1.65×	8.72	5.05	1.73×
2	421.49	712.84	1.69×	9.14	5.27	1.73×
4	781.42	1226.04	1.57×	9.75	6.02	1.62×
8	1425.58	2046.63	1.44×	10.70	7.22	1.48×
16	2545.06	3413.03	1.34×	12.14	8.65	1.40×
32	3980.16	5367.44	1.35×	15.49	10.78	1.44×
64	6217.16	6645.44	1.07×	19.58	14.05	1.39×

EAGLE acceptance rate ≈ 1.4–1.7 verified tokens per main forward (theoretical max = 4 with num_draft_tokens=4), monotonically declining with conc as expected.

Commands used

Serve (non-MTP):

python3 -m sglang.launch_server \
    --attention-backend triton \
    --model-path Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 --port 8888 \
    --tensor-parallel-size 8 --ep-size 1 \
    --trust-remote-code \
    --tokenizer-worker-num 6 \
    --enable-aiter-allreduce-fusion \
    --cuda-graph-max-bs 64 \
    --disable-radix-cache \
    --max-prefill-tokens 32768 \
    --scheduler-recv-interval 30 \
    --mem-fraction-static 0.8 \
    --context-length 2068

Serve (with MTP — adds 4 EAGLE flags):

python3 -m sglang.launch_server \
    --attention-backend triton \
    --model-path Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 --port 8888 \
    --tensor-parallel-size 8 --ep-size 1 \
    --trust-remote-code \
    --tokenizer-worker-num 6 \
    --enable-aiter-allreduce-fusion \
    --cuda-graph-max-bs 64 \
    --disable-radix-cache \
    --max-prefill-tokens 32768 \
    --scheduler-recv-interval 30 \
    --mem-fraction-static 0.8 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --context-length 2068

Co-authors

@ChangLiu0709
@chunfangamd

Test plan

python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-bf16-mi355x-sglang-mtp --config-files .github/configs/amd-master.yaml validates and resolves both entries to the new image.
Local smoke at 1k1k / conc=64 succeeds on mia1-p01-g09 for both none and mtp spec-decoding (results JSONs above are from these runs).
Trigger End-to-End Tests GH Action with test-config --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-bf16-mi355x-sglang-mtp --config-files .github/configs/amd-master.yaml and confirm the per-conc result JSONs match the local-smoke numbers within the usual noise envelope.

Made with Cursor

Note

Low Risk
Config-only Docker image pin for two benchmark matrix keys; no application or auth logic changes.

Overview
Updates the AMD e2e matrix in .github/configs/amd-master.yaml so Qwen3.5 BF16 single-node MI355X SGLang runs use lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x instead of the dated lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 image.

The same image bump applies to qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp; model, runner, and fixed-seq-len search spaces are unchanged. The intent is to align CI with the build where MTP / EAGLE acceleration was validated on hardware.

^{Reviewed by Cursor Bugbot for commit 6c9c79f. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ost1 Pin both qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x (was lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517) so the e2e matrix runs on the image where we already measured the MTP EAGLE acceleration. Measured on a single MI355X (mia1-p01-g09), Qwen/Qwen3.5-397B-A17B, 1k/1k, TP=8, EP=1, no DP-attn, --attention-backend triton, EAGLE num_steps=3 / eagle_topk=1 / num_draft_tokens=4. MTP delivers +34..69% total token throughput and -28..42% median TPOT over non-MTP for conc 1..32; the conc=64 row is depressed on tok/s (+6.9%) because EAGLE silently caps max_running_requests=48 and 16 of 64 requests queue (TPOT speedup unchanged at 1.39x). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-06-05T20:49:00Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude

LGTM, straightforward image tag bump for two benchmark matrix entries.

Extended reasoning...

Overview

This PR updates the docker image tag for the qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp entries in .github/configs/amd-master.yaml from lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517 to lmsysorg/sglang:v0.5.12.post1-rocm720-mi35x to align CI with the image where MTP/EAGLE acceleration was validated on hardware. The diff is exactly four lines (two pairs of image: value changes); no model, runner, or search-space metadata changes.

Security risks

None. This is a CI benchmark matrix config; the image references a published lmsysorg tag and there are no auth, crypto, or permissions changes.

Level of scrutiny

Low. This file is a CI-only sweep configuration for AMD benchmarks; the change is purely a tag bump of an existing well-known publisher image. The PR author already validated locally on mia1-p01-g09 and provided detailed perf data. The pattern matches several adjacent recent image bumps in the same file (e.g., the qwen3.5-fp8-mi355x-sglang entries already use the same post1 family).

Other factors

No bugs were flagged by the bug hunter. The bot's recipe reminder is informational and routine for AMD/SGLang PRs. No outstanding reviewer comments.

ChangLiu0709 requested a review from a team June 5, 2026 20:48

ChangLiu0709 requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 5, 2026 20:48

github-project-automation Bot added this to InferenceMAX Board Jun 5, 2026

claude Bot reviewed Jun 5, 2026

View reviewed changes

ChangLiu0709 mentioned this pull request Jun 8, 2026

[Bug] qwen3.5 bf16 MTP broken sgl-project/sglang#23123

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1#1673

[AMD][MI355X] Bump qwen3.5-bf16 single-node SGLang image to v0.5.12.post1#1673
ChangLiu0709 wants to merge 1 commit into
mainfrom
chang/qwen3.5-bf16-sglang-mtp-perf-drop-fix

ChangLiu0709 commented Jun 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChangLiu0709 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured MTP performance on the new image

Commands used

Co-authors

Test plan

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChangLiu0709 commented Jun 5, 2026 •

edited

Loading