[AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support#1683
[AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support#1683seungrokj wants to merge 8 commits into
Conversation
- server_atom.sh: new launcher using atom.entrypoints.openai_server with mooncake kv-transfer-config, atomesh router, dynamic xP/yD topology - env_atom.sh: ATOM-specific env (mooncake LD_LIBRARY_PATH, ATOM_MOE_GU_ITLV, etc.) - server.sh: add atom-disagg dispatch branch - job.slurm: add atom-disagg DOCKER_ENV_ENGINE block, fix FRAMEWORK->ENGINE for RDMA mounts guard, route atom-disagg to models_atom.yaml - models_atom.yaml: per-model configs for ATOM engine - dsv4_fp4_mi355x_atom-disagg.sh: submit script for DSv4 2P+1D atom-disagg run - amd-master.yaml: add atom-disagg sweep config - launch_mi355x-amds.sh: exclude mia1-p01-g37 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| else | ||
| MODELS_YAML="$(pwd)/models.yaml" | ||
| elif [[ "$ENGINE" == "atom-disagg" ]]; then | ||
| MODELS_YAML="$(pwd)/models_atom.yaml" |
There was a problem hiding this comment.
sglang models.yaml path dropped
High Severity
Replacing the final else that set MODELS_YAML to models.yaml with only vllm-disagg and atom-disagg branches leaves the default ENGINE (sglang-disagg) without a MODELS_YAML value, so model validation fails before the job starts.
Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.
| bnxt_re*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;; | ||
| *) | ||
| local drv; drv=\$(basename \"\$(readlink -f \"\$dev/device/driver\" 2>/dev/null)\" 2>/dev/null || true) | ||
| case \"\$drv\" in bnxt*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;; esac ;; |
There was a problem hiding this comment.
set -e breaks NIC counting
Medium Severity
Inside the new RDMA helper, ((bnxt++)) / ((ionic++)) under set -euo pipefail can exit with status 1 when a counter is still zero, aborting atom-disagg container startup during NIC detection.
Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.
launch_mi355x-amds.sh only routes to benchmarks/multi_node/ when FRAMEWORK==atom-disagg. With framework: atom, it fell into single_node/ and called the single-node atom script which requires TP/CONC/EP_SIZE/ DP_ATTENTION — vars not set for disagg configs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| --node-ips ${IPADDRS} \ | ||
| --node-ports ${PREFILL_PORT} \ | ||
| --wait-for-all-ports \ | ||
| --timeout 1800" |
There was a problem hiding this comment.
Wrong port readiness barrier
High Severity
Node 0 waits for every address in IPADDRS to open PREFILL_PORT, but decode nodes only listen on DECODE_PORT. Decode hosts never expose the prefill port, so the barrier times out and the atomesh router never starts on multi-node disagg topologies like 2P+1D.
Reviewed by Cursor Bugbot for commit ff429df. Configure here.
|
|
||
| # Determine which prefill worker this node belongs to, and its headnode IP | ||
| prefill_worker_idx=$((NODE_RANK / PREFILL_NODES_PER_WORKER)) | ||
| PREFILL_HEADNODE_IP="${PREFILL_IPS[$prefill_worker_idx]}" |
There was a problem hiding this comment.
Unused prefill head IP
Low Severity
PREFILL_HEADNODE_IP is assigned from PREFILL_IPS on secondary prefill ranks but never referenced afterward, unlike the analogous dist-init logic in server_sglang.sh for multi-node prefill workers.
Reviewed by Cursor Bugbot for commit ff429df. Configure here.
| image: rocm/atom-dev:nightly_202606071111-Jasen-fix_dockerfile | ||
| model: deepseek-ai/DeepSeek-V4-Pro | ||
| model-prefix: dsv4 | ||
| runner: mi355x |
There was a problem hiding this comment.
🔴 The new dsv4-fp4-mi355x-atom-disagg entry has framework: atom, but the companion edit to runners/launch_mi355x-amds.sh:80 (also in this PR) explicitly routes only atom-disagg (not atom) to BENCHMARK_SUBDIR=multi_node, and line 79 builds SCRIPT_NAME=dsv4_fp4_mi355x_${FRAMEWORK}.sh — so with framework: atom the launcher resolves to the existing single-node script benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh instead of the newly-added benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh. Every other multi-node disagg recipe in this file uses <engine>-disagg (sglang-disagg / vllm-disagg). Fix: change framework: atom → framework: atom-disagg on line 2725.
Extended reasoning...
What the bug is
.github/configs/amd-master.yaml defines a new sweep recipe dsv4-fp4-mi355x-atom-disagg with multinode: true and disagg: true, but sets framework: atom (line 2725). The value should be atom-disagg to match the <engine>-disagg convention used by every other multi-node disagg recipe in this file (e.g. sglang-disagg, vllm-disagg).
Why this is broken end-to-end
The framework field flows directly into the runner as FRAMEWORK. In runners/launch_mi355x-amds.sh, the multinode branch does two things that both key off this value:
# line 79
SCRIPT_NAME="${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}.sh"
# line 80 (this PR explicitly added the 'atom-disagg' clause)
if [[ "$FRAMEWORK" == "sglang-disagg" ]] || [[ "$FRAMEWORK" == "vllm-disagg" ]] || [[ "$FRAMEWORK" == "atom-disagg" ]]; then
BENCHMARK_SUBDIR="multi_node"
else
BENCHMARK_SUBDIR="single_node/fixed_seq_len"
fiThe fact that this same PR added atom-disagg to line 80 makes the intended value self-evident — if the YAML were meant to say atom, the runner edit wouldn't be needed at all.
Step-by-step proof
With the current framework: atom value in the new entry, here's what happens when the sweep dispatches:
EXP_NAME=dsv4-fp4-mi355x-atom-disagg,PRECISION=fp4,FRAMEWORK=atom.- Line 79:
SCRIPT_NAME = "${EXP_NAME%%_*}_fp4_mi355x_atom.sh".${EXP_NAME%%_*}strips the longest_*suffix, butEXP_NAMEuses hyphens, not underscores, so the result isdsv4-fp4-mi355x-atom-disagg. Either way, the${FRAMEWORK}substitution gives...mi355x_atom.shinstead of...mi355x_atom-disagg.sh. - Line 80:
FRAMEWORK=atomdoes not match any ofsglang-disagg | vllm-disagg | atom-disagg→BENCHMARK_SUBDIR=single_node/fixed_seq_len. - Final dispatch:
bash benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh. That file already exists on main (it's the existing single-node atom recipe for DSv4) — so the runner silently runs the wrong, single-node script instead of the newly-addedbenchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh.
The newly-added multi-node launcher (which is the point of this PR) is never invoked.
Impact
The new disagg recipe is non-functional as written. CI will appear to schedule the sweep but will dispatch the existing single-node atom benchmark, producing misleading results that look like atom-disagg numbers but are actually single-node atom runs.
Fix
One-line change in .github/configs/amd-master.yaml (line 2725):
- framework: atom
+ framework: atom-disaggAfter the fix: SCRIPT_NAME resolves to dsv4_fp4_mi355x_atom-disagg.sh (the file added in this PR), BENCHMARK_SUBDIR=multi_node, and the multinode dispatch chain (job.slurm → server.sh → server_atom.sh) — all of which test ENGINE/FRAMEWORK == "atom-disagg" — kicks in correctly.
| if [[ "$ENGINE" == "vllm-disagg" ]]; then | ||
| MODELS_YAML="$(pwd)/models_vllm.yaml" | ||
| else | ||
| MODELS_YAML="$(pwd)/models.yaml" | ||
| elif [[ "$ENGINE" == "atom-disagg" ]]; then | ||
| MODELS_YAML="$(pwd)/models_atom.yaml" | ||
| fi |
There was a problem hiding this comment.
🔴 Regression: MODELS_YAML is unset for the default ENGINE=sglang-disagg, breaking every existing sglang-disagg recipe. The new if/elif/fi at benchmarks/multi_node/amd_utils/job.slurm:26-30 dropped the else MODELS_YAML="$(pwd)/models.yaml" branch, so when ENGINE is sglang-disagg (the default at line 11) neither arm assigns MODELS_YAML. The very next check at line 32 ([[ ! -f "$MODELS_YAML" ]]) then fires for the empty path and aborts with Error: models YAML not found at . Fix: add else MODELS_YAML="$(pwd)/models.yaml" before the fi to restore the sglang-disagg default.
Extended reasoning...
Bug
benchmarks/multi_node/amd_utils/job.slurm:11 defaults the engine: ENGINE="${ENGINE:-sglang-disagg}". The previous MODELS_YAML selection had a two-arm if/else that always assigned MODELS_YAML — models_vllm.yaml for vllm-disagg, else the SGLang default models.yaml. This PR replaced the else with an elif for the new atom-disagg engine, leaving no fallback assignment:
if [[ "$ENGINE" == "vllm-disagg" ]]; then
MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
MODELS_YAML="$(pwd)/models_atom.yaml"
fiWhen ENGINE=sglang-disagg (the default), neither branch matches and MODELS_YAML stays unset. The top-level script does not enable set -u (only the inner srun bash -lc does), so $MODELS_YAML expands to the empty string.
Trigger
Immediately after, at job.slurm:32-35:
if [[ ! -f "$MODELS_YAML" ]]; then
echo "Error: models YAML not found at $MODELS_YAML"
exit 1
fi[[ ! -f "" ]] evaluates true (the empty string is not a regular file), so the script prints Error: models YAML not found at (trailing empty) and exits 1.
Step-by-step proof
- User submits any existing sglang-disagg recipe (e.g.
dsr1-fp4-mi355x-sglang-disaggfrom.github/configs/amd-master.yaml). The launcher does not overrideENGINE. job.slurm:11runsENGINE="${ENGINE:-sglang-disagg}"→ENGINE=sglang-disagg.- Line 26-30:
$ENGINEis neithervllm-disaggnoratom-disagg. Theif/elif/fifalls through with noelseclause.MODELS_YAMLis never assigned. - Line 32:
[[ ! -f "$MODELS_YAML" ]]expands to[[ ! -f "" ]]→ true. - Line 33: prints
Error: models YAML not found atto stderr. - Line 34:
exit 1— job aborts before any docker/srun work.
Impact
This regresses every pre-existing sglang-disagg recipe in .github/configs/amd-master.yaml:
dsr1-fp4-mi355x-sglang-disaggdsr1-fp8-mi355x-sglang-disagg(and-mtpvariant)dsr1-fp4-mi355x-sglang-disagg-mtp,-1k1k-mtp,-8k1k-mtpglm5-fp8-mi355x-sglang-disaggqwen3.5-fp4-mi355x-sglang-disaggqwen3.5-fp8-mi355x-sglang-disagg
models.yaml is still on disk (benchmarks/multi_node/amd_utils/models.yaml, 14125 bytes), so this is purely a dispatch regression — the file the original code expected exists, but the script no longer points at it.
Fix
Add an else branch restoring the sglang default:
if [[ "$ENGINE" == "vllm-disagg" ]]; then
MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
MODELS_YAML="$(pwd)/models_atom.yaml"
else
MODELS_YAML="$(pwd)/models.yaml"
fi| # Wait for all prefill and decode servers to be ready | ||
| echo "Waiting for all servers to be up..." | ||
| BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \ | ||
| --node-ips ${IPADDRS} \ | ||
| --node-ports ${PREFILL_PORT} \ | ||
| --wait-for-all-ports \ | ||
| --timeout 1800" | ||
|
|
||
| if [[ "$DRY_RUN" -eq 1 ]]; then | ||
| echo "DRY RUN: $BARRIER_CMD" | ||
| else | ||
| eval "$BARRIER_CMD" | ||
| fi |
There was a problem hiding this comment.
🔴 Rank 0's post-prefill barrier passes --node-ports ${PREFILL_PORT} (single value) to sync.py barrier, which broadcasts that one port to every IP in IPADDRS — including the decode nodes, which only ever bind DECODE_PORT (8020 by default, not 8010). The barrier therefore polls <decode-ip>:8010 indefinitely, times out after 1800s, and rank 0 never starts the router — so the new atom-disagg recipe (yD≥1, including the 2P+1D dsv4 config added in this PR) cannot complete startup. Fix: pass a per-node port list aligned with IPADDRS (PREFILL_PORT for prefill IPs, DECODE_PORT for decode IPs), or run two separate barriers.
Extended reasoning...
Bug
In benchmarks/multi_node/amd_utils/server_atom.sh, rank 0 issues this barrier after launching its prefill server (lines 156-168):
BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \
--node-ips ${IPADDRS} \
--node-ports ${PREFILL_PORT} \
--wait-for-all-ports \
--timeout 1800"It passes all node IPs (IPADDRS is the comma-joined union of prefill + decode IPs, set in job.slurm from IPS[*] across SELECTED_NODES) but a single port value (PREFILL_PORT, default 8010).
Why it fires
sync.py cmd_barrier (lines 51-52) explicitly broadcasts a single-port argument to every IP:
if len(NODE_PORTS) == 1:
NODE_PORTS *= len(NODE_IPS)So the resulting check is is_port_open(ip, 8010) for every IP in IPADDRS. But the decode branch of this same script launches the decode server with only:
--host 0.0.0.0 --server-port ${DECODE_PORT} # default 8020
— it never binds PREFILL_PORT. The script itself documents the split in its header comment (Prefill port: $PREFILL_PORT (default 8010) / Decode port: $DECODE_PORT (default 8020)). After binding DECODE_PORT, the decode node enters its own wait for router barrier on NODE0_ADDR:ROUTER_PORT — it never opens 8010.
The barrier hits time.time() - start_time >= timeout (sync.py:87-94) and sys.exit(1). eval "$BARRIER_CMD" propagates the non-zero rc on rank 0, the router is never launched, and decode/non-rank-0 prefill nodes (which are themselves blocked on the router-port barrier at NODE0_ADDR:ROUTER_PORT) eventually time out as well.
Step-by-step proof (dsv4 2P+1D recipe added in this PR)
- The new sweep config (
amd-master.yaml→dsv4-fp4-mi355x-atom-disagg) setsxP=2,yD=1, soNUM_NODES=3injob.slurm. job.slurmselects 3 nodes;IPADDRS="ip_p0,ip_p1,ip_d0"(prefill node 0, prefill node 1, decode node 0).- Rank 0 (
ip_p0) launches its prefill server bound to0.0.0.0:8010and then runs the barrier with--node-ips ip_p0,ip_p1,ip_d0 --node-ports 8010. sync.py cmd_barrierexpandsNODE_PORTSto[8010, 8010, 8010]and starts checking each(ip, port)pair.ip_p0:8010✓ (rank 0s own prefill).ip_p1:8010✓ once rank-1 prefill finishes loading.ip_d0:8010— never opens. RankNODE_OFFSET=2is the decode node, which only binds0.0.0.0:8020(DECODE_PORT) and then waits onNODE0_ADDR:ROUTER_PORT.- After 1800s,
sync.pyexits 1 withERROR: Timeout after 1800 seconds waiting for ports to openlistingip_d0:8010. Rank 0 skips router launch; decode/non-rank-0 prefill nodes are stuck on theNODE0_ADDR:ROUTER_PORTbarrier and eventually fail with their own 1800s timeout. The disagg topology never assembles.
Why the sister scripts do not hit this
server_sglang.sh happens to use port 8000 for both the prefill and decode roles, so broadcasting a single port across all IPs is correct there. The atom port split (8010 / 8020) — explicitly introduced by this PR — is incompatible with the single-port-broadcast pattern this script copied from the sglang variant. Any topology with yD>=1 (which is every real recipe — the only sweep added here is 2P+1D) hits this.
Fix
Use sync.pys already-supported per-node port list (the elif len(NODE_PORTS) != len(NODE_IPS) branch at sync.py:53 confirms a same-length list is the documented API). For example, build a parallel ports list:
PORTS_LIST=""
for ip in "${PREFILL_IPS[@]}"; do PORTS_LIST+="${PREFILL_PORT},"; done
for ip in "${DECODE_IPS[@]}"; do PORTS_LIST+="${DECODE_PORT},"; done
PORTS_LIST="${PORTS_LIST%,}"
BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \
--node-ips ${IPADDRS} \
--node-ports ${PORTS_LIST} \
--wait-for-all-ports \
--timeout 1800"Alternatively, run two sequential barriers — one against PREFILL_IPS on PREFILL_PORT, another against DECODE_IPS on DECODE_PORT.
|
|
||
| # Enable out-of-tree RDMA library mounts for atom-disagg (mooncake requires host RDMA stack) | ||
| RDMA_MOUNTS=() | ||
| if [[ "$ENGINE" == "atom-disagg" ]]; then |
There was a problem hiding this comment.
🔴 Missing fi to close the outer if [[ "$ENGINE" == "atom-disagg" ]]; then block opened at line 443 — the gate never closes before # Pre-clean (idempotent) at line 517. Because bash -c parses the entire inner script (lines 433-597 of the srun bash -lc "…" string) before executing, every rank's docker invocation will fail with syntax error: unexpected end of file regardless of ENGINE, breaking sglang-disagg and vllm-disagg in addition to the new atom-disagg path. Add an fi just before line 517 to close the outer atom-disagg conditional.
Extended reasoning...
What breaks. Line 443 opens if [[ "$ENGINE" == "atom-disagg" ]]; then to gate the entire new RDMA-mount-detection block. Inside that block, the diff defines two helper functions and runs a series of nested if/fi blocks (line 453/463, line 464/466 inside a single-line if at 451, line 479/484, line 481/483 nested inside 479, line 486/509 with internal 495/499 and an elif branch, line 511/515). All of those inner pairs balance correctly. But the outer gate from line 443 is never closed before line 517 # Pre-clean (idempotent), where the script transitions out of atom-disagg-specific logic into the engine-agnostic docker pre-clean / run sequence.\n\nWhy every engine breaks, not just atom-disagg. This inner script lives inside an srun … bash -lc "…" string at lines 432-598. The whole quoted body is passed to bash -c as a single argument. bash -c does a full parse of the input before it executes anything, so a structural unbalance fires a parse error regardless of the runtime value of $ENGINE. That means ENGINE=sglang-disagg and ENGINE=vllm-disagg also explode — even though they never enter the atom-disagg branch logically, the parser still has to read past the opening if at line 443 looking for a closer that doesn't exist.\n\nWhy the outer bash -n job.slurm doesn't catch it. The unmatched if lives inside a double-quoted string at the outer level, so outer-script lexical analysis sees it as ordinary string content. The bug only surfaces when bash -c parses the inner content at runtime, which is exactly what every rank does under srun.\n\nProof. I extracted the body of the bash -lc "…" string (lines 433-597), unescaped \\$ → $, \\" → ", \\\\ → \\, and \\`` → ```` (standard bash double-quote unescaping rules), and ran bash -non the result:\n\n```\n$ bash -n /tmp/inner_script.sh\n/tmp/inner_script.sh: line N: syntax error: unexpected end of file\n$ echo $?\n2\n```\n\nCounting line-leading control tokens in the unescaped script confirms 12ifopeners vs 11ficlosers — exactly one missing closer, and the only opener with no matching closer is line 443.\n\n**Impact.** No docker container will launch on any rank for any disaggregated engine after this PR.srun's bash -cexits with status 2 before reaching the$MAYBE_EXEC $DOCKER_CMD run …line. The PR description's test plan (atom-disagg dry-run, sglang-disagg baseline comparison) cannot complete; nor can any prior sglang-disagg or vllm-disagg CI run that targets this file post-merge.\n\n**Fix.** Add a closingfiimmediately before line 517 to terminate the outer atom-disagg gate, so the if/fi structure becomesif atom-disagg; then [RDMA detection helpers + mount-building blocks]; fi; # Pre-clean (idempotent); …`.
Model is stored in HF hub format at /it-share/hf-hub-cache, not at /it-share/data/DeepSeek-V4-Pro. Mount the cache into the container and pass MODEL_HF_ID / HF_HUB_CACHE so atom server resolves via HuggingFace hub rather than falling through to the failing SGLang filesystem check. - job.slurm: add atom-disagg branch in model resolution (skip srun check), add HF_HUB_CACHE/MODEL_HF_ID to DOCKER_ENV_ENGINE, bind-mount /it-share/hf-hub-cache:/hf-hub-cache into the container - server_atom.sh: use MODEL_HF_ID when set, fallback to MODEL_DIR/MODEL_NAME - dsv4_fp4_mi355x_atom-disagg.sh: export MODEL_HF_ID from GHA MODEL var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ine script The if [[ "$ENGINE" == "atom-disagg" ]]; then block inside the bash -lc heredoc passed to srun was never closed, causing bash to report "unexpected end of file" (syntax error) on every atom-disagg run. Add the missing fi after the RDMA mount logging block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 6 total unresolved issues (including 4 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 36268fe. Configure here.
env_atom.sh called ibv_devinfo inside the container to detect RDMA devices, but: (1) IBDEVICES was never passed into the container from job.slurm, and (2) ibv_devinfo may not see the ionic NIC devices in the container environment. For ATOM/mooncake, IBDEVICES is not needed as a server argument (unlike SGLang's --disaggregation-ib-device). Mooncake uses proxy_ip/handshake_port for KV transfer routing. - job.slurm: pass IBDEVICES into the container via DOCKER_ENV_ENGINE for atom-disagg - env_atom.sh: make IBDEVICES detection non-fatal with a warning instead of exit 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n container The ATOM Docker image doesn't have the iproute2 `ip` command, so `ip route get 1.1.1.1` failed silently, leaving host_ip empty. This caused: - sync.py barrier to error: --local-ip expected one argument - proxy_ip in --kv-transfer-config to be "" instead of the real IP Fall back to `hostname -I` when `ip` is unavailable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revert HF hub cache approach. atom-disagg now resolves the model the same
way as sglang-disagg: via MODEL_DIR/MODEL_NAME mounted at /models in the
container (MODEL_DIR defaults to /it-share/data on mia1 nodes).
- job.slurm: remove atom-disagg model-path branch (falls through to SGLang
filesystem check), drop HF_HUB_CACHE/MODEL_HF_ID from DOCKER_ENV_ENGINE,
remove /it-share/hf-hub-cache bind-mount
- server_atom.sh: revert --model arg to ${MODEL_DIR}/${MODEL_NAME}
- dsv4_fp4_mi355x_atom-disagg.sh: remove MODEL_HF_ID export
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch from 2P1D to 1P1D (prefill-num-worker: 2 -> 1, PREFILL_NODES=2 -> 1) for initial spot-check of the atom-disagg pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
hi @seungrokj thanks for the contribution can u add |


Summary
server_atom.sh: multi-node ATOM engine launcher usingatom.entrypoints.openai_serverwith mooncake RDMA KV transfer (--kv-transfer-config) andatomeshrouter; mirrorsserver_sglang.shdynamicxP/yDtopologyenv_atom.sh: ATOM/mooncake-specific environment (mooncakeLD_LIBRARY_PATH,ATOM_MOE_GU_ITLV=1,AITER_BF16_FP8_MOE_BOUND=0)models_atom.yaml: per-model configs for ATOM engine (DeepSeek-V4-Pro initial entry)server.sh: addatom-disaggdispatch branchjob.slurm: addatom-disaggDOCKER_ENV_ENGINEblock (ports,MEM_FRACTION,KV_CACHE_DTYPE,BLOCK_SIZE,MAX_NUM_SEQS); fixFRAMEWORK→ENGINEin RDMA mounts guard; routeatom-disaggtomodels_atom.yamldsv4_fp4_mi355x_atom-disagg.sh: submit script for DSv4 2P+1D atom-disagg benchmark on MI355Xamd-master.yaml: add atom-disagg sweep configlaunch_mi355x-amds.sh: excludemia1-p01-g37Test plan
ENGINE=atom-disaggwithDRY_RUN=1to verify server/router command constructiondsv4_fp4_mi355x_atom-disagg.shon MI355X cluster (mia1 nodes)kv_producer) and decode (kv_consumer) mooncake handshake completesROUTER_PORT🤖 Generated with Claude Code
Co-authored-by: Jasen2201 Jasen2201@users.noreply.github.com
Co-authored-by: seungrokj seungrokj@users.noreply.github.com
Note
Medium Risk
New multi-node RDMA/mooncake path and Slurm/Docker changes are complex; job.slurm may regress sglang-disagg if models.yaml is unset for that engine.
Overview
Adds multi-node prefill/decode benchmarking for the ATOM engine on MI355X, alongside existing SGLang and vLLM disagg paths.
New
atom-disaggplumbing:server_atom.shlaunchesatom.entrypoints.openai_serverwith mooncake KV transfer (kv_producer/kv_consumer) and an atomesh PD router on rank 0;env_atom.shsets mooncake/ATOM env;models_atom.yamlvalidates models (DeepSeek-V4-Pro);server.shdispatches to the ATOM launcher.job.slurmpasses ATOM-specific Docker env (ports, mem/KV tuning) and, foratom-disagg, bind-mounts host RDMA userspace libs (ionic/bnxt) so mooncake matches the host NIC stack.dsv4_fp4_mi355x_atom-disagg.shwires CI tosubmit.sh;amd-master.yamladdsdsv4-fp4-mi355x-atom-disagg(1P1D TP8, DP-attn, conc 256, temp dev image).launch_mi355x-amds.shroutesatom-disaggto themulti_nodebenchmark scripts.Note:
job.slurmmodel YAML selection no longer setsmodels.yamlfor defaultsglang-disagg(onlyvllm-disaggandatom-disaggbranches remain)—worth confirming that is intentional.Reviewed by Cursor Bugbot for commit 0eea91c. Bugbot is set up for automated code reviews on this repo. Configure here.