[AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support by seungrokj · Pull Request #1683 · SemiAnalysisAI/InferenceX

seungrokj · 2026-06-08T13:28:54Z

Summary

Add server_atom.sh: multi-node ATOM engine launcher using atom.entrypoints.openai_server with mooncake RDMA KV transfer (--kv-transfer-config) and atomesh router; mirrors server_sglang.sh dynamic xP/yD topology
Add env_atom.sh: ATOM/mooncake-specific environment (mooncake LD_LIBRARY_PATH, ATOM_MOE_GU_ITLV=1, AITER_BF16_FP8_MOE_BOUND=0)
Add models_atom.yaml: per-model configs for ATOM engine (DeepSeek-V4-Pro initial entry)
Update server.sh: add atom-disagg dispatch branch
Update job.slurm: add atom-disagg DOCKER_ENV_ENGINE block (ports, MEM_FRACTION, KV_CACHE_DTYPE, BLOCK_SIZE, MAX_NUM_SEQS); fix FRAMEWORK → ENGINE in RDMA mounts guard; route atom-disagg to models_atom.yaml
Add dsv4_fp4_mi355x_atom-disagg.sh: submit script for DSv4 2P+1D atom-disagg benchmark on MI355X
Update amd-master.yaml: add atom-disagg sweep config
Update launch_mi355x-amds.sh: exclude mia1-p01-g37

Test plan

Dry-run ENGINE=atom-disagg with DRY_RUN=1 to verify server/router command construction
Submit dsv4_fp4_mi355x_atom-disagg.sh on MI355X cluster (mia1 nodes)
Verify prefill (kv_producer) and decode (kv_consumer) mooncake handshake completes
Confirm atomesh router accepts traffic on ROUTER_PORT
Run benchmark and compare throughput vs sglang-disagg baseline

🤖 Generated with Claude Code

Co-authored-by: Jasen2201 Jasen2201@users.noreply.github.com
Co-authored-by: seungrokj seungrokj@users.noreply.github.com

Note

Medium Risk
New multi-node RDMA/mooncake path and Slurm/Docker changes are complex; job.slurm may regress sglang-disagg if models.yaml is unset for that engine.

Overview
Adds multi-node prefill/decode benchmarking for the ATOM engine on MI355X, alongside existing SGLang and vLLM disagg paths.

New atom-disagg plumbing: server_atom.sh launches atom.entrypoints.openai_server with mooncake KV transfer (kv_producer / kv_consumer) and an atomesh PD router on rank 0; env_atom.sh sets mooncake/ATOM env; models_atom.yaml validates models (DeepSeek-V4-Pro); server.sh dispatches to the ATOM launcher. job.slurm passes ATOM-specific Docker env (ports, mem/KV tuning) and, for atom-disagg, bind-mounts host RDMA userspace libs (ionic/bnxt) so mooncake matches the host NIC stack.

dsv4_fp4_mi355x_atom-disagg.sh wires CI to submit.sh; amd-master.yaml adds dsv4-fp4-mi355x-atom-disagg (1P1D TP8, DP-attn, conc 256, temp dev image). launch_mi355x-amds.sh routes atom-disagg to the multi_node benchmark scripts.

Note: job.slurm model YAML selection no longer sets models.yaml for default sglang-disagg (only vllm-disagg and atom-disagg branches remain)—worth confirming that is intentional.

^{Reviewed by Cursor Bugbot for commit 0eea91c. Bugbot is set up for automated code reviews on this repo. Configure here.}

- server_atom.sh: new launcher using atom.entrypoints.openai_server with mooncake kv-transfer-config, atomesh router, dynamic xP/yD topology - env_atom.sh: ATOM-specific env (mooncake LD_LIBRARY_PATH, ATOM_MOE_GU_ITLV, etc.) - server.sh: add atom-disagg dispatch branch - job.slurm: add atom-disagg DOCKER_ENV_ENGINE block, fix FRAMEWORK->ENGINE for RDMA mounts guard, route atom-disagg to models_atom.yaml - models_atom.yaml: per-model configs for ATOM engine - dsv4_fp4_mi355x_atom-disagg.sh: submit script for DSv4 2P+1D atom-disagg run - amd-master.yaml: add atom-disagg sweep config - launch_mi355x-amds.sh: exclude mia1-p01-g37 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-08T13:29:05Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor · 2026-06-08T13:31:31Z

-else
-    MODELS_YAML="$(pwd)/models.yaml"
+elif [[ "$ENGINE" == "atom-disagg" ]]; then
+    MODELS_YAML="$(pwd)/models_atom.yaml"


sglang models.yaml path dropped

High Severity

Replacing the final else that set MODELS_YAML to models.yaml with only vllm-disagg and atom-disagg branches leaves the default ENGINE (sglang-disagg) without a MODELS_YAML value, so model validation fails before the job starts.

^{Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.}

cursor · 2026-06-08T13:31:31Z

+                bnxt_re*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;;
+                *)
+                    local drv; drv=\$(basename \"\$(readlink -f \"\$dev/device/driver\" 2>/dev/null)\" 2>/dev/null || true)
+                    case \"\$drv\" in bnxt*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;; esac ;;


set -e breaks NIC counting

Medium Severity

Inside the new RDMA helper, ((bnxt++)) / ((ionic++)) under set -euo pipefail can exit with status 1 when a counter is still zero, aborting atom-disagg container startup during NIC detection.

^{Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.}

launch_mi355x-amds.sh only routes to benchmarks/multi_node/ when FRAMEWORK==atom-disagg. With framework: atom, it fell into single_node/ and called the single-node atom script which requires TP/CONC/EP_SIZE/ DP_ATTENTION — vars not set for disagg configs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor · 2026-06-08T13:38:18Z

+        --node-ips ${IPADDRS} \
+        --node-ports ${PREFILL_PORT} \
+        --wait-for-all-ports \
+        --timeout 1800"


Wrong port readiness barrier

High Severity

Node 0 waits for every address in IPADDRS to open PREFILL_PORT, but decode nodes only listen on DECODE_PORT. Decode hosts never expose the prefill port, so the barrier times out and the atomesh router never starts on multi-node disagg topologies like 2P+1D.

^{Reviewed by Cursor Bugbot for commit ff429df. Configure here.}

cursor · 2026-06-08T13:38:18Z

+
+    # Determine which prefill worker this node belongs to, and its headnode IP
+    prefill_worker_idx=$((NODE_RANK / PREFILL_NODES_PER_WORKER))
+    PREFILL_HEADNODE_IP="${PREFILL_IPS[$prefill_worker_idx]}"


Unused prefill head IP

Low Severity

PREFILL_HEADNODE_IP is assigned from PREFILL_IPS on secondary prefill ranks but never referenced afterward, unlike the analogous dist-init logic in server_sglang.sh for multi-node prefill workers.

^{Reviewed by Cursor Bugbot for commit ff429df. Configure here.}

claude · 2026-06-08T13:41:00Z

+  image: rocm/atom-dev:nightly_202606071111-Jasen-fix_dockerfile
+  model: deepseek-ai/DeepSeek-V4-Pro
+  model-prefix: dsv4
+  runner: mi355x


🔴 The new dsv4-fp4-mi355x-atom-disagg entry has framework: atom, but the companion edit to runners/launch_mi355x-amds.sh:80 (also in this PR) explicitly routes only atom-disagg (not atom) to BENCHMARK_SUBDIR=multi_node, and line 79 builds SCRIPT_NAME=dsv4_fp4_mi355x_${FRAMEWORK}.sh — so with framework: atom the launcher resolves to the existing single-node script benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh instead of the newly-added benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh. Every other multi-node disagg recipe in this file uses <engine>-disagg (sglang-disagg / vllm-disagg). Fix: change framework: atom → framework: atom-disagg on line 2725.

Extended reasoning...

What the bug is

.github/configs/amd-master.yaml defines a new sweep recipe dsv4-fp4-mi355x-atom-disagg with multinode: true and disagg: true, but sets framework: atom (line 2725). The value should be atom-disagg to match the <engine>-disagg convention used by every other multi-node disagg recipe in this file (e.g. sglang-disagg, vllm-disagg).

Why this is broken end-to-end

The framework field flows directly into the runner as FRAMEWORK. In runners/launch_mi355x-amds.sh, the multinode branch does two things that both key off this value:

# line 79 SCRIPT_NAME="${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}.sh" # line 80 (this PR explicitly added the 'atom-disagg' clause) if [[ "$FRAMEWORK" == "sglang-disagg" ]] || [[ "$FRAMEWORK" == "vllm-disagg" ]] || [[ "$FRAMEWORK" == "atom-disagg" ]]; then BENCHMARK_SUBDIR="multi_node" else BENCHMARK_SUBDIR="single_node/fixed_seq_len" fi

The fact that this same PR added atom-disagg to line 80 makes the intended value self-evident — if the YAML were meant to say atom, the runner edit wouldn't be needed at all.

Step-by-step proof

With the current framework: atom value in the new entry, here's what happens when the sweep dispatches:

EXP_NAME=dsv4-fp4-mi355x-atom-disagg, PRECISION=fp4, FRAMEWORK=atom.

Line 79: SCRIPT_NAME = "${EXP_NAME%%_*}_fp4_mi355x_atom.sh". ${EXP_NAME%%_*} strips the longest _* suffix, but EXP_NAME uses hyphens, not underscores, so the result is dsv4-fp4-mi355x-atom-disagg. Either way, the ${FRAMEWORK} substitution gives ...mi355x_atom.sh instead of ...mi355x_atom-disagg.sh.

Line 80: FRAMEWORK=atom does not match any of sglang-disagg | vllm-disagg | atom-disagg → BENCHMARK_SUBDIR=single_node/fixed_seq_len.

Final dispatch: bash benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh. That file already exists on main (it's the existing single-node atom recipe for DSv4) — so the runner silently runs the wrong, single-node script instead of the newly-added benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh.

The newly-added multi-node launcher (which is the point of this PR) is never invoked.

Impact

The new disagg recipe is non-functional as written. CI will appear to schedule the sweep but will dispatch the existing single-node atom benchmark, producing misleading results that look like atom-disagg numbers but are actually single-node atom runs.

Fix

One-line change in .github/configs/amd-master.yaml (line 2725):

- framework: atom + framework: atom-disagg

After the fix: SCRIPT_NAME resolves to dsv4_fp4_mi355x_atom-disagg.sh (the file added in this PR), BENCHMARK_SUBDIR=multi_node, and the multinode dispatch chain (job.slurm → server.sh → server_atom.sh) — all of which test ENGINE/FRAMEWORK == "atom-disagg" — kicks in correctly.

claude · 2026-06-08T13:41:00Z

 if [[ "$ENGINE" == "vllm-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_vllm.yaml"
-else
-    MODELS_YAML="$(pwd)/models.yaml"
+elif [[ "$ENGINE" == "atom-disagg" ]]; then
+    MODELS_YAML="$(pwd)/models_atom.yaml"
 fi


🔴 Regression: MODELS_YAML is unset for the default ENGINE=sglang-disagg, breaking every existing sglang-disagg recipe. The new if/elif/fi at benchmarks/multi_node/amd_utils/job.slurm:26-30 dropped the else MODELS_YAML="$(pwd)/models.yaml" branch, so when ENGINE is sglang-disagg (the default at line 11) neither arm assigns MODELS_YAML. The very next check at line 32 ([[ ! -f "$MODELS_YAML" ]]) then fires for the empty path and aborts with Error: models YAML not found at . Fix: add else MODELS_YAML="$(pwd)/models.yaml" before the fi to restore the sglang-disagg default.

Extended reasoning...

Bug

benchmarks/multi_node/amd_utils/job.slurm:11 defaults the engine: ENGINE="${ENGINE:-sglang-disagg}". The previous MODELS_YAML selection had a two-arm if/else that always assigned MODELS_YAML — models_vllm.yaml for vllm-disagg, else the SGLang default models.yaml. This PR replaced the else with an elif for the new atom-disagg engine, leaving no fallback assignment:

if [[ "$ENGINE" == "vllm-disagg" ]]; then MODELS_YAML="$(pwd)/models_vllm.yaml" elif [[ "$ENGINE" == "atom-disagg" ]]; then MODELS_YAML="$(pwd)/models_atom.yaml" fi

When ENGINE=sglang-disagg (the default), neither branch matches and MODELS_YAML stays unset. The top-level script does not enable set -u (only the inner srun bash -lc does), so $MODELS_YAML expands to the empty string.

Trigger

Immediately after, at job.slurm:32-35:

if [[ ! -f "$MODELS_YAML" ]]; then echo "Error: models YAML not found at $MODELS_YAML" exit 1 fi

[[ ! -f "" ]] evaluates true (the empty string is not a regular file), so the script prints Error: models YAML not found at (trailing empty) and exits 1.

Step-by-step proof

User submits any existing sglang-disagg recipe (e.g. dsr1-fp4-mi355x-sglang-disagg from .github/configs/amd-master.yaml). The launcher does not override ENGINE.

job.slurm:11 runs ENGINE="${ENGINE:-sglang-disagg}" → ENGINE=sglang-disagg.

Line 26-30: $ENGINE is neither vllm-disagg nor atom-disagg. The if/elif/fi falls through with no else clause. MODELS_YAML is never assigned.

Line 32: [[ ! -f "$MODELS_YAML" ]] expands to [[ ! -f "" ]] → true.

Line 33: prints Error: models YAML not found at to stderr.

Line 34: exit 1 — job aborts before any docker/srun work.

Impact

This regresses every pre-existing sglang-disagg recipe in .github/configs/amd-master.yaml:

dsr1-fp4-mi355x-sglang-disagg

dsr1-fp8-mi355x-sglang-disagg (and -mtp variant)

dsr1-fp4-mi355x-sglang-disagg-mtp, -1k1k-mtp, -8k1k-mtp

glm5-fp8-mi355x-sglang-disagg

qwen3.5-fp4-mi355x-sglang-disagg

qwen3.5-fp8-mi355x-sglang-disagg

models.yaml is still on disk (benchmarks/multi_node/amd_utils/models.yaml, 14125 bytes), so this is purely a dispatch regression — the file the original code expected exists, but the script no longer points at it.

Fix

Add an else branch restoring the sglang default:

if [[ "$ENGINE" == "vllm-disagg" ]]; then MODELS_YAML="$(pwd)/models_vllm.yaml" elif [[ "$ENGINE" == "atom-disagg" ]]; then MODELS_YAML="$(pwd)/models_atom.yaml" else MODELS_YAML="$(pwd)/models.yaml" fi

claude · 2026-06-08T13:41:00Z

+    # Wait for all prefill and decode servers to be ready
+    echo "Waiting for all servers to be up..."
+    BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \
+        --node-ips ${IPADDRS} \
+        --node-ports ${PREFILL_PORT} \
+        --wait-for-all-ports \
+        --timeout 1800"
+
+    if [[ "$DRY_RUN" -eq 1 ]]; then
+        echo "DRY RUN: $BARRIER_CMD"
+    else
+        eval "$BARRIER_CMD"
+    fi


🔴 Rank 0's post-prefill barrier passes --node-ports ${PREFILL_PORT} (single value) to sync.py barrier, which broadcasts that one port to every IP in IPADDRS — including the decode nodes, which only ever bind DECODE_PORT (8020 by default, not 8010). The barrier therefore polls <decode-ip>:8010 indefinitely, times out after 1800s, and rank 0 never starts the router — so the new atom-disagg recipe (yD≥1, including the 2P+1D dsv4 config added in this PR) cannot complete startup. Fix: pass a per-node port list aligned with IPADDRS (PREFILL_PORT for prefill IPs, DECODE_PORT for decode IPs), or run two separate barriers.

Extended reasoning...

Bug

In benchmarks/multi_node/amd_utils/server_atom.sh, rank 0 issues this barrier after launching its prefill server (lines 156-168):

BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \ --node-ips ${IPADDRS} \ --node-ports ${PREFILL_PORT} \ --wait-for-all-ports \ --timeout 1800"

It passes all node IPs (IPADDRS is the comma-joined union of prefill + decode IPs, set in job.slurm from IPS[*] across SELECTED_NODES) but a single port value (PREFILL_PORT, default 8010).

Why it fires

sync.py cmd_barrier (lines 51-52) explicitly broadcasts a single-port argument to every IP:

if len(NODE_PORTS) == 1: NODE_PORTS *= len(NODE_IPS)

So the resulting check is is_port_open(ip, 8010) for every IP in IPADDRS. But the decode branch of this same script launches the decode server with only:

--host 0.0.0.0 --server-port ${DECODE_PORT} # default 8020

— it never binds PREFILL_PORT. The script itself documents the split in its header comment (Prefill port: $PREFILL_PORT (default 8010) / Decode port: $DECODE_PORT (default 8020)). After binding DECODE_PORT, the decode node enters its own wait for router barrier on NODE0_ADDR:ROUTER_PORT — it never opens 8010.

The barrier hits time.time() - start_time >= timeout (sync.py:87-94) and sys.exit(1). eval "$BARRIER_CMD" propagates the non-zero rc on rank 0, the router is never launched, and decode/non-rank-0 prefill nodes (which are themselves blocked on the router-port barrier at NODE0_ADDR:ROUTER_PORT) eventually time out as well.

Step-by-step proof (dsv4 2P+1D recipe added in this PR)

The new sweep config (amd-master.yaml → dsv4-fp4-mi355x-atom-disagg) sets xP=2, yD=1, so NUM_NODES=3 in job.slurm.

job.slurm selects 3 nodes; IPADDRS="ip_p0,ip_p1,ip_d0" (prefill node 0, prefill node 1, decode node 0).

Rank 0 (ip_p0) launches its prefill server bound to 0.0.0.0:8010 and then runs the barrier with --node-ips ip_p0,ip_p1,ip_d0 --node-ports 8010.

sync.py cmd_barrier expands NODE_PORTS to [8010, 8010, 8010] and starts checking each (ip, port) pair.

ip_p0:8010 ✓ (rank 0s own prefill).

ip_p1:8010 ✓ once rank-1 prefill finishes loading.

ip_d0:8010 — never opens. Rank NODE_OFFSET=2 is the decode node, which only binds 0.0.0.0:8020 (DECODE_PORT) and then waits on NODE0_ADDR:ROUTER_PORT.

After 1800s, sync.py exits 1 with ERROR: Timeout after 1800 seconds waiting for ports to open listing ip_d0:8010. Rank 0 skips router launch; decode/non-rank-0 prefill nodes are stuck on the NODE0_ADDR:ROUTER_PORT barrier and eventually fail with their own 1800s timeout. The disagg topology never assembles.

Why the sister scripts do not hit this

server_sglang.sh happens to use port 8000 for both the prefill and decode roles, so broadcasting a single port across all IPs is correct there. The atom port split (8010 / 8020) — explicitly introduced by this PR — is incompatible with the single-port-broadcast pattern this script copied from the sglang variant. Any topology with yD>=1 (which is every real recipe — the only sweep added here is 2P+1D) hits this.

Fix

Use sync.pys already-supported per-node port list (the elif len(NODE_PORTS) != len(NODE_IPS) branch at sync.py:53 confirms a same-length list is the documented API). For example, build a parallel ports list:

PORTS_LIST="" for ip in "${PREFILL_IPS[@]}"; do PORTS_LIST+="${PREFILL_PORT},"; done for ip in "${DECODE_IPS[@]}"; do PORTS_LIST+="${DECODE_PORT},"; done PORTS_LIST="${PORTS_LIST%,}" BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \ --node-ips ${IPADDRS} \ --node-ports ${PORTS_LIST} \ --wait-for-all-ports \ --timeout 1800"

Alternatively, run two sequential barriers — one against PREFILL_IPS on PREFILL_PORT, another against DECODE_IPS on DECODE_PORT.

claude · 2026-06-08T13:41:00Z


+# Enable out-of-tree RDMA library mounts for atom-disagg (mooncake requires host RDMA stack)
+RDMA_MOUNTS=()
+if [[ "$ENGINE" == "atom-disagg" ]]; then


🔴 Missing fi to close the outer if [[ "$ENGINE" == "atom-disagg" ]]; then block opened at line 443 — the gate never closes before # Pre-clean (idempotent) at line 517. Because bash -c parses the entire inner script (lines 433-597 of the srun bash -lc "…" string) before executing, every rank's docker invocation will fail with syntax error: unexpected end of file regardless of ENGINE, breaking sglang-disagg and vllm-disagg in addition to the new atom-disagg path. Add an fi just before line 517 to close the outer atom-disagg conditional.

Extended reasoning...

What breaks. Line 443 opens if [[ "$ENGINE" == "atom-disagg" ]]; then to gate the entire new RDMA-mount-detection block. Inside that block, the diff defines two helper functions and runs a series of nested if/fi blocks (line 453/463, line 464/466 inside a single-line if at 451, line 479/484, line 481/483 nested inside 479, line 486/509 with internal 495/499 and an elif branch, line 511/515). All of those inner pairs balance correctly. But the outer gate from line 443 is never closed before line 517 # Pre-clean (idempotent), where the script transitions out of atom-disagg-specific logic into the engine-agnostic docker pre-clean / run sequence.\n\nWhy every engine breaks, not just atom-disagg. This inner script lives inside an srun … bash -lc "…" string at lines 432-598. The whole quoted body is passed to bash -c as a single argument. bash -c does a full parse of the input before it executes anything, so a structural unbalance fires a parse error regardless of the runtime value of $ENGINE. That means ENGINE=sglang-disagg and ENGINE=vllm-disagg also explode — even though they never enter the atom-disagg branch logically, the parser still has to read past the opening if at line 443 looking for a closer that doesn't exist.\n\nWhy the outer bash -n job.slurm doesn't catch it. The unmatched if lives inside a double-quoted string at the outer level, so outer-script lexical analysis sees it as ordinary string content. The bug only surfaces when bash -c parses the inner content at runtime, which is exactly what every rank does under srun.\n\nProof. I extracted the body of the bash -lc "…" string (lines 433-597), unescaped \\$ → $, \\" → ", \\\\ → \\, and \\`` → ```` (standard bash double-quote unescaping rules), and ran bash -non the result:\n\n```\n$ bash -n /tmp/inner_script.sh\n/tmp/inner_script.sh: line N: syntax error: unexpected end of file\n$ echo $?\n2\n```\n\nCounting line-leading control tokens in the unescaped script confirms 12ifopeners vs 11ficlosers — exactly one missing closer, and the only opener with no matching closer is line 443.\n\n**Impact.** No docker container will launch on any rank for any disaggregated engine after this PR.srun's bash -cexits with status 2 before reaching the$MAYBE_EXEC $DOCKER_CMD run …line. The PR description's test plan (atom-disagg dry-run, sglang-disagg baseline comparison) cannot complete; nor can any prior sglang-disagg or vllm-disagg CI run that targets this file post-merge.\n\n**Fix.** Add a closingfiimmediately before line 517 to terminate the outer atom-disagg gate, so the if/fi structure becomesif atom-disagg; then [RDMA detection helpers + mount-building blocks]; fi; # Pre-clean (idempotent); …`.

Model is stored in HF hub format at /it-share/hf-hub-cache, not at /it-share/data/DeepSeek-V4-Pro. Mount the cache into the container and pass MODEL_HF_ID / HF_HUB_CACHE so atom server resolves via HuggingFace hub rather than falling through to the failing SGLang filesystem check. - job.slurm: add atom-disagg branch in model resolution (skip srun check), add HF_HUB_CACHE/MODEL_HF_ID to DOCKER_ENV_ENGINE, bind-mount /it-share/hf-hub-cache:/hf-hub-cache into the container - server_atom.sh: use MODEL_HF_ID when set, fallback to MODEL_DIR/MODEL_NAME - dsv4_fp4_mi355x_atom-disagg.sh: export MODEL_HF_ID from GHA MODEL var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ine script The if [[ "$ENGINE" == "atom-disagg" ]]; then block inside the bash -lc heredoc passed to srun was never closed, causing bash to report "unexpected end of file" (syntax error) on every atom-disagg run. Add the missing fi after the RDMA mount logging block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 6 total unresolved issues (including 4 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 36268fe. Configure here.}

env_atom.sh called ibv_devinfo inside the container to detect RDMA devices, but: (1) IBDEVICES was never passed into the container from job.slurm, and (2) ibv_devinfo may not see the ionic NIC devices in the container environment. For ATOM/mooncake, IBDEVICES is not needed as a server argument (unlike SGLang's --disaggregation-ib-device). Mooncake uses proxy_ip/handshake_port for KV transfer routing. - job.slurm: pass IBDEVICES into the container via DOCKER_ENV_ENGINE for atom-disagg - env_atom.sh: make IBDEVICES detection non-fatal with a warning instead of exit 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n container The ATOM Docker image doesn't have the iproute2 `ip` command, so `ip route get 1.1.1.1` failed silently, leaving host_ip empty. This caused: - sync.py barrier to error: --local-ip expected one argument - proxy_ip in --kv-transfer-config to be "" instead of the real IP Fall back to `hostname -I` when `ip` is unavailable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Revert HF hub cache approach. atom-disagg now resolves the model the same way as sglang-disagg: via MODEL_DIR/MODEL_NAME mounted at /models in the container (MODEL_DIR defaults to /it-share/data on mia1 nodes). - job.slurm: remove atom-disagg model-path branch (falls through to SGLang filesystem check), drop HF_HUB_CACHE/MODEL_HF_ID from DOCKER_ENV_ENGINE, remove /it-share/hf-hub-cache bind-mount - server_atom.sh: revert --model arg to ${MODEL_DIR}/${MODEL_NAME} - dsv4_fp4_mi355x_atom-disagg.sh: remove MODEL_HF_ID export Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Switch from 2P1D to 1P1D (prefill-num-worker: 2 -> 1, PREFILL_NODES=2 -> 1) for initial spot-check of the atom-disagg pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

functionstackx · 2026-06-08T17:10:37Z

hi @seungrokj thanks for the contribution can u add full-sweep-enabled label to this so that u can validate this PR works?

seungrokj requested a review from a team June 8, 2026 13:28

seungrokj requested review from 1am9trash, billishyahao, chunfangamd and yctseng0211 as code owners June 8, 2026 13:28

github-project-automation Bot added this to InferenceMAX Board Jun 8, 2026

cursor Bot reviewed Jun 8, 2026

View reviewed changes

claude Bot reviewed Jun 8, 2026

View reviewed changes

seungrokj and others added 2 commits June 8, 2026 22:47

cursor Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/server_atom.sh

Comment thread benchmarks/multi_node/amd_utils/server_atom.sh

seungrokj and others added 3 commits June 8, 2026 23:40

seungrokj changed the title ~~[AMD] atom-disagg: add multi-node ATOM/mooncake disaggregation support~~ [AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support Jun 8, 2026

seungrokj added the AMD label Jun 8, 2026

[AMD] amd-master: change dsv4-fp4-mi355x-atom-disagg to 1P1D config

0eea91c

Switch from 2P1D to 1P1D (prefill-num-worker: 2 -> 1, PREFILL_NODES=2 -> 1) for initial spot-check of the atom-disagg pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Conversation

seungrokj commented Jun 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

cursor Bot Jun 8, 2026

Choose a reason for hiding this comment

sglang models.yaml path dropped

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 8, 2026

Choose a reason for hiding this comment

set -e breaks NIC counting

Uh oh!

cursor Bot Jun 8, 2026

Choose a reason for hiding this comment

Wrong port readiness barrier

Uh oh!

cursor Bot Jun 8, 2026

Choose a reason for hiding this comment

Unused prefill head IP

Uh oh!

claude Bot Jun 8, 2026

Choose a reason for hiding this comment

What the bug is

Why this is broken end-to-end

Step-by-step proof

Impact

Fix

Uh oh!

claude Bot Jun 8, 2026

Choose a reason for hiding this comment

Bug

Trigger

Step-by-step proof

Impact

Fix

Uh oh!

claude Bot Jun 8, 2026

Choose a reason for hiding this comment

Bug

Why it fires

Step-by-step proof (dsv4 2P+1D recipe added in this PR)

Why the sister scripts do not hit this

Fix

Uh oh!

claude Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

functionstackx commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seungrokj commented Jun 8, 2026 •

edited by cursor Bot

Loading