Skip to content

[AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support#1683

Open
seungrokj wants to merge 8 commits into
mainfrom
amd/atom_mesh_pre
Open

[AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support#1683
seungrokj wants to merge 8 commits into
mainfrom
amd/atom_mesh_pre

Conversation

@seungrokj
Copy link
Copy Markdown
Collaborator

@seungrokj seungrokj commented Jun 8, 2026

Summary

  • Add server_atom.sh: multi-node ATOM engine launcher using atom.entrypoints.openai_server with mooncake RDMA KV transfer (--kv-transfer-config) and atomesh router; mirrors server_sglang.sh dynamic xP/yD topology
  • Add env_atom.sh: ATOM/mooncake-specific environment (mooncake LD_LIBRARY_PATH, ATOM_MOE_GU_ITLV=1, AITER_BF16_FP8_MOE_BOUND=0)
  • Add models_atom.yaml: per-model configs for ATOM engine (DeepSeek-V4-Pro initial entry)
  • Update server.sh: add atom-disagg dispatch branch
  • Update job.slurm: add atom-disagg DOCKER_ENV_ENGINE block (ports, MEM_FRACTION, KV_CACHE_DTYPE, BLOCK_SIZE, MAX_NUM_SEQS); fix FRAMEWORKENGINE in RDMA mounts guard; route atom-disagg to models_atom.yaml
  • Add dsv4_fp4_mi355x_atom-disagg.sh: submit script for DSv4 2P+1D atom-disagg benchmark on MI355X
  • Update amd-master.yaml: add atom-disagg sweep config
  • Update launch_mi355x-amds.sh: exclude mia1-p01-g37

Test plan

  • Dry-run ENGINE=atom-disagg with DRY_RUN=1 to verify server/router command construction
  • Submit dsv4_fp4_mi355x_atom-disagg.sh on MI355X cluster (mia1 nodes)
  • Verify prefill (kv_producer) and decode (kv_consumer) mooncake handshake completes
  • Confirm atomesh router accepts traffic on ROUTER_PORT
  • Run benchmark and compare throughput vs sglang-disagg baseline

🤖 Generated with Claude Code

Co-authored-by: Jasen2201 Jasen2201@users.noreply.github.com
Co-authored-by: seungrokj seungrokj@users.noreply.github.com


Note

Medium Risk
New multi-node RDMA/mooncake path and Slurm/Docker changes are complex; job.slurm may regress sglang-disagg if models.yaml is unset for that engine.

Overview
Adds multi-node prefill/decode benchmarking for the ATOM engine on MI355X, alongside existing SGLang and vLLM disagg paths.

New atom-disagg plumbing: server_atom.sh launches atom.entrypoints.openai_server with mooncake KV transfer (kv_producer / kv_consumer) and an atomesh PD router on rank 0; env_atom.sh sets mooncake/ATOM env; models_atom.yaml validates models (DeepSeek-V4-Pro); server.sh dispatches to the ATOM launcher. job.slurm passes ATOM-specific Docker env (ports, mem/KV tuning) and, for atom-disagg, bind-mounts host RDMA userspace libs (ionic/bnxt) so mooncake matches the host NIC stack.

dsv4_fp4_mi355x_atom-disagg.sh wires CI to submit.sh; amd-master.yaml adds dsv4-fp4-mi355x-atom-disagg (1P1D TP8, DP-attn, conc 256, temp dev image). launch_mi355x-amds.sh routes atom-disagg to the multi_node benchmark scripts.

Note: job.slurm model YAML selection no longer sets models.yaml for default sglang-disagg (only vllm-disagg and atom-disagg branches remain)—worth confirming that is intentional.

Reviewed by Cursor Bugbot for commit 0eea91c. Bugbot is set up for automated code reviews on this repo. Configure here.

- server_atom.sh: new launcher using atom.entrypoints.openai_server
  with mooncake kv-transfer-config, atomesh router, dynamic xP/yD topology
- env_atom.sh: ATOM-specific env (mooncake LD_LIBRARY_PATH, ATOM_MOE_GU_ITLV, etc.)
- server.sh: add atom-disagg dispatch branch
- job.slurm: add atom-disagg DOCKER_ENV_ENGINE block, fix FRAMEWORK->ENGINE
  for RDMA mounts guard, route atom-disagg to models_atom.yaml
- models_atom.yaml: per-model configs for ATOM engine
- dsv4_fp4_mi355x_atom-disagg.sh: submit script for DSv4 2P+1D atom-disagg run
- amd-master.yaml: add atom-disagg sweep config
- launch_mi355x-amds.sh: exclude mia1-p01-g37

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

else
MODELS_YAML="$(pwd)/models.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
MODELS_YAML="$(pwd)/models_atom.yaml"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sglang models.yaml path dropped

High Severity

Replacing the final else that set MODELS_YAML to models.yaml with only vllm-disagg and atom-disagg branches leaves the default ENGINE (sglang-disagg) without a MODELS_YAML value, so model validation fails before the job starts.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.

Comment thread benchmarks/multi_node/amd_utils/job.slurm
Comment thread .github/configs/amd-master.yaml Outdated
bnxt_re*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;;
*)
local drv; drv=\$(basename \"\$(readlink -f \"\$dev/device/driver\" 2>/dev/null)\" 2>/dev/null || true)
case \"\$drv\" in bnxt*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;; esac ;;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e breaks NIC counting

Medium Severity

Inside the new RDMA helper, ((bnxt++)) / ((ionic++)) under set -euo pipefail can exit with status 1 when a counter is still zero, aborting atom-disagg container startup during NIC detection.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.

launch_mi355x-amds.sh only routes to benchmarks/multi_node/ when
FRAMEWORK==atom-disagg. With framework: atom, it fell into single_node/
and called the single-node atom script which requires TP/CONC/EP_SIZE/
DP_ATTENTION — vars not set for disagg configs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--node-ips ${IPADDRS} \
--node-ports ${PREFILL_PORT} \
--wait-for-all-ports \
--timeout 1800"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong port readiness barrier

High Severity

Node 0 waits for every address in IPADDRS to open PREFILL_PORT, but decode nodes only listen on DECODE_PORT. Decode hosts never expose the prefill port, so the barrier times out and the atomesh router never starts on multi-node disagg topologies like 2P+1D.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ff429df. Configure here.


# Determine which prefill worker this node belongs to, and its headnode IP
prefill_worker_idx=$((NODE_RANK / PREFILL_NODES_PER_WORKER))
PREFILL_HEADNODE_IP="${PREFILL_IPS[$prefill_worker_idx]}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused prefill head IP

Low Severity

PREFILL_HEADNODE_IP is assigned from PREFILL_IPS on secondary prefill ranks but never referenced afterward, unlike the analogous dist-init logic in server_sglang.sh for multi-node prefill workers.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ff429df. Configure here.

image: rocm/atom-dev:nightly_202606071111-Jasen-fix_dockerfile
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dsv4-fp4-mi355x-atom-disagg entry has framework: atom, but the companion edit to runners/launch_mi355x-amds.sh:80 (also in this PR) explicitly routes only atom-disagg (not atom) to BENCHMARK_SUBDIR=multi_node, and line 79 builds SCRIPT_NAME=dsv4_fp4_mi355x_${FRAMEWORK}.sh — so with framework: atom the launcher resolves to the existing single-node script benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh instead of the newly-added benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh. Every other multi-node disagg recipe in this file uses <engine>-disagg (sglang-disagg / vllm-disagg). Fix: change framework: atomframework: atom-disagg on line 2725.

Extended reasoning...

What the bug is

.github/configs/amd-master.yaml defines a new sweep recipe dsv4-fp4-mi355x-atom-disagg with multinode: true and disagg: true, but sets framework: atom (line 2725). The value should be atom-disagg to match the <engine>-disagg convention used by every other multi-node disagg recipe in this file (e.g. sglang-disagg, vllm-disagg).

Why this is broken end-to-end

The framework field flows directly into the runner as FRAMEWORK. In runners/launch_mi355x-amds.sh, the multinode branch does two things that both key off this value:

# line 79
SCRIPT_NAME="${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}.sh"
# line 80 (this PR explicitly added the 'atom-disagg' clause)
if [[ "$FRAMEWORK" == "sglang-disagg" ]] || [[ "$FRAMEWORK" == "vllm-disagg" ]] || [[ "$FRAMEWORK" == "atom-disagg" ]]; then
    BENCHMARK_SUBDIR="multi_node"
else
    BENCHMARK_SUBDIR="single_node/fixed_seq_len"
fi

The fact that this same PR added atom-disagg to line 80 makes the intended value self-evident — if the YAML were meant to say atom, the runner edit wouldn't be needed at all.

Step-by-step proof

With the current framework: atom value in the new entry, here's what happens when the sweep dispatches:

  1. EXP_NAME=dsv4-fp4-mi355x-atom-disagg, PRECISION=fp4, FRAMEWORK=atom.
  2. Line 79: SCRIPT_NAME = "${EXP_NAME%%_*}_fp4_mi355x_atom.sh". ${EXP_NAME%%_*} strips the longest _* suffix, but EXP_NAME uses hyphens, not underscores, so the result is dsv4-fp4-mi355x-atom-disagg. Either way, the ${FRAMEWORK} substitution gives ...mi355x_atom.sh instead of ...mi355x_atom-disagg.sh.
  3. Line 80: FRAMEWORK=atom does not match any of sglang-disagg | vllm-disagg | atom-disaggBENCHMARK_SUBDIR=single_node/fixed_seq_len.
  4. Final dispatch: bash benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh. That file already exists on main (it's the existing single-node atom recipe for DSv4) — so the runner silently runs the wrong, single-node script instead of the newly-added benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh.

The newly-added multi-node launcher (which is the point of this PR) is never invoked.

Impact

The new disagg recipe is non-functional as written. CI will appear to schedule the sweep but will dispatch the existing single-node atom benchmark, producing misleading results that look like atom-disagg numbers but are actually single-node atom runs.

Fix

One-line change in .github/configs/amd-master.yaml (line 2725):

-  framework: atom
+  framework: atom-disagg

After the fix: SCRIPT_NAME resolves to dsv4_fp4_mi355x_atom-disagg.sh (the file added in this PR), BENCHMARK_SUBDIR=multi_node, and the multinode dispatch chain (job.slurmserver.shserver_atom.sh) — all of which test ENGINE/FRAMEWORK == "atom-disagg" — kicks in correctly.

Comment on lines 26 to 30
if [[ "$ENGINE" == "vllm-disagg" ]]; then
MODELS_YAML="$(pwd)/models_vllm.yaml"
else
MODELS_YAML="$(pwd)/models.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
MODELS_YAML="$(pwd)/models_atom.yaml"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Regression: MODELS_YAML is unset for the default ENGINE=sglang-disagg, breaking every existing sglang-disagg recipe. The new if/elif/fi at benchmarks/multi_node/amd_utils/job.slurm:26-30 dropped the else MODELS_YAML="$(pwd)/models.yaml" branch, so when ENGINE is sglang-disagg (the default at line 11) neither arm assigns MODELS_YAML. The very next check at line 32 ([[ ! -f "$MODELS_YAML" ]]) then fires for the empty path and aborts with Error: models YAML not found at . Fix: add else MODELS_YAML="$(pwd)/models.yaml" before the fi to restore the sglang-disagg default.

Extended reasoning...

Bug

benchmarks/multi_node/amd_utils/job.slurm:11 defaults the engine: ENGINE="${ENGINE:-sglang-disagg}". The previous MODELS_YAML selection had a two-arm if/else that always assigned MODELS_YAMLmodels_vllm.yaml for vllm-disagg, else the SGLang default models.yaml. This PR replaced the else with an elif for the new atom-disagg engine, leaving no fallback assignment:

if [[ "$ENGINE" == "vllm-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_atom.yaml"
fi

When ENGINE=sglang-disagg (the default), neither branch matches and MODELS_YAML stays unset. The top-level script does not enable set -u (only the inner srun bash -lc does), so $MODELS_YAML expands to the empty string.

Trigger

Immediately after, at job.slurm:32-35:

if [[ ! -f "$MODELS_YAML" ]]; then
    echo "Error: models YAML not found at $MODELS_YAML"
    exit 1
fi

[[ ! -f "" ]] evaluates true (the empty string is not a regular file), so the script prints Error: models YAML not found at (trailing empty) and exits 1.

Step-by-step proof

  1. User submits any existing sglang-disagg recipe (e.g. dsr1-fp4-mi355x-sglang-disagg from .github/configs/amd-master.yaml). The launcher does not override ENGINE.
  2. job.slurm:11 runs ENGINE="${ENGINE:-sglang-disagg}"ENGINE=sglang-disagg.
  3. Line 26-30: $ENGINE is neither vllm-disagg nor atom-disagg. The if/elif/fi falls through with no else clause. MODELS_YAML is never assigned.
  4. Line 32: [[ ! -f "$MODELS_YAML" ]] expands to [[ ! -f "" ]] → true.
  5. Line 33: prints Error: models YAML not found at to stderr.
  6. Line 34: exit 1 — job aborts before any docker/srun work.

Impact

This regresses every pre-existing sglang-disagg recipe in .github/configs/amd-master.yaml:

  • dsr1-fp4-mi355x-sglang-disagg
  • dsr1-fp8-mi355x-sglang-disagg (and -mtp variant)
  • dsr1-fp4-mi355x-sglang-disagg-mtp, -1k1k-mtp, -8k1k-mtp
  • glm5-fp8-mi355x-sglang-disagg
  • qwen3.5-fp4-mi355x-sglang-disagg
  • qwen3.5-fp8-mi355x-sglang-disagg

models.yaml is still on disk (benchmarks/multi_node/amd_utils/models.yaml, 14125 bytes), so this is purely a dispatch regression — the file the original code expected exists, but the script no longer points at it.

Fix

Add an else branch restoring the sglang default:

if [[ "$ENGINE" == "vllm-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_atom.yaml"
else
    MODELS_YAML="$(pwd)/models.yaml"
fi

Comment on lines +156 to +168
# Wait for all prefill and decode servers to be ready
echo "Waiting for all servers to be up..."
BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \
--node-ips ${IPADDRS} \
--node-ports ${PREFILL_PORT} \
--wait-for-all-ports \
--timeout 1800"

if [[ "$DRY_RUN" -eq 1 ]]; then
echo "DRY RUN: $BARRIER_CMD"
else
eval "$BARRIER_CMD"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Rank 0's post-prefill barrier passes --node-ports ${PREFILL_PORT} (single value) to sync.py barrier, which broadcasts that one port to every IP in IPADDRS — including the decode nodes, which only ever bind DECODE_PORT (8020 by default, not 8010). The barrier therefore polls <decode-ip>:8010 indefinitely, times out after 1800s, and rank 0 never starts the router — so the new atom-disagg recipe (yD≥1, including the 2P+1D dsv4 config added in this PR) cannot complete startup. Fix: pass a per-node port list aligned with IPADDRS (PREFILL_PORT for prefill IPs, DECODE_PORT for decode IPs), or run two separate barriers.

Extended reasoning...

Bug

In benchmarks/multi_node/amd_utils/server_atom.sh, rank 0 issues this barrier after launching its prefill server (lines 156-168):

BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \
    --node-ips ${IPADDRS} \
    --node-ports ${PREFILL_PORT} \
    --wait-for-all-ports \
    --timeout 1800"

It passes all node IPs (IPADDRS is the comma-joined union of prefill + decode IPs, set in job.slurm from IPS[*] across SELECTED_NODES) but a single port value (PREFILL_PORT, default 8010).

Why it fires

sync.py cmd_barrier (lines 51-52) explicitly broadcasts a single-port argument to every IP:

if len(NODE_PORTS) == 1:
    NODE_PORTS *= len(NODE_IPS)

So the resulting check is is_port_open(ip, 8010) for every IP in IPADDRS. But the decode branch of this same script launches the decode server with only:

--host 0.0.0.0 --server-port ${DECODE_PORT}    # default 8020

— it never binds PREFILL_PORT. The script itself documents the split in its header comment (Prefill port: $PREFILL_PORT (default 8010) / Decode port: $DECODE_PORT (default 8020)). After binding DECODE_PORT, the decode node enters its own wait for router barrier on NODE0_ADDR:ROUTER_PORT — it never opens 8010.

The barrier hits time.time() - start_time >= timeout (sync.py:87-94) and sys.exit(1). eval "$BARRIER_CMD" propagates the non-zero rc on rank 0, the router is never launched, and decode/non-rank-0 prefill nodes (which are themselves blocked on the router-port barrier at NODE0_ADDR:ROUTER_PORT) eventually time out as well.

Step-by-step proof (dsv4 2P+1D recipe added in this PR)

  1. The new sweep config (amd-master.yamldsv4-fp4-mi355x-atom-disagg) sets xP=2, yD=1, so NUM_NODES=3 in job.slurm.
  2. job.slurm selects 3 nodes; IPADDRS="ip_p0,ip_p1,ip_d0" (prefill node 0, prefill node 1, decode node 0).
  3. Rank 0 (ip_p0) launches its prefill server bound to 0.0.0.0:8010 and then runs the barrier with --node-ips ip_p0,ip_p1,ip_d0 --node-ports 8010.
  4. sync.py cmd_barrier expands NODE_PORTS to [8010, 8010, 8010] and starts checking each (ip, port) pair.
  5. ip_p0:8010 ✓ (rank 0s own prefill).
  6. ip_p1:8010 ✓ once rank-1 prefill finishes loading.
  7. ip_d0:8010never opens. Rank NODE_OFFSET=2 is the decode node, which only binds 0.0.0.0:8020 (DECODE_PORT) and then waits on NODE0_ADDR:ROUTER_PORT.
  8. After 1800s, sync.py exits 1 with ERROR: Timeout after 1800 seconds waiting for ports to open listing ip_d0:8010. Rank 0 skips router launch; decode/non-rank-0 prefill nodes are stuck on the NODE0_ADDR:ROUTER_PORT barrier and eventually fail with their own 1800s timeout. The disagg topology never assembles.

Why the sister scripts do not hit this

server_sglang.sh happens to use port 8000 for both the prefill and decode roles, so broadcasting a single port across all IPs is correct there. The atom port split (8010 / 8020) — explicitly introduced by this PR — is incompatible with the single-port-broadcast pattern this script copied from the sglang variant. Any topology with yD>=1 (which is every real recipe — the only sweep added here is 2P+1D) hits this.

Fix

Use sync.pys already-supported per-node port list (the elif len(NODE_PORTS) != len(NODE_IPS) branch at sync.py:53 confirms a same-length list is the documented API). For example, build a parallel ports list:

PORTS_LIST=""
for ip in "${PREFILL_IPS[@]}"; do PORTS_LIST+="${PREFILL_PORT},"; done
for ip in "${DECODE_IPS[@]}";  do PORTS_LIST+="${DECODE_PORT},"; done
PORTS_LIST="${PORTS_LIST%,}"

BARRIER_CMD="python3 $ATOM_WS_PATH/sync.py barrier \
    --node-ips ${IPADDRS} \
    --node-ports ${PORTS_LIST} \
    --wait-for-all-ports \
    --timeout 1800"

Alternatively, run two sequential barriers — one against PREFILL_IPS on PREFILL_PORT, another against DECODE_IPS on DECODE_PORT.


# Enable out-of-tree RDMA library mounts for atom-disagg (mooncake requires host RDMA stack)
RDMA_MOUNTS=()
if [[ "$ENGINE" == "atom-disagg" ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing fi to close the outer if [[ "$ENGINE" == "atom-disagg" ]]; then block opened at line 443 — the gate never closes before # Pre-clean (idempotent) at line 517. Because bash -c parses the entire inner script (lines 433-597 of the srun bash -lc "…" string) before executing, every rank's docker invocation will fail with syntax error: unexpected end of file regardless of ENGINE, breaking sglang-disagg and vllm-disagg in addition to the new atom-disagg path. Add an fi just before line 517 to close the outer atom-disagg conditional.

Extended reasoning...

What breaks. Line 443 opens if [[ "$ENGINE" == "atom-disagg" ]]; then to gate the entire new RDMA-mount-detection block. Inside that block, the diff defines two helper functions and runs a series of nested if/fi blocks (line 453/463, line 464/466 inside a single-line if at 451, line 479/484, line 481/483 nested inside 479, line 486/509 with internal 495/499 and an elif branch, line 511/515). All of those inner pairs balance correctly. But the outer gate from line 443 is never closed before line 517 # Pre-clean (idempotent), where the script transitions out of atom-disagg-specific logic into the engine-agnostic docker pre-clean / run sequence.\n\nWhy every engine breaks, not just atom-disagg. This inner script lives inside an srun … bash -lc "…" string at lines 432-598. The whole quoted body is passed to bash -c as a single argument. bash -c does a full parse of the input before it executes anything, so a structural unbalance fires a parse error regardless of the runtime value of $ENGINE. That means ENGINE=sglang-disagg and ENGINE=vllm-disagg also explode — even though they never enter the atom-disagg branch logically, the parser still has to read past the opening if at line 443 looking for a closer that doesn't exist.\n\nWhy the outer bash -n job.slurm doesn't catch it. The unmatched if lives inside a double-quoted string at the outer level, so outer-script lexical analysis sees it as ordinary string content. The bug only surfaces when bash -c parses the inner content at runtime, which is exactly what every rank does under srun.\n\nProof. I extracted the body of the bash -lc "…" string (lines 433-597), unescaped \\$$, \\"", \\\\\\, and \\`` → ```` (standard bash double-quote unescaping rules), and ran bash -non the result:\n\n```\n$ bash -n /tmp/inner_script.sh\n/tmp/inner_script.sh: line N: syntax error: unexpected end of file\n$ echo $?\n2\n```\n\nCounting line-leading control tokens in the unescaped script confirms 12ifopeners vs 11ficlosers — exactly one missing closer, and the only opener with no matching closer is line 443.\n\n**Impact.** No docker container will launch on any rank for any disaggregated engine after this PR.srun's bash -cexits with status 2 before reaching the$MAYBE_EXEC $DOCKER_CMD run …line. The PR description's test plan (atom-disagg dry-run, sglang-disagg baseline comparison) cannot complete; nor can any prior sglang-disagg or vllm-disagg CI run that targets this file post-merge.\n\n**Fix.** Add a closingfiimmediately before line 517 to terminate the outer atom-disagg gate, so the if/fi structure becomesif atom-disagg; then [RDMA detection helpers + mount-building blocks]; fi; # Pre-clean (idempotent); …`.

seungrokj and others added 2 commits June 8, 2026 22:47
Model is stored in HF hub format at /it-share/hf-hub-cache, not at
/it-share/data/DeepSeek-V4-Pro. Mount the cache into the container and
pass MODEL_HF_ID / HF_HUB_CACHE so atom server resolves via HuggingFace
hub rather than falling through to the failing SGLang filesystem check.

- job.slurm: add atom-disagg branch in model resolution (skip srun check),
  add HF_HUB_CACHE/MODEL_HF_ID to DOCKER_ENV_ENGINE, bind-mount
  /it-share/hf-hub-cache:/hf-hub-cache into the container
- server_atom.sh: use MODEL_HF_ID when set, fallback to MODEL_DIR/MODEL_NAME
- dsv4_fp4_mi355x_atom-disagg.sh: export MODEL_HF_ID from GHA MODEL var

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ine script

The if [[ "$ENGINE" == "atom-disagg" ]]; then block inside the bash -lc
heredoc passed to srun was never closed, causing bash to report
"unexpected end of file" (syntax error) on every atom-disagg run.

Add the missing fi after the RDMA mount logging block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 6 total unresolved issues (including 4 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 36268fe. Configure here.

Comment thread benchmarks/multi_node/amd_utils/server_atom.sh
Comment thread benchmarks/multi_node/amd_utils/server_atom.sh
seungrokj and others added 3 commits June 8, 2026 23:40
env_atom.sh called ibv_devinfo inside the container to detect RDMA devices,
but: (1) IBDEVICES was never passed into the container from job.slurm, and
(2) ibv_devinfo may not see the ionic NIC devices in the container environment.

For ATOM/mooncake, IBDEVICES is not needed as a server argument (unlike SGLang's
--disaggregation-ib-device). Mooncake uses proxy_ip/handshake_port for KV
transfer routing.

- job.slurm: pass IBDEVICES into the container via DOCKER_ENV_ENGINE for atom-disagg
- env_atom.sh: make IBDEVICES detection non-fatal with a warning instead of exit 1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n container

The ATOM Docker image doesn't have the iproute2 `ip` command, so
`ip route get 1.1.1.1` failed silently, leaving host_ip empty. This caused:
- sync.py barrier to error: --local-ip expected one argument
- proxy_ip in --kv-transfer-config to be "" instead of the real IP

Fall back to `hostname -I` when `ip` is unavailable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revert HF hub cache approach. atom-disagg now resolves the model the same
way as sglang-disagg: via MODEL_DIR/MODEL_NAME mounted at /models in the
container (MODEL_DIR defaults to /it-share/data on mia1 nodes).

- job.slurm: remove atom-disagg model-path branch (falls through to SGLang
  filesystem check), drop HF_HUB_CACHE/MODEL_HF_ID from DOCKER_ENV_ENGINE,
  remove /it-share/hf-hub-cache bind-mount
- server_atom.sh: revert --model arg to ${MODEL_DIR}/${MODEL_NAME}
- dsv4_fp4_mi355x_atom-disagg.sh: remove MODEL_HF_ID export

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@seungrokj seungrokj changed the title [AMD] atom-disagg: add multi-node ATOM/mooncake disaggregation support [AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support Jun 8, 2026
@seungrokj seungrokj added the AMD label Jun 8, 2026
Switch from 2P1D to 1P1D (prefill-num-worker: 2 -> 1, PREFILL_NODES=2 -> 1)
for initial spot-check of the atom-disagg pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@functionstackx
Copy link
Copy Markdown
Collaborator

hi @seungrokj thanks for the contribution can u add full-sweep-enabled label to this so that u can validate this PR works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants