-
Notifications
You must be signed in to change notification settings - Fork 189
[AMD] dsv4-fp4-mi355x-atom-disagg, add multi-node ATOM/mooncake disaggregation support #1683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4d6b3f3
ff429df
3167f19
36268fe
5104b00
fcddddf
e20731a
0eea91c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| #!/bin/bash | ||
| # ATOM/mooncake-specific environment setup for multi-node disaggregated serving. | ||
| # | ||
| # Sourced by server_atom.sh in place of env.sh (which is SGLang/MoRI-specific). | ||
| # | ||
| # REQUIRED ENVIRONMENT VARIABLES: | ||
| # IBDEVICES - RDMA/InfiniBand device names (e.g., ionic_0,ionic_1,...) | ||
| # Set by runner or auto-detected from hostname. | ||
|
|
||
| set -x | ||
|
|
||
| export PYTHONUNBUFFERED=1 | ||
| export PYTHONDONTWRITEBYTECODE=1 | ||
|
|
||
| # ============================================================================= | ||
| # IBDEVICES detection (same as env.sh) | ||
| # ============================================================================= | ||
|
|
||
| if [[ -z "$IBDEVICES" ]]; then | ||
| DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',') | ||
| if [[ -n "$DETECTED" ]]; then | ||
| export IBDEVICES="$DETECTED" | ||
| echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES via ibv_devinfo on $(hostname -s)" | ||
| else | ||
| # ATOM uses mooncake proxy_ip/handshake_port for KV transfer — IBDEVICES is | ||
| # not passed as a server argument (unlike SGLang --disaggregation-ib-device). | ||
| # Log a warning but do not fail; mooncake will use its own RDMA device selection. | ||
| echo "[WARN] Unable to detect RDMA devices via ibv_devinfo; IBDEVICES unset (non-fatal for ATOM/mooncake)" >&2 | ||
| fi | ||
| else | ||
| echo "[INFO] Using IBDEVICES=$IBDEVICES (set by runner or environment)" | ||
| fi | ||
| export IBDEVICES | ||
|
|
||
| # ============================================================================= | ||
| # ATOM/mooncake-specific environment | ||
| # ============================================================================= | ||
|
|
||
| # mooncake RDMA KV transfer library path | ||
| export LD_LIBRARY_PATH=/opt/venv/lib/python3.10/site-packages/mooncake:/opt/rocm/lib:${LD_LIBRARY_PATH:-} | ||
|
|
||
| # ATOM MoE gather/scatter interleave optimization | ||
| export ATOM_MOE_GU_ITLV=1 | ||
|
|
||
| # ATOM_HOST_IP is set per-node in server_atom.sh (= host_ip, used as handshake IP) | ||
|
|
||
| # aiter logging (WARNING to reduce noise; use DEBUG for troubleshooting) | ||
| export AITER_LOG_LEVEL=WARNING | ||
|
|
||
| # Disable bf16->fp8 MoE bound (matches reference script) | ||
| export AITER_BF16_FP8_MOE_BOUND=0 | ||
|
|
||
| # Clear stale ATOM cache on startup (server_atom.sh handles this via rm -rf) | ||
| # No env var needed; documented here for reference. | ||
|
|
||
| set +x | ||
|
|
||
| echo "[INFO] ATOM env: IBDEVICES=$IBDEVICES LD_LIBRARY_PATH includes mooncake" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,8 +25,8 @@ echo "" | |
| # at runtime, but the CWD remains the submit-time directory (amd_utils/). | ||
| if [[ "$ENGINE" == "vllm-disagg" ]]; then | ||
| MODELS_YAML="$(pwd)/models_vllm.yaml" | ||
| else | ||
| MODELS_YAML="$(pwd)/models.yaml" | ||
| elif [[ "$ENGINE" == "atom-disagg" ]]; then | ||
| MODELS_YAML="$(pwd)/models_atom.yaml" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sglang models.yaml path droppedHigh Severity Replacing the final Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here. |
||
| fi | ||
|
Comment on lines
26
to
30
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Regression: Extended reasoning...Bug
if [[ "$ENGINE" == "vllm-disagg" ]]; then
MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
MODELS_YAML="$(pwd)/models_atom.yaml"
fiWhen TriggerImmediately after, at if [[ ! -f "$MODELS_YAML" ]]; then
echo "Error: models YAML not found at $MODELS_YAML"
exit 1
fi
Step-by-step proof
ImpactThis regresses every pre-existing sglang-disagg recipe in
FixAdd an if [[ "$ENGINE" == "vllm-disagg" ]]; then
MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
MODELS_YAML="$(pwd)/models_atom.yaml"
else
MODELS_YAML="$(pwd)/models.yaml"
fi |
||
|
|
||
| if [[ ! -f "$MODELS_YAML" ]]; then | ||
|
|
@@ -402,6 +402,20 @@ if [[ "$ENGINE" == "vllm-disagg" ]]; then | |
| -e VLLM_MORIIO_CONNECTOR_READ_MODE=\${VLLM_MORIIO_CONNECTOR_READ_MODE:-1} | ||
| -e PYTHONPYCACHEPREFIX=/tmp/pycache | ||
| ) | ||
| elif [[ "$ENGINE" == "atom-disagg" ]]; then | ||
| DOCKER_ENV_ENGINE=( | ||
| -e ATOM_WS_PATH=${WS_PATH} | ||
| -e PREFILL_PORT=${PREFILL_PORT:-8010} | ||
| -e DECODE_PORT=${DECODE_PORT:-8020} | ||
| -e ROUTER_PORT=${ROUTER_PORT:-30000} | ||
| -e HANDSHAKE_PORT=${HANDSHAKE_PORT:-6301} | ||
| -e MEM_FRACTION=${MEM_FRACTION:-0.85} | ||
| -e KV_CACHE_DTYPE=${KV_CACHE_DTYPE:-fp8} | ||
| -e BLOCK_SIZE=${BLOCK_SIZE:-16} | ||
| -e MAX_NUM_SEQS=${MAX_NUM_SEQS:-256} | ||
| -e EXTRA_SERVER_ARGS=\${EXTRA_SERVER_ARGS:-} | ||
| -e IBDEVICES=${IBDEVICES:-} | ||
| ) | ||
| else | ||
| DOCKER_ENV_ENGINE=( | ||
| -e SGLANG_WS_PATH=${WS_PATH} | ||
|
|
@@ -425,6 +439,83 @@ echo \"Rank \$SLURM_PROCID on \$(hostname)\" | |
| eval \"\$DOCKER_CMD_DETECT\" | ||
| echo \"[docker-detect] rank \$SLURM_PROCID: DOCKER_CMD=\$DOCKER_CMD\" | ||
|
|
||
| # Enable out-of-tree RDMA library mounts for atom-disagg (mooncake requires host RDMA stack) | ||
| RDMA_MOUNTS=() | ||
| if [[ "$ENGINE" == "atom-disagg" ]]; then | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Missing Extended reasoning...What breaks. Line 443 opens |
||
|
|
||
| # When the container base OS differs from the host (e.g. Ubuntu 24.04 image | ||
| # on a 22.04 host), the container's bundled libibverbs/libionic may be | ||
| # ABI-incompatible with the host kernel drivers. Detect the NIC type and | ||
| # bind-mount the host's out-of-tree RDMA userspace libraries into the | ||
| # container so the RDMA stack always matches the running kernel. | ||
| _detect_nic_type() { | ||
| if [[ -n \"\${MORI_NIC_TYPE:-}\" ]]; then echo \"\$MORI_NIC_TYPE\"; return; fi | ||
| local bnxt=0 mlx5=0 ionic=0 | ||
| if [[ -d /sys/class/infiniband ]]; then | ||
| for dev in /sys/class/infiniband/*; do | ||
| local name; name=\$(basename \"\$dev\") | ||
| case \"\$name\" in | ||
| bnxt_re*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;; | ||
| *) | ||
| local drv; drv=\$(basename \"\$(readlink -f \"\$dev/device/driver\" 2>/dev/null)\" 2>/dev/null || true) | ||
| case \"\$drv\" in bnxt*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;; esac ;; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. set -e breaks NIC countingMedium Severity Inside the new RDMA helper, Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here. |
||
| esac | ||
| done | ||
| fi | ||
| if (( bnxt >= mlx5 && bnxt >= ionic && bnxt > 0 )); then echo bnxt | ||
| elif (( ionic >= mlx5 && ionic > 0 )); then echo ionic | ||
| else echo mlx5; fi | ||
| } | ||
|
|
||
| _find_host_ibverbs() { | ||
| for c in /usr/lib64/libibverbs.so.1 /lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1; do | ||
| local r; r=\$(readlink -f \"\$c\" 2>/dev/null || true) | ||
| if [[ -f \"\$r\" ]]; then echo \"\$r\"; return; fi | ||
| done | ||
| } | ||
|
|
||
| _NIC_TYPE=\$(_detect_nic_type) | ||
| echo \"[rdma] NIC type: \${_NIC_TYPE} on \$(hostname)\" | ||
|
|
||
| if [[ \"\$_NIC_TYPE\" == \"ionic\" || \"\$_NIC_TYPE\" == \"bnxt\" ]]; then | ||
| _host_ibv=\$(_find_host_ibverbs) | ||
| if [[ -n \"\$_host_ibv\" ]]; then | ||
| RDMA_MOUNTS+=(-v \"\$_host_ibv:/lib/x86_64-linux-gnu/libibverbs.so.1\") | ||
| fi | ||
| fi | ||
|
|
||
| if [[ \"\$_NIC_TYPE\" == \"ionic\" ]]; then | ||
| for _dir in /usr/local/lib /usr/lib/x86_64-linux-gnu; do | ||
| for _lib in \"\$_dir\"/libionic*.so; do | ||
| [[ -f \"\$_lib\" ]] || continue | ||
| _real=\$(readlink -f \"\$_lib\") | ||
| [[ -f \"\$_real\" ]] && RDMA_MOUNTS+=(-v \"\$_real:\$_real\") | ||
| RDMA_MOUNTS+=(-v \"\$_lib:/usr/lib/x86_64-linux-gnu/\$(basename \"\$_lib\")\") | ||
| done | ||
| done | ||
| if [[ -d /usr/lib/x86_64-linux-gnu/libibverbs ]]; then | ||
| for _lib in /usr/lib/x86_64-linux-gnu/libibverbs/libionic-rdmav*.so; do | ||
| [[ -f \"\$_lib\" ]] && RDMA_MOUNTS+=(-v \"\$_lib:\$_lib\") | ||
| done | ||
| fi | ||
| [[ -d /etc/libibverbs.d ]] && RDMA_MOUNTS+=(-v /etc/libibverbs.d:/etc/libibverbs.d:ro) | ||
| elif [[ \"\$_NIC_TYPE\" == \"bnxt\" ]]; then | ||
| for _lib in /usr/local/lib/libbnxt_re-rdmav*.so; do | ||
| [[ -f \"\$_lib\" ]] && RDMA_MOUNTS+=(-v \"\$_lib:/usr/lib/x86_64-linux-gnu/libibverbs/\$(basename \"\$_lib\")\") | ||
| done | ||
| for _lib in /usr/local/lib/libbnxt_re.so; do | ||
| [[ -f \"\$_lib\" ]] && RDMA_MOUNTS+=(-v \"\$_lib:/usr/lib/x86_64-linux-gnu/\$(basename \"\$_lib\")\") | ||
| done | ||
| [[ -d /etc/libibverbs.d ]] && RDMA_MOUNTS+=(-v /etc/libibverbs.d:/etc/libibverbs.d:ro) | ||
| fi | ||
|
|
||
| if [[ \${#RDMA_MOUNTS[@]} -gt 0 ]]; then | ||
| echo \"[rdma] bind-mounts: \${RDMA_MOUNTS[*]}\" | ||
| else | ||
| echo \"[rdma] no out-of-tree RDMA mounts needed\" | ||
| fi | ||
|
cursor[bot] marked this conversation as resolved.
|
||
| fi # end: if ENGINE == atom-disagg | ||
|
|
||
| # Pre-clean (idempotent) | ||
| \$DOCKER_CMD ps -aq --filter \"$CONT_FILTER\" | xargs -r \$DOCKER_CMD rm -f || true | ||
| \$DOCKER_CMD ps -aq | xargs -r \$DOCKER_CMD stop || true | ||
|
|
@@ -490,6 +581,7 @@ fi | |
| -v ${BENCHMARK_LOGS_DIR}:/benchmark_logs \ | ||
| -v ${DI_REPO_DIR}:${DOCKER_MOUNT_PATH} \ | ||
| ${EXTRA_DOCKER_MOUNTS:-} \ | ||
| \${RDMA_MOUNTS[@]+"\${RDMA_MOUNTS[@]}"} \ | ||
| ${DOCKER_ENV_COMMON[*]} \ | ||
| ${DOCKER_ENV_ENGINE[*]} \ | ||
| --name \"$DOCKER_CONT_NAME\" \ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| # Model-specific SGLang server configurations for disaggregated inference. | ||
| # | ||
| # Each top-level key is a MODEL_NAME value (must match the directory name under MODEL_DIR). | ||
| # | ||
| # To add a new model: add a new top-level entry following the same schema. | ||
| # No script changes are required. | ||
| # | ||
| # Schema: | ||
| # <model-name>: | ||
| # base_flags: str # Common flags for both prefill and decode | ||
| # mtp_flags: str # Appended to decode when DECODE_MTP_SIZE > 0 | ||
| # dp_flags: str # Appended when DP is enabled (prefill or decode) | ||
| # prefill: | ||
| # mem_fraction_static: float | ||
| # disable_radix_cache: bool | ||
| # dp: # Config when data-parallel attention is enabled | ||
| # max_running_requests: int | ||
| # chunked_prefill_size: str # Can be integer or bash arithmetic expression | ||
| # cuda_graph_bs: str # Space-separated values | ||
| # no_dp: # Config when data-parallel attention is disabled | ||
| # max_running_requests: int | ||
| # chunked_prefill_size: int | ||
| # cuda_graph_bs_range: str # "start-end" expanded via seq | ||
| # decode: | ||
| # mem_fraction_static: float | ||
| # prefill_round_robin_balance: bool | ||
| # dp: | ||
| # max_running_requests: int | ||
| # chunked_prefill_size: str | ||
| # cuda_graph_bs_range: str | ||
| # ep_only: # Config when EP is enabled but DP is disabled | ||
| # max_running_requests: int | ||
| # chunked_prefill_size: int | ||
| # cuda_graph_bs_range: str | ||
| # no_dp: | ||
| # max_running_requests: int | ||
| # chunked_prefill_size: int | ||
| # cuda_graph_bs_range: str | ||
|
|
||
| DeepSeek-V4-Pro: | ||
| # ATOM engine (atom-disagg): server_atom.sh uses MEM_FRACTION/KV_CACHE_DTYPE/BLOCK_SIZE/MAX_NUM_SEQS | ||
| # directly from env vars (defaulting to 0.85/fp8/16/256). base_flags/dp_flags are not used by | ||
| # server_atom.sh; they are kept here for documentation and potential future use. | ||
| base_flags: "" | ||
| mtp_flags: "" | ||
| dp_flags: "" | ||
| prefill: | ||
| mem_fraction_static: 0.85 | ||
| disable_radix_cache: true | ||
| dp: | ||
| max_running_requests: 256 | ||
| chunked_prefill_size: 262144 | ||
| cuda_graph_bs: "1 2 3" | ||
| no_dp: | ||
| max_running_requests: 256 | ||
| chunked_prefill_size: 262144 | ||
| cuda_graph_bs_range: "1-128" | ||
| decode: | ||
| mem_fraction_static: 0.85 | ||
| prefill_round_robin_balance: false | ||
| dp: | ||
| max_running_requests: 256 | ||
| chunked_prefill_size: 262144 | ||
| cuda_graph_bs_range: "1-256" | ||
| ep_only: | ||
| max_running_requests: 256 | ||
| chunked_prefill_size: 262144 | ||
| cuda_graph_bs_range: "1-256" | ||
| no_dp: | ||
| max_running_requests: 256 | ||
| chunked_prefill_size: 262144 | ||
| cuda_graph_bs_range: "1-256" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,19 +1,23 @@ | ||
| #!/bin/bash | ||
| # Dual-Engine Disaggregated Server Dispatcher | ||
| # Multi-Engine Disaggregated Server Dispatcher | ||
| # ============================================================================= | ||
| # Dispatches to the engine-specific server launcher based on ENGINE env var. | ||
| # ENGINE=sglang-disagg (default) -> server_sglang.sh (SGLang + MoRI) | ||
| # ENGINE=vllm-disagg -> server_vllm.sh (vLLM + Nixl/MoRI-IO) | ||
| # ENGINE=atom-disagg -> server_atom.sh (ATOM + mooncake) | ||
| # ============================================================================= | ||
|
|
||
| ENGINE="${ENGINE:-sglang-disagg}" | ||
| WS_PATH="${WS_PATH:-${SGLANG_WS_PATH:-${VLLM_WS_PATH:-$(dirname "${BASH_SOURCE[0]}")}}}" | ||
| WS_PATH="${WS_PATH:-${SGLANG_WS_PATH:-${VLLM_WS_PATH:-${ATOM_WS_PATH:-$(dirname "${BASH_SOURCE[0]}")}}}}" | ||
| export WS_PATH ENGINE | ||
|
|
||
| echo "[DISPATCHER] ENGINE=$ENGINE WS_PATH=$WS_PATH" | ||
|
|
||
| if [[ "$ENGINE" == "vllm-disagg" ]]; then | ||
| source "$WS_PATH/server_vllm.sh" | ||
| elif [[ "$ENGINE" == "atom-disagg" ]]; then | ||
| export ATOM_WS_PATH="$WS_PATH" | ||
| source "$WS_PATH/server_atom.sh" | ||
| else | ||
| source "$WS_PATH/server_sglang.sh" | ||
| fi |


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
dsv4-fp4-mi355x-atom-disaggentry hasframework: atom, but the companion edit torunners/launch_mi355x-amds.sh:80(also in this PR) explicitly routes onlyatom-disagg(notatom) toBENCHMARK_SUBDIR=multi_node, and line 79 buildsSCRIPT_NAME=dsv4_fp4_mi355x_${FRAMEWORK}.sh— so withframework: atomthe launcher resolves to the existing single-node scriptbenchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.shinstead of the newly-addedbenchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh. Every other multi-node disagg recipe in this file uses<engine>-disagg(sglang-disagg / vllm-disagg). Fix: changeframework: atom→framework: atom-disaggon line 2725.Extended reasoning...
What the bug is
.github/configs/amd-master.yamldefines a new sweep recipedsv4-fp4-mi355x-atom-disaggwithmultinode: trueanddisagg: true, but setsframework: atom(line 2725). The value should beatom-disaggto match the<engine>-disaggconvention used by every other multi-node disagg recipe in this file (e.g.sglang-disagg,vllm-disagg).Why this is broken end-to-end
The
frameworkfield flows directly into the runner asFRAMEWORK. Inrunners/launch_mi355x-amds.sh, the multinode branch does two things that both key off this value:The fact that this same PR added
atom-disaggto line 80 makes the intended value self-evident — if the YAML were meant to sayatom, the runner edit wouldn't be needed at all.Step-by-step proof
With the current
framework: atomvalue in the new entry, here's what happens when the sweep dispatches:EXP_NAME=dsv4-fp4-mi355x-atom-disagg,PRECISION=fp4,FRAMEWORK=atom.SCRIPT_NAME = "${EXP_NAME%%_*}_fp4_mi355x_atom.sh".${EXP_NAME%%_*}strips the longest_*suffix, butEXP_NAMEuses hyphens, not underscores, so the result isdsv4-fp4-mi355x-atom-disagg. Either way, the${FRAMEWORK}substitution gives...mi355x_atom.shinstead of...mi355x_atom-disagg.sh.FRAMEWORK=atomdoes not match any ofsglang-disagg | vllm-disagg | atom-disagg→BENCHMARK_SUBDIR=single_node/fixed_seq_len.bash benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh. That file already exists on main (it's the existing single-node atom recipe for DSv4) — so the runner silently runs the wrong, single-node script instead of the newly-addedbenchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh.The newly-added multi-node launcher (which is the point of this PR) is never invoked.
Impact
The new disagg recipe is non-functional as written. CI will appear to schedule the sweep but will dispatch the existing single-node atom benchmark, producing misleading results that look like atom-disagg numbers but are actually single-node
atomruns.Fix
One-line change in
.github/configs/amd-master.yaml(line 2725):After the fix:
SCRIPT_NAMEresolves todsv4_fp4_mi355x_atom-disagg.sh(the file added in this PR),BENCHMARK_SUBDIR=multi_node, and the multinode dispatch chain (job.slurm→server.sh→server_atom.sh) — all of which testENGINE/FRAMEWORK == "atom-disagg"— kicks in correctly.