Skip to content
Open
33 changes: 33 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2716,3 +2716,36 @@ dsv4-fp4-mi355x-sglang-agentic:
# async scheduling, max-num-seqs=128, max-num-batched-tokens=8192,
# gpu-mem-util=0.6. TP8 sweeps conc 4-64; DEP8 has a single conc=64
# probe to validate the ROCm DP+EP path.

dsv4-fp4-mi355x-atom-disagg:
#TODO: (srok), temporary dev img. will update
image: rocm/atom-dev:nightly_202606071111-Jasen-fix_dockerfile
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dsv4-fp4-mi355x-atom-disagg entry has framework: atom, but the companion edit to runners/launch_mi355x-amds.sh:80 (also in this PR) explicitly routes only atom-disagg (not atom) to BENCHMARK_SUBDIR=multi_node, and line 79 builds SCRIPT_NAME=dsv4_fp4_mi355x_${FRAMEWORK}.sh — so with framework: atom the launcher resolves to the existing single-node script benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh instead of the newly-added benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh. Every other multi-node disagg recipe in this file uses <engine>-disagg (sglang-disagg / vllm-disagg). Fix: change framework: atomframework: atom-disagg on line 2725.

Extended reasoning...

What the bug is

.github/configs/amd-master.yaml defines a new sweep recipe dsv4-fp4-mi355x-atom-disagg with multinode: true and disagg: true, but sets framework: atom (line 2725). The value should be atom-disagg to match the <engine>-disagg convention used by every other multi-node disagg recipe in this file (e.g. sglang-disagg, vllm-disagg).

Why this is broken end-to-end

The framework field flows directly into the runner as FRAMEWORK. In runners/launch_mi355x-amds.sh, the multinode branch does two things that both key off this value:

# line 79
SCRIPT_NAME="${EXP_NAME%%_*}_${PRECISION}_mi355x_${FRAMEWORK}.sh"
# line 80 (this PR explicitly added the 'atom-disagg' clause)
if [[ "$FRAMEWORK" == "sglang-disagg" ]] || [[ "$FRAMEWORK" == "vllm-disagg" ]] || [[ "$FRAMEWORK" == "atom-disagg" ]]; then
    BENCHMARK_SUBDIR="multi_node"
else
    BENCHMARK_SUBDIR="single_node/fixed_seq_len"
fi

The fact that this same PR added atom-disagg to line 80 makes the intended value self-evident — if the YAML were meant to say atom, the runner edit wouldn't be needed at all.

Step-by-step proof

With the current framework: atom value in the new entry, here's what happens when the sweep dispatches:

  1. EXP_NAME=dsv4-fp4-mi355x-atom-disagg, PRECISION=fp4, FRAMEWORK=atom.
  2. Line 79: SCRIPT_NAME = "${EXP_NAME%%_*}_fp4_mi355x_atom.sh". ${EXP_NAME%%_*} strips the longest _* suffix, but EXP_NAME uses hyphens, not underscores, so the result is dsv4-fp4-mi355x-atom-disagg. Either way, the ${FRAMEWORK} substitution gives ...mi355x_atom.sh instead of ...mi355x_atom-disagg.sh.
  3. Line 80: FRAMEWORK=atom does not match any of sglang-disagg | vllm-disagg | atom-disaggBENCHMARK_SUBDIR=single_node/fixed_seq_len.
  4. Final dispatch: bash benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh. That file already exists on main (it's the existing single-node atom recipe for DSv4) — so the runner silently runs the wrong, single-node script instead of the newly-added benchmarks/multi_node/dsv4_fp4_mi355x_atom-disagg.sh.

The newly-added multi-node launcher (which is the point of this PR) is never invoked.

Impact

The new disagg recipe is non-functional as written. CI will appear to schedule the sweep but will dispatch the existing single-node atom benchmark, producing misleading results that look like atom-disagg numbers but are actually single-node atom runs.

Fix

One-line change in .github/configs/amd-master.yaml (line 2725):

-  framework: atom
+  framework: atom-disagg

After the fix: SCRIPT_NAME resolves to dsv4_fp4_mi355x_atom-disagg.sh (the file added in this PR), BENCHMARK_SUBDIR=multi_node, and the multinode dispatch chain (job.slurmserver.shserver_atom.sh) — all of which test ENGINE/FRAMEWORK == "atom-disagg" — kicks in correctly.

precision: fp4
framework: atom-disagg
multinode: true
disagg: true
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
# 1P1D DP+TP8
# TODO: (srok), spot check
- conc-list: [ 256 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
58 changes: 58 additions & 0 deletions benchmarks/multi_node/amd_utils/env_atom.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/bin/bash
# ATOM/mooncake-specific environment setup for multi-node disaggregated serving.
#
# Sourced by server_atom.sh in place of env.sh (which is SGLang/MoRI-specific).
#
# REQUIRED ENVIRONMENT VARIABLES:
# IBDEVICES - RDMA/InfiniBand device names (e.g., ionic_0,ionic_1,...)
# Set by runner or auto-detected from hostname.

set -x

export PYTHONUNBUFFERED=1
export PYTHONDONTWRITEBYTECODE=1

# =============================================================================
# IBDEVICES detection (same as env.sh)
# =============================================================================

if [[ -z "$IBDEVICES" ]]; then
DETECTED=$(ibv_devinfo 2>/dev/null | grep "hca_id:" | awk '{print $2}' | paste -sd',')
if [[ -n "$DETECTED" ]]; then
export IBDEVICES="$DETECTED"
echo "[INFO] Auto-detected IBDEVICES=$IBDEVICES via ibv_devinfo on $(hostname -s)"
else
# ATOM uses mooncake proxy_ip/handshake_port for KV transfer — IBDEVICES is
# not passed as a server argument (unlike SGLang --disaggregation-ib-device).
# Log a warning but do not fail; mooncake will use its own RDMA device selection.
echo "[WARN] Unable to detect RDMA devices via ibv_devinfo; IBDEVICES unset (non-fatal for ATOM/mooncake)" >&2
fi
else
echo "[INFO] Using IBDEVICES=$IBDEVICES (set by runner or environment)"
fi
export IBDEVICES

# =============================================================================
# ATOM/mooncake-specific environment
# =============================================================================

# mooncake RDMA KV transfer library path
export LD_LIBRARY_PATH=/opt/venv/lib/python3.10/site-packages/mooncake:/opt/rocm/lib:${LD_LIBRARY_PATH:-}

# ATOM MoE gather/scatter interleave optimization
export ATOM_MOE_GU_ITLV=1

# ATOM_HOST_IP is set per-node in server_atom.sh (= host_ip, used as handshake IP)

# aiter logging (WARNING to reduce noise; use DEBUG for troubleshooting)
export AITER_LOG_LEVEL=WARNING

# Disable bf16->fp8 MoE bound (matches reference script)
export AITER_BF16_FP8_MOE_BOUND=0

# Clear stale ATOM cache on startup (server_atom.sh handles this via rm -rf)
# No env var needed; documented here for reference.

set +x

echo "[INFO] ATOM env: IBDEVICES=$IBDEVICES LD_LIBRARY_PATH includes mooncake"
96 changes: 94 additions & 2 deletions benchmarks/multi_node/amd_utils/job.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ echo ""
# at runtime, but the CWD remains the submit-time directory (amd_utils/).
if [[ "$ENGINE" == "vllm-disagg" ]]; then
MODELS_YAML="$(pwd)/models_vllm.yaml"
else
MODELS_YAML="$(pwd)/models.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
MODELS_YAML="$(pwd)/models_atom.yaml"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sglang models.yaml path dropped

High Severity

Replacing the final else that set MODELS_YAML to models.yaml with only vllm-disagg and atom-disagg branches leaves the default ENGINE (sglang-disagg) without a MODELS_YAML value, so model validation fails before the job starts.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.

fi
Comment on lines 26 to 30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Regression: MODELS_YAML is unset for the default ENGINE=sglang-disagg, breaking every existing sglang-disagg recipe. The new if/elif/fi at benchmarks/multi_node/amd_utils/job.slurm:26-30 dropped the else MODELS_YAML="$(pwd)/models.yaml" branch, so when ENGINE is sglang-disagg (the default at line 11) neither arm assigns MODELS_YAML. The very next check at line 32 ([[ ! -f "$MODELS_YAML" ]]) then fires for the empty path and aborts with Error: models YAML not found at . Fix: add else MODELS_YAML="$(pwd)/models.yaml" before the fi to restore the sglang-disagg default.

Extended reasoning...

Bug

benchmarks/multi_node/amd_utils/job.slurm:11 defaults the engine: ENGINE="${ENGINE:-sglang-disagg}". The previous MODELS_YAML selection had a two-arm if/else that always assigned MODELS_YAMLmodels_vllm.yaml for vllm-disagg, else the SGLang default models.yaml. This PR replaced the else with an elif for the new atom-disagg engine, leaving no fallback assignment:

if [[ "$ENGINE" == "vllm-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_atom.yaml"
fi

When ENGINE=sglang-disagg (the default), neither branch matches and MODELS_YAML stays unset. The top-level script does not enable set -u (only the inner srun bash -lc does), so $MODELS_YAML expands to the empty string.

Trigger

Immediately after, at job.slurm:32-35:

if [[ ! -f "$MODELS_YAML" ]]; then
    echo "Error: models YAML not found at $MODELS_YAML"
    exit 1
fi

[[ ! -f "" ]] evaluates true (the empty string is not a regular file), so the script prints Error: models YAML not found at (trailing empty) and exits 1.

Step-by-step proof

  1. User submits any existing sglang-disagg recipe (e.g. dsr1-fp4-mi355x-sglang-disagg from .github/configs/amd-master.yaml). The launcher does not override ENGINE.
  2. job.slurm:11 runs ENGINE="${ENGINE:-sglang-disagg}"ENGINE=sglang-disagg.
  3. Line 26-30: $ENGINE is neither vllm-disagg nor atom-disagg. The if/elif/fi falls through with no else clause. MODELS_YAML is never assigned.
  4. Line 32: [[ ! -f "$MODELS_YAML" ]] expands to [[ ! -f "" ]] → true.
  5. Line 33: prints Error: models YAML not found at to stderr.
  6. Line 34: exit 1 — job aborts before any docker/srun work.

Impact

This regresses every pre-existing sglang-disagg recipe in .github/configs/amd-master.yaml:

  • dsr1-fp4-mi355x-sglang-disagg
  • dsr1-fp8-mi355x-sglang-disagg (and -mtp variant)
  • dsr1-fp4-mi355x-sglang-disagg-mtp, -1k1k-mtp, -8k1k-mtp
  • glm5-fp8-mi355x-sglang-disagg
  • qwen3.5-fp4-mi355x-sglang-disagg
  • qwen3.5-fp8-mi355x-sglang-disagg

models.yaml is still on disk (benchmarks/multi_node/amd_utils/models.yaml, 14125 bytes), so this is purely a dispatch regression — the file the original code expected exists, but the script no longer points at it.

Fix

Add an else branch restoring the sglang default:

if [[ "$ENGINE" == "vllm-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_vllm.yaml"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
    MODELS_YAML="$(pwd)/models_atom.yaml"
else
    MODELS_YAML="$(pwd)/models.yaml"
fi


if [[ ! -f "$MODELS_YAML" ]]; then
Expand Down Expand Up @@ -402,6 +402,20 @@ if [[ "$ENGINE" == "vllm-disagg" ]]; then
-e VLLM_MORIIO_CONNECTOR_READ_MODE=\${VLLM_MORIIO_CONNECTOR_READ_MODE:-1}
-e PYTHONPYCACHEPREFIX=/tmp/pycache
)
elif [[ "$ENGINE" == "atom-disagg" ]]; then
DOCKER_ENV_ENGINE=(
-e ATOM_WS_PATH=${WS_PATH}
-e PREFILL_PORT=${PREFILL_PORT:-8010}
-e DECODE_PORT=${DECODE_PORT:-8020}
-e ROUTER_PORT=${ROUTER_PORT:-30000}
-e HANDSHAKE_PORT=${HANDSHAKE_PORT:-6301}
-e MEM_FRACTION=${MEM_FRACTION:-0.85}
-e KV_CACHE_DTYPE=${KV_CACHE_DTYPE:-fp8}
-e BLOCK_SIZE=${BLOCK_SIZE:-16}
-e MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
-e EXTRA_SERVER_ARGS=\${EXTRA_SERVER_ARGS:-}
-e IBDEVICES=${IBDEVICES:-}
)
else
DOCKER_ENV_ENGINE=(
-e SGLANG_WS_PATH=${WS_PATH}
Expand All @@ -425,6 +439,83 @@ echo \"Rank \$SLURM_PROCID on \$(hostname)\"
eval \"\$DOCKER_CMD_DETECT\"
echo \"[docker-detect] rank \$SLURM_PROCID: DOCKER_CMD=\$DOCKER_CMD\"

# Enable out-of-tree RDMA library mounts for atom-disagg (mooncake requires host RDMA stack)
RDMA_MOUNTS=()
if [[ "$ENGINE" == "atom-disagg" ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing fi to close the outer if [[ "$ENGINE" == "atom-disagg" ]]; then block opened at line 443 — the gate never closes before # Pre-clean (idempotent) at line 517. Because bash -c parses the entire inner script (lines 433-597 of the srun bash -lc "…" string) before executing, every rank's docker invocation will fail with syntax error: unexpected end of file regardless of ENGINE, breaking sglang-disagg and vllm-disagg in addition to the new atom-disagg path. Add an fi just before line 517 to close the outer atom-disagg conditional.

Extended reasoning...

What breaks. Line 443 opens if [[ "$ENGINE" == "atom-disagg" ]]; then to gate the entire new RDMA-mount-detection block. Inside that block, the diff defines two helper functions and runs a series of nested if/fi blocks (line 453/463, line 464/466 inside a single-line if at 451, line 479/484, line 481/483 nested inside 479, line 486/509 with internal 495/499 and an elif branch, line 511/515). All of those inner pairs balance correctly. But the outer gate from line 443 is never closed before line 517 # Pre-clean (idempotent), where the script transitions out of atom-disagg-specific logic into the engine-agnostic docker pre-clean / run sequence.\n\nWhy every engine breaks, not just atom-disagg. This inner script lives inside an srun … bash -lc "…" string at lines 432-598. The whole quoted body is passed to bash -c as a single argument. bash -c does a full parse of the input before it executes anything, so a structural unbalance fires a parse error regardless of the runtime value of $ENGINE. That means ENGINE=sglang-disagg and ENGINE=vllm-disagg also explode — even though they never enter the atom-disagg branch logically, the parser still has to read past the opening if at line 443 looking for a closer that doesn't exist.\n\nWhy the outer bash -n job.slurm doesn't catch it. The unmatched if lives inside a double-quoted string at the outer level, so outer-script lexical analysis sees it as ordinary string content. The bug only surfaces when bash -c parses the inner content at runtime, which is exactly what every rank does under srun.\n\nProof. I extracted the body of the bash -lc "…" string (lines 433-597), unescaped \\$$, \\"", \\\\\\, and \\`` → ```` (standard bash double-quote unescaping rules), and ran bash -non the result:\n\n```\n$ bash -n /tmp/inner_script.sh\n/tmp/inner_script.sh: line N: syntax error: unexpected end of file\n$ echo $?\n2\n```\n\nCounting line-leading control tokens in the unescaped script confirms 12ifopeners vs 11ficlosers — exactly one missing closer, and the only opener with no matching closer is line 443.\n\n**Impact.** No docker container will launch on any rank for any disaggregated engine after this PR.srun's bash -cexits with status 2 before reaching the$MAYBE_EXEC $DOCKER_CMD run …line. The PR description's test plan (atom-disagg dry-run, sglang-disagg baseline comparison) cannot complete; nor can any prior sglang-disagg or vllm-disagg CI run that targets this file post-merge.\n\n**Fix.** Add a closingfiimmediately before line 517 to terminate the outer atom-disagg gate, so the if/fi structure becomesif atom-disagg; then [RDMA detection helpers + mount-building blocks]; fi; # Pre-clean (idempotent); …`.


# When the container base OS differs from the host (e.g. Ubuntu 24.04 image
# on a 22.04 host), the container's bundled libibverbs/libionic may be
# ABI-incompatible with the host kernel drivers. Detect the NIC type and
# bind-mount the host's out-of-tree RDMA userspace libraries into the
# container so the RDMA stack always matches the running kernel.
_detect_nic_type() {
if [[ -n \"\${MORI_NIC_TYPE:-}\" ]]; then echo \"\$MORI_NIC_TYPE\"; return; fi
local bnxt=0 mlx5=0 ionic=0
if [[ -d /sys/class/infiniband ]]; then
for dev in /sys/class/infiniband/*; do
local name; name=\$(basename \"\$dev\")
case \"\$name\" in
bnxt_re*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;;
*)
local drv; drv=\$(basename \"\$(readlink -f \"\$dev/device/driver\" 2>/dev/null)\" 2>/dev/null || true)
case \"\$drv\" in bnxt*) ((bnxt++)) ;; mlx5*) ((mlx5++)) ;; ionic*) ((ionic++)) ;; esac ;;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e breaks NIC counting

Medium Severity

Inside the new RDMA helper, ((bnxt++)) / ((ionic++)) under set -euo pipefail can exit with status 1 when a counter is still zero, aborting atom-disagg container startup during NIC detection.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d6b3f3. Configure here.

esac
done
fi
if (( bnxt >= mlx5 && bnxt >= ionic && bnxt > 0 )); then echo bnxt
elif (( ionic >= mlx5 && ionic > 0 )); then echo ionic
else echo mlx5; fi
}

_find_host_ibverbs() {
for c in /usr/lib64/libibverbs.so.1 /lib/x86_64-linux-gnu/libibverbs.so.1 /usr/lib/x86_64-linux-gnu/libibverbs.so.1; do
local r; r=\$(readlink -f \"\$c\" 2>/dev/null || true)
if [[ -f \"\$r\" ]]; then echo \"\$r\"; return; fi
done
}

_NIC_TYPE=\$(_detect_nic_type)
echo \"[rdma] NIC type: \${_NIC_TYPE} on \$(hostname)\"

if [[ \"\$_NIC_TYPE\" == \"ionic\" || \"\$_NIC_TYPE\" == \"bnxt\" ]]; then
_host_ibv=\$(_find_host_ibverbs)
if [[ -n \"\$_host_ibv\" ]]; then
RDMA_MOUNTS+=(-v \"\$_host_ibv:/lib/x86_64-linux-gnu/libibverbs.so.1\")
fi
fi

if [[ \"\$_NIC_TYPE\" == \"ionic\" ]]; then
for _dir in /usr/local/lib /usr/lib/x86_64-linux-gnu; do
for _lib in \"\$_dir\"/libionic*.so; do
[[ -f \"\$_lib\" ]] || continue
_real=\$(readlink -f \"\$_lib\")
[[ -f \"\$_real\" ]] && RDMA_MOUNTS+=(-v \"\$_real:\$_real\")
RDMA_MOUNTS+=(-v \"\$_lib:/usr/lib/x86_64-linux-gnu/\$(basename \"\$_lib\")\")
done
done
if [[ -d /usr/lib/x86_64-linux-gnu/libibverbs ]]; then
for _lib in /usr/lib/x86_64-linux-gnu/libibverbs/libionic-rdmav*.so; do
[[ -f \"\$_lib\" ]] && RDMA_MOUNTS+=(-v \"\$_lib:\$_lib\")
done
fi
[[ -d /etc/libibverbs.d ]] && RDMA_MOUNTS+=(-v /etc/libibverbs.d:/etc/libibverbs.d:ro)
elif [[ \"\$_NIC_TYPE\" == \"bnxt\" ]]; then
for _lib in /usr/local/lib/libbnxt_re-rdmav*.so; do
[[ -f \"\$_lib\" ]] && RDMA_MOUNTS+=(-v \"\$_lib:/usr/lib/x86_64-linux-gnu/libibverbs/\$(basename \"\$_lib\")\")
done
for _lib in /usr/local/lib/libbnxt_re.so; do
[[ -f \"\$_lib\" ]] && RDMA_MOUNTS+=(-v \"\$_lib:/usr/lib/x86_64-linux-gnu/\$(basename \"\$_lib\")\")
done
[[ -d /etc/libibverbs.d ]] && RDMA_MOUNTS+=(-v /etc/libibverbs.d:/etc/libibverbs.d:ro)
fi

if [[ \${#RDMA_MOUNTS[@]} -gt 0 ]]; then
echo \"[rdma] bind-mounts: \${RDMA_MOUNTS[*]}\"
else
echo \"[rdma] no out-of-tree RDMA mounts needed\"
fi
Comment thread
cursor[bot] marked this conversation as resolved.
fi # end: if ENGINE == atom-disagg

# Pre-clean (idempotent)
\$DOCKER_CMD ps -aq --filter \"$CONT_FILTER\" | xargs -r \$DOCKER_CMD rm -f || true
\$DOCKER_CMD ps -aq | xargs -r \$DOCKER_CMD stop || true
Expand Down Expand Up @@ -490,6 +581,7 @@ fi
-v ${BENCHMARK_LOGS_DIR}:/benchmark_logs \
-v ${DI_REPO_DIR}:${DOCKER_MOUNT_PATH} \
${EXTRA_DOCKER_MOUNTS:-} \
\${RDMA_MOUNTS[@]+"\${RDMA_MOUNTS[@]}"} \
${DOCKER_ENV_COMMON[*]} \
${DOCKER_ENV_ENGINE[*]} \
--name \"$DOCKER_CONT_NAME\" \
Expand Down
72 changes: 72 additions & 0 deletions benchmarks/multi_node/amd_utils/models_atom.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Model-specific SGLang server configurations for disaggregated inference.
#
# Each top-level key is a MODEL_NAME value (must match the directory name under MODEL_DIR).
#
# To add a new model: add a new top-level entry following the same schema.
# No script changes are required.
#
# Schema:
# <model-name>:
# base_flags: str # Common flags for both prefill and decode
# mtp_flags: str # Appended to decode when DECODE_MTP_SIZE > 0
# dp_flags: str # Appended when DP is enabled (prefill or decode)
# prefill:
# mem_fraction_static: float
# disable_radix_cache: bool
# dp: # Config when data-parallel attention is enabled
# max_running_requests: int
# chunked_prefill_size: str # Can be integer or bash arithmetic expression
# cuda_graph_bs: str # Space-separated values
# no_dp: # Config when data-parallel attention is disabled
# max_running_requests: int
# chunked_prefill_size: int
# cuda_graph_bs_range: str # "start-end" expanded via seq
# decode:
# mem_fraction_static: float
# prefill_round_robin_balance: bool
# dp:
# max_running_requests: int
# chunked_prefill_size: str
# cuda_graph_bs_range: str
# ep_only: # Config when EP is enabled but DP is disabled
# max_running_requests: int
# chunked_prefill_size: int
# cuda_graph_bs_range: str
# no_dp:
# max_running_requests: int
# chunked_prefill_size: int
# cuda_graph_bs_range: str

DeepSeek-V4-Pro:
# ATOM engine (atom-disagg): server_atom.sh uses MEM_FRACTION/KV_CACHE_DTYPE/BLOCK_SIZE/MAX_NUM_SEQS
# directly from env vars (defaulting to 0.85/fp8/16/256). base_flags/dp_flags are not used by
# server_atom.sh; they are kept here for documentation and potential future use.
base_flags: ""
mtp_flags: ""
dp_flags: ""
prefill:
mem_fraction_static: 0.85
disable_radix_cache: true
dp:
max_running_requests: 256
chunked_prefill_size: 262144
cuda_graph_bs: "1 2 3"
no_dp:
max_running_requests: 256
chunked_prefill_size: 262144
cuda_graph_bs_range: "1-128"
decode:
mem_fraction_static: 0.85
prefill_round_robin_balance: false
dp:
max_running_requests: 256
chunked_prefill_size: 262144
cuda_graph_bs_range: "1-256"
ep_only:
max_running_requests: 256
chunked_prefill_size: 262144
cuda_graph_bs_range: "1-256"
no_dp:
max_running_requests: 256
chunked_prefill_size: 262144
cuda_graph_bs_range: "1-256"
8 changes: 6 additions & 2 deletions benchmarks/multi_node/amd_utils/server.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,23 @@
#!/bin/bash
# Dual-Engine Disaggregated Server Dispatcher
# Multi-Engine Disaggregated Server Dispatcher
# =============================================================================
# Dispatches to the engine-specific server launcher based on ENGINE env var.
# ENGINE=sglang-disagg (default) -> server_sglang.sh (SGLang + MoRI)
# ENGINE=vllm-disagg -> server_vllm.sh (vLLM + Nixl/MoRI-IO)
# ENGINE=atom-disagg -> server_atom.sh (ATOM + mooncake)
# =============================================================================

ENGINE="${ENGINE:-sglang-disagg}"
WS_PATH="${WS_PATH:-${SGLANG_WS_PATH:-${VLLM_WS_PATH:-$(dirname "${BASH_SOURCE[0]}")}}}"
WS_PATH="${WS_PATH:-${SGLANG_WS_PATH:-${VLLM_WS_PATH:-${ATOM_WS_PATH:-$(dirname "${BASH_SOURCE[0]}")}}}}"
export WS_PATH ENGINE

echo "[DISPATCHER] ENGINE=$ENGINE WS_PATH=$WS_PATH"

if [[ "$ENGINE" == "vllm-disagg" ]]; then
source "$WS_PATH/server_vllm.sh"
elif [[ "$ENGINE" == "atom-disagg" ]]; then
export ATOM_WS_PATH="$WS_PATH"
source "$WS_PATH/server_atom.sh"
else
source "$WS_PATH/server_sglang.sh"
fi
Loading
Loading