Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11497,3 +11497,41 @@ minimaxm2.5-fp8-gb300-dynamo-vllm:
tp: 4
ep: 4
dp-attn: true

# --- Measured-power campaign (dsr1-disagg-NVIDIA) ---------------------------
# Minimal single-job validation of the perfmon plumbing on GB300 before the
# full dsr1-disagg-NVIDIA measured sweep. Identical to the 1k1k low-latency
# scenario of dsr1-fp4-gb300-dynamo-sglang (same recipe + topology), trimmed
# to one concurrency so it runs as ONE gb300 job. runner: gb300-cw (CoreWeave
# fallback — the gb300-nv fleet was wedged on pre-run cleanup) routes to
# runners/launch_gb300-cw.sh, whose new dsr1 branch clones the perfmon fork and
# injects monitoring: -> per-node perf_samples_*.csv -> measured per-phase
# board power. Remove once the path is proven (or keep as a power canary).
dsr1-fp4-gb300-dynamo-sglang-powercheck:
image: "lmsysorg/sglang:v0.5.8.post1-cu130-runtime"
model: nvidia/DeepSeek-R1-0528-NVFP4-v2
model-prefix: dsr1
runner: gb300-cw
precision: fp4
framework: dynamo-sglang
multinode: true
disagg: true
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- spec-decoding: "none"
conc-list: [ 8 ]
prefill:
num-worker: 1
tp: 4
ep: 1
dp-attn: false
additional-settings:
- "CONFIG_FILE=recipes/gb300-fp4/1k1k/low_latency.yaml"
decode:
num-worker: 2
tp: 4
ep: 1
dp-attn: false
7 changes: 7 additions & 0 deletions .github/workflows/benchmark-multinode-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,13 @@ jobs:
sleep 5
done
fi
# Drop root-owned leftovers from a prior (often cancelled) multinode
# run. The benchmark container runs as root and writes benchmark_logs/;
# if the job was cancelled its cleanup trap never ran, leaving
# root-owned dirs that actions/checkout (clean: true) can't rmdir
# (EACCES) — which then poison-fails EVERY subsequent job on that
# runner. Runs in both pre- and post-run cleanup (shared anchor).
sudo rm -rf "${GITHUB_WORKSPACE}/benchmark_logs" 2>/dev/null || true

- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
Expand Down
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3531,3 +3531,11 @@
- "The Rust frontend replaces only the Python serving/API layer (HTTP, tokenization, scheduling glue, detokenization) and spawns the same Python EngineCore, so GPU kernels/attention/MoE GEMM/KV cache are untouched"
- "A/B sweep (28 single-node points, 1k1k + 8k1k, TP 1/2/4) vs the Python-frontend baseline (run 26696260751): throughput Pareto-neutral (peak tok/s/GPU within <1.5%, frontiers coincident) and TPOT flat (+-0.5%); TTFT improves ~8% at 1k1k and ~22% at 8k1k (every point), the expected signature of lower frontend CPU latency before first token, scaling with input length"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1634

- config-keys:
- dsr1-fp4-gb300-dynamo-sglang-powercheck
description:
- "Minimal single-job GB300 validation of the measured-power perfmon plumbing before the full dsr1-disagg-NVIDIA sweep (NVIDIA analogue of the AMD smoke run in PR #1574). Same recipe + disagg topology as the 1k/1k low-latency cell of dsr1-fp4-gb300-dynamo-sglang (1 prefill TP4 + 2 decode TP4), trimmed to one concurrency (8) so the changelog matrix expands to exactly ONE gb300 job and the shared cluster stays clear."
- "Exercises the runner-side wiring added to runners/launch_gb300-nv.sh: the dsr1 branch clones SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon (NVIDIA/srt-slurm PR #35) instead of upstream sa-submission, recursively injects `monitoring:` into every recipes/<hw>/<seq>/*.yaml (find -type f, never a flat glob — the flat glob is what silently produced 0 power rows in sweep #26548110246), and stages the per-node perf_samples_*.csv to $GITHUB_WORKSPACE before `rm -rf outputs`, setting GPU_METRICS_CSV_GLOB for the Process-result step."
- "Success criteria: job green AND the agg JSON patched with avg_power_w + per-stage prefill_avg_power_w/decode_avg_power_w + workers[] (role-labelled prefill/decode) from utils/aggregate_power.py. If those fields are absent the plumbing is not yet proven and the full dsr1-disagg-NVIDIA sweep stays gated. Remove this key (or keep as a GB300 power canary) once validated."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1686
79 changes: 75 additions & 4 deletions runners/launch_gb300-cw.sh
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,22 @@ elif [[ $MODEL_PREFIX == "glm5" && $PRECISION == "fp8" ]]; then
echo "Unsupported framework on gb300-cw for glm5/fp8: $FRAMEWORK. Currently supported: dynamo-sglang"
exit 1
fi
elif [[ $MODEL_PREFIX == "dsr1" && $PRECISION == "fp4" ]]; then
# MEASURED-POWER CAMPAIGN — CoreWeave fallback for dsr1 (the gb300-nv fleet
# was wedged on its pre-run cleanup). Clone the perfmon fork directly: it
# already carries BOTH the gb300-fp4 dsr1 recipes (recipes/gb300-fp4/1k1k/
# *.yaml) AND the nvidia-smi perfmon machinery, so no hand-rolled recipe
# overlay is needed. Weights are pre-staged on CoreWeave at
# /mnt/vast/models/dsr1-fp4; the recipe's `model.path: dsr1` alias is mapped
# to that path in the srtslurm.yaml model_paths block below.
export MODEL_PATH="/mnt/vast/models/dsr1-fp4"
SRT_SLURM_RECIPES_REPO="https://github.com/SemiAnalysisAI/srt-slurm.git"
SRT_SLURM_RECIPES_REF="feat/inferencex-perfmon"
SRT_RECIPE_SRC="" # fork ships the recipes; skip the overlay step
SRT_RECIPE_DST=""
export PERFMON_ENABLED=1
else
echo "Unsupported model prefix/precision combination on gb300-cw: $MODEL_PREFIX/$PRECISION. Currently supported: dsv4/fp4, glm5/fp8"
echo "Unsupported model prefix/precision combination on gb300-cw: $MODEL_PREFIX/$PRECISION. Currently supported: dsv4/fp4, glm5/fp8, dsr1/fp4"
exit 1
fi

Expand Down Expand Up @@ -145,9 +159,44 @@ git clone "$SRT_SLURM_RECIPES_REPO" "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
git checkout "$SRT_SLURM_RECIPES_REF"

# Overlay the hand-rolled DSV4 recipes onto the selected srt-slurm checkout.
mkdir -p "$SRT_RECIPE_DST"
cp -rT "$SRT_RECIPE_SRC" "$SRT_RECIPE_DST"
# Overlay the hand-rolled recipes onto the selected srt-slurm checkout — unless
# the chosen repo already ships them (the dsr1 perfmon fork leaves
# SRT_RECIPE_SRC empty because its recipes/gb300-fp4 dsr1 recipes are built in).
if [ -n "$SRT_RECIPE_SRC" ]; then
mkdir -p "$SRT_RECIPE_DST"
cp -rT "$SRT_RECIPE_SRC" "$SRT_RECIPE_DST"
fi

# MEASURED-POWER CAMPAIGN: enable per-node nvidia-smi perfmon. `monitoring` is a
# top-level SrtConfig field (defaults to None) that the orchestrator reads to
# spawn perfmon.py, which writes perf_samples_*.csv. Inject it into the chosen
# CONFIG_FILE recipe when running the perfmon fork (idempotent). srtctl applies
# only this CONFIG_FILE, so injecting here is sufficient.
if [ "${PERFMON_ENABLED:-0}" = "1" ] && [ -n "$CONFIG_FILE" ] && [ -f "$CONFIG_FILE" ]; then
if ! grep -q '^monitoring:' "$CONFIG_FILE"; then
printf '\nmonitoring:\n enabled: true\n sample_interval: 1.0\n' >> "$CONFIG_FILE"
echo "[perfmon] injected monitoring: into $CONFIG_FILE"
fi
# CoreWeave: request unlimited node memory so pyxis/enroot can extract the
# 15-30 GB container squashfs without the cgroup OOM-killing unsquashfs
# (exit 137 -> prefill container never starts -> etcd down -> decode workers
# crash with "Could not connect to etcd"). The fork's gb300-fp4 recipe omits
# this; every working CW recipe (dsv4/glm5) sets sbatch_directives.mem: "0".
if ! grep -q '^sbatch_directives:' "$CONFIG_FILE"; then
printf '\nsbatch_directives:\n mem: "0"\n' >> "$CONFIG_FILE"
echo "[perfmon] injected sbatch_directives.mem=0 into $CONFIG_FILE"
fi
# dsr1 warmup (load 671B weights + FlashInfer autotune + capture 36 CUDA-graph
# batch sizes up to cuda-graph-max-bs 256) takes ~35 min, exceeding the default
# health_check ceiling (max_attempts 180 x interval 10 = 1800s/30min) — the
# orchestrator timed out and killed etcd mid-capture. Raise the ceiling to
# 90 min. It is a CEILING, not a fixed wait: the sweep proceeds the instant the
# servers report healthy, so generous headroom costs nothing.
if ! grep -q '^health_check:' "$CONFIG_FILE"; then
printf '\nhealth_check:\n max_attempts: 540\n interval_seconds: 10\n' >> "$CONFIG_FILE"
echo "[perfmon] injected health_check (90min ceiling) into $CONFIG_FILE"
fi
fi

echo "Installing srtctl..."
# CRITICAL — uv install location.
Expand Down Expand Up @@ -233,6 +282,10 @@ model_paths:
deepseek-v4-pro: "${MODEL_PATH}"
# GLM-5 FP8 sglang recipes use `model.path: glm-5-fp8`.
glm-5-fp8: "${MODEL_PATH}"
# dsr1 fp4 sglang recipe (perfmon fork) uses `model.path: dsr1`; for dsr1 runs
# MODEL_PATH points at /mnt/vast/models/dsr1-fp4. Harmless on dsv4/glm5 runs
# (their recipes never reference the `dsr1` alias).
dsr1: "${MODEL_PATH}"
containers:
dynamo-trtllm: ${SQUASH_FILE}
dynamo-sglang: ${SQUASH_FILE}
Expand Down Expand Up @@ -347,6 +400,24 @@ else
echo "Warning: Logs directory not found at $LOGS_DIR"
fi

# MEASURED-POWER CAMPAIGN: stage per-node perfmon CSVs for the downstream
# "Process result" workflow step. The fork's perfmon writes perf_samples_*.csv
# under the srt-slurm job logs dir; copy them into $GITHUB_WORKSPACE and export
# GPU_METRICS_CSV_GLOB so process_result.py runs aggregate_power.py over them and
# patches the agg JSON with measured power + temp/util/mem. Guarded on
# PERFMON_ENABLED so dsv4/glm5 runs on this launcher are unaffected.
if [ "${PERFMON_ENABLED:-0}" = "1" ] && [ -d "$LOGS_DIR" ]; then
if find "$LOGS_DIR" -name 'perf_samples_*.csv' 2>/dev/null | grep -q .; then
mkdir -p "$GITHUB_WORKSPACE/perf_samples"
find "$LOGS_DIR" -name 'perf_samples_*.csv' -exec cp {} "$GITHUB_WORKSPACE/perf_samples/" \;
perf_csv_count=$(ls "$GITHUB_WORKSPACE/perf_samples"/perf_samples_*.csv 2>/dev/null | wc -l | tr -d ' ')
echo "GPU_METRICS_CSV_GLOB=$GITHUB_WORKSPACE/perf_samples/perf_samples_*.csv" >> "$GITHUB_ENV"
echo "[perfmon] staged $perf_csv_count per-node perf_samples_*.csv to \$GITHUB_WORKSPACE/perf_samples/"
else
echo "[perfmon] WARNING: monitoring enabled but no perf_samples_*.csv found under $LOGS_DIR — measured power aggregation will be skipped" >&2
fi
fi

if [[ "${EVAL_ONLY:-false}" != "true" ]]; then
if [ ! -d "$LOGS_DIR" ]; then
exit 1
Expand Down
52 changes: 52 additions & 0 deletions runners/launch_gb300-nv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,39 @@ elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" && $PRECIS
git checkout main
mkdir -p recipes/vllm/minimax-m2.5
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5" recipes/vllm/minimax-m2.5
elif [[ $MODEL_PREFIX == "dsr1" ]]; then
# MEASURED-POWER CAMPAIGN (dsr1-disagg-NVIDIA): clone the SemiAnalysisAI
# fork pinned to NVIDIA/srt-slurm PR #35 (feat/inferencex-perfmon) instead
# of upstream sa-submission-q2-2026 (the else branch). The fork is a
# SUPERSET: byte-identical dsr1 recipes (verified — under recipes/gb300-*
# only glm5.yaml differs, never the dsr1 recipes) PLUS the per-node
# nvidia-smi perfmon machinery (src/srtctl/monitor/perfmon.py) that writes
# perf_samples_*.csv when a recipe declares `monitoring:`. Pointing dsr1
# here is what turns the dsr1-disagg-NVIDIA energy charts MEASURED.
git clone https://github.com/SemiAnalysisAI/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
git checkout feat/inferencex-perfmon
export PERFMON_ENABLED=1
# Enable per-node GPU perfmon on every recipe. `monitoring` is a top-level
# SrtConfig field defaulting to None, so without this the orchestrator's
# _start_perf_monitor short-circuits and no perf_samples_*.csv is written.
# Idempotent. Use a RECURSIVE find, never a flat *.yaml glob: recipes live
# in recipes/<hw>/<seq>/*.yaml, and a flat glob matched zero files in PR
# #1574's sweep #26548110246 — it completed "success" with NO power data.
FOUND_COUNT=0
INJECTED_COUNT=0
while IFS= read -r recipe; do
FOUND_COUNT=$((FOUND_COUNT + 1))
if ! grep -q '^monitoring:' "$recipe"; then
printf '\nmonitoring:\n enabled: true\n sample_interval: 1.0\n' >> "$recipe"
INJECTED_COUNT=$((INJECTED_COUNT + 1))
fi
done < <(find recipes -type f -name '*.yaml')
if [ "$FOUND_COUNT" -eq 0 ]; then
echo "[perfmon] WARNING: zero recipe YAMLs found under recipes/ — power data will be MISSING from this run." >&2
else
echo "[perfmon] injected monitoring: into $INJECTED_COUNT of $FOUND_COUNT recipes."
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dsr1 branch ignores framework

High Severity

The new dsr1 clone path keys only on MODEL_PREFIX, so any GB300-NV job with model-prefix: dsr1—including dynamo-trt configs—checks out the perfmon srt-slurm fork, enables PERFMON_ENABLED, and rewrites every recipe YAML. TRT (and other non–dynamo-sglang) dsr1 runs previously used the default sa-submission-q2-2026 checkout and can fail or run the wrong stack.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce2c4ba. Configure here.

else
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
Expand Down Expand Up @@ -408,6 +441,25 @@ fi
# Without this inline call, R25 lost both 1p6d shards' logs.
_snapshot_server_logs

# MEASURED-POWER CAMPAIGN: stage per-node perfmon CSVs to a cleanup-proof
# location and hand them to the downstream "Process result" step (a SEPARATE
# workflow job step, run after this launch step finishes). srt-slurm perfmon
# writes perf_samples_*.csv into $LOGS_DIR; the `rm -rf outputs` below would
# delete them before Process result reads GPU_METRICS_CSV_GLOB — unlike
# gb300-cw, which keeps outputs/ — so copy them under $GITHUB_WORKSPACE first.
# Guarded on PERFMON_ENABLED so non-dsr1 models on this runner stay silent.
if [ "${PERFMON_ENABLED:-0}" = "1" ] && [ -d "$LOGS_DIR" ]; then
perf_csv_count=$(ls "$LOGS_DIR"/perf_samples_*.csv 2>/dev/null | wc -l | tr -d ' ')
if [ "$perf_csv_count" -gt 0 ]; then
mkdir -p "$GITHUB_WORKSPACE/perf_samples"
cp "$LOGS_DIR"/perf_samples_*.csv "$GITHUB_WORKSPACE/perf_samples/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale perf CSVs not cleared

Medium Severity

Measured-power staging copies into $GITHUB_WORKSPACE/perf_samples/ without clearing that directory first. process_result.py globs every perf_samples_*.csv there, so files left from an earlier job on the same runner can be aggregated with the current run and skew avg_power_w, worker counts, and joules fields.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9839a0c. Configure here.

echo "GPU_METRICS_CSV_GLOB=$GITHUB_WORKSPACE/perf_samples/perf_samples_*.csv" >> "$GITHUB_ENV"
echo "[perfmon] staged $perf_csv_count per-node perf_samples_*.csv to \$GITHUB_WORKSPACE/perf_samples/"
else
echo "[perfmon] WARNING: monitoring enabled but no perf_samples_*.csv found in $LOGS_DIR — measured power aggregation will be skipped" >&2
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfmon failure omits glob guard

Medium Severity

When PERFMON_ENABLED is set but no perf_samples_*.csv files are found, the launcher logs a warning and does not set GPU_METRICS_CSV_GLOB. process_result.py then uses the single-node gpu_metrics.csv fallback, which can patch multinode agg JSON with stale single-node power instead of omitting telemetry.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5b3eb12. Configure here.

fi

# Clean up srt-slurm outputs to prevent NFS silly-rename lock files
# from blocking the next job's checkout on this runner
echo "Cleaning up srt-slurm outputs..."
Expand Down
Loading