SemiAnalysisAI · arygupt · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026
@@ -11497,3 +11497,41 @@ minimaxm2.5-fp8-gb300-dynamo-vllm:
           tp: 4
           ep: 4
           dp-attn: true
+
+# --- Measured-power campaign (dsr1-disagg-NVIDIA) ---------------------------
+# Minimal single-job validation of the perfmon plumbing on GB300 before the
+# full dsr1-disagg-NVIDIA measured sweep. Identical to the 1k1k low-latency
+# scenario of dsr1-fp4-gb300-dynamo-sglang (same recipe + topology), trimmed
+# to one concurrency so it runs as ONE gb300 job. runner: gb300-cw (CoreWeave
+# fallback — the gb300-nv fleet was wedged on pre-run cleanup) routes to
+# runners/launch_gb300-cw.sh, whose new dsr1 branch clones the perfmon fork and
+# injects monitoring: -> per-node perf_samples_*.csv -> measured per-phase
+# board power. Remove once the path is proven (or keep as a power canary).
+dsr1-fp4-gb300-dynamo-sglang-powercheck:
+  image: "lmsysorg/sglang:v0.5.8.post1-cu130-runtime"
+  model: nvidia/DeepSeek-R1-0528-NVFP4-v2
+  model-prefix: dsr1
+  runner: gb300-cw
+  precision: fp4
+  framework: dynamo-sglang
+  multinode: true
+  disagg: true
+  scenarios:
+    fixed-seq-len:
+    - isl: 1024
+      osl: 1024
+      search-space:
+      - spec-decoding: "none"
+        conc-list: [ 8 ]
+        prefill:
+          num-worker: 1
+          tp: 4
+          ep: 1
+          dp-attn: false
+          additional-settings:
+          - "CONFIG_FILE=recipes/gb300-fp4/1k1k/low_latency.yaml"
+        decode:
+          num-worker: 2
+          tp: 4
+          ep: 1
+          dp-attn: false
diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml
@@ -178,6 +178,13 @@ jobs:
               sleep 5
             done
           fi
+          # Drop root-owned leftovers from a prior (often cancelled) multinode
+          # run. The benchmark container runs as root and writes benchmark_logs/;
+          # if the job was cancelled its cleanup trap never ran, leaving
+          # root-owned dirs that actions/checkout (clean: true) can't rmdir
+          # (EACCES) — which then poison-fails EVERY subsequent job on that
+          # runner. Runs in both pre- and post-run cleanup (shared anchor).
+          sudo rm -rf "${GITHUB_WORKSPACE}/benchmark_logs" 2>/dev/null || true
 
       - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
         with:

diff --git a/perf-changelog.yaml b/perf-changelog.yaml
@@ -3531,3 +3531,11 @@
     - "The Rust frontend replaces only the Python serving/API layer (HTTP, tokenization, scheduling glue, detokenization) and spawns the same Python EngineCore, so GPU kernels/attention/MoE GEMM/KV cache are untouched"
     - "A/B sweep (28 single-node points, 1k1k + 8k1k, TP 1/2/4) vs the Python-frontend baseline (run 26696260751): throughput Pareto-neutral (peak tok/s/GPU within <1.5%, frontiers coincident) and TPOT flat (+-0.5%); TTFT improves ~8% at 1k1k and ~22% at 8k1k (every point), the expected signature of lower frontend CPU latency before first token, scaling with input length"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1634
+
+- config-keys:
+    - dsr1-fp4-gb300-dynamo-sglang-powercheck
+  description:
+    - "Minimal single-job GB300 validation of the measured-power perfmon plumbing before the full dsr1-disagg-NVIDIA sweep (NVIDIA analogue of the AMD smoke run in PR #1574). Same recipe + disagg topology as the 1k/1k low-latency cell of dsr1-fp4-gb300-dynamo-sglang (1 prefill TP4 + 2 decode TP4), trimmed to one concurrency (8) so the changelog matrix expands to exactly ONE gb300 job and the shared cluster stays clear."
+    - "Exercises the runner-side wiring added to runners/launch_gb300-nv.sh: the dsr1 branch clones SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon (NVIDIA/srt-slurm PR #35) instead of upstream sa-submission, recursively injects `monitoring:` into every recipes/<hw>/<seq>/*.yaml (find -type f, never a flat glob — the flat glob is what silently produced 0 power rows in sweep #26548110246), and stages the per-node perf_samples_*.csv to $GITHUB_WORKSPACE before `rm -rf outputs`, setting GPU_METRICS_CSV_GLOB for the Process-result step."
+    - "Success criteria: job green AND the agg JSON patched with avg_power_w + per-stage prefill_avg_power_w/decode_avg_power_w + workers[] (role-labelled prefill/decode) from utils/aggregate_power.py. If those fields are absent the plumbing is not yet proven and the full dsr1-disagg-NVIDIA sweep stays gated. Remove this key (or keep as a GB300 power canary) once validated."
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1686
diff --git a/runners/launch_gb300-cw.sh b/runners/launch_gb300-cw.sh
@@ -59,8 +59,22 @@ elif [[ $MODEL_PREFIX == "glm5" && $PRECISION == "fp8" ]]; then
         echo "Unsupported framework on gb300-cw for glm5/fp8: $FRAMEWORK. Currently supported: dynamo-sglang"
         exit 1
     fi
+elif [[ $MODEL_PREFIX == "dsr1" && $PRECISION == "fp4" ]]; then
+    # MEASURED-POWER CAMPAIGN — CoreWeave fallback for dsr1 (the gb300-nv fleet
+    # was wedged on its pre-run cleanup). Clone the perfmon fork directly: it
+    # already carries BOTH the gb300-fp4 dsr1 recipes (recipes/gb300-fp4/1k1k/
+    # *.yaml) AND the nvidia-smi perfmon machinery, so no hand-rolled recipe
+    # overlay is needed. Weights are pre-staged on CoreWeave at
+    # /mnt/vast/models/dsr1-fp4; the recipe's `model.path: dsr1` alias is mapped
+    # to that path in the srtslurm.yaml model_paths block below.
+    export MODEL_PATH="/mnt/vast/models/dsr1-fp4"
+    SRT_SLURM_RECIPES_REPO="https://github.com/SemiAnalysisAI/srt-slurm.git"
+    SRT_SLURM_RECIPES_REF="feat/inferencex-perfmon"
+    SRT_RECIPE_SRC=""   # fork ships the recipes; skip the overlay step
+    SRT_RECIPE_DST=""
+    export PERFMON_ENABLED=1
 else
-    echo "Unsupported model prefix/precision combination on gb300-cw: $MODEL_PREFIX/$PRECISION. Currently supported: dsv4/fp4, glm5/fp8"
+    echo "Unsupported model prefix/precision combination on gb300-cw: $MODEL_PREFIX/$PRECISION. Currently supported: dsv4/fp4, glm5/fp8, dsr1/fp4"
     exit 1
 fi
 
@@ -145,9 +159,44 @@ git clone "$SRT_SLURM_RECIPES_REPO" "$SRT_REPO_DIR"
 cd "$SRT_REPO_DIR"
 git checkout "$SRT_SLURM_RECIPES_REF"
 
-# Overlay the hand-rolled DSV4 recipes onto the selected srt-slurm checkout.
-mkdir -p "$SRT_RECIPE_DST"
-cp -rT "$SRT_RECIPE_SRC" "$SRT_RECIPE_DST"
+# Overlay the hand-rolled recipes onto the selected srt-slurm checkout — unless
+# the chosen repo already ships them (the dsr1 perfmon fork leaves
+# SRT_RECIPE_SRC empty because its recipes/gb300-fp4 dsr1 recipes are built in).
+if [ -n "$SRT_RECIPE_SRC" ]; then
+    mkdir -p "$SRT_RECIPE_DST"
+    cp -rT "$SRT_RECIPE_SRC" "$SRT_RECIPE_DST"
+fi
+
+# MEASURED-POWER CAMPAIGN: enable per-node nvidia-smi perfmon. `monitoring` is a
+# top-level SrtConfig field (defaults to None) that the orchestrator reads to
+# spawn perfmon.py, which writes perf_samples_*.csv. Inject it into the chosen
+# CONFIG_FILE recipe when running the perfmon fork (idempotent). srtctl applies
+# only this CONFIG_FILE, so injecting here is sufficient.
+if [ "${PERFMON_ENABLED:-0}" = "1" ] && [ -n "$CONFIG_FILE" ] && [ -f "$CONFIG_FILE" ]; then
+    if ! grep -q '^monitoring:' "$CONFIG_FILE"; then
+        printf '\nmonitoring:\n  enabled: true\n  sample_interval: 1.0\n' >> "$CONFIG_FILE"
+        echo "[perfmon] injected monitoring: into $CONFIG_FILE"
+    fi
+    # CoreWeave: request unlimited node memory so pyxis/enroot can extract the
+    # 15-30 GB container squashfs without the cgroup OOM-killing unsquashfs
+    # (exit 137 -> prefill container never starts -> etcd down -> decode workers
+    # crash with "Could not connect to etcd"). The fork's gb300-fp4 recipe omits
+    # this; every working CW recipe (dsv4/glm5) sets sbatch_directives.mem: "0".
+    if ! grep -q '^sbatch_directives:' "$CONFIG_FILE"; then
+        printf '\nsbatch_directives:\n  mem: "0"\n' >> "$CONFIG_FILE"
+        echo "[perfmon] injected sbatch_directives.mem=0 into $CONFIG_FILE"
+    fi
+    # dsr1 warmup (load 671B weights + FlashInfer autotune + capture 36 CUDA-graph
+    # batch sizes up to cuda-graph-max-bs 256) takes ~35 min, exceeding the default
+    # health_check ceiling (max_attempts 180 x interval 10 = 1800s/30min) — the
+    # orchestrator timed out and killed etcd mid-capture. Raise the ceiling to
+    # 90 min. It is a CEILING, not a fixed wait: the sweep proceeds the instant the
+    # servers report healthy, so generous headroom costs nothing.
+    if ! grep -q '^health_check:' "$CONFIG_FILE"; then
+        printf '\nhealth_check:\n  max_attempts: 540\n  interval_seconds: 10\n' >> "$CONFIG_FILE"
+        echo "[perfmon] injected health_check (90min ceiling) into $CONFIG_FILE"
+    fi
+fi
 
 echo "Installing srtctl..."
 # CRITICAL — uv install location.
@@ -233,6 +282,10 @@ model_paths:
   deepseek-v4-pro: "${MODEL_PATH}"
   # GLM-5 FP8 sglang recipes use `model.path: glm-5-fp8`.
   glm-5-fp8: "${MODEL_PATH}"
+  # dsr1 fp4 sglang recipe (perfmon fork) uses `model.path: dsr1`; for dsr1 runs
+  # MODEL_PATH points at /mnt/vast/models/dsr1-fp4. Harmless on dsv4/glm5 runs
+  # (their recipes never reference the `dsr1` alias).
+  dsr1: "${MODEL_PATH}"
 containers:
   dynamo-trtllm: ${SQUASH_FILE}
   dynamo-sglang: ${SQUASH_FILE}
@@ -347,6 +400,24 @@ else
     echo "Warning: Logs directory not found at $LOGS_DIR"
 fi
 
+# MEASURED-POWER CAMPAIGN: stage per-node perfmon CSVs for the downstream
+# "Process result" workflow step. The fork's perfmon writes perf_samples_*.csv
+# under the srt-slurm job logs dir; copy them into $GITHUB_WORKSPACE and export
+# GPU_METRICS_CSV_GLOB so process_result.py runs aggregate_power.py over them and
+# patches the agg JSON with measured power + temp/util/mem. Guarded on
+# PERFMON_ENABLED so dsv4/glm5 runs on this launcher are unaffected.
+if [ "${PERFMON_ENABLED:-0}" = "1" ] && [ -d "$LOGS_DIR" ]; then
+    if find "$LOGS_DIR" -name 'perf_samples_*.csv' 2>/dev/null | grep -q .; then
+        mkdir -p "$GITHUB_WORKSPACE/perf_samples"
+        find "$LOGS_DIR" -name 'perf_samples_*.csv' -exec cp {} "$GITHUB_WORKSPACE/perf_samples/" \;
+        perf_csv_count=$(ls "$GITHUB_WORKSPACE/perf_samples"/perf_samples_*.csv 2>/dev/null | wc -l | tr -d ' ')
+        echo "GPU_METRICS_CSV_GLOB=$GITHUB_WORKSPACE/perf_samples/perf_samples_*.csv" >> "$GITHUB_ENV"
+        echo "[perfmon] staged $perf_csv_count per-node perf_samples_*.csv to \$GITHUB_WORKSPACE/perf_samples/"
+    else
+        echo "[perfmon] WARNING: monitoring enabled but no perf_samples_*.csv found under $LOGS_DIR — measured power aggregation will be skipped" >&2
+    fi
+fi
+
 if [[ "${EVAL_ONLY:-false}" != "true" ]]; then
     if [ ! -d "$LOGS_DIR" ]; then
         exit 1

diff --git a/runners/launch_gb300-nv.sh b/runners/launch_gb300-nv.sh
@@ -155,6 +155,39 @@ elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" && $PRECIS
     git checkout main
     mkdir -p recipes/vllm/minimax-m2.5
     cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5" recipes/vllm/minimax-m2.5
+elif [[ $MODEL_PREFIX == "dsr1" ]]; then
+    # MEASURED-POWER CAMPAIGN (dsr1-disagg-NVIDIA): clone the SemiAnalysisAI
+    # fork pinned to NVIDIA/srt-slurm PR #35 (feat/inferencex-perfmon) instead
+    # of upstream sa-submission-q2-2026 (the else branch). The fork is a
+    # SUPERSET: byte-identical dsr1 recipes (verified — under recipes/gb300-*
+    # only glm5.yaml differs, never the dsr1 recipes) PLUS the per-node
+    # nvidia-smi perfmon machinery (src/srtctl/monitor/perfmon.py) that writes
+    # perf_samples_*.csv when a recipe declares `monitoring:`. Pointing dsr1
+    # here is what turns the dsr1-disagg-NVIDIA energy charts MEASURED.
+    git clone https://github.com/SemiAnalysisAI/srt-slurm.git "$SRT_REPO_DIR"
+    cd "$SRT_REPO_DIR"
+    git checkout feat/inferencex-perfmon
+    export PERFMON_ENABLED=1
+    # Enable per-node GPU perfmon on every recipe. `monitoring` is a top-level
+    # SrtConfig field defaulting to None, so without this the orchestrator's
+    # _start_perf_monitor short-circuits and no perf_samples_*.csv is written.
+    # Idempotent. Use a RECURSIVE find, never a flat *.yaml glob: recipes live
+    # in recipes/<hw>/<seq>/*.yaml, and a flat glob matched zero files in PR
+    # #1574's sweep #26548110246 — it completed "success" with NO power data.
+    FOUND_COUNT=0
+    INJECTED_COUNT=0
+    while IFS= read -r recipe; do
+        FOUND_COUNT=$((FOUND_COUNT + 1))
+        if ! grep -q '^monitoring:' "$recipe"; then
+            printf '\nmonitoring:\n  enabled: true\n  sample_interval: 1.0\n' >> "$recipe"
+            INJECTED_COUNT=$((INJECTED_COUNT + 1))
+        fi
+    done < <(find recipes -type f -name '*.yaml')
+    if [ "$FOUND_COUNT" -eq 0 ]; then
+        echo "[perfmon] WARNING: zero recipe YAMLs found under recipes/ — power data will be MISSING from this run." >&2
+    else
+        echo "[perfmon] injected monitoring: into $INJECTED_COUNT of $FOUND_COUNT recipes."
+    fi
 else
     git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
     cd "$SRT_REPO_DIR"
@@ -408,6 +441,25 @@ fi
 # Without this inline call, R25 lost both 1p6d shards' logs.
 _snapshot_server_logs
 
+# MEASURED-POWER CAMPAIGN: stage per-node perfmon CSVs to a cleanup-proof
+# location and hand them to the downstream "Process result" step (a SEPARATE
+# workflow job step, run after this launch step finishes). srt-slurm perfmon
+# writes perf_samples_*.csv into $LOGS_DIR; the `rm -rf outputs` below would
+# delete them before Process result reads GPU_METRICS_CSV_GLOB — unlike
+# gb300-cw, which keeps outputs/ — so copy them under $GITHUB_WORKSPACE first.
+# Guarded on PERFMON_ENABLED so non-dsr1 models on this runner stay silent.
+if [ "${PERFMON_ENABLED:-0}" = "1" ] && [ -d "$LOGS_DIR" ]; then
+    perf_csv_count=$(ls "$LOGS_DIR"/perf_samples_*.csv 2>/dev/null | wc -l | tr -d ' ')
+    if [ "$perf_csv_count" -gt 0 ]; then
+        mkdir -p "$GITHUB_WORKSPACE/perf_samples"
+        cp "$LOGS_DIR"/perf_samples_*.csv "$GITHUB_WORKSPACE/perf_samples/"
+        echo "GPU_METRICS_CSV_GLOB=$GITHUB_WORKSPACE/perf_samples/perf_samples_*.csv" >> "$GITHUB_ENV"
+        echo "[perfmon] staged $perf_csv_count per-node perf_samples_*.csv to \$GITHUB_WORKSPACE/perf_samples/"
+    else
+        echo "[perfmon] WARNING: monitoring enabled but no perf_samples_*.csv found in $LOGS_DIR — measured power aggregation will be skipped" >&2
+    fi
+fi
+
 # Clean up srt-slurm outputs to prevent NFS silly-rename lock files
 # from blocking the next job's checkout on this runner
 echo "Cleaning up srt-slurm outputs..."