Skip to content

ci: remove nick-fields/retry wrapper and add shared sinfo-based GPU partition selection#1299

Open
sbryngelson wants to merge 23 commits intoMFlowCode:masterfrom
sbryngelson:retryci
Open

ci: remove nick-fields/retry wrapper and add shared sinfo-based GPU partition selection#1299
sbryngelson wants to merge 23 commits intoMFlowCode:masterfrom
sbryngelson:retryci

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Mar 10, 2026

Summary

  • Replace nick-fields/retry JS action with plain run: steps for Frontier builds (test.yml) and bench builds (bench.yml). The JS action wrapper was getting SIGKILL'd on Frontier login nodes after the build completed successfully, causing false build failures. Retry logic is handled by retry_build() in .github/scripts/retry-build.sh, which all cluster build.sh scripts already call.
  • Unified CI scripts — replace per-cluster submit/test/bench scripts with parameterized versions:
    • submit-slurm-job.sh: single submit+monitor script for all clusters (replaces phoenix/submit-job.sh, phoenix/submit.sh, frontier/submit.sh). Cluster config (account, QOS, partitions, time limits) selected via case block. Idempotent stale-job cancellation now applies to all clusters.
    • common/test.sh: unified test script with conditional build, cluster-aware GPU detection, thread counts, RDMA, and sharding.
    • common/bench.sh: unified bench script with conditional build, TMPDIR management (Phoenix-only), and cluster-aware bench flags.
  • Shared select-gpu-partition.sh script for sinfo-based GPU partition selection, used by both test and benchmark jobs. GPU partition priority: gpu-rtx6000 → gpu-l40s → gpu-v100 → gpu-h200 → gpu-h100 → gpu-a100.
  • Parallel benchmark jobs require 2 idle/mix nodes (GPU_PARTITION_MIN_NODES=2) before selecting a partition, since PR and master benchmark jobs run concurrently.
  • Exclude dead GPU node atl1-1-03-002-29-0 (persistent cuInit error 999).
  • Delete dead code: run-tests-with-retry.sh (never called).

Workflow simplification

  • test.yml self job: 5 conditional steps → 2 (Build + Test)
  • test.yml case-opt job: 5 conditional steps → 3
  • 11 per-cluster scripts deleted, 3 unified scripts added

Test plan

  • Trigger CI on a PR and verify Frontier build steps pass without false failures
  • Verify Phoenix test jobs land on an available GPU partition via sinfo selection
  • Verify Phoenix benchmark jobs (PR + master) land on the same partition with 2 available nodes
  • Verify all cluster/device/interface combinations produce correct SLURM submissions

Spencer Bryngelson added 3 commits March 9, 2026 23:36
…nodes

The JS action wrapper gets SIGKILL'd on Frontier login nodes under memory
pressure, falsely failing the Build step even when build.sh succeeds.
retry_build() inside build.sh already handles 2-attempt retry with
rm -rf build between attempts.

Also move gpu-v100 to last in Phoenix GPU partition priority so SLURM
prefers newer GPU nodes (a100/h100/l40s/h200) over the aging V100s that
have had recurring driver issues.
Extract partition selection into select-gpu-partition.sh so both test
jobs (submit-job.sh) and benchmark jobs (run_parallel_benchmarks.sh)
use the same sinfo-based logic with a consistent priority order:
  gpu-rtx6000 -> gpu-l40s -> gpu-v100 -> gpu-h200 -> gpu-h100 -> gpu-a100

Tests now dynamically pick the best available partition rather than
submitting to a static multi-partition list, matching the benchmark
approach. Bench still exports BENCH_GPU_PARTITION so PR and master
land on the same GPU type for fair comparisons.
Copilot AI review requested due to automatic review settings March 10, 2026 03:45
@qodo-code-review

This comment has been minimized.

@qodo-code-review

This comment has been minimized.

@github-actions

This comment has been minimized.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates CI/SLURM automation to centralize Phoenix GPU partition selection logic and simplify the non-Phoenix build step in GitHub Actions.

Changes:

  • Replace the nick-fields/retry wrapper in test.yml with a direct run step + timeout-minutes.
  • Introduce a reusable .github/scripts/select-gpu-partition.sh and use it from both Phoenix benchmark and test submission paths.
  • Simplify .github/scripts/retry-build.sh by removing support for a post-build validation hook.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
.github/workflows/test.yml Removes external retry wrapper around non-Phoenix build step.
.github/workflows/phoenix/submit-job.sh Uses shared GPU partition selection for non-benchmark Phoenix submissions.
.github/scripts/select-gpu-partition.sh New shared helper to pick an available Phoenix GPU partition via sinfo.
.github/scripts/run_parallel_benchmarks.sh Refactors inline partition selection into the shared helper script.
.github/scripts/retry-build.sh Removes RETRY_VALIDATE_CMD-based post-build validation behavior from retry loop.

Comment on lines 2 to 14
# Provides retry_build(): 2-attempt loop.
# On failure of attempt 1, nukes the entire build directory before attempt 2.
# Set RETRY_VALIDATE_CMD to run a post-build validation; failure triggers a retry.
# Usage: source .github/scripts/retry-build.sh
# retry_build ./mfc.sh build -j 8 --gpu acc

retry_build() {
local validate_cmd="${RETRY_VALIDATE_CMD:-}"
local max_attempts=2
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Build attempt $attempt of $max_attempts..."
if "$@"; then
if [ -n "$validate_cmd" ]; then
if ! eval "$validate_cmd"; then
echo "Post-build validation failed on attempt $attempt."
if [ $attempt -lt $max_attempts ]; then
echo " Nuking build directory before retry..."
rm -rf build 2>/dev/null || true
sleep 5
attempt=$((attempt + 1))
continue
else
echo "Validation still failing after $max_attempts attempts."
return 1
fi
fi
fi
echo "Build succeeded on attempt $attempt."
return 0

This comment was marked as off-topic.

Comment on lines +37 to +46
if [ "$job_type" = "bench" ]; then
bench_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}"
echo "Submitting bench GPU job to partition: $bench_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})"
sbatch_gpu_opts="\
#SBATCH -p $bench_partition
#SBATCH --ntasks-per-node=4 # Number of cores per node required
#SBATCH -G2\
"
# BENCH_GPU_PARTITION is pre-selected by run_parallel_benchmarks.sh so both
# PR and master jobs land on the same GPU type for a fair comparison.
gpu_partition="${BENCH_GPU_PARTITION:-gpu-rtx6000}"
echo "Submitting bench GPU job to partition: $gpu_partition (BENCH_GPU_PARTITION=${BENCH_GPU_PARTITION:-<unset, using default>})"
sbatch_time="#SBATCH -t 04:00:00"
else
sbatch_gpu_opts="\
#SBATCH -p gpu-v100,gpu-a100,gpu-h100,gpu-l40s,gpu-h200
source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh"
gpu_partition="$SELECTED_GPU_PARTITION"
sbatch_time="#SBATCH -t 03:00:00"

This comment was marked as off-topic.

@coderabbitai

This comment has been minimized.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/phoenix/submit-job.sh (1)

44-44: Consider using a more robust path resolution.

The relative path ../../scripts/select-gpu-partition.sh works correctly but is fragile if the script hierarchy changes. Consider extracting the repository root and using an absolute path pattern:

♻️ Optional: Use repo-root-relative path
 else
-    source "$(dirname "${BASH_SOURCE[0]}")/../../scripts/select-gpu-partition.sh"
+    _repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)"
+    source "${_repo_root}/.github/scripts/select-gpu-partition.sh"
     gpu_partition="$SELECTED_GPU_PARTITION"
     sbatch_time="#SBATCH -t 03:00:00"
 fi

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d42e334-ff80-4016-adf4-47b1c6150d4a

📥 Commits

Reviewing files that changed from the base of the PR and between edff972 and 24185f8.

📒 Files selected for processing (5)
  • .github/scripts/retry-build.sh
  • .github/scripts/run_parallel_benchmarks.sh
  • .github/scripts/select-gpu-partition.sh
  • .github/workflows/phoenix/submit-job.sh
  • .github/workflows/test.yml
💤 Files with no reviewable changes (1)
  • .github/scripts/retry-build.sh

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Free Tier Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

# retry_build ./mfc.sh build -j 8 --gpu acc

retry_build() {
local validate_cmd="${RETRY_VALIDATE_CMD:-}"

This comment was marked as off-topic.

Spencer Bryngelson and others added 2 commits March 9, 2026 23:51
Make bench jobs use sinfo-based GPU partition selection (via
select-gpu-partition.sh) as a baseline, then override with
BENCH_GPU_PARTITION only when run_parallel_benchmarks.sh has
pre-selected a partition for PR/master consistency. Previously
bench jobs fell back to a hardcoded gpu-rtx6000 when
BENCH_GPU_PARTITION was unset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…selection

For parallel benchmarks (PR + master), both jobs need a GPU node
concurrently, so require at least 2 idle/mix nodes before selecting
a partition. Add GPU_PARTITION_MIN_NODES parameter to
select-gpu-partition.sh (defaults to 1 for single-job test use).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@sbryngelson sbryngelson marked this pull request as draft March 10, 2026 03:56
@github-actions

This comment has been minimized.

@sbryngelson sbryngelson marked this pull request as ready for review March 10, 2026 03:57
@qodo-code-review

This comment has been minimized.

@qodo-code-review

This comment has been minimized.

phoenix/test.sh relies on RETRY_VALIDATE_CMD to smoke-test the
freshly built syscheck binary and trigger a rebuild on failure,
catching architecture mismatches (SIGILL) from binaries compiled
on a different compute node. Mistakenly removed in the previous
commit as 'unused'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Replace per-cluster submit/test/bench scripts with unified versions:
- submit-slurm-job.sh: single parameterized submit+monitor script for
  all clusters (replaces phoenix/submit-job.sh, phoenix/submit.sh,
  frontier/submit.sh). Cluster config (account, QOS, partitions, time
  limits) is selected via a case block. Idempotent stale-job cancellation
  now applies to all clusters, not just Phoenix.
- common/test.sh: unified test script with conditional build (skips if
  build/ exists from Frontier login-node build), cluster-aware GPU
  detection, thread counts, RDMA, and sharding.
- common/bench.sh: unified bench script with conditional build, TMPDIR
  management (Phoenix-only), and cluster-aware bench flags.

Also removes nick-fields/retry from bench.yml (frontier build.sh
already uses retry_build internally) and deletes dead code
(run-tests-with-retry.sh).

test.yml self job: 5 conditional steps -> 2 steps (Build + Test).
test.yml case-opt job: 5 conditional steps -> 3 steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

- Remove no-op 'rm -rf build' inside 'if [ ! -d build ]' guard
  in common/test.sh and common/bench.sh.
- Default gpu_partition to 'batch' before dynamic selection to
  prevent unbound variable error if a new cluster is added.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Spencer Bryngelson and others added 2 commits March 10, 2026 01:34
On Phoenix, test.yml uses clean:false so build/ can persist across
reruns. If the prior run built on a different CPU microarchitecture,
the stale binaries would SIGILL. Run syscheck on any existing build
and nuke build/ on failure so the rebuild block fires.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Same ISA mismatch fix as test.sh: always rm -rf build on Phoenix.
Also add trap EXIT for TMPDIR cleanup so early failures don't leak
temp directories under /storage/project.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Spencer Bryngelson and others added 2 commits March 10, 2026 01:52
bench.py spawns ./mfc.sh run as a subprocess without forwarding -j,
so the flag was silently ignored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions
Copy link

Claude Code Review

Head SHA: b7e8e75

Files changed: 20

  • .github/scripts/retry-build.sh (modified)
  • .github/scripts/run-tests-with-retry.sh (deleted)
  • .github/scripts/run_parallel_benchmarks.sh (modified)
  • .github/scripts/select-gpu-partition.sh (new)
  • .github/scripts/submit-slurm-job.sh (new)
  • .github/scripts/submit_and_monitor_bench.sh (modified)
  • .github/workflows/bench.yml (modified)
  • .github/workflows/common/bench.sh (new)
  • .github/workflows/common/test.sh (new)
  • .github/workflows/frontier/bench.sh, frontier/submit.sh, frontier/test.sh, frontier_amd/\*, phoenix/\* (deleted — 9 files)

Summary

  • Removes nick-fields/retry JS wrapper (SIGKILL root cause fix) in favour of plain run: steps; retry logic for builds moves into retry_build() inside SLURM jobs.
  • Replaces 9 per-cluster scripts with 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh).
  • Extracts shared sinfo-based GPU partition selection into select-gpu-partition.sh with a configurable minimum-node threshold for concurrent bench jobs.
  • Stale-job idempotency (was Phoenix-only) now applies to all clusters via the unified submit script.
  • Dead code deleted: run-tests-with-retry.sh, phoenix/submit-job.sh, cluster-specific test/bench stubs.

Findings

1. common/bench.sh drops -j $n_jobs from the bench run command (behavioural change)
Old phoenix/bench.sh: ./mfc.sh bench $bench_opts -j $n_jobs -o ... (parallelism = capped nproc, up to 64).
New common/bench.sh (line 45–48): ./mfc.sh bench --mem 4 -o ... -- -c $bench_cluster ...-j is absent.
If ./mfc.sh bench uses -j to run benchmark cases in parallel, this silently serialises them and will inflate wall-clock times. Please confirm whether the flag is intentionally dropped or should be -j $n_jobs.

2. submit-slurm-job.sh: no catch-all in device-opts case blocks (lines 85–115)
Both the cpu and gpu device branches contain case "$cluster" in phoenix) ... frontier|frontier_amd) ... esac with no *) fallback. If a new cluster is added to the case list without updating this file, sbatch_device_opts is left unset and the generated SBATCH script will be silently malformed. A *) echo "ERROR: no sbatch_device_opts for cluster '$cluster'" ; exit 1 ;; guard would make this fail loudly.

3. common/test.sh Phoenix validate-cmd is a no-op when syscheck is absent (line 27)

validate_cmd='syscheck_bin=$(find build/install -name syscheck ...); [ -z "$syscheck_bin" ] || "$syscheck_bin" ...'

If syscheck isn't installed (e.g., clean CI environment), the validate command silently succeeds ([ -z "" ] is true) and the architecture-mismatch check is skipped entirely. This is documented by the comment, so it's intentional, but worth verifying the binary actually exists in the Phoenix install tree so the guard fires when needed.

4. bench.yml: build-failure cleanup path removed (intentional but worth confirming)
The old nick-fields/retry had on_retry_command: rm -rf pr/build master/build. The new plain run: step has no cleanup on failure (lines 108–115 of bench.yml). For Frontier login-node pre-builds that fail, a subsequent re-run of the workflow will encounter a stale build/ directory. The Phoenix path is fine (build directory is always nuked inside SLURM by common/bench.sh). For Frontier, build/ persistence across runs is the intended behaviour (cache), so this is likely fine — just confirming it's a conscious decision.

5. Minor: select-gpu-partition.sh node exclusion not reflected in sinfo count
The dead node atl1-1-03-002-29-0 is excluded via --exclude= in the SBATCH header (submit-slurm-job.sh line ~105), but sinfo in select-gpu-partition.sh does not filter it out. A node with cuInit error 999 is almost certainly in down or drain state (not idle/mix), so the idle count should be accurate in practice — no immediate fix required.


Improvement opportunities

  • Consider quoting the mkdir -p $tmpbuild and mkdir -p $currentdir calls in common/bench.sh (lines 19–20) since set -u is active and these paths contain a variable.
  • select-gpu-partition.sh (line 23): add sinfo -p "$_part" validation to handle partitions that no longer exist on the cluster (the || true suppresses this already, so it degrades gracefully — just noting it).

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.94%. Comparing base (edff972) to head (61b84cf).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1299   +/-   ##
=======================================
  Coverage   44.94%   44.94%           
=======================================
  Files          70       70           
  Lines       20504    20504           
  Branches     1946     1946           
=======================================
  Hits         9216     9216           
  Misses      10166    10166           
  Partials     1122     1122           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

submit_and_monitor_bench.sh cd's into master/ before calling
submit-slurm-job.sh, which reads the bench script via cat. Since
master branch doesn't have common/bench.sh yet, the cat fails.
Fix by resolving the bench script path from the PR tree (absolute
path) so it works regardless of cwd.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: 61b84cfdf1a75a4811f4577af409f8f6abf18182
Files changed: 20 (387 additions, 461 deletions — net reduction)

Changed files:

  • .github/scripts/submit-slurm-job.sh (new — unified SLURM submit+monitor)
  • .github/workflows/common/test.sh (new — unified test script)
  • .github/workflows/common/bench.sh (new — unified bench script)
  • .github/scripts/select-gpu-partition.sh (new — sinfo-based GPU selection)
  • .github/workflows/test.yml, bench.yml (workflow simplification)
  • 11 per-cluster scripts deleted

Summary

  • Replaces nick-fields/retry JS wrapper (SIGKILL'd on Frontier login nodes) with plain run: steps; retry logic stays inside SLURM jobs via retry_build().
  • Consolidates 11 per-cluster scripts into 3 unified parameterized scripts; cluster-specific config lives in a case block in submit-slurm-job.sh.
  • Extracts GPU partition selection into a reusable select-gpu-partition.sh; GPU_PARTITION_MIN_NODES=2 ensures PR and master bench jobs land on the same partition.
  • Idempotent stale-job cancellation now covers all clusters uniformly.
  • Net code reduction of ~25% with improved consistency.

Findings

.github/workflows/common/bench.sh:50 — dropped -j $n_ranks from GPU bench command (possible regression)

Old frontier/bench.sh:

./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks

New common/bench.sh:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The -j $n_ranks flag is gone in the GPU branch. If this controls harness-level parallelism for the benchmark runner, removing it changes behavior. Worth confirming — if -n already handles this, a short comment would clarify intent.


.github/scripts/select-gpu-partition.sh:20 — sinfo node-state regex misses modifier suffixes

_idle=$(sinfo -p "$_part" --noheader -o "%t" 2>/dev/null | grep -cE "^(idle|mix)" || true)

SLURM can emit states with modifiers like idle~ (power-saving) or idle+ (not-responding). These are not matched by ^(idle|mix). This is inherited from the old run_parallel_benchmarks.sh logic, not a regression introduced here — noting it as an improvement opportunity.


.github/workflows/common/bench.sh:15 — TMPDIR collision risk on Phoenix

currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))

The random suffix space (0–8999) is small. Concurrent jobs on the same node can collide. Using $$ (PID) or mktemp -d would eliminate this. Pre-existing behavior from old phoenix scripts, but the unification is a good chance to fix it.


.github/scripts/submit-slurm-job.sh:155 — sbatch heredoc expands script contents in outer shell

submit_output=$(sbatch <<EOT
...
$sbatch_script_contents
EOT
)

Using an unquoted EOT delimiter causes the outer shell to expand $(), backticks, and ${var} in the embedded script before sbatch sees it. This is intentional here (to interpolate $job_slug, $job_device, etc. from the outer scope), but it means any future script sourced via this mechanism must not contain unexpanded ${} that differ from the outer shell's env. A comment explaining this design choice would help contributors avoid subtle bugs.


Minor Improvement Opportunities

  1. submit-slurm-job.sh:60 — A new cluster added to the outer case "$cluster" block must also be added to the inner device case blocks or sbatch_device_opts will be unset (fails with -u). A comment noting this coupling would help.

  2. bench.yml:115 — The bench Setup & Build step lost retry without explanation. Adding # retry handled inside build.sh via retry_build() would make the intent clear.

  3. common/test.sh:27validate_cmd="" then RETRY_VALIDATE_CMD="$validate_cmd" for non-Phoenix paths. Since retry-build.sh reads ${RETRY_VALIDATE_CMD:-}, an empty assignment and unset behave the same today, but this is fragile. Prefer unset RETRY_VALIDATE_CMD for the non-Phoenix branch.


Overall this is a well-motivated refactor — the unified scripts are easier to reason about than 11 near-duplicate per-cluster files. The dropped -j bench flag is the only item I would want a clear answer on before merging.

sbryngelson and others added 2 commits March 10, 2026 15:49
RTX 6000 nodes can't finish the full test suite within the 3-hour
SLURM wall time. Use gpu-l40s as the new fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: 1c4bd73c02dc2ed894d649c12f932bacd09e059f

Files changed: 20

Changed files
  • .github/scripts/retry-build.sh (comment/doc update)
  • .github/scripts/run-tests-with-retry.sh (deleted)
  • .github/scripts/run_parallel_benchmarks.sh (refactored to use shared script)
  • .github/scripts/select-gpu-partition.sh (new)
  • .github/scripts/submit-slurm-job.sh (new, replaces 3 per-cluster scripts)
  • .github/scripts/submit_and_monitor_bench.sh
  • .github/workflows/bench.yml
  • .github/workflows/common/bench.sh (new)
  • .github/workflows/common/test.sh (new)
  • 11 deleted per-cluster scripts (frontier/{bench,submit,test}.sh, frontier_amd symlinks, phoenix/{bench,submit-job,submit,test}.sh)
  • .github/workflows/test.yml

Summary

  • Removes the nick-fields/retry JS wrapper that was getting SIGKILL'd on Frontier login nodes post-build, replacing it with plain run: steps — the underlying retry logic is provided by retry_build() in retry-build.sh.
  • Consolidates 11 per-cluster SLURM scripts into 3 unified scripts: submit-slurm-job.sh, common/test.sh, common/bench.sh.
  • Adds shared select-gpu-partition.sh for sinfo-based partition selection, used by both test and bench jobs.
  • Excludes the dead GPU node atl1-1-03-002-29-0 (cuInit 999) in all Phoenix GPU jobs.
  • Reduces CI workflow steps from 5 → 2 for the self job, 5 → 3 for the case-opt job.

Findings

[Medium] common/test.sh: Phoenix GPU tests may rebuild without GPU flags

File: .github/workflows/common/test.sh, line 536

Old phoenix/test.sh passed build_opts (containing --gpu acc) to the test command:

# phoenix/test.sh (deleted)
./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts ${build_opts:---no-gpu} -- -c phoenix

New common/test.sh does not pass the build interface flag to the test run:

# common/test.sh line 536
./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $rdma_opts $device_opts $shard_opts -- -c $job_cluster

build_opts (set at line 475 from gpu-opts.sh) is used for the dry-run build at line 496, but is not forwarded to the live test command. If ./mfc.sh test triggers an incremental rebuild, it would do so without --gpu acc / --gpu omp, potentially creating CPU-only binaries. Please verify that the cached CMake configuration prevents recompilation, or explicitly pass $build_opts (or an equivalent flag like --no-build) here.


[Minor] GPU partition priority change may shift benchmark baselines

File: .github/scripts/select-gpu-partition.sh, line 115

New priority: gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100

Old run_parallel_benchmarks.sh priority: gpu-rtx6000 gpu-l40s gpu-v100 gpu-h200 gpu-h100 gpu-a100

The gpu-v100 moved from position 3 to last. Bench jobs that previously landed on v100 for reproducibility will now prefer h200/h100/a100. This is benign if intentional, but benchmark baselines in the historical record will appear as regressions/improvements when the partition switches. Consider adding a note to the PR/changelog that bench baselines will need to be re-established after this change.


[Minor] common/bench.sh drops -j flag from ./mfc.sh bench

File: .github/workflows/common/bench.sh, lines 451–453

Old frontier/bench.sh:

./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster ...

New:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The -j $n_ranks flag was removed from ./mfc.sh bench. If this flag controls build parallelism for the bench step (vs. the separate n_jobs for the build step above), this may serialize benchmark compilation. Intentional?


[Nit] bench.yml Setup & Build: no retry for Frontier login-node build failures

File: .github/workflows/bench.yml, lines 389–397

The nick-fields/retry with on_retry_command: rm -rf pr/build master/build has been removed. The PR explanation is correct (SIGKILL after successful build caused false failures). However, transient login-node build failures on Frontier now fail the job outright with no retry. This is an acceptable trade-off given the described failure mode.


Overall this is a well-motivated cleanup that reduces duplication and eliminates a known false-failure mode. The main item to verify before merging is the missing build_opts in the Phoenix test command (#1 above).

@github-actions
Copy link

Claude Code Review

Head SHA: 1c4bd73c02dc2ed894d649c12f932bacd09e059f

Files changed: 20

File +/-
.github/scripts/submit-slurm-job.sh +200 (new)
.github/workflows/common/test.sh +70 (new)
.github/workflows/common/bench.sh +54 (new)
.github/scripts/select-gpu-partition.sh +34 (new)
.github/scripts/retry-build.sh +5/-2
.github/workflows/test.yml +7/-34
.github/workflows/bench.yml +9/-15
11 deleted per-cluster scripts -849

Summary

  • Replaces nick-fields/retry JS action (SIGKILL'd on Frontier login nodes after successful builds) with plain run: steps, eliminating false CI failures.
  • Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), significantly reducing duplication.
  • Extracts sinfo-based GPU partition selection into a shared select-gpu-partition.sh, with a GPU_PARTITION_MIN_NODES=2 guard so parallel benchmark jobs land on the same partition.
  • Hardcodes exclusion of broken node atl1-1-03-002-29-0 (persistent cuInit error 999).
  • Deletes dead code (run-tests-with-retry.sh).

Findings

[Medium] bench.yml: Frontier login-node builds now have no retry

.github/workflows/bench.yml, new "Setup & Build" step removes nick-fields/retry with max_attempts: 2 and on_retry_command: rm -rf pr/build master/build. If a Frontier bench login-node build hits a transient failure (not a SIGKILL on the Actions runner side), there is no retry. The PR states that retry_build() inside cluster build.sh scripts handles retries — please confirm all build.sh scripts referenced by matrix.build_script in bench.yml internally call retry_build(), otherwise a transient build failure will fail the entire bench job with no recovery.


[Minor] common/bench.sh lines 424-427: Unquoted variables in TMPDIR setup

tmpbuild=/storage/project/r-sbryngelson3-0/sbryngelson3/mytmp_build
currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))
mkdir -p $tmpbuild
mkdir -p $currentdir

$tmpbuild and $currentdir should be quoted (mkdir -p "$tmpbuild" / mkdir -p "$currentdir"). The trap 'rm -rf "$currentdir"' on line 427 is correctly quoted, making this inconsistent. Low practical risk given the fixed path, but worth fixing.


[Minor] select-gpu-partition.sh line 115: Partition priority change may affect benchmark baseline consistency

_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"

RTX 6000 is excluded entirely (comment: "too slow for the test suite time limit"). The old run_parallel_benchmarks.sh preferred gpu-rtx6000 first for benchmarks ("gives the most consistent baselines"). This shared script is now used for both tests and benchmarks. If historical benchmark baselines were collected on rtx6000, the shift to l40s/h-series may affect PR-vs-master comparisons. Confirm this is acceptable for bench jobs (4-hour time limit is more generous than the 3-hour test limit).


[Minor] common/bench.sh lines 451/453: -j parallelism flag dropped from ./mfc.sh bench

Old phoenix/bench.sh passed -j $n_jobs (CPU cores) before -- to ./mfc.sh bench. The new unified script omits this. If ./mfc.sh bench uses -j for build or execution parallelism, Phoenix bench jobs may run slower. The old frontier/bench.sh used -j $n_ranks (MPI ranks, not CPU cores — apparent copy-paste bug), so dropping it for Frontier is likely correct. Worth verifying the Phoenix case is intentional.


Overall a clean, well-motivated CI refactor. The no-retry concern for Frontier bench login-node builds is the main thing to confirm before merge.

The dry-run build uses build_opts but the live test command didn't.
CMake caches the config, but passing it explicitly is safer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: b51f214469a1296bcfd2ef39678317454a67b06e

Files changed: 20

  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh (deleted)
  • .github/scripts/run_parallel_benchmarks.sh
  • .github/scripts/select-gpu-partition.sh (new)
  • .github/scripts/submit-slurm-job.sh (new)
  • .github/scripts/submit_and_monitor_bench.sh
  • .github/workflows/bench.yml
  • .github/workflows/common/bench.sh (new)
  • .github/workflows/common/test.sh (new)
  • .github/workflows/frontier/{bench,submit,test}.sh (deleted)
  • .github/workflows/frontier_amd/{bench,submit,test}.sh (deleted, were symlinks)
  • .github/workflows/phoenix/{bench,submit-job,submit,test}.sh (deleted)
  • .github/workflows/test.yml

Summary:

  • Removes nick-fields/retry JS action (SIGKILL culprit on Frontier) and replaces with plain run: steps backed by retry_build() inside SLURM jobs.
  • Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh) + shared select-gpu-partition.sh.
  • Phoenix build/test/bench flow now runs entirely inside SLURM (no split submit+monitor steps).
  • GPU partition selection extracted to a reusable script; gpu-rtx6000 excluded; parallel bench jobs require 2 idle/mix nodes (GPU_PARTITION_MIN_NODES=2).
  • Dead node atl1-1-03-002-29-0 excluded via #SBATCH --exclude= in the unified submit script.

Findings

1. Missing -j flag in unified bench command (behavioral regression risk)

.github/workflows/common/bench.sh, lines 41–45:

if [ "$job_device" = "gpu" ]; then
    ./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
else
    ./mfc.sh bench --mem 1 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks
fi

Both old scripts passed a -j flag: phoenix/bench.sh used -j $n_jobs (capped at 64) and frontier/bench.sh used -j $n_ranks. The new unified script drops -j entirely. If mfc.sh bench defaults to nproc this is fine on small nodes but could be unexpectedly aggressive on Frontier's 192-core GNR nodes. Worth verifying the default or restoring an explicit -j $n_jobs.

2. No retry for login-node build in bench workflow

.github/workflows/bench.yml, lines 105–119:
The old nick-fields/retry wrapper included on_retry_command: rm -rf pr/build master/build. The new plain run: step has no retry. For Frontier (which pre-builds on a login node), a transient build failure now fails CI with no second chance. This is intentional given the SIGKILL issue, but it means flaky Frontier login-node builds will show as real failures. Consider wrapping with retry_build or documenting that Frontier re-runs are expected to self-heal via the job's own build logic when build/ is absent.

3. Inconsistency in GPU partition priority between test and bench contexts

.github/scripts/select-gpu-partition.sh, line 14:

_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"

The old run_parallel_benchmarks.sh explicitly preferred gpu-rtx6000 first for benchmark consistency ("most nodes, most consistent baselines"). The new unified script excludes gpu-rtx6000 entirely (comment says "too slow for the test suite time limit"). Using the same script for both test and bench jobs means bench jobs may now land on L40S or H200 nodes — faster, but potentially less stable baselines compared to historical RTX 6000 runs. If benchmark trend data is used for regression detection, this partition change will create a discontinuity. Consider whether bench-specific partition preferences should be passed via GPU_PARTITION_MIN_NODES / a separate priority list.

4. select-gpu-partition.sh is sourced with source but not marked executable

.github/scripts/select-gpu-partition.sh is created without an execute bit (no chmod +x in the diff, and the file itself has no shebang-based #!/bin/bash protection since it's always sourced). This is fine as long as callers always use source/. , which they do — just confirm it's not accidentally called with bash anywhere.


Minor / no action needed

  • RANDOM % 9000 in common/bench.sh (vs old % 900) — more unique tmp dirs, harmless improvement.
  • The frontier_amd stubs (symlinks → deleted) are correctly replaced by cluster dispatch in submit-slurm-job.sh.
  • The new submit-slurm-job.sh heredoc uses #SBATCH -o ${output_file} (space after -o); old scripts used #SBATCH -o$output_file — both are valid SBATCH syntax.

Overall this is a clean, well-motivated simplification. The main item to verify before merge is whether dropping -j from the bench command is intentional.

sbryngelson and others added 3 commits March 10, 2026 16:00
Under set -e, 'wait $pid' returning non-zero aborts the script
before the exit code is captured, leaving the second parallel job
unmonitored. Use 'wait $pid || exit=$?' so both jobs are always
waited on.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- select-gpu-partition.sh: priority comment said 'smaller/older' but
  list is now L40S/H200/H100/A100/V100
- submit-slurm-job.sh: 'Idempotent' → 'Rerun-safe' (it always
  submits a new job)
- bench.sh: n_jobs only used for build, not bench

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For non-Phoenix GPU jobs, both device_opts and build_opts resolved to
the same --gpu flag. Let build_opts carry it; device_opts is only for
cluster-specific runtime flags like -g (Phoenix GPU IDs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: 698bd2e7f642c4f2fa6ffb24523109ea5b6e558e

Files changed: 20

  • .github/scripts/retry-build.sh (modified)
  • .github/scripts/run-tests-with-retry.sh (deleted)
  • .github/scripts/run_parallel_benchmarks.sh (modified)
  • .github/scripts/select-gpu-partition.sh (new)
  • .github/scripts/submit-slurm-job.sh (new)
  • .github/scripts/submit_and_monitor_bench.sh (modified)
  • .github/workflows/bench.yml (modified)
  • .github/workflows/common/bench.sh (new)
  • .github/workflows/common/test.sh (new)
  • .github/workflows/frontier/bench.sh (deleted)
  • .github/workflows/frontier/submit.sh (deleted)
  • .github/workflows/frontier/test.sh (deleted)
  • .github/workflows/frontier_amd/{bench,submit,test}.sh (deleted symlinks)
  • .github/workflows/phoenix/{bench,submit-job,submit,test}.sh (deleted)
  • .github/workflows/test.yml (modified)

Summary

  • Replaces nick-fields/retry JS action (SIGKILL'd on Frontier login nodes) with plain run: steps; retry is now handled entirely inside retry_build() in shell scripts.
  • Consolidates 11 per-cluster scripts into 3 unified scripts (submit-slurm-job.sh, common/test.sh, common/bench.sh), reducing duplication significantly.
  • Adds select-gpu-partition.sh for sinfo-based GPU partition selection, shared between test and bench jobs; requires 2 idle/mix nodes for bench to keep PR+master jobs on the same partition.
  • Excludes dead node atl1-1-03-002-29-0 (persistent cuInit 999) from Phoenix GPU sbatch options.
  • Deletes dead code run-tests-with-retry.sh (was never called).

Findings

1. common/bench.sh:482-484-j (parallel-jobs) flag dropped from ./mfc.sh bench

Old phoenix/bench.sh:

./mfc.sh bench --mem 4 -j $n_jobs -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks

Old frontier/bench.sh (GPU path):

./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks

New common/bench.sh:

./mfc.sh bench --mem 4 -o "$job_slug.yaml" -- -c $bench_cluster $device_opts -n $n_ranks

The -j flag controlling build/bench worker parallelism is absent in the unified script. If ./mfc.sh bench has a built-in default, this may be fine, but it is a behavioural change that could affect benchmark throughput or reproducibility. Worth confirming the default is acceptable (or re-adding -j $n_jobs).

2. select-gpu-partition.sh:135,142 — Comment says "smaller/older first" but gpu-v100 is last

# Priority order prefers smaller/older nodes to leave modern GPUs free
_GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100"

gpu-v100 (older/smaller than h100/h200/a100) is at the end, contradicting the stated intent. If the aim is to leave modern GPUs free for production, gpu-v100 should be earlier in the list (e.g., gpu-l40s gpu-v100 gpu-h200 gpu-h100 gpu-a100). This also diverges from the old bench ordering (rtx6000 → l40s → v100 → h200 → h100 → a100). Minor but could affect which GPU is chosen for benchmarks.

3. submit-slurm-job.sh:326-354 — Unquoted heredoc with script content expansion

sbatch_script_contents=$(cat "$script_path")
...
submit_output=$(sbatch <<EOT
...
$sbatch_script_contents
EOT
)

The unquoted EOT delimiter means bash expands $sbatch_script_contents during heredoc evaluation. Any $(...) or backtick constructs in the bench/test scripts execute in the submission shell. This is a pre-existing pattern (identical to the old frontier/submit.sh), so it is low risk for trusted internal scripts, but it is worth being aware of if scripts ever include command substitutions meant to run inside the SLURM job.

4. test.yml:237-252 — Frontier login-node build loses GitHub-Actions-level retry

The old step used nick-fields/retry with on_retry_command: rm -rf build. The new step is a plain run: with no retry wrapper. The PR notes that build.sh scripts call retry_build() internally — confirmed by the codebase — so this is intentional and correct. Just noting it for reviewers who might question why the external retry was dropped without replacement.


Overall: Clean, well-scoped CI infrastructure simplification. The -j omission in common/bench.sh is the only change that could silently degrade benchmark performance. The GPU partition priority order is worth a second look for correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants