Skip to content

Conversation

Copy link

Copilot AI commented Feb 11, 2026

SLURM job scripts in research/ and scripts/ cannot be validated locally before cluster submission, causing slow iteration and wasted resources on configuration errors.

Implementation

Docker Compose setup (docker-compose.slurm.yml)

  • SLURM controller (slurmctld) and compute node (slurmd) containers
  • Project volume-mounted at /workspace for script access
  • GPU passthrough optional (commented by default, requires NVIDIA Docker runtime)

SLURM configuration (docker/slurm/slurm.conf)

  • Matches DRAC/Compute Canada partition structure: main (48h), long, long-cpu
  • Fair-share scheduling with priority policies per DRAC spec
  • Process tracking via pgid (cgroup unavailable in Docker)
  • Resources: 4 CPUs, 16GB RAM (configurable)

Example job scripts (docker/slurm/examples/)

  • Basic validation (job_hello.sh)
  • Environment setup testing (job_test_env.sh, job_test_ami_ml.sh)
  • GPU testing (job_gpu_test.sh)

CI integration (.github/workflows/test-slurm-jobs.yml)

  • Automated cluster build and test job execution

Usage

# Start cluster
docker compose -f docker-compose.slurm.yml up -d

# Submit jobs
docker exec -it ami-ml-slurmctld bash
cd /workspace
sbatch research/order_level_classifier/job_train_classifier.sh

# Monitor
sinfo    # cluster status
squeue   # job queue

See docker/slurm/README.md for adapting real job scripts (module loading, data paths, resource limits).

Limitations

  • Single compute node vs. multi-node cluster
  • No environment modules system (conda direct)
  • No HPC filesystem hierarchy (scratch/project)
  • No database accounting
  • Process isolation via pgid instead of cgroup

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • c1
    • Triggering command: /usr/sbin/slurmctld /usr/sbin/slurmctld -D -vvv ln Root�� ARICA_TLS_ECC_Root_CA_2021.pem /etc/ssl/certs/HiPKI_Root_CA_-_Gs/,/_/g /usr/bin/ln /var/lib/dpkg/re/usr/bin/networkctl -liblua5.1-0_5.1list /usr/sbin/rm ln SM_S�� SERTrust_RSA_Certification_Autho-e /etc/ssl/certs/Security_Communics/,/_/g /usr/bin/ln -- -libpython3-dev_--norc b ln (dns block)
  • download.schedmd.com
    • Triggering command: /usr/bin/wget wget REDACTED /usr/bin/dpkg-de.crt grep remi�� em rtificates/mozilla/Microsoft_RSAs/,/_/g /usr/bin/basename -pinentry-cursesbash p.ci (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>feat: add Docker Compose environment to simulate SLURM for local testing</issue_title>
<issue_description>## Context

Training jobs are submitted to HPC clusters via SLURM scripts in scripts/ and research/. There is no way to test these job scripts locally before scheduling real jobs, which makes iteration slow and wastes cluster resources on configuration errors.

Proposed Changes

Add a Docker Compose setup that simulates a minimal SLURM environment locally:

  • review the DRAC / Compute Canada SLURM documentation to match their config as closely as possible: https://docs.alliancecan.ca/wiki/Running_jobs see the bash scripts in the research/ folder of this repo for real examples of how the pipeline in this repo is scheduled in DRAC's SLURM. For example: research/order_level_classifier/job_train_classifier.sh
  • A container with a SLURM controller and single compute node (e.g., using giovtorres/slurm-docker-cluster or similar)
  • The project mounted as a volume so job scripts can be submitted with sbatch
  • GPU passthrough optional (for CPU-only smoke tests, training can run for 1-2 epochs)
  • A README explaining how to start the environment, submit jobs, and check output
  • Create a GitHub workflow to test SLURM jobs

This would allow developers to validate SLURM scripts, environment setup, and pipeline orchestration before submitting to the real cluster.

Related

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 5 commits February 11, 2026 03:23
Co-authored-by: mihow <158175+mihow@users.noreply.github.com>
Co-authored-by: mihow <158175+mihow@users.noreply.github.com>
Co-authored-by: mihow <158175+mihow@users.noreply.github.com>
Co-authored-by: mihow <158175+mihow@users.noreply.github.com>
Co-authored-by: mihow <158175+mihow@users.noreply.github.com>
Copilot AI changed the title [WIP] Add Docker Compose environment to simulate SLURM for local testing feat: add Docker Compose SLURM environment for local job testing Feb 11, 2026
Copilot AI requested a review from mihow February 11, 2026 03:35
@mihow mihow requested review from Copilot and removed request for mihow February 11, 2026 03:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a local Docker Compose–based SLURM “mini cluster” so developers can validate SLURM job scripts (from research/ and scripts/) before submitting to DRAC/Compute Canada, plus a CI workflow to smoke-test job submission.

Changes:

  • Introduces a SLURM controller + single compute node Compose stack with Ubuntu-based SLURM image and entrypoint wiring (munge + slurm daemons).
  • Adds SLURM configs and example job scripts (hello/env/ami-ml/GPU) to exercise common workflows.
  • Adds a GitHub Actions workflow to build the cluster and submit a basic test job.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
docker-compose.slurm.yml Compose definition for controller/compute services, volumes, and optional GPU configuration.
docker/slurm/Dockerfile Builds the SLURM image (Ubuntu + slurm + munge + miniconda + poetry) and installs configs/entrypoint.
docker/slurm/entrypoint.sh Container entrypoint that starts munge and runs slurmctld/slurmd based on role env var.
docker/slurm/slurm.conf SLURM cluster config (partitions, scheduling, accounting off, pgid proctrack, node resources).
docker/slurm/cgroup.conf cgroup constraint config file (currently not aligned with pgid proctrack in slurm.conf).
docker/slurm/gres.conf Example GPU GRES device mapping for node c1.
docker/slurm/examples/job_hello.sh Minimal “hello world” SLURM job for basic submission validation.
docker/slurm/examples/job_test_env.sh Example job intended to validate conda/poetry/workspace availability.
docker/slurm/examples/job_test_ami_ml.sh Example job intended to validate installing/running ami-ml CLI entrypoints.
docker/slurm/examples/job_gpu_test.sh Example job for verifying GPU passthrough and torch CUDA visibility.
docker/slurm/README.md Documentation for running the local SLURM environment and troubleshooting/usage guidance.
README.md Adds a repo-level pointer to the SLURM local testing docs and quick-start commands.
.github/workflows/test-slurm-jobs.yml CI workflow that builds the image, starts the cluster, and submits smoke-test jobs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Check GPU availability
scontrol show node c1 | grep Gres
```

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU section suggests sbatch --gres=gpu:1 ..., but in the default config GresTypes=gpu is commented out and gres.conf isn't copied into the image by default. As written, GPU jobs are likely to stay pending or be rejected until those pieces are enabled. Consider adding explicit steps here (what to uncomment/copy in slurm.conf, Dockerfile, and/or compose) so users can actually get GRES working.

Suggested change
To make `--gres=gpu` work in this environment, you also need to:
- Ensure `GresTypes=gpu` is enabled in your `slurm.conf` (uncomment or add a line such as `GresTypes=gpu`).
- Provide a `gres.conf` file that defines the GPUs on the compute node and make it available inside the Slurm containers
(for example by adding `COPY gres.conf /etc/slurm/gres.conf` in the Slurm Dockerfile or by bind-mounting it via Docker Compose).
- Include an entry for the compute node in `gres.conf`, for example:
```conf
NodeName=c1 Name=gpu File=/dev/nvidia0
  • Restart the Slurm Docker stack after changing configuration files, e.g.:

    docker compose down
    docker compose up -d

Copilot uses AI. Check for mistakes.
conda --version

# Create and activate a test environment
echo "Creating test conda environment..."
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda activate test-env will fail in a non-interactive script unless the conda shell hook is initialized (e.g., eval "$(/opt/miniconda3/bin/conda shell.bash hook)" or sourcing conda.sh). Add the conda init step before calling conda activate, similar to job_test_ami_ml.sh.

Suggested change
echo "Creating test conda environment..."
echo "Creating test conda environment..."
# Initialize conda for this non-interactive shell
eval "$(conda shell.bash hook)"

Copilot uses AI. Check for mistakes.

- **Authentication**: MUNGE for inter-process communication
- **Scheduling**: Backfill scheduler with fair-share policies
- **Resource Management**: Cgroup-based process tracking and resource constraints
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README claims "Cgroup-based process tracking and resource constraints", but docker/slurm/slurm.conf explicitly configures ProctrackType=proctrack/pgid and comments out TaskPlugin=task/cgroup for Docker compatibility. Update the README to match the actual configuration (or enable cgroup tracking and document the required Docker/cgroup setup).

Suggested change
- **Resource Management**: Cgroup-based process tracking and resource constraints
- **Resource Management**: Process tracking via Linux process groups (`ProctrackType=proctrack/pgid`); cgroup-based tracking is disabled in this Docker setup for compatibility.

Copilot uses AI. Check for mistakes.
1. **Scale**: Single node vs. multi-node cluster
2. **Modules**: No `module load` environment modules (uses conda directly)
3. **Storage**: No scratch/project filesystem hierarchy
4. **Accounting**: No database accounting (uses flat file)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says accounting "uses flat file", but the shipped slurm.conf sets AccountingStorageType=accounting_storage/none (no accounting storage). Either update the README wording or change the SLURM config to match what's documented.

Suggested change
4. **Accounting**: No database accounting (uses flat file)
4. **Accounting**: Disabled (no accounting storage; `AccountingStorageType=accounting_storage/none`)

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +39
docker-compose -f docker-compose.slurm.yml build

- name: Start SLURM cluster
run: |
docker-compose -f docker-compose.slurm.yml up -d
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow uses the legacy docker-compose binary. On GitHub-hosted runners (and many local setups) only the Compose v2 plugin (docker compose) is guaranteed to be present; docker-compose may be missing. Consider switching these commands to docker compose ... for more reliable CI execution.

Suggested change
docker-compose -f docker-compose.slurm.yml build
- name: Start SLURM cluster
run: |
docker-compose -f docker-compose.slurm.yml up -d
docker compose -f docker-compose.slurm.yml build
- name: Start SLURM cluster
run: |
docker compose -f docker-compose.slurm.yml up -d

Copilot uses AI. Check for mistakes.
Quick start:
```bash
# Build and start the SLURM cluster
docker-compose -f docker-compose.slurm.yml up -d
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs use the legacy docker-compose binary. Consider using the Compose v2 plugin syntax (docker compose -f docker-compose.slurm.yml ...) to match modern Docker installs where docker-compose may not be available.

Suggested change
docker-compose -f docker-compose.slurm.yml up -d
docker compose -f docker-compose.slurm.yml up -d

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +41
eval "$(/opt/miniconda3/bin/conda shell.bash hook)"
conda activate ami-ml || {
echo "Creating ami-ml environment..."
conda create -n ami-ml python=3.9 -y
conda activate ami-ml

# Install dependencies with Poetry
poetry install
}

# Load environment variables
if [ -f .env ]; then
set -o allexport
source .env
set +o allexport
echo "Loaded environment variables from .env"
else
echo "Warning: .env file not found"
fi

# Test that ami-ml commands are available
echo "Testing ami-ml commands..."
ami-classification --help || echo "ami-classification command not available"
ami-dataset --help || echo "ami-dataset command not available"
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poetry install will, by default, create an isolated virtualenv and install ami-dataset/ami-classification entrypoints there. Since the script later calls ami-classification/ami-dataset directly (not via poetry run), these commands are likely to be missing even after install. Either run the checks via poetry run ... or configure Poetry to install into the active conda env (e.g., disable Poetry venv creation) and consider failing the job when the commands aren't found so this script actually validates the setup.

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +112
# Wait for job to complete (with timeout)
timeout=60
elapsed=0
while [ $elapsed -lt $timeout ]; do
status=$(docker exec ami-ml-slurmctld squeue -j $JOB_ID -h -o "%T" 2>/dev/null || echo "COMPLETED")
if [ "$status" = "COMPLETED" ] || [ -z "$status" ]; then
echo "Job $JOB_ID completed"
break
fi
echo "Job $JOB_ID status: $status (waiting...)"
sleep 5
elapsed=$((elapsed + 5))
done

# Display job output
docker exec ami-ml-slurmctld bash -c "cd /workspace && cat test_simple_*.out"

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the polling loop, if the job never completes within the timeout, the script exits the loop and continues without failing explicitly. Add a check after the loop to fail the step when elapsed >= timeout (and/or when the final state is not COMPLETED), so CI deterministically reports a failure instead of hanging or passing with partial output.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +53
# Uncomment the deploy section below to enable GPU support
# Requires NVIDIA Docker runtime to be installed
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented GPU passthrough example uses the deploy: section. For docker compose up (non-Swarm), deploy is typically ignored, so uncommenting this may not actually provide GPUs to the container. Prefer the Compose v2 GPU mechanism (e.g., gpus: all / device requests) and document the exact command/runtime requirement.

Suggested change
# Uncomment the deploy section below to enable GPU support
# Requires NVIDIA Docker runtime to be installed
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
# Uncomment the line below to enable GPU support with Docker Compose v2
# Requires NVIDIA Container Toolkit / runtime to be installed and configured
# Run with: docker compose -f docker-compose.slurm.yml up
# gpus: all

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +46
```bash
# From the repository root
docker-compose -f docker-compose.slurm.yml build
docker-compose -f docker-compose.slurm.yml up -d
```
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These instructions use docker-compose. If the project wants to be compatible with Docker Compose v2-only installs, consider updating the docs to use docker compose ... (or note that either works if a compatibility shim is installed).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add Docker Compose environment to simulate SLURM for local testing

2 participants