Skip to content

fix: TCL-6370 revert login pod to configless-only#34

Open
sagrawal-byte wants to merge 1 commit into
slurm-1.0-together-changesfrom
sagrawal/tcl-6370-login-pod-configless
Open

fix: TCL-6370 revert login pod to configless-only#34
sagrawal-byte wants to merge 1 commit into
slurm-1.0-together-changesfrom
sagrawal/tcl-6370-login-pod-configless

Conversation

@sagrawal-byte
Copy link
Copy Markdown

Summary

  • Revert the login pod to upstream Slinky's fully-configless layout
  • Drop the static slurm.conf mount at /etc/slurm/ that was intercepting srun's config lookup
  • Drop the initContainer + emptyDir chmod dance from commit 57c7bdc
  • /etc/slurm/ on the login pod now contains only slurm.key (the sackd auth key); everything else comes from sackd's configless fetch into /run/slurm/conf/

Why

TCL-6370 reported that Pyxis stopped loading on v1.0 login pods. Root-cause analysis (see Linear comment):

The Together fork's login-deployment.yaml (introduced in commit 910e26f, 2025-11-23) statically mounts the slurm-config ConfigMap at /etc/slurm/, but does not also mount slurm-config-extra which holds plugstack.conf. So /etc/slurm/plugstack.conf is missing, and srun's default-path lookup for SPANK config fails.

Pre-commit 57c7bdc, srun couldn't read mode-0600 slurm.conf and fell through to /run/slurm/conf/ (where plugstack.conf is sitting via sackd's configless distribution) — Pyxis worked accidentally. Once 57c7bdc made slurm.conf successfully readable to non-root users, srun stopped falling through and the latent bug surfaced.

How this fixes it

With only slurm.key at /etc/slurm/:

  1. srun looks for /etc/slurm/slurm.conf → missing
  2. srun falls through to /run/slurm/conf/slurm.conf (precedence #4b in Slurm's lookup chain)
  3. Sibling /run/slurm/conf/plugstack.conf is there (sackd put it there via configless fetch)
  4. srun dlopens spank_pyxis.so → Pyxis loads → --container-image works

This also fixes the original LDAP permission problem that 57c7bdc was solving, because the mixed-permission static mount no longer exists. sackd writes configs to /run/slurm/conf/ at default mode 0644, so LDAP users can read them naturally.

Test plan

After deploying this change to a v1.0 cluster:

  • kubectl exec slurm-login-X -- ls -la /etc/slurm/ — should show only slurm.key
  • kubectl exec slurm-login-X -- ls -la /run/slurm/conf/ — should show slurm.conf, plugstack.conf, cgroup.conf, etc.
  • As an LDAP user via SSH: sinfo — should print partitions, not Permission denied
  • As an LDAP user via SSH: srun --container-image=ubuntu:24.04 cat /etc/os-release — should print Ubuntu 24.04 (Pyxis loaded)
  • As an LDAP user via SSH: srun hostname — should print the slurmd pod's hostname

Related

  • Linear: TCL-6370
  • Introduced by: commit 910e26f (created Together's custom login-deployment.yaml)
  • Made worse by: commit 57c7bdc (TCL-4402 split-permissions initContainer)

🤖 Generated with Claude Code

The login pod was statically mounting slurm.conf at /etc/slurm/ via the
slurm-config ConfigMap, with an initContainer chmod'ing files for mixed
permissions. This intercepted srun's config lookup at /etc/slurm/slurm.conf
and prevented the configless fallback to /run/slurm/conf/slurm.conf.

The static mount also did not include plugstack.conf (which lives in the
separate slurm-config-extra ConfigMap), so srun could not find SPANK plugin
config -> Pyxis did not load -> --container-image options failed silently.

Revert to upstream Slinky's fully-configless login pod design:
- /etc/slurm/ contains only slurm.key (the sackd auth key).
- sackd fetches slurm.conf, plugstack.conf, etc. via --conf-server into
  /run/slurm/conf/ at default Slurm mode 0644.
- srun's lookup falls through /etc/slurm/slurm.conf (missing) to
  /run/slurm/conf/slurm.conf, which is readable by LDAP users and has
  plugstack.conf in the same dir -> Pyxis loads automatically.

This also eliminates the original LDAP permission problem that the
initContainer chmod dance was solving, because the problematic
mixed-permission static mount no longer exists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant