Skip to content

Commit dac6cd9

Browse files
authored
feat(gpu): disable NFD/GFD and remove nodeAffinity from device plugin chart (#497)
Disables GPU Feature Discovery and Node Feature Discovery DaemonSets and overrides the device plugin's default nodeAffinity to null so it schedules unconditionally on the single-node gateway without requiring NFD/GFD labels. Setting affinity to an empty map ({}) does not override the chart defaults because Helm deep-merges user values with chart defaults. Using null explicitly removes the key, causing the chart template to skip the affinity block entirely.
1 parent eff88b7 commit dac6cd9

File tree

3 files changed

+9
-5
lines changed

3 files changed

+9
-5
lines changed

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
303303
| mTLS secrets missing | Bootstrap couldn't apply secrets (namespace not ready) | Check deploy logs and verify `openshell` namespace exists (Step 6) |
304304
| mTLS mismatch after redeploy | PKI rotated but workload not restarted, or rollout failed | Check that all three TLS secrets exist and that the openshell pod restarted after cert rotation (Step 6) |
305305
| Helm install job failed | Chart values error or dependency issue | `openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-openshell` |
306+
| NFD/GFD DaemonSets present (`node-feature-discovery`, `gpu-feature-discovery`) | Cluster was deployed before NFD/GFD were disabled (pre-simplify-device-plugin change) | These are harmless but add overhead. Clean up: `openshell doctor exec -- kubectl delete daemonset -n nvidia-device-plugin -l app.kubernetes.io/name=node-feature-discovery` and similarly for GFD. The `nvidia.com/gpu.present` node label is no longer applied; device plugin scheduling no longer requires it. |
306307
| Architecture mismatch (remote) | Built on arm64, deploying to amd64 | Cross-build the image for the target architecture |
307308
| Port conflict | Another service on the configured gateway host port (default 8080) | Stop conflicting service or use `--port` on `openshell gateway start` to pick a different host port |
308309
| gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |

architecture/gateway-single-node.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -300,7 +300,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa
300300
- When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
301301
- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
302302
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
303-
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`, along with GPU Feature Discovery and Node Feature Discovery.
303+
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`. NFD and GFD are disabled; the device plugin's default `nodeAffinity` (which requires `feature.node.kubernetes.io/pci-10de.present=true` or `nvidia.com/gpu.present=true` from NFD/GFD) is overridden to empty so the DaemonSet schedules on the single-node cluster without requiring those labels.
304304
- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
305305
- The OpenShell Helm chart grants the gateway service account cluster-scoped read access to `node.k8s.io/runtimeclasses` and core `nodes` so GPU sandbox admission can verify both the `nvidia` `RuntimeClass` and allocatable GPU capacity before creating a sandbox.
306306

deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,10 @@
77
#
88
# The chart installs:
99
# - NVIDIA device plugin DaemonSet (advertises nvidia.com/gpu resources)
10-
# - GPU Feature Discovery (labels nodes with GPU properties)
11-
# - Node Feature Discovery (dependency for GFD)
10+
#
11+
# NFD and GFD are disabled; the device plugin's default nodeAffinity
12+
# (which requires nvidia.com/gpu.present=true) is overridden to empty
13+
# so it schedules on any node without requiring NFD/GFD labels.
1214
#
1315
# k3s auto-detects nvidia-container-runtime on PATH and registers the "nvidia"
1416
# RuntimeClass automatically, so no manual RuntimeClass manifest is needed.
@@ -27,6 +29,7 @@ spec:
2729
valuesContent: |-
2830
runtimeClassName: nvidia
2931
gfd:
30-
enabled: true
32+
enabled: false
3133
nfd:
32-
enabled: true
34+
enabled: false
35+
affinity: null

0 commit comments

Comments
 (0)