feat(gpu): disable NFD/GFD and remove nodeAffinity from device plugin chart (#497)

elezar · web-flow · commit dac6cd953d2e · 2026-03-20T10:45:30.000-07:00
Disables GPU Feature Discovery and Node Feature Discovery DaemonSets and
overrides the device plugin's default nodeAffinity to null so it schedules
unconditionally on the single-node gateway without requiring NFD/GFD labels.

Setting affinity to an empty map ({}) does not override the chart defaults
because Helm deep-merges user values with chart defaults. Using null explicitly
removes the key, causing the chart template to skip the affinity block entirely.
diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md
@@ -303,6 +303,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
 | mTLS secrets missing | Bootstrap couldn't apply secrets (namespace not ready) | Check deploy logs and verify `openshell` namespace exists (Step 6) |
 | mTLS mismatch after redeploy | PKI rotated but workload not restarted, or rollout failed | Check that all three TLS secrets exist and that the openshell pod restarted after cert rotation (Step 6) |
 | Helm install job failed | Chart values error or dependency issue | `openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-openshell` |
+| NFD/GFD DaemonSets present (`node-feature-discovery`, `gpu-feature-discovery`) | Cluster was deployed before NFD/GFD were disabled (pre-simplify-device-plugin change) | These are harmless but add overhead. Clean up: `openshell doctor exec -- kubectl delete daemonset -n nvidia-device-plugin -l app.kubernetes.io/name=node-feature-discovery` and similarly for GFD. The `nvidia.com/gpu.present` node label is no longer applied; device plugin scheduling no longer requires it. |
 | Architecture mismatch (remote) | Built on arm64, deploying to amd64 | Cross-build the image for the target architecture |
 | Port conflict | Another service on the configured gateway host port (default 8080) | Stop conflicting service or use `--port` on `openshell gateway start` to pick a different host port |
 | gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |
diff --git a/architecture/gateway-single-node.md b/architecture/gateway-single-node.md
@@ -300,7 +300,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa
 - When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
 - `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
 - `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
-- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`, along with GPU Feature Discovery and Node Feature Discovery.
+- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`. NFD and GFD are disabled; the device plugin's default `nodeAffinity` (which requires `feature.node.kubernetes.io/pci-10de.present=true` or `nvidia.com/gpu.present=true` from NFD/GFD) is overridden to empty so the DaemonSet schedules on the single-node cluster without requiring those labels.
 - k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
 - The OpenShell Helm chart grants the gateway service account cluster-scoped read access to `node.k8s.io/runtimeclasses` and core `nodes` so GPU sandbox admission can verify both the `nvidia` `RuntimeClass` and allocatable GPU capacity before creating a sandbox.
 
diff --git a/deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml b/deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml
@@ -7,8 +7,10 @@
 #
 # The chart installs:
 #   - NVIDIA device plugin DaemonSet (advertises nvidia.com/gpu resources)
-#   - GPU Feature Discovery (labels nodes with GPU properties)
-#   - Node Feature Discovery (dependency for GFD)
+#
+# NFD and GFD are disabled; the device plugin's default nodeAffinity
+# (which requires nvidia.com/gpu.present=true) is overridden to empty
+# so it schedules on any node without requiring NFD/GFD labels.
 #
 # k3s auto-detects nvidia-container-runtime on PATH and registers the "nvidia"
 # RuntimeClass automatically, so no manual RuntimeClass manifest is needed.
@@ -27,6 +29,7 @@ spec:
   valuesContent: |-
     runtimeClassName: nvidia
     gfd:
-      enabled: true
+      enabled: false
     nfd:
-      enabled: true
+      enabled: false
+    affinity: null