The deep_health_check_passed_nodes_only option generates an incorrect nodeSelector in the Kubernetes job manifest, causing jobs to stay Pending indefinitely.
The Jinja template currently renders:
nodeSelector:
deep-health-check-passed: "true"
But the actual label on HyperPod EKS nodes is:
nodeSelector:
sagemaker.amazonaws.com/deep-health-check-status: "Passed"
CLI version: v3.7.0