fix(cli): make cluster-running probe authoritative via kubectl, not file presence#537
Open
bussyjd wants to merge 1 commit into
Open
fix(cli): make cluster-running probe authoritative via kubectl, not file presence#537bussyjd wants to merge 1 commit into
bussyjd wants to merge 1 commit into
Conversation
…ocker labels The previous EnsureCluster only stat'd kubeconfig.yaml on disk. After `k3d cluster stop && k3d cluster start` the kubeconfig still exists but the k3d API port can drift (pitfall #1), so the next kubectl exec gets "connection refused" and wrapClusterDown wrongly tells the user the cluster is stopped — even though `kubectl get pods -A` against a refreshed kubeconfig succeeds. Replace the file-presence check with an active probe of the K8s API server (`kubectl version --request-timeout=3s`). On a cluster-down signature, attempt one best-effort `k3d kubeconfig write --overwrite` and re-probe before giving up. Non-cluster-down probe failures (e.g. missing kubectl binary) pass through verbatim instead of being masked by the misleading "cluster appears to be stopped" hint. Probe and refresh are swappable via package-level vars so the recovery branches are fully unit-testable without a live cluster.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Repro
k3d cluster stop obol-stack-<name>k3d cluster start obol-stack-<name>kubectl get pods -A(against$OBOL_CONFIG_DIR/kubeconfig.yaml) succeedsobol sell http <name> --upstream foo --port 80 --namespace default --per-request 0.001 --chain base-sepolia --wallet 0x...fails with:Same false-positive surfaces on every
obol sell *,obol network *,obol model *,obol agent *, and most other gated subcommands (every caller ofkubectl.EnsureCluster).What the old probe checked + why it lied
internal/kubectl/kubectl.EnsureClusteronly did:i.e. it asserted "cluster is up" purely from the kubeconfig file existing. The actual
cluster appears to be stoppedmessage comes fromwrapClusterDown, which seesconnection refusedon the first kubectl exec the subcommand attempts (e.g.kubectl applyfor theServiceOffer).After
k3d cluster stop && k3d cluster startthe kubeconfig file is still on disk but its embeddedserver: https://0.0.0.0:<port>line points at the previous k3d API port. New port → connection refused → falsecluster appears to be stopped. This is pitfall #1 in the project CLAUDE.md ("Kubeconfig port drift — k3d API port can change between restarts").What the new probe checks + why it's authoritative
EnsureClusternow:kubectl version --request-timeout=3s -o jsonagainst$OBOL_CONFIG_DIR/kubeconfig.yaml. The same path every downstream kubectl exec will take — if this succeeds, every other call against the same kubeconfig will too.connection refused/no route to host/Unable to connect to the serverfailure (the existingwrapClusterDownsignature), runs one best-effortk3d kubeconfig write <cluster> -o <kubeconfig> --overwriteto recover from the port drift case, then re-probes.ErrClusterDown.kubectlbinary) pass through verbatim instead of being masked as "cluster appears to be stopped".The refresh helper declines silently if prerequisites are missing (no
k3dbinary, no.stack-id, or.stack-backendis set to a non-k3d backend likek3s) — so the change is safe for the k3s backend and for early-init states.The probe and refresh are swappable via two package-level vars (
probeAPIServerFn,refreshKubeconfigFn), so the recovery branches are fully unit-tested without a live cluster.Test plan
go build ./...cleango test ./internal/kubectl/... -count=1green (new table covers: probe success, port-drift recovery, refresh skipped, refresh ran but probe still failing, non-cluster-down passthrough, refresh prerequisite checks)go test ./cmd/obol/... -count=1green (no existing test regressions; tests that seed a stub kubeconfig do not hitEnsureClusterpaths that depended on the old no-op behavior)go test ./internal/... -count=1green (one pre-existing failure ininternal/stack/TestWarnIfNoChatModel_EmitsWarnWhenNoModelsreproduces onmainunmodified — unrelated to this change)Manual repro test
Live verification on a real cluster (recommend executing before merge):
Expected: succeeds after the kubeconfig auto-refresh, no
cluster appears to be stoppedmessage. (Old behavior: false positive.)To confirm the kubeconfig was actually refreshed, diff
$OBOL_CONFIG_DIR/kubeconfig.yamlbefore/after the failing sequence — theserver:URL port will have updated.