Skip to content

fix(cli): make cluster-running probe authoritative via kubectl, not file presence#537

Open
bussyjd wants to merge 1 commit into
mainfrom
fix/cli-cluster-running-probe
Open

fix(cli): make cluster-running probe authoritative via kubectl, not file presence#537
bussyjd wants to merge 1 commit into
mainfrom
fix/cli-cluster-running-probe

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Repro

  1. k3d cluster stop obol-stack-<name>
  2. k3d cluster start obol-stack-<name>
  3. kubectl get pods -A (against $OBOL_CONFIG_DIR/kubeconfig.yaml) succeeds
  4. obol sell http <name> --upstream foo --port 80 --namespace default --per-request 0.001 --chain base-sepolia --wallet 0x... fails with:
✗ cluster appears to be stopped — run 'obol stack up' before creating an HTTP service offer

Same false-positive surfaces on every obol sell *, obol network *, obol model *, obol agent *, and most other gated subcommands (every caller of kubectl.EnsureCluster).

What the old probe checked + why it lied

internal/kubectl/kubectl.EnsureCluster only did:

if _, err := os.Stat(filepath.Join(cfg.ConfigDir, "kubeconfig.yaml")); os.IsNotExist(err) {
    return errors.New("cluster not running. Run 'obol stack up' first")
}
return nil

i.e. it asserted "cluster is up" purely from the kubeconfig file existing. The actual cluster appears to be stopped message comes from wrapClusterDown, which sees connection refused on the first kubectl exec the subcommand attempts (e.g. kubectl apply for the ServiceOffer).

After k3d cluster stop && k3d cluster start the kubeconfig file is still on disk but its embedded server: https://0.0.0.0:<port> line points at the previous k3d API port. New port → connection refused → false cluster appears to be stopped. This is pitfall #1 in the project CLAUDE.md ("Kubeconfig port drift — k3d API port can change between restarts").

What the new probe checks + why it's authoritative

EnsureCluster now:

  1. Confirms the kubeconfig file exists (unchanged early-out for never-up clusters).
  2. Actively probes the Kubernetes API server via kubectl version --request-timeout=3s -o json against $OBOL_CONFIG_DIR/kubeconfig.yaml. The same path every downstream kubectl exec will take — if this succeeds, every other call against the same kubeconfig will too.
  3. On a connection refused / no route to host / Unable to connect to the server failure (the existing wrapClusterDown signature), runs one best-effort k3d kubeconfig write <cluster> -o <kubeconfig> --overwrite to recover from the port drift case, then re-probes.
  4. Only if the post-refresh probe still fails does it return ErrClusterDown.
  5. Non-cluster-down failures (e.g. missing kubectl binary) pass through verbatim instead of being masked as "cluster appears to be stopped".

The refresh helper declines silently if prerequisites are missing (no k3d binary, no .stack-id, or .stack-backend is set to a non-k3d backend like k3s) — so the change is safe for the k3s backend and for early-init states.

The probe and refresh are swappable via two package-level vars (probeAPIServerFn, refreshKubeconfigFn), so the recovery branches are fully unit-tested without a live cluster.

Test plan

  • go build ./... clean
  • go test ./internal/kubectl/... -count=1 green (new table covers: probe success, port-drift recovery, refresh skipped, refresh ran but probe still failing, non-cluster-down passthrough, refresh prerequisite checks)
  • go test ./cmd/obol/... -count=1 green (no existing test regressions; tests that seed a stub kubeconfig do not hit EnsureCluster paths that depended on the old no-op behavior)
  • go test ./internal/... -count=1 green (one pre-existing failure in internal/stack/TestWarnIfNoChatModel_EmitsWarnWhenNoModels reproduces on main unmodified — unrelated to this change)

Manual repro test

Live verification on a real cluster (recommend executing before merge):

obol stack up
k3d cluster stop obol-stack-<id>
k3d cluster start obol-stack-<id>
obol sell http demo --upstream ollama --port 11434 --namespace llm \
  --per-request 0.001 --chain base-sepolia --wallet 0xYOUR_WALLET

Expected: succeeds after the kubeconfig auto-refresh, no cluster appears to be stopped message. (Old behavior: false positive.)

To confirm the kubeconfig was actually refreshed, diff $OBOL_CONFIG_DIR/kubeconfig.yaml before/after the failing sequence — the server: URL port will have updated.

…ocker labels

The previous EnsureCluster only stat'd kubeconfig.yaml on disk. After
`k3d cluster stop && k3d cluster start` the kubeconfig still exists but
the k3d API port can drift (pitfall #1), so the next kubectl exec gets
"connection refused" and wrapClusterDown wrongly tells the user the
cluster is stopped — even though `kubectl get pods -A` against a
refreshed kubeconfig succeeds.

Replace the file-presence check with an active probe of the K8s API
server (`kubectl version --request-timeout=3s`). On a cluster-down
signature, attempt one best-effort `k3d kubeconfig write --overwrite`
and re-probe before giving up. Non-cluster-down probe failures (e.g.
missing kubectl binary) pass through verbatim instead of being masked
by the misleading "cluster appears to be stopped" hint.

Probe and refresh are swappable via package-level vars so the recovery
branches are fully unit-testable without a live cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant