fix(cli): make cluster-running probe authoritative via kubectl, not file presence by bussyjd · Pull Request #537 · ObolNetwork/obol-stack

bussyjd · 2026-05-24T10:12:35Z

Repro

k3d cluster stop obol-stack-<name>
k3d cluster start obol-stack-<name>
kubectl get pods -A (against $OBOL_CONFIG_DIR/kubeconfig.yaml) succeeds
obol sell http <name> --upstream foo --port 80 --namespace default --per-request 0.001 --chain base-sepolia --wallet 0x... fails with:

✗ cluster appears to be stopped — run 'obol stack up' before creating an HTTP service offer

Same false-positive surfaces on every obol sell *, obol network *, obol model *, obol agent *, and most other gated subcommands (every caller of kubectl.EnsureCluster).

What the old probe checked + why it lied

internal/kubectl/kubectl.EnsureCluster only did:

if _, err := os.Stat(filepath.Join(cfg.ConfigDir, "kubeconfig.yaml")); os.IsNotExist(err) {
    return errors.New("cluster not running. Run 'obol stack up' first")
}
return nil

i.e. it asserted "cluster is up" purely from the kubeconfig file existing. The actual cluster appears to be stopped message comes from wrapClusterDown, which sees connection refused on the first kubectl exec the subcommand attempts (e.g. kubectl apply for the ServiceOffer).

After k3d cluster stop && k3d cluster start the kubeconfig file is still on disk but its embedded server: https://0.0.0.0:<port> line points at the previous k3d API port. New port → connection refused → false cluster appears to be stopped. This is pitfall #1 in the project CLAUDE.md ("Kubeconfig port drift — k3d API port can change between restarts").

What the new probe checks + why it's authoritative

EnsureCluster now:

Confirms the kubeconfig file exists (unchanged early-out for never-up clusters).
Actively probes the Kubernetes API server via kubectl version --request-timeout=3s -o json against $OBOL_CONFIG_DIR/kubeconfig.yaml. The same path every downstream kubectl exec will take — if this succeeds, every other call against the same kubeconfig will too.
On a connection refused / no route to host / Unable to connect to the server failure (the existing wrapClusterDown signature), runs one best-effort k3d kubeconfig write <cluster> -o <kubeconfig> --overwrite to recover from the port drift case, then re-probes.
Only if the post-refresh probe still fails does it return ErrClusterDown.
Non-cluster-down failures (e.g. missing kubectl binary) pass through verbatim instead of being masked as "cluster appears to be stopped".

The refresh helper declines silently if prerequisites are missing (no k3d binary, no .stack-id, or .stack-backend is set to a non-k3d backend like k3s) — so the change is safe for the k3s backend and for early-init states.

The probe and refresh are swappable via two package-level vars (probeAPIServerFn, refreshKubeconfigFn), so the recovery branches are fully unit-tested without a live cluster.

Test plan

go build ./... clean
go test ./internal/kubectl/... -count=1 green (new table covers: probe success, port-drift recovery, refresh skipped, refresh ran but probe still failing, non-cluster-down passthrough, refresh prerequisite checks)
go test ./cmd/obol/... -count=1 green (no existing test regressions; tests that seed a stub kubeconfig do not hit EnsureCluster paths that depended on the old no-op behavior)
go test ./internal/... -count=1 green (one pre-existing failure in internal/stack/TestWarnIfNoChatModel_EmitsWarnWhenNoModels reproduces on main unmodified — unrelated to this change)

Manual repro test

Live verification on a real cluster (recommend executing before merge):

obol stack up
k3d cluster stop obol-stack-<id>
k3d cluster start obol-stack-<id>
obol sell http demo --upstream ollama --port 11434 --namespace llm \
  --per-request 0.001 --chain base-sepolia --wallet 0xYOUR_WALLET

Expected: succeeds after the kubeconfig auto-refresh, no cluster appears to be stopped message. (Old behavior: false positive.)

To confirm the kubeconfig was actually refreshed, diff $OBOL_CONFIG_DIR/kubeconfig.yaml before/after the failing sequence — the server: URL port will have updated.

…ocker labels The previous EnsureCluster only stat'd kubeconfig.yaml on disk. After `k3d cluster stop && k3d cluster start` the kubeconfig still exists but the k3d API port can drift (pitfall #1), so the next kubectl exec gets "connection refused" and wrapClusterDown wrongly tells the user the cluster is stopped — even though `kubectl get pods -A` against a refreshed kubeconfig succeeds. Replace the file-presence check with an active probe of the K8s API server (`kubectl version --request-timeout=3s`). On a cluster-down signature, attempt one best-effort `k3d kubeconfig write --overwrite` and re-probe before giving up. Non-cluster-down probe failures (e.g. missing kubectl binary) pass through verbatim instead of being masked by the misleading "cluster appears to be stopped" hint. Probe and refresh are swappable via package-level vars so the recovery branches are fully unit-testable without a live cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): make cluster-running probe authoritative via kubectl, not file presence#537

fix(cli): make cluster-running probe authoritative via kubectl, not file presence#537
bussyjd wants to merge 1 commit into
mainfrom
fix/cli-cluster-running-probe

bussyjd commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 24, 2026

Repro

What the old probe checked + why it lied

What the new probe checks + why it's authoritative

Test plan

Manual repro test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant