fix(stack): tolerate transient helm repo update failures, keep cluster running#538
Open
bussyjd wants to merge 1 commit into
Open
fix(stack): tolerate transient helm repo update failures, keep cluster running#538bussyjd wants to merge 1 commit into
bussyjd wants to merge 1 commit into
Conversation
…r running helm aborts the WHOLE `helm repo update` if any single registered repo fails (e.g. kubernetes-dashboard started returning 404 in 2025). That cascaded: helmfile sync propagated exit 1, syncDefaults called Down() as "cleanup", and `obol stack up` ended up running `k3d cluster delete` — destroying PVCs, agent wallets, and registered services for a transient external-repo flake. Two layers of defense: A. Tolerant repo-update preflight (helmcmd.UpdateRepos): run `helm repo update --fail-on-repo-update-fail=false <managed-repos...>` before helmfile, then pass --skip-deps so helmfile doesn't re-run its own strict update. Only the repos our embedded helmfile.yaml declares are touched; unrelated repos in the user's global helm config are ignored. B. Don't stop the cluster on helmfile failure: remove the Down() call from the error branch in syncDefaults. A failed sync is almost always recoverable in place; tearing down the cluster destroys unrelated state. We surface inspect/retry hints instead. Tests: - helmcmd: ParseHelmfileRepos / UpdateRepos tolerant-flag wiring (fake helm). - stack: preflightHelmRepos success + failure paths against a fake helm that exits 1 on `repo update` (mirrors the real kubernetes-dashboard incident). - stack: source-level regression guard against re-introducing Down(cfg, u) in syncDefaults. Confirmed the guard fires against the pre-fix code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
obol stack upinvokeshelmfile sync, which callshelm repo update. Helm aborts the entire repo update if any single repo fails — including external repos we don't manage. Whenkubernetes-dashboard's GitHub Pages repo started returning 404 in mid-2025, the cascade was:helm repo update→ exit 1 (one bad repo abort the whole list)helmfile sync→ propagates exit 1obol stack up→ callsDown()as "cleanup"k3d cluster delete→ user loses PVCs, agent wallets, registered services, everythingWorkaround until now was:
helm repo remove kubernetes-dashboard && obol stack up. Losing all cluster state because a tertiary external repo started 404ing is a wildly disproportionate failure mode.Reproducer:
Why the cluster-stop was the wrong call
The old
syncDefaultstreated everyhelmfile syncfailure as fatal enough to warrantDown(). But helmfile failures are almost always recoverable in place — a transient repo flake, a single misconfigured release, a CRD race. The destructive cleanup destroyed state the user cares about (PVCs holding agent wallets, ServiceOffers, ERC-8004 registrations) to "fix" a problem that didn't need a clean slate.Fix strategy (both A and B)
A. Tolerate per-repo update failures. New
helmcmd.UpdateReposrunshelm repo update --fail-on-repo-update-fail=false <managed-repos…>targeting only the repos our embeddedhelmfile.yamldeclares. Unrelated repos in the user's global helm config can't break us. The sync then runs with--skip-depsso helmfile doesn't re-run its own strict update.B. Don't stop the cluster on helmfile failure. Removed the
Down(cfg, u)call fromsyncDefaults. The error path now prints clear inspect/retry hints and leaves the cluster intact so the user can debug without losing state.Both fixes are independent improvements and ship together.
Files changed
internal/helmcmd/helmcmd.go— new helpers:ParseHelmfileRepos,ManagedRepoNames,EnsureRepos,UpdateRepos(passes--fail-on-repo-update-fail=falsewhen helm ≥ 3 detected).internal/helmcmd/helmcmd_test.go— table-driven tests for parsing + fake-helm test for tolerant-flag wiring.internal/stack/stack.go—syncDefaultsnow runspreflightHelmRepos, passes--skip-deps, and on failure prints a retry hint instead of callingDown().internal/stack/stack_test.go—TestPreflightHelmRepos_SuccessAllowsSkipDeps,TestPreflightHelmRepos_FailureFallsBackGracefully(fake helm exits 1 onrepo update, mirroring the real incident),TestSyncDefaults_DoesNotCallDownOnHelmfileFailure(source-level regression guard).internal/stack/backend_k3d.go— comment-only update (the dev-registry comment referenced the now-removedDown()cleanup path).Test plan
go build ./...cleango vet ./internal/stack/... ./internal/helmcmd/...cleanTestPreflightHelmRepos_SuccessAllowsSkipDeps— verifies the happy path passes the tolerant flag and targets only our managed reposTestPreflightHelmRepos_FailureFallsBackGracefully— verifies a failing helm preflight returnsfalse(so helmfile falls back to its own resolution) and does not propagate as a fatal errorTestSyncDefaults_DoesNotCallDownOnHelmfileFailure— source-level regression guard. Confirmed this guard fails against the pre-fix code (Down(cfg, u)present) and passes against the new code.TestParseHelmfileReposBytes— repo extraction is robust to missing fields and OCI-only entries.TestUpdateRepos_TolerantArgsConstructed/TestUpdateRepos_NoNamesIsNoop— argv wiring for the fake helm binary.helm repo add deadrepo https://example.invalid/charts; obol stack upshould now succeed (preflight ignores the dead repo, helmfile sync runs with--skip-deps). To be validated on a real host.Notes
--fail-on-repo-update-fail. obol pins helm 3.20.1 viaobolup.sh, so we always have the flag.UpdateReposstill version-gates the flag for safety if a user has helm overridden globally.TestWarnIfNoChatModel_EmitsWarnWhenNoModelsis failing onmain(string drift: expects "No chat-capable LLM detected", code says "No chat-capable model detected"). Out of scope for this PR.