Skip to content

fix(stack): tolerate transient helm repo update failures, keep cluster running#538

Open
bussyjd wants to merge 1 commit into
mainfrom
fix/tolerant-helm-repo-update
Open

fix(stack): tolerate transient helm repo update failures, keep cluster running#538
bussyjd wants to merge 1 commit into
mainfrom
fix/tolerant-helm-repo-update

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

obol stack up invokes helmfile sync, which calls helm repo update. Helm aborts the entire repo update if any single repo fails — including external repos we don't manage. When kubernetes-dashboard's GitHub Pages repo started returning 404 in mid-2025, the cascade was:

  1. helm repo update → exit 1 (one bad repo abort the whole list)
  2. helmfile sync → propagates exit 1
  3. obol stack up → calls Down() as "cleanup"
  4. k3d cluster delete → user loses PVCs, agent wallets, registered services, everything

Workaround until now was: helm repo remove kubernetes-dashboard && obol stack up. Losing all cluster state because a tertiary external repo started 404ing is a wildly disproportionate failure mode.

Reproducer:

helm repo add deadrepo https://example.invalid/charts
obol stack up
# fails with "failed to update the following repositories: [https://example.invalid/charts]"
# THEN destroys the cluster

Why the cluster-stop was the wrong call

The old syncDefaults treated every helmfile sync failure as fatal enough to warrant Down(). But helmfile failures are almost always recoverable in place — a transient repo flake, a single misconfigured release, a CRD race. The destructive cleanup destroyed state the user cares about (PVCs holding agent wallets, ServiceOffers, ERC-8004 registrations) to "fix" a problem that didn't need a clean slate.

Fix strategy (both A and B)

A. Tolerate per-repo update failures. New helmcmd.UpdateRepos runs helm repo update --fail-on-repo-update-fail=false <managed-repos…> targeting only the repos our embedded helmfile.yaml declares. Unrelated repos in the user's global helm config can't break us. The sync then runs with --skip-deps so helmfile doesn't re-run its own strict update.

B. Don't stop the cluster on helmfile failure. Removed the Down(cfg, u) call from syncDefaults. The error path now prints clear inspect/retry hints and leaves the cluster intact so the user can debug without losing state.

Both fixes are independent improvements and ship together.

Files changed

  • internal/helmcmd/helmcmd.go — new helpers: ParseHelmfileRepos, ManagedRepoNames, EnsureRepos, UpdateRepos (passes --fail-on-repo-update-fail=false when helm ≥ 3 detected).
  • internal/helmcmd/helmcmd_test.go — table-driven tests for parsing + fake-helm test for tolerant-flag wiring.
  • internal/stack/stack.gosyncDefaults now runs preflightHelmRepos, passes --skip-deps, and on failure prints a retry hint instead of calling Down().
  • internal/stack/stack_test.goTestPreflightHelmRepos_SuccessAllowsSkipDeps, TestPreflightHelmRepos_FailureFallsBackGracefully (fake helm exits 1 on repo update, mirroring the real incident), TestSyncDefaults_DoesNotCallDownOnHelmfileFailure (source-level regression guard).
  • internal/stack/backend_k3d.go — comment-only update (the dev-registry comment referenced the now-removed Down() cleanup path).

Test plan

  • go build ./... clean
  • go vet ./internal/stack/... ./internal/helmcmd/... clean
  • TestPreflightHelmRepos_SuccessAllowsSkipDeps — verifies the happy path passes the tolerant flag and targets only our managed repos
  • TestPreflightHelmRepos_FailureFallsBackGracefully — verifies a failing helm preflight returns false (so helmfile falls back to its own resolution) and does not propagate as a fatal error
  • TestSyncDefaults_DoesNotCallDownOnHelmfileFailure — source-level regression guard. Confirmed this guard fails against the pre-fix code (Down(cfg, u) present) and passes against the new code.
  • TestParseHelmfileReposBytes — repo extraction is robust to missing fields and OCI-only entries.
  • TestUpdateRepos_TolerantArgsConstructed / TestUpdateRepos_NoNamesIsNoop — argv wiring for the fake helm binary.
  • Manual: helm repo add deadrepo https://example.invalid/charts; obol stack up should now succeed (preflight ignores the dead repo, helmfile sync runs with --skip-deps). To be validated on a real host.
  • Manual: force a real helmfile sync failure (e.g. invalid chart values) and confirm the cluster stays up and the retry hint is printed.

Notes

  • Helm 3.14+ supports --fail-on-repo-update-fail. obol pins helm 3.20.1 via obolup.sh, so we always have the flag. UpdateRepos still version-gates the flag for safety if a user has helm overridden globally.
  • Pre-existing test TestWarnIfNoChatModel_EmitsWarnWhenNoModels is failing on main (string drift: expects "No chat-capable LLM detected", code says "No chat-capable model detected"). Out of scope for this PR.

…r running

helm aborts the WHOLE `helm repo update` if any single registered repo fails
(e.g. kubernetes-dashboard started returning 404 in 2025). That cascaded:
helmfile sync propagated exit 1, syncDefaults called Down() as "cleanup",
and `obol stack up` ended up running `k3d cluster delete` — destroying PVCs,
agent wallets, and registered services for a transient external-repo flake.

Two layers of defense:

A. Tolerant repo-update preflight (helmcmd.UpdateRepos): run
   `helm repo update --fail-on-repo-update-fail=false <managed-repos...>`
   before helmfile, then pass --skip-deps so helmfile doesn't re-run its own
   strict update. Only the repos our embedded helmfile.yaml declares are
   touched; unrelated repos in the user's global helm config are ignored.

B. Don't stop the cluster on helmfile failure: remove the Down() call from
   the error branch in syncDefaults. A failed sync is almost always
   recoverable in place; tearing down the cluster destroys unrelated state.
   We surface inspect/retry hints instead.

Tests:
- helmcmd: ParseHelmfileRepos / UpdateRepos tolerant-flag wiring (fake helm).
- stack: preflightHelmRepos success + failure paths against a fake helm that
  exits 1 on `repo update` (mirrors the real kubernetes-dashboard incident).
- stack: source-level regression guard against re-introducing Down(cfg, u)
  in syncDefaults. Confirmed the guard fires against the pre-fix code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant