fix(stack): tolerate transient helm repo update failures, keep cluster running by bussyjd · Pull Request #538 · ObolNetwork/obol-stack

bussyjd · 2026-05-24T10:14:53Z

Summary

obol stack up invokes helmfile sync, which calls helm repo update. Helm aborts the entire repo update if any single repo fails — including external repos we don't manage. When kubernetes-dashboard's GitHub Pages repo started returning 404 in mid-2025, the cascade was:

helm repo update → exit 1 (one bad repo abort the whole list)
helmfile sync → propagates exit 1
obol stack up → calls Down() as "cleanup"
k3d cluster delete → user loses PVCs, agent wallets, registered services, everything

Workaround until now was: helm repo remove kubernetes-dashboard && obol stack up. Losing all cluster state because a tertiary external repo started 404ing is a wildly disproportionate failure mode.

Reproducer:

helm repo add deadrepo https://example.invalid/charts
obol stack up
# fails with "failed to update the following repositories: [https://example.invalid/charts]"
# THEN destroys the cluster

Why the cluster-stop was the wrong call

The old syncDefaults treated every helmfile sync failure as fatal enough to warrant Down(). But helmfile failures are almost always recoverable in place — a transient repo flake, a single misconfigured release, a CRD race. The destructive cleanup destroyed state the user cares about (PVCs holding agent wallets, ServiceOffers, ERC-8004 registrations) to "fix" a problem that didn't need a clean slate.

Fix strategy (both A and B)

A. Tolerate per-repo update failures. New helmcmd.UpdateRepos runs helm repo update --fail-on-repo-update-fail=false <managed-repos…> targeting only the repos our embedded helmfile.yaml declares. Unrelated repos in the user's global helm config can't break us. The sync then runs with --skip-deps so helmfile doesn't re-run its own strict update.

B. Don't stop the cluster on helmfile failure. Removed the Down(cfg, u) call from syncDefaults. The error path now prints clear inspect/retry hints and leaves the cluster intact so the user can debug without losing state.

Both fixes are independent improvements and ship together.

Files changed

internal/helmcmd/helmcmd.go — new helpers: ParseHelmfileRepos, ManagedRepoNames, EnsureRepos, UpdateRepos (passes --fail-on-repo-update-fail=false when helm ≥ 3 detected).
internal/helmcmd/helmcmd_test.go — table-driven tests for parsing + fake-helm test for tolerant-flag wiring.
internal/stack/stack.go — syncDefaults now runs preflightHelmRepos, passes --skip-deps, and on failure prints a retry hint instead of calling Down().
internal/stack/stack_test.go — TestPreflightHelmRepos_SuccessAllowsSkipDeps, TestPreflightHelmRepos_FailureFallsBackGracefully (fake helm exits 1 on repo update, mirroring the real incident), TestSyncDefaults_DoesNotCallDownOnHelmfileFailure (source-level regression guard).
internal/stack/backend_k3d.go — comment-only update (the dev-registry comment referenced the now-removed Down() cleanup path).

Test plan

go build ./... clean
go vet ./internal/stack/... ./internal/helmcmd/... clean
TestPreflightHelmRepos_SuccessAllowsSkipDeps — verifies the happy path passes the tolerant flag and targets only our managed repos
TestPreflightHelmRepos_FailureFallsBackGracefully — verifies a failing helm preflight returns false (so helmfile falls back to its own resolution) and does not propagate as a fatal error
TestSyncDefaults_DoesNotCallDownOnHelmfileFailure — source-level regression guard. Confirmed this guard fails against the pre-fix code (Down(cfg, u) present) and passes against the new code.
TestParseHelmfileReposBytes — repo extraction is robust to missing fields and OCI-only entries.
TestUpdateRepos_TolerantArgsConstructed / TestUpdateRepos_NoNamesIsNoop — argv wiring for the fake helm binary.
Manual: helm repo add deadrepo https://example.invalid/charts; obol stack up should now succeed (preflight ignores the dead repo, helmfile sync runs with --skip-deps). To be validated on a real host.
Manual: force a real helmfile sync failure (e.g. invalid chart values) and confirm the cluster stays up and the retry hint is printed.

Notes

Helm 3.14+ supports --fail-on-repo-update-fail. obol pins helm 3.20.1 via obolup.sh, so we always have the flag. UpdateRepos still version-gates the flag for safety if a user has helm overridden globally.
Pre-existing test TestWarnIfNoChatModel_EmitsWarnWhenNoModels is failing on main (string drift: expects "No chat-capable LLM detected", code says "No chat-capable model detected"). Out of scope for this PR.

…r running helm aborts the WHOLE `helm repo update` if any single registered repo fails (e.g. kubernetes-dashboard started returning 404 in 2025). That cascaded: helmfile sync propagated exit 1, syncDefaults called Down() as "cleanup", and `obol stack up` ended up running `k3d cluster delete` — destroying PVCs, agent wallets, and registered services for a transient external-repo flake. Two layers of defense: A. Tolerant repo-update preflight (helmcmd.UpdateRepos): run `helm repo update --fail-on-repo-update-fail=false <managed-repos...>` before helmfile, then pass --skip-deps so helmfile doesn't re-run its own strict update. Only the repos our embedded helmfile.yaml declares are touched; unrelated repos in the user's global helm config are ignored. B. Don't stop the cluster on helmfile failure: remove the Down() call from the error branch in syncDefaults. A failed sync is almost always recoverable in place; tearing down the cluster destroys unrelated state. We surface inspect/retry hints instead. Tests: - helmcmd: ParseHelmfileRepos / UpdateRepos tolerant-flag wiring (fake helm). - stack: preflightHelmRepos success + failure paths against a fake helm that exits 1 on `repo update` (mirrors the real kubernetes-dashboard incident). - stack: source-level regression guard against re-introducing Down(cfg, u) in syncDefaults. Confirmed the guard fires against the pre-fix code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(stack): tolerate transient helm repo update failures, keep cluster running#538

fix(stack): tolerate transient helm repo update failures, keep cluster running#538
bussyjd wants to merge 1 commit into
mainfrom
fix/tolerant-helm-repo-update

bussyjd commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 24, 2026

Summary

Why the cluster-stop was the wrong call

Fix strategy (both A and B)

Files changed

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant