feat(controller): wire client-go leader-election so HA scaling is safe#518
Closed
bussyjd wants to merge 1 commit into
Closed
feat(controller): wire client-go leader-election so HA scaling is safe#518bussyjd wants to merge 1 commit into
bussyjd wants to merge 1 commit into
Conversation
Today the serviceoffer-controller is pinned at replicas: 1 with a
"Do not scale" comment in x402.yaml. The RBAC for leases is already
granted (x402.yaml:176-178) — pre-positioned and unused. An accidental
`kubectl scale --replicas=2` or HPA misconfiguration produces
split-brain finalizers and double on-chain ERC-8004 registration
(real gas spend + duplicate registry entries).
This wires client-go tools/leaderelection so multi-replica deployment
is safe-by-correctness, not safe-by-comment.
- cmd/serviceoffer-controller/main.go:
- Read POD_NAME / POD_NAMESPACE from downward API env.
- Acquire Lease "serviceoffer-controller" in POD_NAMESPACE
before running the reconcile loop.
- On lost leadership, os.Exit(1) — kubelet restarts the pod
which re-elects from scratch.
- --leader-elect flag (default true) so local dev can bypass.
- x402.yaml:
- Add downward-API POD_NAME env to the controller Deployment
(POD_NAMESPACE was already wired).
- Update the "Do not scale" comment to "Single replica by
default; bumping to 2+ is now safe — leader election prevents
split-brain on the reconcile loop."
- Lease parameters chosen for fast failover on k3d (lease=30s,
renew=20s, retry=5s). Tunable via flag if a multi-zone deployment
ever needs longer.
Uses client-go directly rather than controller-runtime Manager to
minimize churn — controller is currently raw client-go workqueues,
not controller-runtime. Migration to controller-runtime is a separate
much larger workstream and not necessary just for leader election.
6 tasks
Collaborator
Author
|
Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Today the serviceoffer-controller is
replicas: 1with a "Do not scale" comment. RBAC forcoordination.k8s.io/leasesis already granted but unused. An accidentalkubectl scale --replicas=2produces split-brain finalizers and double on-chain ERC-8004 registration (real gas spend + duplicate registry entries).This wires leader-election so multi-replica is safe-by-correctness, not safe-by-comment.
Before
After
What changed
cmd/serviceoffer-controller/main.go— wrapscontroller.Runinleaderelection.RunOrDie. POD_NAME/POD_NAMESPACE from downward API.--leader-elect=falseflag for local dev.x402.yaml— adds downward-APIPOD_NAMEenv to controller Deployment (POD_NAMESPACE was already wired); updates "Do not scale" comment to reflect that scaling is now safe.Lease parameters
Test plan
go build ./...cleango test ./internal/serviceoffercontroller/... ./cmd/serviceoffer-controller/...greengo test ./internal/embed/...green (embedded manifest still parses)kubectl scale deploy/serviceoffer-controller -n x402 --replicas=2— confirm pod-B logs "new leader is pod-A"kubectl delete pod -n x402 -l app=serviceoffer-controller --field-selector=metadata.name=pod-A— pod-B should take leadership within ~30sPairs with
PR #515 (verifier replicas: 1). The verifier needs per-pod metric correctness so replicas: 1 stays; the controller's correctness requirement was different (write-side races), now solved by leader-election.