feat(controller): wire client-go leader-election so HA scaling is safe by bussyjd · Pull Request #518 · ObolNetwork/obol-stack

bussyjd · 2026-05-23T18:45:50Z

Why

Today the serviceoffer-controller is replicas: 1 with a "Do not scale" comment. RBAC for coordination.k8s.io/leases is already granted but unused. An accidental kubectl scale --replicas=2 produces split-brain finalizers and double on-chain ERC-8004 registration (real gas spend + duplicate registry entries).

This wires leader-election so multi-replica is safe-by-correctness, not safe-by-comment.

Before

   Operator: kubectl scale deploy/serviceoffer-controller --replicas=2
                            │
                            ▼
       ┌───────────────────┴───────────────────┐
       ▼                                       ▼
   controller-pod-A                       controller-pod-B
   reconciles offer X                     reconciles offer X
       │                                       │
       ▼                                       ▼
   creates HTTPRoute, ReferenceGrant,     same — race on Update
   RegistrationRequest                    finalizer set both, removed
   submits ERC-8004 tx (gas spent)        submits ERC-8004 tx (gas spent)
       │                                       │
       └─────────────┬─────────────────────────┘
                     ▼
       2 on-chain registrations for same offer
       2 stale HTTPRoute generations
       Finalizer thrash

After

   Operator: kubectl scale deploy/serviceoffer-controller --replicas=2
                            │
                            ▼
       ┌───────────────────┴───────────────────┐
       ▼                                       ▼
   controller-pod-A                       controller-pod-B
   acquires Lease "serviceoffer-controller" in x402 ns
       │                                       │
       ▼                                       ▼
   OnStartedLeading() → runs reconciler     OnNewLeader(A) → standby
                                              (renews lease watch only)
   ...                                          ...
   pod-A dies                              acquires Lease within ~30s
                                           OnStartedLeading() → runs

What changed

cmd/serviceoffer-controller/main.go — wraps controller.Run in leaderelection.RunOrDie. POD_NAME/POD_NAMESPACE from downward API. --leader-elect=false flag for local dev.
x402.yaml — adds downward-API POD_NAME env to controller Deployment (POD_NAMESPACE was already wired); updates "Do not scale" comment to reflect that scaling is now safe.

Lease parameters

LeaseDuration 30s, RenewDeadline 20s, RetryPeriod 5s — fast failover on k3d single-node. Tunable.
ReleaseOnCancel: true — graceful shutdown releases the lease immediately, no wait for expiry.

Test plan

go build ./... clean
go test ./internal/serviceoffercontroller/... ./cmd/serviceoffer-controller/... green
go test ./internal/embed/... green (embedded manifest still parses)
Manual: kubectl scale deploy/serviceoffer-controller -n x402 --replicas=2 — confirm pod-B logs "new leader is pod-A"
Manual: kubectl delete pod -n x402 -l app=serviceoffer-controller --field-selector=metadata.name=pod-A — pod-B should take leadership within ~30s

Pairs with

PR #515 (verifier replicas: 1). The verifier needs per-pod metric correctness so replicas: 1 stays; the controller's correctness requirement was different (write-side races), now solved by leader-election.

Today the serviceoffer-controller is pinned at replicas: 1 with a "Do not scale" comment in x402.yaml. The RBAC for leases is already granted (x402.yaml:176-178) — pre-positioned and unused. An accidental `kubectl scale --replicas=2` or HPA misconfiguration produces split-brain finalizers and double on-chain ERC-8004 registration (real gas spend + duplicate registry entries). This wires client-go tools/leaderelection so multi-replica deployment is safe-by-correctness, not safe-by-comment. - cmd/serviceoffer-controller/main.go: - Read POD_NAME / POD_NAMESPACE from downward API env. - Acquire Lease "serviceoffer-controller" in POD_NAMESPACE before running the reconcile loop. - On lost leadership, os.Exit(1) — kubelet restarts the pod which re-elects from scratch. - --leader-elect flag (default true) so local dev can bypass. - x402.yaml: - Add downward-API POD_NAME env to the controller Deployment (POD_NAMESPACE was already wired). - Update the "Do not scale" comment to "Single replica by default; bumping to 2+ is now safe — leader election prevents split-brain on the reconcile loop." - Lease parameters chosen for fast failover on k3d (lease=30s, renew=20s, retry=5s). Tunable via flag if a multi-zone deployment ever needs longer. Uses client-go directly rather than controller-runtime Manager to minimize churn — controller is currently raw client-go workqueues, not controller-runtime. Migration to controller-runtime is a separate much larger workstream and not necessary just for leader election.

bussyjd · 2026-05-24T09:13:23Z

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

bussyjd mentioned this pull request May 24, 2026

feat: x402 marketplace + architecture review bundle (#513-#535) #536

Merged

6 tasks

bussyjd closed this May 24, 2026

bussyjd mentioned this pull request May 24, 2026

fix: resolve marketplace bundle architecture blockers #541

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(controller): wire client-go leader-election so HA scaling is safe#518

feat(controller): wire client-go leader-election so HA scaling is safe#518
bussyjd wants to merge 1 commit into
mainfrom
feat/controller-leader-election

bussyjd commented May 23, 2026

Uh oh!

bussyjd commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 23, 2026

Why

Before

After

What changed

Lease parameters

Test plan

Pairs with

Uh oh!

bussyjd commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant