fix(x402): verifier replicas: 2 → 1 to keep metric GC correct#515
Closed
bussyjd wants to merge 1 commit into
Closed
fix(x402): verifier replicas: 2 → 1 to keep metric GC correct#515bussyjd wants to merge 1 commit into
bussyjd wants to merge 1 commit into
Conversation
Commit 0fbb99a (fix(x402): GC verifier metric series for deleted offers) added pruneSeriesNotIn to Verifier.load. Each verifier pod runs its own informer + its own metric registry, so the GC is per-pod. With replicas: 2 + ServiceMonitor (round-robin scrape over Endpoints), Prometheus sees: * one pod's registry on scrape N (pruned correctly), * the other pod's on scrape N+1 (may still hold a deleted offer's series until that pod's informer also sees the delete). Result: deleted offers' last_payment_success_seconds gauge and charged_requests_total counters reappear every other scrape, polluting dashboards and creating spurious alert state. Cheapest correct fix is replicas: 1. The verifier is on the request path but single-node k3d gains no HA from 2 replicas. Drop the PodDisruptionBudget too — minAvailable:1 at replicas:1 just blocks voluntary drains on the only pod, useless on k3d. If/when the stack ever runs multi-node and HA replicas are wanted, the right pattern is ServiceMonitor → PodMonitor with a `pod` label and recording rules using `sum without(pod)`. That's a future change; right now correctness > theoretical HA.
This was referenced May 23, 2026
Collaborator
Author
|
Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved. |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
0fbb99a(this PR's parent) shippedpruneSeriesNotInto GC verifier metric series when offers are deleted. The GC is per-pod (each pod runs its own informer + its own registry). Withreplicas: 2+ ServiceMonitor scraping Endpoints (round-robin), Prometheus sees inconsistent series — deleted offers come back every other scrape.Before
After
What changed
x402.yamlverifier Deployment:replicas: 2 → 1minAvailable: 1atreplicas: 1— blocks voluntary drains on the only pod, useless on single-node k3d)Future HA path
If/when this stack runs multi-node and wants 2+ verifier replicas, the correct pattern is
ServiceMonitor → PodMonitor(each pod scraped with apodlabel) + recording rules usingsum without(pod). Not now; correctness > theoretical HA on a single-node k3d.Test plan
go build ./...cleango test ./internal/x402/...green/metricsfrom the verifier, delete a ServiceOffer, scrape again — series should disappear and STAY disappearedStacks on
PR #513 (introduces
pruneSeriesNotIn). Will rebase onto main after #513 merges.