Skip to content

fix(x402): verifier replicas: 2 → 1 to keep metric GC correct#515

Closed
bussyjd wants to merge 1 commit into
feat/x402-marketplace-metricsfrom
fix/verifier-single-replica
Closed

fix(x402): verifier replicas: 2 → 1 to keep metric GC correct#515
bussyjd wants to merge 1 commit into
feat/x402-marketplace-metricsfrom
fix/verifier-single-replica

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 23, 2026

Why

0fbb99a (this PR's parent) shipped pruneSeriesNotIn to GC verifier metric series when offers are deleted. The GC is per-pod (each pod runs its own informer + its own registry). With replicas: 2 + ServiceMonitor scraping Endpoints (round-robin), Prometheus sees inconsistent series — deleted offers come back every other scrape.

Before

                 ServiceMonitor scrape (round-robin Endpoints)
                            │
                    ┌───────┴───────┐
                    ▼               ▼
              verifier-pod-A   verifier-pod-B
              metric registry  metric registry
              GC ran on        GC ran on
              Reload event ✓   Reload event ?
                    │               │
                    ▼               ▼
              clean series     stale series for
              for deleted X    deleted offer X
                            │
                    Prometheus sees flip-flop
                    → alerts fire/silence on dead labels
                    → dashboards show resurrected offers

After

              ServiceMonitor scrape → only verifier-pod-A
                            │
                            ▼
                      single registry
                      single GC source
                            │
                            ▼
                   deterministic series state
                   alerts trustworthy
                   dashboards consistent

What changed

  • x402.yaml verifier Deployment: replicas: 2 → 1
  • Removed verifier PDB (was minAvailable: 1 at replicas: 1 — blocks voluntary drains on the only pod, useless on single-node k3d)
  • Added explanatory comment so this isn't re-bumped without thought

Future HA path

If/when this stack runs multi-node and wants 2+ verifier replicas, the correct pattern is ServiceMonitor → PodMonitor (each pod scraped with a pod label) + recording rules using sum without(pod). Not now; correctness > theoretical HA on a single-node k3d.

Test plan

  • go build ./... clean
  • go test ./internal/x402/... green
  • Manual on next stack up: scrape /metrics from the verifier, delete a ServiceOffer, scrape again — series should disappear and STAY disappeared

Stacks on

PR #513 (introduces pruneSeriesNotIn). Will rebase onto main after #513 merges.

Commit 0fbb99a (fix(x402): GC verifier metric series for deleted offers)
added pruneSeriesNotIn to Verifier.load. Each verifier pod runs its
own informer + its own metric registry, so the GC is per-pod. With
replicas: 2 + ServiceMonitor (round-robin scrape over Endpoints),
Prometheus sees:

  * one pod's registry on scrape N (pruned correctly),
  * the other pod's on scrape N+1 (may still hold a deleted offer's
    series until that pod's informer also sees the delete).

Result: deleted offers' last_payment_success_seconds gauge and
charged_requests_total counters reappear every other scrape, polluting
dashboards and creating spurious alert state.

Cheapest correct fix is replicas: 1. The verifier is on the request
path but single-node k3d gains no HA from 2 replicas. Drop the
PodDisruptionBudget too — minAvailable:1 at replicas:1 just blocks
voluntary drains on the only pod, useless on k3d.

If/when the stack ever runs multi-node and HA replicas are wanted,
the right pattern is ServiceMonitor → PodMonitor with a `pod` label
and recording rules using `sum without(pod)`. That's a future change;
right now correctness > theoretical HA.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant