Skip to content

feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests#513

Closed
bussyjd wants to merge 5 commits into
mainfrom
feat/x402-marketplace-metrics
Closed

feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests#513
bussyjd wants to merge 5 commits into
mainfrom
feat/x402-marketplace-metrics

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 23, 2026

Summary

Powers the new My Listings → EarningsStrip and My Purchases → expansion drawer surfaces in obol-stack-front-end#parity/results-bar by closing three data-pipeline gaps:

  1. chain label is now on every obol_x402_buyer_* and obol_x402_verifier_* series — required for per-chain aggregation in the EarningsStrip / WalletStrip.
  2. obol_x402_verifier_last_payment_success_seconds gauge stamps the most recent settled payment per route — powers the "Last settlement · 2m ago" cell.
  3. PurchaseAutoRefill Go-type sync with the CRD spec — adds MaxTotal + MaxSpendPerDay fields that already existed in purchaserequest-crd.yaml but were unread by the controller / frontend.

Plus the x402-verifier PodMonitor the namespace was missing (only litellm-x402-buyer was scraped), and obol-frontend RBAC grows read access to purchaserequests so the new /api/marketplace/purchases route can list them.

What changed

# File Why
1 internal/x402/buyer/metrics.go Add chain to all 9 counter/gauge label sets
2 internal/x402/buyer/proxy.go Thread cfg.Network through prometheusLabels() at all 3 callsites
3 internal/x402/metrics.go Add lastPaymentSuccess GaugeVec + register it
4 internal/x402/verifier.go chain label in prometheusLabels(rule); SetToCurrentTime() on success in both ForwardAuth + proxy-mode paths
5 internal/monetizeapi/types.go PurchaseAutoRefill.MaxTotal int + MaxSpendPerDay string
6 internal/x402/verifier_test.go Update label maps for chain; new TestVerifier_LastPaymentSuccessGauge (3 subtests)
7 internal/x402/buyer/proxy_test.go Update 4 existing TestProxy_* label-map assertions for the new chain label
8 internal/x402/buyer/metrics_test.go (new) TestPrometheusLabels_ChainPropagation (3) + TestMetrics_ChainLabelScrapeRoundtrip (2)
9 internal/monetizeapi/types_test.go (new) TestPurchaseAutoRefill_JSONRoundTrip (5) + TestPurchaseAutoRefill_UnmarshalAcceptsCRDForm
10 internal/embed/infrastructure/helmfile.yaml Add x402-verifier PodMonitor (release: monitoring) + grant obol-frontend ClusterRole get/list/watch on purchaserequests

Live validation (base-sepolia)

Funded the default obol-agent wallet (0xBb0a70F713334401063c9A8519014F6F026E1c5e) with 0.005 ETH and 3 USDC from the smoke-test seller key, then drove two paid requests through buy.py pay against demo-hello:

Settlement tx 1   0x52140ce18c64cb96170dc54bccbbdfcbe8e0426a161318c8adf687639ee3ac31
Settlement tx 2   (second buy.py pay)

obol_x402_verifier_charged_requests_total{
  chain="eip155:84532",
  offer_namespace="demo", offer_name="demo-hello",
  route="/services/demo-hello/*"
}  =  2

obol_x402_verifier_last_payment_success_seconds{same labels}
  =  1779554031.04   (matches the on-chain block timestamp)

increase(charged_requests_total[7d]){chain="eip155:84532"}  ≈  1.08
   (Prometheus extrapolates from 2 samples; close enough for the
    EarningsStrip's per-chain × price display)

The verifier-side new image was rebuilt locally (localhost:54103/x402-verifier:dev, sha256:af11911f…), imported into the k3d cluster, and the PodMonitor confirmed up{job="x402/x402-verifier"} = 1 for both replicas before traffic flowed.

CRD-cleanliness notes

  • No CRD schema change. Pause/resume continues to ride the existing metadata.annotations["obol.org/paused"] mechanism (already honored by the controller at internal/serviceoffercontroller/controller.go:458–463). The frontend's new PATCH route writes the annotation; the controller does the rest.
  • PurchaseAutoRefill schema in purchaserequest-crd.yaml already declared maxTotal (integer) and maxSpendPerDay (string) — the Go types just hadn't been updated. This commit closes the drift, which TestPurchaseAutoRefill_UnmarshalAcceptsCRDForm now guards against re-occurring.
  • Earnings / 24h-traffic / last-settlement intentionally stay in Prometheus, not on the ServiceOffer / PurchaseRequest status. The recommendation from the up-front mapping work was that CR status should not be a metrics store; the new obol_x402_verifier_last_payment_success_seconds gauge is the cleanest place for the one timestamp the design canvas asks for.

Test plan

  • go test ./internal/x402/... ./internal/monetizeapi/... — green (14 new subtests, 4 pre-existing buyer-proxy assertions updated, no skips)
  • go vet ./internal/x402/... ./internal/monetizeapi/... — clean
  • go build ./... — clean
  • Live: paid request through buy.py pay against demo-hello on base-sepolia — verifier metrics scraped, frontend /api/sell/list returns non-zero chargedByChain + lastSettlementUnix
  • Release-smoke (flow-08/14/15) — out of scope for this PR; the buyer-proxy test update means flow-08 should now pass label assertions without manual edits

Related

Pairs with frontend PR https://github.com/ObolNetwork/obol-stack-front-end/pull/new/parity/results-bar which consumes these labels via three new PrometheusClient methods (chargedSalesByOfferAndChain, chargedRequests24hByOffer, lastSettlementByOffer).

bussyjd added 4 commits May 23, 2026 20:41
…oRefill

Unblocks per-chain earnings/spend aggregation in the frontend's My Listings
and My Purchases pages.

  - obol_x402_buyer_* metrics now carry a `chain` label sourced from
    UpstreamConfig.Network (already in payload, just wasn't on the labels).
  - obol_x402_verifier_* metrics now carry a `chain` label sourced from
    RouteRule.Network. Existing verifier metric tests updated to assert the
    new label (empty string when no Network is set on the rule).
  - internal/monetizeapi/types.go PurchaseAutoRefill struct now mirrors the
    CRD spec (purchaserequest-crd.yaml lines 93-96) by including MaxTotal +
    MaxSpendPerDay. The CRD already accepts these, the Go types just
    weren't reading them.

Together this means the frontend can soon switch the EarningsStrip /
WalletStrip from zeroed placeholders to real PromQL aggregates such as:

  sum by (chain) (increase(obol_x402_buyer_payment_success_total[7d]))
Closes the data loop for the frontend My Listings EarningsStrip + the
"Last settlement" timestamp the design canvas wants.

  - New gauge obol_x402_verifier_last_payment_success_seconds, labeled
    by (route, offer_namespace, offer_name, chain). Stamped via
    SetToCurrentTime() in both ForwardAuth and proxy-mode paths
    whenever a paid request reaches the seller successfully.
  - helmfile.yaml grows an x402-verifier PodMonitor (the namespace was
    previously scraping only litellm-x402-buyer). Same release: monitoring
    label so kube-prometheus-stack picks it up.

The frontend already has matching consumers
(chargedSalesByOfferAndChain, chargedRequests24hByOffer,
lastSettlementByOffer in PrometheusClient) — without this scrape the
metrics never reach the dashboard.
…Refill

Closes the test gap left open by the recent chain-label + last-settlement
gauge work. 14 new subtests across three packages plus four pre-existing
buyer-proxy assertions updated to carry the new chain label.

New tests:
  - internal/x402/verifier_test.go
      TestVerifier_LastPaymentSuccessGauge (3 subtests):
        successful payment stamps gauge within ±5s of time.Now(),
        unpaid 402 leaves it untouched, rejected payment leaves it untouched.
      findVerifierMetricValue helper for time-window assertions.
  - internal/x402/buyer/metrics_test.go
      TestPrometheusLabels_ChainPropagation (3 subtests):
        base-sepolia / base mainnet / empty chain.
      TestMetrics_ChainLabelScrapeRoundtrip (2 subtests):
        scrape /metrics through the registry, assert every counter +
        gauge series carries the expected chain label.
  - internal/monetizeapi/types_test.go
      TestPurchaseAutoRefill_JSONRoundTrip (5 subtests):
        full population, only new caps, all-zero omitempty, single fields.
      TestPurchaseAutoRefill_UnmarshalAcceptsCRDForm:
        catches json-tag drift between the Go struct and CRD spec.

Pre-existing fix:
  - internal/x402/buyer/proxy_test.go — four TestProxy_* assertions had
    label maps without `chain`; tests use Network "base-sepolia" so the
    expected chain is now spelled out alongside upstream + remote_model.

RBAC:
  - helmfile.yaml: obol-frontend ClusterRole grows read access for
    purchaserequests + purchaserequests/status (frontend My Purchases
    needs list; agent buy.py + controller remain the only writers).
    Live-patched into the running cluster too.
…ording rules

Phase 1 + Phase 2 hardening on top of the chain-label/last-settlement work,
incorporating findings from the 4-agent K8s architecture review. Skips the
auth-on-mutating-endpoints item per operator clarification: the obol-stack
frontend is local-only behind the obol.stack hostname restriction, so it's
not the primary trust boundary.

RBAC trims:
  - Drop `secrets get/list` from obol-frontend-openclaw-discovery
    ClusterRole; pre-existing dangling grant, no code reads them.
  - Drop /status subresource from purchaserequests rule; frontend never
    writes status (only the controller does).

Monitoring + RBAC co-location (kills 3 bedag/raw helmfile releases):
  - x402-verifier: PodMonitor -> ServiceMonitor in base/templates/x402.yaml.
    Verifier has a stable Service on port http:8080; ServiceMonitor scrapes
    the endpoint cleanly across replicas.
  - litellm-x402-buyer: PodMonitor moved into base/templates/llm.yaml.
    Stays a PodMonitor because the sidecar's port 8402 is per-pod, not
    fronted by a Service.
  - obol-frontend RBAC moved into base/templates/obol-frontend-rbac.yaml
    next to the workload it grants.

Label cardinality:
  - Drop `route` label from verifier metrics. (offer_namespace, offer_name,
    chain) already uniquely scopes a paid route; `route` (= rule.Pattern)
    was redundant and unbounded by path fragments.

PrometheusRule (new base/templates/x402-prometheus-rules.yaml):
  - Recording: x402:revenue:24h_by_offer_chain,
    x402:revenue:7d_by_offer_chain, x402:revenue:lifetime_by_offer,
    x402:settlement_rate:1h_by_offer_chain. The frontend's PrometheusClient
    reads these so renaming raw metrics no longer breaks the UI, and the
    `increase()` 2-sample minimum no longer leaves cold offers at "0" for
    the first 30s of traffic.
  - Alerting: X402PaymentFailureRateHigh (>10% over 1h),
    X402NoSettlementsAfterChallenge (402s issued, no charges).

Deferred (out of scope for this hardening pass):
  - Frontend-egress NetworkPolicy: on k3s + Flannel the apiserver Service
    endpoints point at the host process, outside the cluster pod/service
    CIDRs. A clean allowlist policy can't target the apiserver portably
    without an install-specific ipBlock; revisit when obol-stack ships a
    non-k3s deployment surface.
  - obol-marketplace-api aggregator service: overkill for the local
    single-operator context.
  - Three-deployment-paths consolidation (helmfile + bedag/raw + Go
    `EnsureVerifier`): larger refactor; tracked as separate workstream.

Live validation:
  - 2 paid requests against demo-hello survive both the RBAC trims and
    the ServiceMonitor swap. `x402:revenue:7d_by_offer_chain` returns
    1.0076 for chain=eip155:84532 (matches the underlying
    obol_x402_verifier_charged_requests_total counter at value 2 over
    2 samples).
  - /api/marketplace/purchases still returns 200 after dropping the
    /status grant.
  - /api/agents/wallets returns the agent wallet via the new batched
    listAllWalletMetadata path (1 ConfigMap list vs N+1 per-instance).
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 23, 2026

Phase 1+2 hardening pushed (commit 27e1ac5)

Picked up findings from a 4-agent K8s architecture review (observability, RBAC, helmfile, frontend boundary). What landed in this commit:

RBAC trims (defense-in-depth — the obol-stack frontend is local-only behind the hostnames: ["obol.stack"] HTTPRoute restriction):

  • Drop secrets get/list from obol-frontend-openclaw-discovery ClusterRole. Pre-existing dangling grant; no code reads them.
  • Drop /status subresource from purchaserequests rule. Frontend never writes status; only the controller does.

Monitoring + RBAC co-location (kills 3 bedag/raw helmfile releases):

  • x402-verifier: PodMonitor → ServiceMonitor in base/templates/x402.yaml. Verifier has a stable Service on http:8080; scrapes the endpoint cleanly across replicas.
  • litellm-x402-buyer: PodMonitor moved into base/templates/llm.yaml. Stays PodMonitor (sidecar port not in a Service).
  • obol-frontend-rbac ClusterRole + binding moved into new base/templates/obol-frontend-rbac.yaml.

Label cardinality:

  • Drop route label from verifier metrics. (offer_namespace, offer_name, chain) already uniquely scopes a paid route; route (= rule.Pattern) was redundant and unbounded by path fragments.

PrometheusRule (new base/templates/x402-prometheus-rules.yaml):

  • Recording: x402:revenue:{24h,7d}_by_offer_chain, x402:revenue:lifetime_by_offer, x402:settlement_rate:1h_by_offer_chain. Frontend reads these so renaming raw metrics no longer breaks the UI; increase() 2-sample minimum no longer leaves cold offers at "0" for first 30s.
  • Alerting: X402PaymentFailureRateHigh (>10% over 1h), X402NoSettlementsAfterChallenge (402s issued, no settlements).

Live-validated: 2 paid requests against demo-hello survive RBAC trims + the ServiceMonitor swap. x402:revenue:7d_by_offer_chain returns 1.0076 for chain=eip155:84532 (matches the underlying counter). /api/marketplace/purchases still 200 after the /status grant trim.

Explicitly deferred (out of scope, documented inline):

  • Frontend-egress NetworkPolicy: tried and reverted. On k3s + Flannel the apiserver Service Endpoints point at the host process, outside the cluster pod/service CIDRs. A clean allowlist can't target the apiserver portably without an install-specific ipBlock. Tracked for the day obol-stack ships a non-k3s deployment surface.
  • obol-marketplace-api Go aggregator: overkill for the local single-operator context.
  • Three-deployment-paths consolidation: larger refactor, separate workstream.

Pairs with the frontend hardening commit on https://github.com/ObolNetwork/obol-stack-front-end/pull/330

The verifier's per-offer counters and the last_payment_success_seconds
gauge were created on first use and never removed. Deleting an offer
(via `obol sell delete`, ServiceOffer CR deletion, or pricing config
edit) left stale series in the registry forever, which:

  * pollutes My Listings / dashboards with rows for offers that no
    longer exist,
  * lets X402NoSettlementsAfterChallenge keep referencing dead labels,
  * silently inflates the "last successful charge" gauge with timestamps
    from offers the operator already retired.

Verifier.load() now diffs the incoming route set against the live label
tuples in the registry and calls DeletePartialMatch on each vec for
every (offer_namespace, offer_name, chain) triple that is no longer
served. Both reload paths (file config watcher and the kube
ServiceOffer informer via ConfigAccumulator) funnel through load(), so
one hook covers everything.

Also fixes a guard test from the prior hardening commit that was still
asserting the old "no ServiceMonitor here" invariant after we
intentionally relocated the ServiceMonitor into this manifest. Flipped
to assert presence so a future cleanup can't silently drop it.

Test:
  TestVerifier_Reload_PrunesDeletedOfferSeries stamps two offers' worth
  of metrics, reloads with one removed, and asserts the removed offer
  is gone from all six vecs while the kept offer survives.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 23, 2026

Follow-up: GC for stale verifier metric series (commit 0fbb99a)

Closes the last deferred item from the hardening pass.

Problem: deleting an offer left its per-label series in the verifier registry forever — most notably obol_x402_verifier_last_payment_success_seconds, which would keep advertising the deleted offer's last-success timestamp and falsely satisfy "recent activity" dashboards/alerts.

Fix: Verifier.load() now diffs the incoming route set against the live (offer_namespace, offer_name, chain) tuples in the registry and DeletePartialMatches each vec for every triple no longer served. Both reload paths (file-config watcher and the ServiceOffer informer via ConfigAccumulator.SetRoutes) funnel through load(), so one hook covers everything — no separate OnDelete handler needed in the informer.

Test: TestVerifier_Reload_PrunesDeletedOfferSeries stamps two offers' worth of metrics with paid requests, reloads with one removed, asserts the removed offer is gone from all six vecs (requests_total, payment_required_total, payment_verified_total, payment_failed_total, charged_requests_total, last_payment_success_seconds) while the kept offer survives.

Also: fixed a guard test (TestX402Manifest_UsesServiceOfferControllerModel) that was still asserting "no ServiceMonitor in this manifest" after the prior commit intentionally relocated one in. Flipped to assert presence so a future cleanup can't silently drop it.

@TateLyman
Copy link
Copy Markdown

One accounting nit before merge: routes that inherit the global chain still emit chain="".

In this PR, matchPaidRouteFull correctly resolves chainName := cfg.Chain when rule.Network == "", but the metric label still comes from prometheusLabels(rule), which returns "chain": rule.Network at internal/x402/verifier.go:464-473. The reload keep-set uses the same raw value at internal/x402/verifier.go:72-78, and the new prune test currently asserts chain: "" for default-chain routes (internal/x402/verifier_test.go:953-954).

That means the new per-chain recordings can produce a blank-chain bucket for the normal/default route case instead of base-sepolia / eip155:84532 / whichever canonical label the frontend expects. It also makes dashboards mix "route inherited default chain" with "route genuinely has no chain", which weakens the EarningsStrip/WalletStrip story in the PR description.

Suggested shape: thread the resolved chainName (or chain.CAIP2Network, if the UI wants CAIP-2) into the labels and the prune keep-key, then flip the regression test from chain: "" to the resolved default chain. The buyer-side proxy already passes cfg.Network into its label helper, so this is mostly a verifier-side inherited-chain case.

@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Bundle PR #536 includes this PR's commits along with the full marketplace architecture-review roadmap (#515-#535, minus #526 obsoleted by #535, minus #514 already merged). Leaving #513 open per request — close manually if the bundle merge satisfies it.

@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Closing as superseded by #536. The head of feat/x402-marketplace-metrics is already included in feat/marketplace-bundle, and #536 is now the single review/merge target for the marketplace bundle.

@bussyjd bussyjd closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants