feat: x402 marketplace + architecture review bundle (#513-#535) by bussyjd · Pull Request #536 · ObolNetwork/obol-stack

bussyjd · 2026-05-24T09:13:06Z

Summary

Consolidates 22 stacked PRs (#513-#535, excluding #514 already merged and #526 obsoleted by #535) into a single mergeable bundle. Replaces the fragile cross-branch stack with one merge target.

Collapse update:

fix: resolve marketplace bundle architecture blockers #541 has been merged into feat/marketplace-bundle as the architecture-review fixup.
fix(rbac): grant frontend read access to PurchaseRequest + RegistrationRequest #540 is closed as superseded by fix: resolve marketplace bundle architecture blockers #541; its RBAC permission change is included in the merged fixup.
feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 is closed as superseded; feat/x402-marketplace-metrics is already an ancestor of this bundle branch.

Original PR roadmap:

Architecture review base: feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 (x402 marketplace metrics)
Verifier hardening: fix(x402): verifier replicas: 2 → 1 to keep metric GC correct #515, fix(x402): gate verifier /readyz on informer cache sync #519, fix(x402): fail-closed when URI is under a paid prefix but no rule matches #524 (single-replica GC, /readyz informer sync, fail-closed on paid prefix)
Prometheus observability: fix(x402-metrics): align Prometheus retention with recording-rule windows; rename mis-named lifetime rule #516, fix(prometheus-rules): escape PromQL $labels for Helm rendering #527, fix(prometheus-rules): use increase() for the per-offer revenue rule #530, feat(x402-metrics): add asset_symbol label for per-token queries #531, fix(prometheus-rules): use epsilon floor not 1.0 to avoid under-reporting low-traffic alerts #532 (retention alignment, Helm escape, increase() counter math, asset_symbol label, clamp_min epsilon)
Controller safety: feat(controller): wire client-go leader-election so HA scaling is safe #518 (leader election), feat(monetizeapi): controller-gen as canonical CRD schema source #525 (controller-gen)
Image + deploy hygiene: chore(images): digest-pin verifier, controller, litellm, cloudflared #517 (digest pinning), refactor(x402): drive verifier deployment from helmfile, not Go-side kubectl apply #520 (helmfile-driven verifier), refactor: relocate remaining bedag/raw helmfile releases into base chart #523/docs(migration): bedag/raw → base release ownership transfer script #528 (bedag/raw cleanup), ci: add helm-template-smoke job to catch chart-render parse errors #533 (helm-template CI)
Pod Security Standard: feat(security): Restricted Pod Security Standard across embedded workloads #521, fix(controller/render): Restricted PSS securityContext on httpd workloads #529
Buyer state durability: fix(x402-buyer): persist consumed-nonce state to PVC instead of emptyDir #522 (PVC-backed nonce state)
Pause -> drain redesign: feat(monetize): replace pause annotation with ERC-8004-friendly drain #535 (replaces feat(api): spec.paused + metav1.Condition with listType=map (CRD v1alpha2-prep) #526 - ERC-8004-friendly drain instead of route on/off toggle)
Observability doc: docs(observability): record the thin-layer architecture decisions #534

Merge-commit graph

git log --first-parent --oneline feat/marketplace-bundle ^main

Conflicts resolved

internal/embed/infrastructure/base/templates/serviceoffer-crd.yaml (feat(monetize): replace pause annotation with ERC-8004-friendly drain #535 merge): kept HEAD's controller-gen output structure from feat(monetizeapi): controller-gen as canonical CRD schema source #525; regenerated all CRDs via controller-gen with merged types.go (which contains DrainAt fields).
internal/embed/infrastructure/base/templates/llm.yaml (feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 merge): combined refactor: relocate remaining bedag/raw helmfile releases into base chart #523 relocation comment with feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 PodMonitor rationale; kept app: litellm label.
internal/embed/infrastructure/helmfile.yaml (feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 merge): kept refactor: relocate remaining bedag/raw helmfile releases into base chart #523's removal of bedag/raw releases (already moved into base templates); kept feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513's NOTE explaining the obol-frontend RBAC relocation.
internal/monetizeapi/types.go (feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 merge): kept HEAD (feat(monetizeapi): controller-gen as canonical CRD schema source #525) kubebuilder annotations on PurchaseAutoRefill; regenerated CRDs.
internal/x402/verifier.go (fix(x402): fail-closed when URI is under a paid prefix but no rule matches #524 merge): kept both Verifier fields (routesLoaded from fix(x402): gate verifier /readyz on informer cache sync #519 + paidPrefixes from fix(x402): fail-closed when URI is under a paid prefix but no rule matches #524); both behaviors land.

Additional fix:

internal/stack/stack_test.go: loosened TestLLMTemplate_IncludesPaidRouteAndBuyerSidecar to accept multi-line emptyDir: after feat(security): Restricted Pod Security Standard across embedded workloads #521's PSS sweep added sizeLimit values.
fix: resolve marketplace bundle architecture blockers #541: resolved bundle-level architecture blockers found during review: duplicate frontend RBAC ownership, x402 runtime ConfigMap preservation, ERC-8004 drain visibility, leader-failure semantics, stale route teardown, observability docs drift, Helm duplicate-object CI, and pre-release ownership-warning cleanup.

Test plan

go build ./... clean
go test ./... green except pre-existing TestWarnIfNoChatModel_EmitsWarnWhenNoModels (unrelated; message text drift)
go test -tags integration ./internal/openclaw/... (needs cluster)
release-smoke flow-07 (validated separately)
release-smoke flow-08, 11, 13, 14, 15 (separate test run)
Deploy to live cluster, exercise drain + EarningsStrip + asset_symbol metrics

Closes

Closes #513 #515 #516 #517 #518 #519 #520 #521 #522 #523 #524 #525 #527 #528 #529 #530 #531 #532 #533 #534 #535

Supersedes #526 (replaced by drain semantics in #535).

…oRefill Unblocks per-chain earnings/spend aggregation in the frontend's My Listings and My Purchases pages. - obol_x402_buyer_* metrics now carry a `chain` label sourced from UpstreamConfig.Network (already in payload, just wasn't on the labels). - obol_x402_verifier_* metrics now carry a `chain` label sourced from RouteRule.Network. Existing verifier metric tests updated to assert the new label (empty string when no Network is set on the rule). - internal/monetizeapi/types.go PurchaseAutoRefill struct now mirrors the CRD spec (purchaserequest-crd.yaml lines 93-96) by including MaxTotal + MaxSpendPerDay. The CRD already accepts these, the Go types just weren't reading them. Together this means the frontend can soon switch the EarningsStrip / WalletStrip from zeroed placeholders to real PromQL aggregates such as: sum by (chain) (increase(obol_x402_buyer_payment_success_total[7d]))

Closes the data loop for the frontend My Listings EarningsStrip + the "Last settlement" timestamp the design canvas wants. - New gauge obol_x402_verifier_last_payment_success_seconds, labeled by (route, offer_namespace, offer_name, chain). Stamped via SetToCurrentTime() in both ForwardAuth and proxy-mode paths whenever a paid request reaches the seller successfully. - helmfile.yaml grows an x402-verifier PodMonitor (the namespace was previously scraping only litellm-x402-buyer). Same release: monitoring label so kube-prometheus-stack picks it up. The frontend already has matching consumers (chargedSalesByOfferAndChain, chargedRequests24hByOffer, lastSettlementByOffer in PrometheusClient) — without this scrape the metrics never reach the dashboard.

…Refill Closes the test gap left open by the recent chain-label + last-settlement gauge work. 14 new subtests across three packages plus four pre-existing buyer-proxy assertions updated to carry the new chain label. New tests: - internal/x402/verifier_test.go TestVerifier_LastPaymentSuccessGauge (3 subtests): successful payment stamps gauge within ±5s of time.Now(), unpaid 402 leaves it untouched, rejected payment leaves it untouched. findVerifierMetricValue helper for time-window assertions. - internal/x402/buyer/metrics_test.go TestPrometheusLabels_ChainPropagation (3 subtests): base-sepolia / base mainnet / empty chain. TestMetrics_ChainLabelScrapeRoundtrip (2 subtests): scrape /metrics through the registry, assert every counter + gauge series carries the expected chain label. - internal/monetizeapi/types_test.go TestPurchaseAutoRefill_JSONRoundTrip (5 subtests): full population, only new caps, all-zero omitempty, single fields. TestPurchaseAutoRefill_UnmarshalAcceptsCRDForm: catches json-tag drift between the Go struct and CRD spec. Pre-existing fix: - internal/x402/buyer/proxy_test.go — four TestProxy_* assertions had label maps without `chain`; tests use Network "base-sepolia" so the expected chain is now spelled out alongside upstream + remote_model. RBAC: - helmfile.yaml: obol-frontend ClusterRole grows read access for purchaserequests + purchaserequests/status (frontend My Purchases needs list; agent buy.py + controller remain the only writers). Live-patched into the running cluster too.

…ording rules Phase 1 + Phase 2 hardening on top of the chain-label/last-settlement work, incorporating findings from the 4-agent K8s architecture review. Skips the auth-on-mutating-endpoints item per operator clarification: the obol-stack frontend is local-only behind the obol.stack hostname restriction, so it's not the primary trust boundary. RBAC trims: - Drop `secrets get/list` from obol-frontend-openclaw-discovery ClusterRole; pre-existing dangling grant, no code reads them. - Drop /status subresource from purchaserequests rule; frontend never writes status (only the controller does). Monitoring + RBAC co-location (kills 3 bedag/raw helmfile releases): - x402-verifier: PodMonitor -> ServiceMonitor in base/templates/x402.yaml. Verifier has a stable Service on port http:8080; ServiceMonitor scrapes the endpoint cleanly across replicas. - litellm-x402-buyer: PodMonitor moved into base/templates/llm.yaml. Stays a PodMonitor because the sidecar's port 8402 is per-pod, not fronted by a Service. - obol-frontend RBAC moved into base/templates/obol-frontend-rbac.yaml next to the workload it grants. Label cardinality: - Drop `route` label from verifier metrics. (offer_namespace, offer_name, chain) already uniquely scopes a paid route; `route` (= rule.Pattern) was redundant and unbounded by path fragments. PrometheusRule (new base/templates/x402-prometheus-rules.yaml): - Recording: x402:revenue:24h_by_offer_chain, x402:revenue:7d_by_offer_chain, x402:revenue:lifetime_by_offer, x402:settlement_rate:1h_by_offer_chain. The frontend's PrometheusClient reads these so renaming raw metrics no longer breaks the UI, and the `increase()` 2-sample minimum no longer leaves cold offers at "0" for the first 30s of traffic. - Alerting: X402PaymentFailureRateHigh (>10% over 1h), X402NoSettlementsAfterChallenge (402s issued, no charges). Deferred (out of scope for this hardening pass): - Frontend-egress NetworkPolicy: on k3s + Flannel the apiserver Service endpoints point at the host process, outside the cluster pod/service CIDRs. A clean allowlist policy can't target the apiserver portably without an install-specific ipBlock; revisit when obol-stack ships a non-k3s deployment surface. - obol-marketplace-api aggregator service: overkill for the local single-operator context. - Three-deployment-paths consolidation (helmfile + bedag/raw + Go `EnsureVerifier`): larger refactor; tracked as separate workstream. Live validation: - 2 paid requests against demo-hello survive both the RBAC trims and the ServiceMonitor swap. `x402:revenue:7d_by_offer_chain` returns 1.0076 for chain=eip155:84532 (matches the underlying obol_x402_verifier_charged_requests_total counter at value 2 over 2 samples). - /api/marketplace/purchases still returns 200 after dropping the /status grant. - /api/agents/wallets returns the agent wallet via the new batched listAllWalletMetadata path (1 ConfigMap list vs N+1 per-instance).

The verifier's per-offer counters and the last_payment_success_seconds gauge were created on first use and never removed. Deleting an offer (via `obol sell delete`, ServiceOffer CR deletion, or pricing config edit) left stale series in the registry forever, which: * pollutes My Listings / dashboards with rows for offers that no longer exist, * lets X402NoSettlementsAfterChallenge keep referencing dead labels, * silently inflates the "last successful charge" gauge with timestamps from offers the operator already retired. Verifier.load() now diffs the incoming route set against the live label tuples in the registry and calls DeletePartialMatch on each vec for every (offer_namespace, offer_name, chain) triple that is no longer served. Both reload paths (file config watcher and the kube ServiceOffer informer via ConfigAccumulator) funnel through load(), so one hook covers everything. Also fixes a guard test from the prior hardening commit that was still asserting the old "no ServiceMonitor here" invariant after we intentionally relocated the ServiceMonitor into this manifest. Flipped to assert presence so a future cleanup can't silently drop it. Test: TestVerifier_Reload_PrunesDeletedOfferSeries stamps two offers' worth of metrics, reloads with one removed, and asserts the removed offer is gone from all six vecs while the kept offer survives.

Commit 0fbb99a (fix(x402): GC verifier metric series for deleted offers) added pruneSeriesNotIn to Verifier.load. Each verifier pod runs its own informer + its own metric registry, so the GC is per-pod. With replicas: 2 + ServiceMonitor (round-robin scrape over Endpoints), Prometheus sees: * one pod's registry on scrape N (pruned correctly), * the other pod's on scrape N+1 (may still hold a deleted offer's series until that pod's informer also sees the delete). Result: deleted offers' last_payment_success_seconds gauge and charged_requests_total counters reappear every other scrape, polluting dashboards and creating spurious alert state. Cheapest correct fix is replicas: 1. The verifier is on the request path but single-node k3d gains no HA from 2 replicas. Drop the PodDisruptionBudget too — minAvailable:1 at replicas:1 just blocks voluntary drains on the only pod, useless on k3d. If/when the stack ever runs multi-node and HA replicas are wanted, the right pattern is ServiceMonitor → PodMonitor with a `pod` label and recording rules using `sum without(pod)`. That's a future change; right now correctness > theoretical HA.

…dows; rename mis-named lifetime rule Two related metric-correctness fixes layered on top of the recording rules added in 27e1ac5: 1. Retention 6h → 8d. The recording rules added in 27e1ac5 use [24h] and [7d] windows. `increase(x[24h])` against a 6h-retention TSDB silently returns "last 6h extrapolated to 24h" with no error. The frontend displays that result as "24h revenue" — wrong by 4x. 8d (= 7d + 1d safety margin) keeps the [7d] rule valid across a brief Prometheus outage. 2. `x402:revenue:lifetime_by_offer` → `x402:revenue:total_by_offer_current`. The original expression was `sum(counter)` (not `sum(increase[lifetime])`), so it: * is NOT lifetime — it's "sum across currently-alive verifier replicas of their since-last-restart counts", * drops ~50% on every replica rollout, * compounds with the per-pod-registry issue addressed by the replicas:1 fix. Renaming makes the semantic explicit. True lifetime queries should use `sum_over_time(...[Nd])` against a long-retention store. Retention bump increases Prometheus disk footprint roughly proportional to (8d/6h) ≈ 32x. The local-only kube-prometheus-stack PVC sizing in monitoring.yaml.gotmpl needs review on next `obol stack up` if disk pressure shows up — currently no PVC size cap set, so it inherits the storageClass default.

@sha256

Extends the @sha256 digest discipline that x402-buyer and the frontend already carry to the remaining four images that ship as part of the embedded infrastructure. Tag-only refs (e.g. ghcr.io/obolnetwork/ x402-verifier:b13254e) are vulnerable to mutable-tag rewrites — the class of bug CLAUDE.md pitfall #12 documented as a real production fire. Pins: - x402-verifier:b13254e @ sha256:a8a7aa0ca4c35b0ddf6983fa6e3e5f8a3f64e44d8e506ebfd55e39de2bc0342d - serviceoffer-controller:b13254e @ sha256:f83bd7e55bdc5d87edb49c04e7fd9257097364e2d43e769c19dfd7c8b47d07af - litellm:sha-c16b156 @ sha256:9f112b51ac5a57d73cdd54103fb98d24eabaddd8689a9a285884dca6456dc86e - cloudflared:2026.3.0 @ sha256:6b599ca3e974349ead3286d178da61d291961182ec3fe9c505e1dd02c8ac31b0 Adds a regression test asserting every embedded manifest carries @sha256: on its image refs so a future dependency bump can't silently revert to tag-only. Dev-rewrite invariant (defaults.go:124 + setup.go:74 alternation regex) verified intact via go test ./internal/defaults/... ./internal/x402/...

Today the serviceoffer-controller is pinned at replicas: 1 with a "Do not scale" comment in x402.yaml. The RBAC for leases is already granted (x402.yaml:176-178) — pre-positioned and unused. An accidental `kubectl scale --replicas=2` or HPA misconfiguration produces split-brain finalizers and double on-chain ERC-8004 registration (real gas spend + duplicate registry entries). This wires client-go tools/leaderelection so multi-replica deployment is safe-by-correctness, not safe-by-comment. - cmd/serviceoffer-controller/main.go: - Read POD_NAME / POD_NAMESPACE from downward API env. - Acquire Lease "serviceoffer-controller" in POD_NAMESPACE before running the reconcile loop. - On lost leadership, os.Exit(1) — kubelet restarts the pod which re-elects from scratch. - --leader-elect flag (default true) so local dev can bypass. - x402.yaml: - Add downward-API POD_NAME env to the controller Deployment (POD_NAMESPACE was already wired). - Update the "Do not scale" comment to "Single replica by default; bumping to 2+ is now safe — leader election prevents split-brain on the reconcile loop." - Lease parameters chosen for fast failover on k3d (lease=30s, renew=20s, retry=5s). Tunable via flag if a multi-zone deployment ever needs longer. Uses client-go directly rather than controller-runtime Manager to minimize churn — controller is currently raw client-go workqueues, not controller-runtime. Migration to controller-runtime is a separate much larger workstream and not necessary just for leader election.

Closes the root cause of CLAUDE.md pitfall #14 ("first-request flake on freshly-deployed verifier"). Previously /readyz returned 200 the moment config.Load() became non-nil, but routes from the ServiceOffer informer load later — between those two events the pod is Ready from kubelet's view, receives Service traffic, and matchPaidRoute returns "no rule -> 200" for paid routes. The release-smoke flows hide this behind 12x5s retry loops; the actual fix is to not be Ready until routes are loaded. - Adds routesLoaded atomic.Bool to Verifier. - HandleReadyz returns 503 until BOTH config and routes loaded, with a body that distinguishes the two cases for kubectl describe debuggability. - WatchServiceOffers takes an optional onFirstApply callback, invoked after the post-WaitForCacheSync refresh succeeds. - main.go wires v.MarkRoutesLoaded as the callback for kube source, or invokes it directly after NewVerifier for file source (the file source has no informer; routes are loaded synchronously). Pairs with PR #515 (replicas: 1) — at single replica the rollout window for this race shrinks from "some scrapes" to "first ~5-10s", but it's still a bug; this PR closes it.

…kubectl apply Kills CLAUDE.md pitfall #9 forever. The previous code path had two problems that compounded: 1. EnsureVerifier did kubectl apply of embed.FS x402.yaml directly, overwriting whatever helmfile had installed. Under OBOL_DEVELOPMENT=true, this stripped local-build image pins back to registry-pinned digests — silently bypassing every dev edit to the verifier. 2. To work around (1), setup.go carried a DUPLICATE copy of the image-pin rewrite regex from internal/defaults/defaults.go (with a code comment confessing "duplicated here to avoid an import cycle"). Every fix to the regex (e.g. pitfall #12's alternation- order fix) had to be applied in two places — which is exactly the kind of footgun that produces silent bypasses. Now EnsureVerifier shells out to helmfile --selector name=base sync against the helmfile state already used by obol stack up. Since helmfile reads the manifests from \$OBOL_CONFIG_DIR/defaults/ — which is populated by defaults.CopyInfrastructure with the canonical regex already applied — the dev-rewrite happens exactly once, in exactly one place. - Deletes the duplicate devLocallyBuiltImageBases + regex from internal/x402/setup.go. - EnsureVerifier now: RefreshInfrastructureIfChanged(); helmfile sync --selector name=base. - Deletes internal/x402/manifest_devmode_test.go — the canonical regression test is internal/defaults/defaults_test.go:: TestCopyInfrastructure_DevModeRewritesDigestPins which still guards the rewrite at its single source. - Adds a structural test (setup_structure_test.go) asserting setup.go does not import the regexp package, making re-introduction of the duplicate fail at test time. The duplicate-regex footgun is now structurally impossible to re-introduce.

…loads Brings every embedded Deployment shipped by obol-stack up to PSS Restricted: - runAsNonRoot: true with fixed non-zero UID/GID (65532) - allowPrivilegeEscalation: false - capabilities.drop: [ALL] - seccompProfile: RuntimeDefault - readOnlyRootFilesystem: true (with named emptyDir mounts where Python needs writeable /tmp and HOME/.cache) PSS labels (enforce=restricted, audit/warn=restricted) added to the x402 and llm namespaces so future Deployment edits that omit per-pod securityContext are rejected at admission. Also switches the serviceoffer-controller Dockerfile from gcr.io/distroless/static-debian12 (UID 0) to ...:nonroot (UID 65532). Container escape via a Go runtime CVE on a UID-0 / no-seccomp / no-cap-drop / RW-rootfs container was the easiest path to host pivot on k3s single-node; this closes it. Files touched: - Dockerfile.serviceoffer-controller (:nonroot base) - internal/embed/infrastructure/base/templates/x402.yaml (verifier + controller securityContext blocks, x402 ns PSS label) - internal/embed/infrastructure/base/templates/llm.yaml (litellm + x402-buyer securityContext, litellm-tmp + litellm-home emptyDir mounts with HOME/XDG_CACHE_HOME/HF_HOME redirection, llm ns PSS label) Scope notes: - local-path-provisioner lives in kube-system (k3d-managed); not relabeled per PSS guidance to skip system namespaces. - hermes-obol-agent runtime is generated dynamically by serviceoffer-controller (internal/serviceoffercontroller/agent_render.go and internal/hermes/hermes.go), not from the embedded templates; its init-hermes-perms initContainer legitimately runs as UID 0 for /data chown and is intentionally left out of this PR's scope. - cloudflared chart (internal/embed/infrastructure/cloudflared/...) is a separate Helm chart and not in this PR's file list. What may break: - LiteLLM with readOnlyRootFilesystem may fail if it writes outside /tmp or $HOME — watch the next release-smoke for permission-denied errors and add named emptyDir mounts for any new write paths.

Today the x402-buyer sidecar's /state directory is an emptyDir. When the litellm pod restarts (rollout, OOM, node drain), consumed.json is gone. The pre-signed auth pool reloads from the ConfigMap the controller manages, and the buyer treats every auth as unconsumed — attempting to spend nonces that the facilitator already marked used. Cascade: facilitator returns 400 "nonce already used" -> buyer 402 back to LiteLLM -> caller retry -> same 400 -> eventually buyer pool exhausted -> 503 until manual `buy.py process --all` reseeds. Fix: convert /state to a PVC backed by local-path-provisioner (the storage class already deployed via base/templates/local-path.yaml). 50Mi request; consumed.json is tiny but room left for log growth. Deployment strategy switched to Recreate because a RWO PVC can't be co-mounted during a RollingUpdate surge. Litellm is replicas: 1 so this just means rollouts have a ~5s gap instead of an overlap — acceptable. What this does NOT solve: - Multi-replica litellm. RWO PVC works only for replicas: 1; would need RWX (which local-path doesn't support — needs NFS/Longhorn) or per-replica state via StatefulSet. Out of scope; litellm has no current scaling need. - Hard node loss. local-path PVCs are node-local; if the k3d node is destroyed, state is gone (along with the rest of the cluster). For local-only operator that's the expected blast radius. PSS compatibility note: the PVC mount works under PSS Restricted as long as the buyer container runs with appropriate fsGroup. PR #12 (Restricted PSS sweep) handles that separately and will verify mount permissions when it lands.

The infrastructure helmfile shipped 6 `bedag/raw` releases — a wrapper chart whose only job is to apply inline YAML through helmfile. With the `base` release already rendering every other YAML in `base/templates/`, the inline approach has zero remaining justification. This PR finishes the job by relocating all 6: - llm-buyer-podmonitor → base/templates/llm.yaml (appended) - erpc-httproute → base/templates/erpc.yaml (new file) - erpc-x402-middleware → base/templates/erpc.yaml - erpc-metadata → base/templates/erpc.yaml - obol-frontend-httproute → base/templates/obol-frontend.yaml (new file) - obol-frontend-rbac → base/templates/obol-frontend.yaml Net change to the rendering: zero. Same YAML, just sourced from the chart's templates directory instead of inlined in helmfile.yaml through the bedag/raw wrapper chart. Each relocated YAML carries a provenance comment. DAG: `base` now `needs: [traefik/traefik, monitoring/monitoring]` so the Traefik (Middleware) / Gateway API (HTTPRoute) / Prometheus operator (PodMonitor) CRDs are guaranteed present before the relocated templates apply. New Namespace docs for `erpc` and `obol-frontend` make the `base` release self-contained — the upstream chart releases that originally created those namespaces still set `createNamespace: true`, which is a no-op against an existing namespace. The `bedag` repository entry is removed (no infrastructure release uses it anymore). Network helmfiles + hermes still use bedag/raw — out of scope for this PR. `migrateDefaultsHTTPRouteHostnames` in internal/stack/stack.go targets the old in-helmfile HTTPRoute indentation pattern; it is a no-op against the relocated templates and against the new helmfile, preserved unchanged for users upgrading from older stacks. The `hostnames: ["obol.stack"]` restriction is preserved on every relocated HTTPRoute per CLAUDE.md guidance — removing it would expose the frontend / eRPC to the public cloudflared tunnel. `TestHelmfile_IncludesBuyerPodMonitor` rewired to read `base/templates/llm.yaml`. All embed CRD tests, stack tests, and go build are green.

…tches Today HandleVerify returns 200 whenever matchPaidRoute returns nil. Combined with Traefik ForwardAuth's "200 = allow" semantics, this means a misconfigured Middleware on a paid route OR a code bug where the route was supposed to match but didn't silently makes the route FREE — revenue loss with no signal. - Adds paidPrefixes atomic.Pointer[[]string] to Verifier. - Verifier.load() derives prefixes from cfg.Routes patterns: "/services/foo/*" -> "/services/foo/" (trailing slash kept so HasPrefix doesn't false-match /services/foobar/). - HandleVerify: when matchRoute returns nil, check if URI is under any tracked prefix. If yes -> 403. If no -> 200 (legitimately free). Complementary to PR #519 (gate /readyz on informer sync): - PR #519 ensures the pod isn't Ready until routes are loaded (closes the bootstrap-window leak). - This PR ensures that after Ready, any prefix the verifier KNOWS about that doesn't have a matching rule is fail-closed (closes the steady-state-bug leak). Together they cover the "rule should match but doesn't" gap.

Closes the entire class of "CRD YAML and Go struct drifted" bugs. PurchaseAutoRefill.MaxTotal was the most recent instance — it existed in purchaserequest-crd.yaml for months while internal/monetizeapi/ types.go didn't have the corresponding field. Without this commit, that pattern recurs by design: two sources of truth, one hand- maintained, no enforcement of agreement. Now Go is the single source of truth: - kubebuilder markers on every CRD-backed struct in types.go (validation, required, enum, pattern, printer columns, subresources) - `just generate` regenerates *-crd.yaml from those markers + zz_generated.deepcopy.go from object:generate=true - CI fails if `git status` is non-empty after `just generate` runs This commit also fixes the documented MaxTotal / MaxSpendPerDay drift by adding both fields to PurchaseAutoRefill — the generated CRD now matches the prior hand-written one and the controller can read them. Pinned controller-tools at v0.16.5 in tools/tools.go (compatible with client-go v0.34.x; a newer release would force prometheus/common through a panicking validation-scheme change). Generation is deterministic; running locally produces no diff after a clean checkout. For future CRD edits: 1. Edit types.go (add/change a field, update markers) 2. `just generate` 3. Commit both the Go and YAML diffs 4. CI verifies the YAML was committed PreSignedAuth.Payment is map[string]interface{} (opaque x402 payload), which controller-gen cannot deep-copy automatically; a hand-written DeepCopy lives in deepcopy_manual.go and the type is flagged object:generate=false. The hack/boilerplate.go.txt file is force-added past *.txt gitignore; it's an empty marker for now — add a copyright header later if the repo settles on one.

PrometheusRule annotations use {{ $labels.X }} which Prometheus evaluates at alert-firing time. When this file is rendered through Helm (via chart: ./base in helmfile.yaml), Helm's Go-template engine tries to evaluate $labels at chart-render time and fails with: Error: UPGRADE FAILED: parse error at (base-infra/templates/x402-prometheus-rules.yaml:N): undefined variable "$labels" Wrap each templated brace pair as {{ "{{" }}...{{ "}}" }} so Helm emits literal Prometheus template syntax verbatim into the YAML output, where Prometheus picks it up at alert-eval time. Bug surfaced by integration-branch full stack-up; not caught by `go test ./...` (unit tests don't render Helm) nor by the agent worktree validation (which only checked Go-side compilation). Recommend adding a CI smoke that pipes embedded *.yaml templates through `helm template ./base` to catch this class going forward. Stacks on PR #513 (which introduced the file in commit 27e1ac5).

PR #523 moved 6 bedag/raw helmfile releases into the base chart so there's one source of truth for what ships in each namespace. Fresh installs work. EXISTING clusters being upgraded from pre-#523 obol-stack fail at `helm upgrade base` with: Error: UPGRADE FAILED: <resource> exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "base" This blocks `obol stack up` until the operator manually re-annotates ~10 resources (Namespaces, HTTPRoutes, Middlewares, ConfigMaps, PrometheusRule, PodMonitor, ClusterRole/Binding). Adds hack/migrate-bedag-raw-to-base.sh which finds all such orphans and re-annotates them in bulk. Idempotent — safe to re-run. Surfaced by the 14-PR integration test campaign; see plans/integration-test-results-final-20260524.md Bug #2.

…oads PR #521 enforces Restricted Pod Security Standard on x402 + llm namespaces. The controller renders two httpd-based Deployments (obol-skill-md publisher + agentidentity-default-registration well- known/agent-registration.json publisher) without securityContext, so PSS admission rejects them and they never start. Result: marketplace API returns STACK_UNREACHABLE because skill-md isn't reachable. Adds Restricted-compliant securityContext to both renderers: pod: runAsNonRoot, runAsUser=1000, RunAsGroup=1000, seccompProfile=RuntimeDefault, fsGroup=1000 container: allowPrivilegeEscalation=false, drop ALL capabilities Both Deployments already bind httpd to 8080, which is non-root safe, so no port change is required. Surfaced by the 14-PR integration test campaign. The integration test workaround patched the running Deployments manually: plans/integration-test-results-final-20260524.md Bug #3.

Recording rule was sum(counter), which is wrong for any metric where the counter resets across pod restarts — Prometheus counters are per-process by design. The TSDB is the canonical persistence layer; rate() and increase() perform reset detection at query time across the samples the TSDB holds. - Renames the rule to x402:revenue:7d_by_offer (name matches what it returns; the old "lifetime" / "total_by_offer_current" names were aspirational against a finite retention window). - Expression: sum by (offer_namespace, offer_name) ( increase(obol_x402_verifier_charged_requests_total[7d]) ) - 7d inside 8d retention gives 1-day headroom so reset detection has both-side samples at the window's left edge. Per Robust Perception's "avoiding the counter-reset undercount" canonical guidance. Zero new components — uses only native Prometheus + recording-rule primitives. Found by the 14-PR integration test (plans/integration-test-results- final-20260524.md). The OBOL parity smoke surfaced it more visibly when a verifier restart produced a "0 req·24h" UI display on a row with real on-chain traffic. Stacks on PR #527 (Helm-escape fix for the same file).

Currently the verifier emits (offer_namespace, offer_name, chain). Answering "what's my OBOL revenue?" requires joining metrics with the ServiceOffer CR's spec.payment.asset.symbol at the frontend. With asset_symbol on the label set, the answer is a direct PromQL aggregation. Cardinality cost: zero. Each offer pins exactly one asset (A=1 per offer), so the new dimension is functionally constant within the existing (ns, name) group — no series multiplication. The "don't label what you can derive" guidance exists to prevent *multiplicative* blowups (chain x pod x pod_owner style); the single-asset-per-offer invariant means there's no multiplication to prevent. The argument for adding asset_symbol is identical to the argument that already justifies `chain` on these vecs: both are CR-derived, both are query-meaningful, both have bounded values. Changes: - 6 metric vecs: label slice gains "asset_symbol" - pruneSeriesNotIn key now (ns, name, chain, asset_symbol) so asset-repin doesn't leak the old series - verifier.load() live-set built with the same 4-tuple - prometheusLabels() emits rule.AssetSymbol (or "unknown" if empty as defensive fallback) - New _asset_symbol-suffixed recording rules added side-by-side with existing rules; existing rules unchanged (non-breaking) - Tests: emission asserts asset_symbol; prune test asserts asset-repin doesn't leak Frontend can simplify the existing metric x CR join in a future PR once it migrates to the _asset_symbol-suffixed rule. Findings from: plans/integration-test-L7-paid-flow-20260524.md (OBOL parity smoke surfaced this as a real gap when validating the WalletStrip / EarningsStrip per-token columns).

…ting low-traffic alerts X402PaymentFailureRateHigh and the settlement_rate recording rule used clamp_min(denominator, 1) as a div-by-zero guard. For paid endpoints under light load (sub-1 req/s), the floor is 1.0 instead of the true denominator, so the ratio numerator/denominator returns near-zero even when 50%+ of requests are failing — the alert never fires. Switch the floor to 1e-9. Epsilon prevents division-by-zero while keeping the actual ratio accurate at any non-zero traffic level. Surfaced by Expert #2 review of the PromQL design (plans/integration-test-L7-paid-flow-20260524.md follow-ups). Stacks on PR #531 (asset_symbol label) which is the tip of the rules-file chain. Will rebase onto main as the chain merges.

PR #527 fixed an unescaped {{ $labels }} in a PrometheusRule annotation that broke `helm upgrade base` on every `obol stack up`. The bug shipped to integration testing because go test ./... doesn't exercise Helm rendering. This job pipes the embedded base chart through `helm template` on every PR; parse errors fail the build before merge. - Runs against ./internal/embed/infrastructure/base - Uses helm v3.20.1 (matches obolup.sh pinned version) - Also runs `helm lint` for chart-structure issues - Substitutes {{OLLAMA_HOST_IP}}/{{CLUSTER_ID}} stubs in a temp copy of the chart (mirroring what `obol stack init` does via internal/defaults/defaults.go::InfrastructureReplacements) - Future: pair with a helmfile-lint job for state-value tests If we ever land a chart-template change that this doesn't catch, expand the helm-template invocation with --set values mimicking what `obol stack up` provides.

After the OBOL parity smoke + Prometheus expert review, we made explicit design choices worth recording so they don't get re-litigated: 1. Counters are intentionally per-process — Prometheus design. Pod restarts reset them; rate()/increase() handle this at query time via the TSDB's reset detection. Don't add persistence to the counter itself. 2. Prometheus = recent operational telemetry (bounded by retention). On-chain settlement TXs = canonical lifetime financial record. 3. Recording rules use the convention <level>:<metric>:<operations>; name the window (7d_by_offer, not lifetime_by_offer). 4. Add labels you'd query by directly (chain, asset_symbol — both CR-derived, both query-meaningful, both bounded). 5. div-by-zero guards use epsilon (1e-9), not 1.0. 6. CRD versioning stance: stay on v1alpha1 during active dev; the alpha promise IS "no compat". Graduate only when an external operator commits to depending on the schema. The PVC-backed counter persistence option was considered and rejected for our single-operator local-k3d use case. The doc walks through why, what would change that decision, and where the canonical "lifetime" answer comes from. Adds CLAUDE.md pointer so future contributors land here first.

The legacy obol.org/paused annotation tore down HTTPRoutes immediately, which is indistinguishable from a crash to remote x402 buyers and ERC-8004 reputation scorers. obol sell stop was also broken: it patched status.conditions which the controller immediately overwrote. This replaces both with a real drain: - New ServiceOffer spec.drainAt (date-time) + spec.drainGracePeriod (duration; default 1h) mark an offer as winding down. - While draining, /skill.md and /.well-known/agent-registration.json advertise the offer with available=false and drainEndsAt set, so external discovery can react before traffic disappears. - The HTTPRoute + payment gate stay up until DrainEndsAt, letting in-flight buyers complete payments. - After the grace period, the controller tears down the route, sets Draining=False reason=Drained, and leaves the CR (delete is the canonical removal command). obol sell stop sets spec.drainAt, supports --grace <duration> and --force/--now (zero grace = abrupt teardown for behavior parity with the old annotation).

…controller, litellm, cloudflared

… election

…deploy from helmfile

…ded workloads

… to PVC

…chitecture blockers Fold architecture-review fixes into feat/marketplace-bundle before the bundle lands.

bussyjd added 30 commits May 23, 2026 20:41

merge: chore/digest-pin-cluster-images (#517) - digest-pin verifier, …

ffdd459

…controller, litellm, cloudflared

merge: feat/controller-leader-election (#518) - wire client-go leader…

b8f0e09

… election

merge: refactor/ensure-verifier-via-helmfile (#520) - drive verifier …

3693513

…deploy from helmfile

merge: feat/restricted-pss-sweep (#521) - Restricted PSS across embed…

4b58459

…ded workloads

merge: fix/x402-buyer-state-pvc (#522) - persist consumed-nonce state…

22971d7

… to PVC

github-advanced-security AI found potential problems May 24, 2026

View reviewed changes

Comment thread .github/workflows/helm-template-smoke.yml Fixed

Comment thread .github/workflows/lint-test.yaml Fixed

fix: resolve marketplace bundle architecture blockers

c3ba469

bussyjd mentioned this pull request May 24, 2026

fix: resolve marketplace bundle architecture blockers #541

Merged

bussyjd and others added 5 commits May 24, 2026 17:37

chore: remove pre-release migration script

82cbfae

docs: warn pre-release testers about stack reset

94418db

docs: clarify pre-release ownership warning

46189cd

merge: fix/marketplace-bundle-architecture-review (#541) - resolve ar…

1dbbf60

…chitecture blockers Fold architecture-review fixes into feat/marketplace-bundle before the bundle lands.

ci: restrict workflow token permissions

7453339

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: x402 marketplace + architecture review bundle (#513-#535)#536

feat: x402 marketplace + architecture review bundle (#513-#535)#536
bussyjd wants to merge 53 commits into
mainfrom
feat/marketplace-bundle

bussyjd commented May 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bussyjd commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Merge-commit graph

Conflicts resolved

Test plan

Closes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bussyjd commented May 24, 2026 •

edited

Loading