feat: x402 marketplace + architecture review bundle (#513-#535)#536
Open
bussyjd wants to merge 53 commits into
Open
feat: x402 marketplace + architecture review bundle (#513-#535)#536bussyjd wants to merge 53 commits into
bussyjd wants to merge 53 commits into
Conversation
…oRefill
Unblocks per-chain earnings/spend aggregation in the frontend's My Listings
and My Purchases pages.
- obol_x402_buyer_* metrics now carry a `chain` label sourced from
UpstreamConfig.Network (already in payload, just wasn't on the labels).
- obol_x402_verifier_* metrics now carry a `chain` label sourced from
RouteRule.Network. Existing verifier metric tests updated to assert the
new label (empty string when no Network is set on the rule).
- internal/monetizeapi/types.go PurchaseAutoRefill struct now mirrors the
CRD spec (purchaserequest-crd.yaml lines 93-96) by including MaxTotal +
MaxSpendPerDay. The CRD already accepts these, the Go types just
weren't reading them.
Together this means the frontend can soon switch the EarningsStrip /
WalletStrip from zeroed placeholders to real PromQL aggregates such as:
sum by (chain) (increase(obol_x402_buyer_payment_success_total[7d]))
Closes the data loop for the frontend My Listings EarningsStrip + the
"Last settlement" timestamp the design canvas wants.
- New gauge obol_x402_verifier_last_payment_success_seconds, labeled
by (route, offer_namespace, offer_name, chain). Stamped via
SetToCurrentTime() in both ForwardAuth and proxy-mode paths
whenever a paid request reaches the seller successfully.
- helmfile.yaml grows an x402-verifier PodMonitor (the namespace was
previously scraping only litellm-x402-buyer). Same release: monitoring
label so kube-prometheus-stack picks it up.
The frontend already has matching consumers
(chargedSalesByOfferAndChain, chargedRequests24hByOffer,
lastSettlementByOffer in PrometheusClient) — without this scrape the
metrics never reach the dashboard.
…Refill
Closes the test gap left open by the recent chain-label + last-settlement
gauge work. 14 new subtests across three packages plus four pre-existing
buyer-proxy assertions updated to carry the new chain label.
New tests:
- internal/x402/verifier_test.go
TestVerifier_LastPaymentSuccessGauge (3 subtests):
successful payment stamps gauge within ±5s of time.Now(),
unpaid 402 leaves it untouched, rejected payment leaves it untouched.
findVerifierMetricValue helper for time-window assertions.
- internal/x402/buyer/metrics_test.go
TestPrometheusLabels_ChainPropagation (3 subtests):
base-sepolia / base mainnet / empty chain.
TestMetrics_ChainLabelScrapeRoundtrip (2 subtests):
scrape /metrics through the registry, assert every counter +
gauge series carries the expected chain label.
- internal/monetizeapi/types_test.go
TestPurchaseAutoRefill_JSONRoundTrip (5 subtests):
full population, only new caps, all-zero omitempty, single fields.
TestPurchaseAutoRefill_UnmarshalAcceptsCRDForm:
catches json-tag drift between the Go struct and CRD spec.
Pre-existing fix:
- internal/x402/buyer/proxy_test.go — four TestProxy_* assertions had
label maps without `chain`; tests use Network "base-sepolia" so the
expected chain is now spelled out alongside upstream + remote_model.
RBAC:
- helmfile.yaml: obol-frontend ClusterRole grows read access for
purchaserequests + purchaserequests/status (frontend My Purchases
needs list; agent buy.py + controller remain the only writers).
Live-patched into the running cluster too.
…ording rules
Phase 1 + Phase 2 hardening on top of the chain-label/last-settlement work,
incorporating findings from the 4-agent K8s architecture review. Skips the
auth-on-mutating-endpoints item per operator clarification: the obol-stack
frontend is local-only behind the obol.stack hostname restriction, so it's
not the primary trust boundary.
RBAC trims:
- Drop `secrets get/list` from obol-frontend-openclaw-discovery
ClusterRole; pre-existing dangling grant, no code reads them.
- Drop /status subresource from purchaserequests rule; frontend never
writes status (only the controller does).
Monitoring + RBAC co-location (kills 3 bedag/raw helmfile releases):
- x402-verifier: PodMonitor -> ServiceMonitor in base/templates/x402.yaml.
Verifier has a stable Service on port http:8080; ServiceMonitor scrapes
the endpoint cleanly across replicas.
- litellm-x402-buyer: PodMonitor moved into base/templates/llm.yaml.
Stays a PodMonitor because the sidecar's port 8402 is per-pod, not
fronted by a Service.
- obol-frontend RBAC moved into base/templates/obol-frontend-rbac.yaml
next to the workload it grants.
Label cardinality:
- Drop `route` label from verifier metrics. (offer_namespace, offer_name,
chain) already uniquely scopes a paid route; `route` (= rule.Pattern)
was redundant and unbounded by path fragments.
PrometheusRule (new base/templates/x402-prometheus-rules.yaml):
- Recording: x402:revenue:24h_by_offer_chain,
x402:revenue:7d_by_offer_chain, x402:revenue:lifetime_by_offer,
x402:settlement_rate:1h_by_offer_chain. The frontend's PrometheusClient
reads these so renaming raw metrics no longer breaks the UI, and the
`increase()` 2-sample minimum no longer leaves cold offers at "0" for
the first 30s of traffic.
- Alerting: X402PaymentFailureRateHigh (>10% over 1h),
X402NoSettlementsAfterChallenge (402s issued, no charges).
Deferred (out of scope for this hardening pass):
- Frontend-egress NetworkPolicy: on k3s + Flannel the apiserver Service
endpoints point at the host process, outside the cluster pod/service
CIDRs. A clean allowlist policy can't target the apiserver portably
without an install-specific ipBlock; revisit when obol-stack ships a
non-k3s deployment surface.
- obol-marketplace-api aggregator service: overkill for the local
single-operator context.
- Three-deployment-paths consolidation (helmfile + bedag/raw + Go
`EnsureVerifier`): larger refactor; tracked as separate workstream.
Live validation:
- 2 paid requests against demo-hello survive both the RBAC trims and
the ServiceMonitor swap. `x402:revenue:7d_by_offer_chain` returns
1.0076 for chain=eip155:84532 (matches the underlying
obol_x402_verifier_charged_requests_total counter at value 2 over
2 samples).
- /api/marketplace/purchases still returns 200 after dropping the
/status grant.
- /api/agents/wallets returns the agent wallet via the new batched
listAllWalletMetadata path (1 ConfigMap list vs N+1 per-instance).
The verifier's per-offer counters and the last_payment_success_seconds
gauge were created on first use and never removed. Deleting an offer
(via `obol sell delete`, ServiceOffer CR deletion, or pricing config
edit) left stale series in the registry forever, which:
* pollutes My Listings / dashboards with rows for offers that no
longer exist,
* lets X402NoSettlementsAfterChallenge keep referencing dead labels,
* silently inflates the "last successful charge" gauge with timestamps
from offers the operator already retired.
Verifier.load() now diffs the incoming route set against the live label
tuples in the registry and calls DeletePartialMatch on each vec for
every (offer_namespace, offer_name, chain) triple that is no longer
served. Both reload paths (file config watcher and the kube
ServiceOffer informer via ConfigAccumulator) funnel through load(), so
one hook covers everything.
Also fixes a guard test from the prior hardening commit that was still
asserting the old "no ServiceMonitor here" invariant after we
intentionally relocated the ServiceMonitor into this manifest. Flipped
to assert presence so a future cleanup can't silently drop it.
Test:
TestVerifier_Reload_PrunesDeletedOfferSeries stamps two offers' worth
of metrics, reloads with one removed, and asserts the removed offer
is gone from all six vecs while the kept offer survives.
Commit 0fbb99a (fix(x402): GC verifier metric series for deleted offers) added pruneSeriesNotIn to Verifier.load. Each verifier pod runs its own informer + its own metric registry, so the GC is per-pod. With replicas: 2 + ServiceMonitor (round-robin scrape over Endpoints), Prometheus sees: * one pod's registry on scrape N (pruned correctly), * the other pod's on scrape N+1 (may still hold a deleted offer's series until that pod's informer also sees the delete). Result: deleted offers' last_payment_success_seconds gauge and charged_requests_total counters reappear every other scrape, polluting dashboards and creating spurious alert state. Cheapest correct fix is replicas: 1. The verifier is on the request path but single-node k3d gains no HA from 2 replicas. Drop the PodDisruptionBudget too — minAvailable:1 at replicas:1 just blocks voluntary drains on the only pod, useless on k3d. If/when the stack ever runs multi-node and HA replicas are wanted, the right pattern is ServiceMonitor → PodMonitor with a `pod` label and recording rules using `sum without(pod)`. That's a future change; right now correctness > theoretical HA.
…dows; rename mis-named lifetime rule Two related metric-correctness fixes layered on top of the recording rules added in 27e1ac5: 1. Retention 6h → 8d. The recording rules added in 27e1ac5 use [24h] and [7d] windows. `increase(x[24h])` against a 6h-retention TSDB silently returns "last 6h extrapolated to 24h" with no error. The frontend displays that result as "24h revenue" — wrong by 4x. 8d (= 7d + 1d safety margin) keeps the [7d] rule valid across a brief Prometheus outage. 2. `x402:revenue:lifetime_by_offer` → `x402:revenue:total_by_offer_current`. The original expression was `sum(counter)` (not `sum(increase[lifetime])`), so it: * is NOT lifetime — it's "sum across currently-alive verifier replicas of their since-last-restart counts", * drops ~50% on every replica rollout, * compounds with the per-pod-registry issue addressed by the replicas:1 fix. Renaming makes the semantic explicit. True lifetime queries should use `sum_over_time(...[Nd])` against a long-retention store. Retention bump increases Prometheus disk footprint roughly proportional to (8d/6h) ≈ 32x. The local-only kube-prometheus-stack PVC sizing in monitoring.yaml.gotmpl needs review on next `obol stack up` if disk pressure shows up — currently no PVC size cap set, so it inherits the storageClass default.
Extends the @sha256 digest discipline that x402-buyer and the frontend already carry to the remaining four images that ship as part of the embedded infrastructure. Tag-only refs (e.g. ghcr.io/obolnetwork/ x402-verifier:b13254e) are vulnerable to mutable-tag rewrites — the class of bug CLAUDE.md pitfall #12 documented as a real production fire. Pins: - x402-verifier:b13254e @ sha256:a8a7aa0ca4c35b0ddf6983fa6e3e5f8a3f64e44d8e506ebfd55e39de2bc0342d - serviceoffer-controller:b13254e @ sha256:f83bd7e55bdc5d87edb49c04e7fd9257097364e2d43e769c19dfd7c8b47d07af - litellm:sha-c16b156 @ sha256:9f112b51ac5a57d73cdd54103fb98d24eabaddd8689a9a285884dca6456dc86e - cloudflared:2026.3.0 @ sha256:6b599ca3e974349ead3286d178da61d291961182ec3fe9c505e1dd02c8ac31b0 Adds a regression test asserting every embedded manifest carries @sha256: on its image refs so a future dependency bump can't silently revert to tag-only. Dev-rewrite invariant (defaults.go:124 + setup.go:74 alternation regex) verified intact via go test ./internal/defaults/... ./internal/x402/...
Today the serviceoffer-controller is pinned at replicas: 1 with a
"Do not scale" comment in x402.yaml. The RBAC for leases is already
granted (x402.yaml:176-178) — pre-positioned and unused. An accidental
`kubectl scale --replicas=2` or HPA misconfiguration produces
split-brain finalizers and double on-chain ERC-8004 registration
(real gas spend + duplicate registry entries).
This wires client-go tools/leaderelection so multi-replica deployment
is safe-by-correctness, not safe-by-comment.
- cmd/serviceoffer-controller/main.go:
- Read POD_NAME / POD_NAMESPACE from downward API env.
- Acquire Lease "serviceoffer-controller" in POD_NAMESPACE
before running the reconcile loop.
- On lost leadership, os.Exit(1) — kubelet restarts the pod
which re-elects from scratch.
- --leader-elect flag (default true) so local dev can bypass.
- x402.yaml:
- Add downward-API POD_NAME env to the controller Deployment
(POD_NAMESPACE was already wired).
- Update the "Do not scale" comment to "Single replica by
default; bumping to 2+ is now safe — leader election prevents
split-brain on the reconcile loop."
- Lease parameters chosen for fast failover on k3d (lease=30s,
renew=20s, retry=5s). Tunable via flag if a multi-zone deployment
ever needs longer.
Uses client-go directly rather than controller-runtime Manager to
minimize churn — controller is currently raw client-go workqueues,
not controller-runtime. Migration to controller-runtime is a separate
much larger workstream and not necessary just for leader election.
Closes the root cause of CLAUDE.md pitfall #14 ("first-request flake on freshly-deployed verifier"). Previously /readyz returned 200 the moment config.Load() became non-nil, but routes from the ServiceOffer informer load later — between those two events the pod is Ready from kubelet's view, receives Service traffic, and matchPaidRoute returns "no rule -> 200" for paid routes. The release-smoke flows hide this behind 12x5s retry loops; the actual fix is to not be Ready until routes are loaded. - Adds routesLoaded atomic.Bool to Verifier. - HandleReadyz returns 503 until BOTH config and routes loaded, with a body that distinguishes the two cases for kubectl describe debuggability. - WatchServiceOffers takes an optional onFirstApply callback, invoked after the post-WaitForCacheSync refresh succeeds. - main.go wires v.MarkRoutesLoaded as the callback for kube source, or invokes it directly after NewVerifier for file source (the file source has no informer; routes are loaded synchronously). Pairs with PR #515 (replicas: 1) — at single replica the rollout window for this race shrinks from "some scrapes" to "first ~5-10s", but it's still a bug; this PR closes it.
…kubectl apply Kills CLAUDE.md pitfall #9 forever. The previous code path had two problems that compounded: 1. EnsureVerifier did kubectl apply of embed.FS x402.yaml directly, overwriting whatever helmfile had installed. Under OBOL_DEVELOPMENT=true, this stripped local-build image pins back to registry-pinned digests — silently bypassing every dev edit to the verifier. 2. To work around (1), setup.go carried a DUPLICATE copy of the image-pin rewrite regex from internal/defaults/defaults.go (with a code comment confessing "duplicated here to avoid an import cycle"). Every fix to the regex (e.g. pitfall #12's alternation- order fix) had to be applied in two places — which is exactly the kind of footgun that produces silent bypasses. Now EnsureVerifier shells out to helmfile --selector name=base sync against the helmfile state already used by obol stack up. Since helmfile reads the manifests from \$OBOL_CONFIG_DIR/defaults/ — which is populated by defaults.CopyInfrastructure with the canonical regex already applied — the dev-rewrite happens exactly once, in exactly one place. - Deletes the duplicate devLocallyBuiltImageBases + regex from internal/x402/setup.go. - EnsureVerifier now: RefreshInfrastructureIfChanged(); helmfile sync --selector name=base. - Deletes internal/x402/manifest_devmode_test.go — the canonical regression test is internal/defaults/defaults_test.go:: TestCopyInfrastructure_DevModeRewritesDigestPins which still guards the rewrite at its single source. - Adds a structural test (setup_structure_test.go) asserting setup.go does not import the regexp package, making re-introduction of the duplicate fail at test time. The duplicate-regex footgun is now structurally impossible to re-introduce.
…loads
Brings every embedded Deployment shipped by obol-stack up to PSS Restricted:
- runAsNonRoot: true with fixed non-zero UID/GID (65532)
- allowPrivilegeEscalation: false
- capabilities.drop: [ALL]
- seccompProfile: RuntimeDefault
- readOnlyRootFilesystem: true (with named emptyDir mounts where Python
needs writeable /tmp and HOME/.cache)
PSS labels (enforce=restricted, audit/warn=restricted) added to the x402
and llm namespaces so future Deployment edits that omit per-pod
securityContext are rejected at admission.
Also switches the serviceoffer-controller Dockerfile from
gcr.io/distroless/static-debian12 (UID 0) to ...:nonroot (UID 65532).
Container escape via a Go runtime CVE on a UID-0 / no-seccomp /
no-cap-drop / RW-rootfs container was the easiest path to host pivot
on k3s single-node; this closes it.
Files touched:
- Dockerfile.serviceoffer-controller (:nonroot base)
- internal/embed/infrastructure/base/templates/x402.yaml
(verifier + controller securityContext blocks, x402 ns PSS label)
- internal/embed/infrastructure/base/templates/llm.yaml
(litellm + x402-buyer securityContext, litellm-tmp + litellm-home
emptyDir mounts with HOME/XDG_CACHE_HOME/HF_HOME redirection,
llm ns PSS label)
Scope notes:
- local-path-provisioner lives in kube-system (k3d-managed); not
relabeled per PSS guidance to skip system namespaces.
- hermes-obol-agent runtime is generated dynamically by
serviceoffer-controller (internal/serviceoffercontroller/agent_render.go
and internal/hermes/hermes.go), not from the embedded templates;
its init-hermes-perms initContainer legitimately runs as UID 0
for /data chown and is intentionally left out of this PR's scope.
- cloudflared chart (internal/embed/infrastructure/cloudflared/...)
is a separate Helm chart and not in this PR's file list.
What may break:
- LiteLLM with readOnlyRootFilesystem may fail if it writes outside
/tmp or $HOME — watch the next release-smoke for permission-denied
errors and add named emptyDir mounts for any new write paths.
Today the x402-buyer sidecar's /state directory is an emptyDir. When
the litellm pod restarts (rollout, OOM, node drain), consumed.json is
gone. The pre-signed auth pool reloads from the ConfigMap the
controller manages, and the buyer treats every auth as unconsumed —
attempting to spend nonces that the facilitator already marked used.
Cascade: facilitator returns 400 "nonce already used" -> buyer 402
back to LiteLLM -> caller retry -> same 400 -> eventually buyer pool
exhausted -> 503 until manual `buy.py process --all` reseeds.
Fix: convert /state to a PVC backed by local-path-provisioner (the
storage class already deployed via base/templates/local-path.yaml).
50Mi request; consumed.json is tiny but room left for log growth.
Deployment strategy switched to Recreate because a RWO PVC can't
be co-mounted during a RollingUpdate surge. Litellm is replicas: 1
so this just means rollouts have a ~5s gap instead of an overlap —
acceptable.
What this does NOT solve:
- Multi-replica litellm. RWO PVC works only for replicas: 1; would
need RWX (which local-path doesn't support — needs NFS/Longhorn)
or per-replica state via StatefulSet. Out of scope; litellm has
no current scaling need.
- Hard node loss. local-path PVCs are node-local; if the k3d node
is destroyed, state is gone (along with the rest of the cluster).
For local-only operator that's the expected blast radius.
PSS compatibility note: the PVC mount works under PSS Restricted as
long as the buyer container runs with appropriate fsGroup. PR #12
(Restricted PSS sweep) handles that separately and will verify mount
permissions when it lands.
The infrastructure helmfile shipped 6 `bedag/raw` releases — a wrapper chart whose only job is to apply inline YAML through helmfile. With the `base` release already rendering every other YAML in `base/templates/`, the inline approach has zero remaining justification. This PR finishes the job by relocating all 6: - llm-buyer-podmonitor → base/templates/llm.yaml (appended) - erpc-httproute → base/templates/erpc.yaml (new file) - erpc-x402-middleware → base/templates/erpc.yaml - erpc-metadata → base/templates/erpc.yaml - obol-frontend-httproute → base/templates/obol-frontend.yaml (new file) - obol-frontend-rbac → base/templates/obol-frontend.yaml Net change to the rendering: zero. Same YAML, just sourced from the chart's templates directory instead of inlined in helmfile.yaml through the bedag/raw wrapper chart. Each relocated YAML carries a provenance comment. DAG: `base` now `needs: [traefik/traefik, monitoring/monitoring]` so the Traefik (Middleware) / Gateway API (HTTPRoute) / Prometheus operator (PodMonitor) CRDs are guaranteed present before the relocated templates apply. New Namespace docs for `erpc` and `obol-frontend` make the `base` release self-contained — the upstream chart releases that originally created those namespaces still set `createNamespace: true`, which is a no-op against an existing namespace. The `bedag` repository entry is removed (no infrastructure release uses it anymore). Network helmfiles + hermes still use bedag/raw — out of scope for this PR. `migrateDefaultsHTTPRouteHostnames` in internal/stack/stack.go targets the old in-helmfile HTTPRoute indentation pattern; it is a no-op against the relocated templates and against the new helmfile, preserved unchanged for users upgrading from older stacks. The `hostnames: ["obol.stack"]` restriction is preserved on every relocated HTTPRoute per CLAUDE.md guidance — removing it would expose the frontend / eRPC to the public cloudflared tunnel. `TestHelmfile_IncludesBuyerPodMonitor` rewired to read `base/templates/llm.yaml`. All embed CRD tests, stack tests, and go build are green.
…tches
Today HandleVerify returns 200 whenever matchPaidRoute returns nil.
Combined with Traefik ForwardAuth's "200 = allow" semantics, this
means a misconfigured Middleware on a paid route OR a code bug where
the route was supposed to match but didn't silently makes the route
FREE — revenue loss with no signal.
- Adds paidPrefixes atomic.Pointer[[]string] to Verifier.
- Verifier.load() derives prefixes from cfg.Routes patterns:
"/services/foo/*" -> "/services/foo/" (trailing slash kept so
HasPrefix doesn't false-match /services/foobar/).
- HandleVerify: when matchRoute returns nil, check if URI is under
any tracked prefix. If yes -> 403. If no -> 200 (legitimately free).
Complementary to PR #519 (gate /readyz on informer sync):
- PR #519 ensures the pod isn't Ready until routes are loaded
(closes the bootstrap-window leak).
- This PR ensures that after Ready, any prefix the verifier KNOWS
about that doesn't have a matching rule is fail-closed
(closes the steady-state-bug leak).
Together they cover the "rule should match but doesn't" gap.
Closes the entire class of "CRD YAML and Go struct drifted" bugs.
PurchaseAutoRefill.MaxTotal was the most recent instance — it existed
in purchaserequest-crd.yaml for months while internal/monetizeapi/
types.go didn't have the corresponding field. Without this commit,
that pattern recurs by design: two sources of truth, one hand-
maintained, no enforcement of agreement.
Now Go is the single source of truth:
- kubebuilder markers on every CRD-backed struct in types.go
(validation, required, enum, pattern, printer columns, subresources)
- `just generate` regenerates *-crd.yaml from those markers
+ zz_generated.deepcopy.go from object:generate=true
- CI fails if `git status` is non-empty after `just generate` runs
This commit also fixes the documented MaxTotal / MaxSpendPerDay drift
by adding both fields to PurchaseAutoRefill — the generated CRD now
matches the prior hand-written one and the controller can read them.
Pinned controller-tools at v0.16.5 in tools/tools.go (compatible with
client-go v0.34.x; a newer release would force prometheus/common
through a panicking validation-scheme change). Generation is
deterministic; running locally produces no diff after a clean
checkout.
For future CRD edits:
1. Edit types.go (add/change a field, update markers)
2. `just generate`
3. Commit both the Go and YAML diffs
4. CI verifies the YAML was committed
PreSignedAuth.Payment is map[string]interface{} (opaque x402
payload), which controller-gen cannot deep-copy automatically; a
hand-written DeepCopy lives in deepcopy_manual.go and the type is
flagged object:generate=false.
The hack/boilerplate.go.txt file is force-added past *.txt gitignore;
it's an empty marker for now — add a copyright header later if the
repo settles on one.
PrometheusRule annotations use {{ $labels.X }} which Prometheus
evaluates at alert-firing time. When this file is rendered through
Helm (via chart: ./base in helmfile.yaml), Helm's Go-template engine
tries to evaluate $labels at chart-render time and fails with:
Error: UPGRADE FAILED: parse error at
(base-infra/templates/x402-prometheus-rules.yaml:N):
undefined variable "$labels"
Wrap each templated brace pair as {{ "{{" }}...{{ "}}" }} so Helm
emits literal Prometheus template syntax verbatim into the YAML
output, where Prometheus picks it up at alert-eval time.
Bug surfaced by integration-branch full stack-up; not caught by
`go test ./...` (unit tests don't render Helm) nor by the agent
worktree validation (which only checked Go-side compilation).
Recommend adding a CI smoke that pipes embedded *.yaml templates
through `helm template ./base` to catch this class going forward.
Stacks on PR #513 (which introduced the file in commit 27e1ac5).
PR #523 moved 6 bedag/raw helmfile releases into the base chart so there's one source of truth for what ships in each namespace. Fresh installs work. EXISTING clusters being upgraded from pre-#523 obol-stack fail at `helm upgrade base` with: Error: UPGRADE FAILED: <resource> exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "base" This blocks `obol stack up` until the operator manually re-annotates ~10 resources (Namespaces, HTTPRoutes, Middlewares, ConfigMaps, PrometheusRule, PodMonitor, ClusterRole/Binding). Adds hack/migrate-bedag-raw-to-base.sh which finds all such orphans and re-annotates them in bulk. Idempotent — safe to re-run. Surfaced by the 14-PR integration test campaign; see plans/integration-test-results-final-20260524.md Bug #2.
…oads PR #521 enforces Restricted Pod Security Standard on x402 + llm namespaces. The controller renders two httpd-based Deployments (obol-skill-md publisher + agentidentity-default-registration well- known/agent-registration.json publisher) without securityContext, so PSS admission rejects them and they never start. Result: marketplace API returns STACK_UNREACHABLE because skill-md isn't reachable. Adds Restricted-compliant securityContext to both renderers: pod: runAsNonRoot, runAsUser=1000, RunAsGroup=1000, seccompProfile=RuntimeDefault, fsGroup=1000 container: allowPrivilegeEscalation=false, drop ALL capabilities Both Deployments already bind httpd to 8080, which is non-root safe, so no port change is required. Surfaced by the 14-PR integration test campaign. The integration test workaround patched the running Deployments manually: plans/integration-test-results-final-20260524.md Bug #3.
Recording rule was sum(counter), which is wrong for any metric where
the counter resets across pod restarts — Prometheus counters are
per-process by design. The TSDB is the canonical persistence layer;
rate() and increase() perform reset detection at query time across
the samples the TSDB holds.
- Renames the rule to x402:revenue:7d_by_offer (name matches what
it returns; the old "lifetime" / "total_by_offer_current" names
were aspirational against a finite retention window).
- Expression: sum by (offer_namespace, offer_name) (
increase(obol_x402_verifier_charged_requests_total[7d])
)
- 7d inside 8d retention gives 1-day headroom so reset detection
has both-side samples at the window's left edge.
Per Robust Perception's "avoiding the counter-reset undercount"
canonical guidance. Zero new components — uses only native Prometheus
+ recording-rule primitives.
Found by the 14-PR integration test (plans/integration-test-results-
final-20260524.md). The OBOL parity smoke surfaced it more visibly
when a verifier restart produced a "0 req·24h" UI display on a row
with real on-chain traffic.
Stacks on PR #527 (Helm-escape fix for the same file).
Currently the verifier emits (offer_namespace, offer_name, chain).
Answering "what's my OBOL revenue?" requires joining metrics with
the ServiceOffer CR's spec.payment.asset.symbol at the frontend.
With asset_symbol on the label set, the answer is a direct PromQL
aggregation.
Cardinality cost: zero. Each offer pins exactly one asset (A=1
per offer), so the new dimension is functionally constant within
the existing (ns, name) group — no series multiplication. The
"don't label what you can derive" guidance exists to prevent
*multiplicative* blowups (chain x pod x pod_owner style); the
single-asset-per-offer invariant means there's no multiplication
to prevent.
The argument for adding asset_symbol is identical to the argument
that already justifies `chain` on these vecs: both are
CR-derived, both are query-meaningful, both have bounded values.
Changes:
- 6 metric vecs: label slice gains "asset_symbol"
- pruneSeriesNotIn key now (ns, name, chain, asset_symbol) so
asset-repin doesn't leak the old series
- verifier.load() live-set built with the same 4-tuple
- prometheusLabels() emits rule.AssetSymbol (or "unknown" if
empty as defensive fallback)
- New _asset_symbol-suffixed recording rules added side-by-side
with existing rules; existing rules unchanged (non-breaking)
- Tests: emission asserts asset_symbol; prune test asserts
asset-repin doesn't leak
Frontend can simplify the existing metric x CR join in a future PR
once it migrates to the _asset_symbol-suffixed rule.
Findings from: plans/integration-test-L7-paid-flow-20260524.md
(OBOL parity smoke surfaced this as a real gap when validating
the WalletStrip / EarningsStrip per-token columns).
…ting low-traffic alerts X402PaymentFailureRateHigh and the settlement_rate recording rule used clamp_min(denominator, 1) as a div-by-zero guard. For paid endpoints under light load (sub-1 req/s), the floor is 1.0 instead of the true denominator, so the ratio numerator/denominator returns near-zero even when 50%+ of requests are failing — the alert never fires. Switch the floor to 1e-9. Epsilon prevents division-by-zero while keeping the actual ratio accurate at any non-zero traffic level. Surfaced by Expert #2 review of the PromQL design (plans/integration-test-L7-paid-flow-20260524.md follow-ups). Stacks on PR #531 (asset_symbol label) which is the tip of the rules-file chain. Will rebase onto main as the chain merges.
PR #527 fixed an unescaped {{ $labels }} in a PrometheusRule annotation that broke `helm upgrade base` on every `obol stack up`. The bug shipped to integration testing because go test ./... doesn't exercise Helm rendering. This job pipes the embedded base chart through `helm template` on every PR; parse errors fail the build before merge. - Runs against ./internal/embed/infrastructure/base - Uses helm v3.20.1 (matches obolup.sh pinned version) - Also runs `helm lint` for chart-structure issues - Substitutes {{OLLAMA_HOST_IP}}/{{CLUSTER_ID}} stubs in a temp copy of the chart (mirroring what `obol stack init` does via internal/defaults/defaults.go::InfrastructureReplacements) - Future: pair with a helmfile-lint job for state-value tests If we ever land a chart-template change that this doesn't catch, expand the helm-template invocation with --set values mimicking what `obol stack up` provides.
After the OBOL parity smoke + Prometheus expert review, we made
explicit design choices worth recording so they don't get
re-litigated:
1. Counters are intentionally per-process — Prometheus design.
Pod restarts reset them; rate()/increase() handle this at
query time via the TSDB's reset detection. Don't add
persistence to the counter itself.
2. Prometheus = recent operational telemetry (bounded by retention).
On-chain settlement TXs = canonical lifetime financial record.
3. Recording rules use the convention <level>:<metric>:<operations>;
name the window (7d_by_offer, not lifetime_by_offer).
4. Add labels you'd query by directly (chain, asset_symbol —
both CR-derived, both query-meaningful, both bounded).
5. div-by-zero guards use epsilon (1e-9), not 1.0.
6. CRD versioning stance: stay on v1alpha1 during active dev;
the alpha promise IS "no compat". Graduate only when an
external operator commits to depending on the schema.
The PVC-backed counter persistence option was considered and
rejected for our single-operator local-k3d use case. The doc
walks through why, what would change that decision, and where
the canonical "lifetime" answer comes from.
Adds CLAUDE.md pointer so future contributors land here first.
The legacy obol.org/paused annotation tore down HTTPRoutes immediately, which is indistinguishable from a crash to remote x402 buyers and ERC-8004 reputation scorers. obol sell stop was also broken: it patched status.conditions which the controller immediately overwrote. This replaces both with a real drain: - New ServiceOffer spec.drainAt (date-time) + spec.drainGracePeriod (duration; default 1h) mark an offer as winding down. - While draining, /skill.md and /.well-known/agent-registration.json advertise the offer with available=false and drainEndsAt set, so external discovery can react before traffic disappears. - The HTTPRoute + payment gate stay up until DrainEndsAt, letting in-flight buyers complete payments. - After the grace period, the controller tears down the route, sets Draining=False reason=Drained, and leaves the CR (delete is the canonical removal command). obol sell stop sets spec.drainAt, supports --grace <duration> and --force/--now (zero grace = abrupt teardown for behavior parity with the old annotation).
…controller, litellm, cloudflared
…deploy from helmfile
This was referenced May 24, 2026
This was referenced May 24, 2026
Closed
…chitecture blockers Fold architecture-review fixes into feat/marketplace-bundle before the bundle lands.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates 22 stacked PRs (#513-#535, excluding #514 already merged and #526 obsoleted by #535) into a single mergeable bundle. Replaces the fragile cross-branch stack with one merge target.
Collapse update:
feat/marketplace-bundleas the architecture-review fixup.feat/x402-marketplace-metricsis already an ancestor of this bundle branch.Original PR roadmap:
Merge-commit graph
git log --first-parent --oneline feat/marketplace-bundle ^mainConflicts resolved
internal/embed/infrastructure/base/templates/serviceoffer-crd.yaml(feat(monetize): replace pause annotation with ERC-8004-friendly drain #535 merge): kept HEAD's controller-gen output structure from feat(monetizeapi): controller-gen as canonical CRD schema source #525; regenerated all CRDs viacontroller-genwith merged types.go (which contains DrainAt fields).internal/embed/infrastructure/base/templates/llm.yaml(feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 merge): combined refactor: relocate remaining bedag/raw helmfile releases into base chart #523 relocation comment with feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 PodMonitor rationale; keptapp: litellmlabel.internal/embed/infrastructure/helmfile.yaml(feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 merge): kept refactor: relocate remaining bedag/raw helmfile releases into base chart #523's removal of bedag/raw releases (already moved into base templates); kept feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513's NOTE explaining the obol-frontend RBAC relocation.internal/monetizeapi/types.go(feat(x402): chain label, last-settlement gauge, PurchaseAutoRefill sync + verifier PodMonitor + obol-frontend RBAC for purchaserequests #513 merge): kept HEAD (feat(monetizeapi): controller-gen as canonical CRD schema source #525) kubebuilder annotations on PurchaseAutoRefill; regenerated CRDs.internal/x402/verifier.go(fix(x402): fail-closed when URI is under a paid prefix but no rule matches #524 merge): kept both Verifier fields (routesLoadedfrom fix(x402): gate verifier /readyz on informer cache sync #519 +paidPrefixesfrom fix(x402): fail-closed when URI is under a paid prefix but no rule matches #524); both behaviors land.Additional fix:
internal/stack/stack_test.go: loosenedTestLLMTemplate_IncludesPaidRouteAndBuyerSidecarto accept multi-lineemptyDir:after feat(security): Restricted Pod Security Standard across embedded workloads #521's PSS sweep addedsizeLimitvalues.Test plan
go build ./...cleango test ./...green except pre-existingTestWarnIfNoChatModel_EmitsWarnWhenNoModels(unrelated; message text drift)go test -tags integration ./internal/openclaw/...(needs cluster)Closes
Closes #513 #515 #516 #517 #518 #519 #520 #521 #522 #523 #524 #525 #527 #528 #529 #530 #531 #532 #533 #534 #535
Supersedes #526 (replaced by drain semantics in #535).