Skip to content

feat: x402 marketplace + architecture review bundle (#513-#535)#536

Open
bussyjd wants to merge 53 commits into
mainfrom
feat/marketplace-bundle
Open

feat: x402 marketplace + architecture review bundle (#513-#535)#536
bussyjd wants to merge 53 commits into
mainfrom
feat/marketplace-bundle

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

Consolidates 22 stacked PRs (#513-#535, excluding #514 already merged and #526 obsoleted by #535) into a single mergeable bundle. Replaces the fragile cross-branch stack with one merge target.

Collapse update:

Original PR roadmap:

Merge-commit graph

git log --first-parent --oneline feat/marketplace-bundle ^main

Conflicts resolved

Additional fix:

Test plan

  • go build ./... clean
  • go test ./... green except pre-existing TestWarnIfNoChatModel_EmitsWarnWhenNoModels (unrelated; message text drift)
  • go test -tags integration ./internal/openclaw/... (needs cluster)
  • release-smoke flow-07 (validated separately)
  • release-smoke flow-08, 11, 13, 14, 15 (separate test run)
  • Deploy to live cluster, exercise drain + EarningsStrip + asset_symbol metrics

Closes

Closes #513 #515 #516 #517 #518 #519 #520 #521 #522 #523 #524 #525 #527 #528 #529 #530 #531 #532 #533 #534 #535

Supersedes #526 (replaced by drain semantics in #535).

bussyjd added 30 commits May 23, 2026 20:41
…oRefill

Unblocks per-chain earnings/spend aggregation in the frontend's My Listings
and My Purchases pages.

  - obol_x402_buyer_* metrics now carry a `chain` label sourced from
    UpstreamConfig.Network (already in payload, just wasn't on the labels).
  - obol_x402_verifier_* metrics now carry a `chain` label sourced from
    RouteRule.Network. Existing verifier metric tests updated to assert the
    new label (empty string when no Network is set on the rule).
  - internal/monetizeapi/types.go PurchaseAutoRefill struct now mirrors the
    CRD spec (purchaserequest-crd.yaml lines 93-96) by including MaxTotal +
    MaxSpendPerDay. The CRD already accepts these, the Go types just
    weren't reading them.

Together this means the frontend can soon switch the EarningsStrip /
WalletStrip from zeroed placeholders to real PromQL aggregates such as:

  sum by (chain) (increase(obol_x402_buyer_payment_success_total[7d]))
Closes the data loop for the frontend My Listings EarningsStrip + the
"Last settlement" timestamp the design canvas wants.

  - New gauge obol_x402_verifier_last_payment_success_seconds, labeled
    by (route, offer_namespace, offer_name, chain). Stamped via
    SetToCurrentTime() in both ForwardAuth and proxy-mode paths
    whenever a paid request reaches the seller successfully.
  - helmfile.yaml grows an x402-verifier PodMonitor (the namespace was
    previously scraping only litellm-x402-buyer). Same release: monitoring
    label so kube-prometheus-stack picks it up.

The frontend already has matching consumers
(chargedSalesByOfferAndChain, chargedRequests24hByOffer,
lastSettlementByOffer in PrometheusClient) — without this scrape the
metrics never reach the dashboard.
…Refill

Closes the test gap left open by the recent chain-label + last-settlement
gauge work. 14 new subtests across three packages plus four pre-existing
buyer-proxy assertions updated to carry the new chain label.

New tests:
  - internal/x402/verifier_test.go
      TestVerifier_LastPaymentSuccessGauge (3 subtests):
        successful payment stamps gauge within ±5s of time.Now(),
        unpaid 402 leaves it untouched, rejected payment leaves it untouched.
      findVerifierMetricValue helper for time-window assertions.
  - internal/x402/buyer/metrics_test.go
      TestPrometheusLabels_ChainPropagation (3 subtests):
        base-sepolia / base mainnet / empty chain.
      TestMetrics_ChainLabelScrapeRoundtrip (2 subtests):
        scrape /metrics through the registry, assert every counter +
        gauge series carries the expected chain label.
  - internal/monetizeapi/types_test.go
      TestPurchaseAutoRefill_JSONRoundTrip (5 subtests):
        full population, only new caps, all-zero omitempty, single fields.
      TestPurchaseAutoRefill_UnmarshalAcceptsCRDForm:
        catches json-tag drift between the Go struct and CRD spec.

Pre-existing fix:
  - internal/x402/buyer/proxy_test.go — four TestProxy_* assertions had
    label maps without `chain`; tests use Network "base-sepolia" so the
    expected chain is now spelled out alongside upstream + remote_model.

RBAC:
  - helmfile.yaml: obol-frontend ClusterRole grows read access for
    purchaserequests + purchaserequests/status (frontend My Purchases
    needs list; agent buy.py + controller remain the only writers).
    Live-patched into the running cluster too.
…ording rules

Phase 1 + Phase 2 hardening on top of the chain-label/last-settlement work,
incorporating findings from the 4-agent K8s architecture review. Skips the
auth-on-mutating-endpoints item per operator clarification: the obol-stack
frontend is local-only behind the obol.stack hostname restriction, so it's
not the primary trust boundary.

RBAC trims:
  - Drop `secrets get/list` from obol-frontend-openclaw-discovery
    ClusterRole; pre-existing dangling grant, no code reads them.
  - Drop /status subresource from purchaserequests rule; frontend never
    writes status (only the controller does).

Monitoring + RBAC co-location (kills 3 bedag/raw helmfile releases):
  - x402-verifier: PodMonitor -> ServiceMonitor in base/templates/x402.yaml.
    Verifier has a stable Service on port http:8080; ServiceMonitor scrapes
    the endpoint cleanly across replicas.
  - litellm-x402-buyer: PodMonitor moved into base/templates/llm.yaml.
    Stays a PodMonitor because the sidecar's port 8402 is per-pod, not
    fronted by a Service.
  - obol-frontend RBAC moved into base/templates/obol-frontend-rbac.yaml
    next to the workload it grants.

Label cardinality:
  - Drop `route` label from verifier metrics. (offer_namespace, offer_name,
    chain) already uniquely scopes a paid route; `route` (= rule.Pattern)
    was redundant and unbounded by path fragments.

PrometheusRule (new base/templates/x402-prometheus-rules.yaml):
  - Recording: x402:revenue:24h_by_offer_chain,
    x402:revenue:7d_by_offer_chain, x402:revenue:lifetime_by_offer,
    x402:settlement_rate:1h_by_offer_chain. The frontend's PrometheusClient
    reads these so renaming raw metrics no longer breaks the UI, and the
    `increase()` 2-sample minimum no longer leaves cold offers at "0" for
    the first 30s of traffic.
  - Alerting: X402PaymentFailureRateHigh (>10% over 1h),
    X402NoSettlementsAfterChallenge (402s issued, no charges).

Deferred (out of scope for this hardening pass):
  - Frontend-egress NetworkPolicy: on k3s + Flannel the apiserver Service
    endpoints point at the host process, outside the cluster pod/service
    CIDRs. A clean allowlist policy can't target the apiserver portably
    without an install-specific ipBlock; revisit when obol-stack ships a
    non-k3s deployment surface.
  - obol-marketplace-api aggregator service: overkill for the local
    single-operator context.
  - Three-deployment-paths consolidation (helmfile + bedag/raw + Go
    `EnsureVerifier`): larger refactor; tracked as separate workstream.

Live validation:
  - 2 paid requests against demo-hello survive both the RBAC trims and
    the ServiceMonitor swap. `x402:revenue:7d_by_offer_chain` returns
    1.0076 for chain=eip155:84532 (matches the underlying
    obol_x402_verifier_charged_requests_total counter at value 2 over
    2 samples).
  - /api/marketplace/purchases still returns 200 after dropping the
    /status grant.
  - /api/agents/wallets returns the agent wallet via the new batched
    listAllWalletMetadata path (1 ConfigMap list vs N+1 per-instance).
The verifier's per-offer counters and the last_payment_success_seconds
gauge were created on first use and never removed. Deleting an offer
(via `obol sell delete`, ServiceOffer CR deletion, or pricing config
edit) left stale series in the registry forever, which:

  * pollutes My Listings / dashboards with rows for offers that no
    longer exist,
  * lets X402NoSettlementsAfterChallenge keep referencing dead labels,
  * silently inflates the "last successful charge" gauge with timestamps
    from offers the operator already retired.

Verifier.load() now diffs the incoming route set against the live label
tuples in the registry and calls DeletePartialMatch on each vec for
every (offer_namespace, offer_name, chain) triple that is no longer
served. Both reload paths (file config watcher and the kube
ServiceOffer informer via ConfigAccumulator) funnel through load(), so
one hook covers everything.

Also fixes a guard test from the prior hardening commit that was still
asserting the old "no ServiceMonitor here" invariant after we
intentionally relocated the ServiceMonitor into this manifest. Flipped
to assert presence so a future cleanup can't silently drop it.

Test:
  TestVerifier_Reload_PrunesDeletedOfferSeries stamps two offers' worth
  of metrics, reloads with one removed, and asserts the removed offer
  is gone from all six vecs while the kept offer survives.
Commit 0fbb99a (fix(x402): GC verifier metric series for deleted offers)
added pruneSeriesNotIn to Verifier.load. Each verifier pod runs its
own informer + its own metric registry, so the GC is per-pod. With
replicas: 2 + ServiceMonitor (round-robin scrape over Endpoints),
Prometheus sees:

  * one pod's registry on scrape N (pruned correctly),
  * the other pod's on scrape N+1 (may still hold a deleted offer's
    series until that pod's informer also sees the delete).

Result: deleted offers' last_payment_success_seconds gauge and
charged_requests_total counters reappear every other scrape, polluting
dashboards and creating spurious alert state.

Cheapest correct fix is replicas: 1. The verifier is on the request
path but single-node k3d gains no HA from 2 replicas. Drop the
PodDisruptionBudget too — minAvailable:1 at replicas:1 just blocks
voluntary drains on the only pod, useless on k3d.

If/when the stack ever runs multi-node and HA replicas are wanted,
the right pattern is ServiceMonitor → PodMonitor with a `pod` label
and recording rules using `sum without(pod)`. That's a future change;
right now correctness > theoretical HA.
…dows; rename mis-named lifetime rule

Two related metric-correctness fixes layered on top of the recording
rules added in 27e1ac5:

1. Retention 6h → 8d. The recording rules added in 27e1ac5 use [24h]
   and [7d] windows. `increase(x[24h])` against a 6h-retention TSDB
   silently returns "last 6h extrapolated to 24h" with no error. The
   frontend displays that result as "24h revenue" — wrong by 4x.
   8d (= 7d + 1d safety margin) keeps the [7d] rule valid across a
   brief Prometheus outage.

2. `x402:revenue:lifetime_by_offer` → `x402:revenue:total_by_offer_current`.
   The original expression was `sum(counter)` (not `sum(increase[lifetime])`),
   so it:
     * is NOT lifetime — it's "sum across currently-alive verifier
       replicas of their since-last-restart counts",
     * drops ~50% on every replica rollout,
     * compounds with the per-pod-registry issue addressed by the
       replicas:1 fix.

   Renaming makes the semantic explicit. True lifetime queries should
   use `sum_over_time(...[Nd])` against a long-retention store.

Retention bump increases Prometheus disk footprint roughly proportional
to (8d/6h) ≈ 32x. The local-only kube-prometheus-stack PVC sizing in
monitoring.yaml.gotmpl needs review on next `obol stack up` if disk
pressure shows up — currently no PVC size cap set, so it inherits the
storageClass default.
Extends the @sha256 digest discipline that x402-buyer and the frontend
already carry to the remaining four images that ship as part of the
embedded infrastructure. Tag-only refs (e.g. ghcr.io/obolnetwork/
x402-verifier:b13254e) are vulnerable to mutable-tag rewrites — the
class of bug CLAUDE.md pitfall #12 documented as a real production fire.

Pins:
  - x402-verifier:b13254e             @ sha256:a8a7aa0ca4c35b0ddf6983fa6e3e5f8a3f64e44d8e506ebfd55e39de2bc0342d
  - serviceoffer-controller:b13254e   @ sha256:f83bd7e55bdc5d87edb49c04e7fd9257097364e2d43e769c19dfd7c8b47d07af
  - litellm:sha-c16b156               @ sha256:9f112b51ac5a57d73cdd54103fb98d24eabaddd8689a9a285884dca6456dc86e
  - cloudflared:2026.3.0              @ sha256:6b599ca3e974349ead3286d178da61d291961182ec3fe9c505e1dd02c8ac31b0

Adds a regression test asserting every embedded manifest carries
@sha256: on its image refs so a future dependency bump can't silently
revert to tag-only.

Dev-rewrite invariant (defaults.go:124 + setup.go:74 alternation regex)
verified intact via go test ./internal/defaults/... ./internal/x402/...
Today the serviceoffer-controller is pinned at replicas: 1 with a
"Do not scale" comment in x402.yaml. The RBAC for leases is already
granted (x402.yaml:176-178) — pre-positioned and unused. An accidental
`kubectl scale --replicas=2` or HPA misconfiguration produces
split-brain finalizers and double on-chain ERC-8004 registration
(real gas spend + duplicate registry entries).

This wires client-go tools/leaderelection so multi-replica deployment
is safe-by-correctness, not safe-by-comment.

  - cmd/serviceoffer-controller/main.go:
      - Read POD_NAME / POD_NAMESPACE from downward API env.
      - Acquire Lease "serviceoffer-controller" in POD_NAMESPACE
        before running the reconcile loop.
      - On lost leadership, os.Exit(1) — kubelet restarts the pod
        which re-elects from scratch.
      - --leader-elect flag (default true) so local dev can bypass.

  - x402.yaml:
      - Add downward-API POD_NAME env to the controller Deployment
        (POD_NAMESPACE was already wired).
      - Update the "Do not scale" comment to "Single replica by
        default; bumping to 2+ is now safe — leader election prevents
        split-brain on the reconcile loop."

  - Lease parameters chosen for fast failover on k3d (lease=30s,
    renew=20s, retry=5s). Tunable via flag if a multi-zone deployment
    ever needs longer.

Uses client-go directly rather than controller-runtime Manager to
minimize churn — controller is currently raw client-go workqueues,
not controller-runtime. Migration to controller-runtime is a separate
much larger workstream and not necessary just for leader election.
Closes the root cause of CLAUDE.md pitfall #14 ("first-request flake
on freshly-deployed verifier"). Previously /readyz returned 200 the
moment config.Load() became non-nil, but routes from the ServiceOffer
informer load later — between those two events the pod is Ready from
kubelet's view, receives Service traffic, and matchPaidRoute returns
"no rule -> 200" for paid routes. The release-smoke flows hide this
behind 12x5s retry loops; the actual fix is to not be Ready until
routes are loaded.

  - Adds routesLoaded atomic.Bool to Verifier.
  - HandleReadyz returns 503 until BOTH config and routes loaded,
    with a body that distinguishes the two cases for kubectl describe
    debuggability.
  - WatchServiceOffers takes an optional onFirstApply callback,
    invoked after the post-WaitForCacheSync refresh succeeds.
  - main.go wires v.MarkRoutesLoaded as the callback for kube source,
    or invokes it directly after NewVerifier for file source (the
    file source has no informer; routes are loaded synchronously).

Pairs with PR #515 (replicas: 1) — at single replica the rollout
window for this race shrinks from "some scrapes" to "first ~5-10s",
but it's still a bug; this PR closes it.
…kubectl apply

Kills CLAUDE.md pitfall #9 forever. The previous code path had two
problems that compounded:

  1. EnsureVerifier did kubectl apply of embed.FS x402.yaml directly,
     overwriting whatever helmfile had installed. Under
     OBOL_DEVELOPMENT=true, this stripped local-build image pins back
     to registry-pinned digests — silently bypassing every dev edit
     to the verifier.

  2. To work around (1), setup.go carried a DUPLICATE copy of the
     image-pin rewrite regex from internal/defaults/defaults.go (with
     a code comment confessing "duplicated here to avoid an import
     cycle"). Every fix to the regex (e.g. pitfall #12's alternation-
     order fix) had to be applied in two places — which is exactly
     the kind of footgun that produces silent bypasses.

Now EnsureVerifier shells out to helmfile --selector name=base sync
against the helmfile state already used by obol stack up. Since
helmfile reads the manifests from \$OBOL_CONFIG_DIR/defaults/ — which
is populated by defaults.CopyInfrastructure with the canonical regex
already applied — the dev-rewrite happens exactly once, in exactly
one place.

  - Deletes the duplicate devLocallyBuiltImageBases + regex from
    internal/x402/setup.go.
  - EnsureVerifier now: RefreshInfrastructureIfChanged(); helmfile
    sync --selector name=base.
  - Deletes internal/x402/manifest_devmode_test.go — the canonical
    regression test is internal/defaults/defaults_test.go::
    TestCopyInfrastructure_DevModeRewritesDigestPins which still
    guards the rewrite at its single source.
  - Adds a structural test (setup_structure_test.go) asserting
    setup.go does not import the regexp package, making
    re-introduction of the duplicate fail at test time.

The duplicate-regex footgun is now structurally impossible to
re-introduce.
…loads

Brings every embedded Deployment shipped by obol-stack up to PSS Restricted:
  - runAsNonRoot: true with fixed non-zero UID/GID (65532)
  - allowPrivilegeEscalation: false
  - capabilities.drop: [ALL]
  - seccompProfile: RuntimeDefault
  - readOnlyRootFilesystem: true (with named emptyDir mounts where Python
    needs writeable /tmp and HOME/.cache)

PSS labels (enforce=restricted, audit/warn=restricted) added to the x402
and llm namespaces so future Deployment edits that omit per-pod
securityContext are rejected at admission.

Also switches the serviceoffer-controller Dockerfile from
gcr.io/distroless/static-debian12 (UID 0) to ...:nonroot (UID 65532).
Container escape via a Go runtime CVE on a UID-0 / no-seccomp /
no-cap-drop / RW-rootfs container was the easiest path to host pivot
on k3s single-node; this closes it.

Files touched:
  - Dockerfile.serviceoffer-controller (:nonroot base)
  - internal/embed/infrastructure/base/templates/x402.yaml
    (verifier + controller securityContext blocks, x402 ns PSS label)
  - internal/embed/infrastructure/base/templates/llm.yaml
    (litellm + x402-buyer securityContext, litellm-tmp + litellm-home
     emptyDir mounts with HOME/XDG_CACHE_HOME/HF_HOME redirection,
     llm ns PSS label)

Scope notes:
  - local-path-provisioner lives in kube-system (k3d-managed); not
    relabeled per PSS guidance to skip system namespaces.
  - hermes-obol-agent runtime is generated dynamically by
    serviceoffer-controller (internal/serviceoffercontroller/agent_render.go
    and internal/hermes/hermes.go), not from the embedded templates;
    its init-hermes-perms initContainer legitimately runs as UID 0
    for /data chown and is intentionally left out of this PR's scope.
  - cloudflared chart (internal/embed/infrastructure/cloudflared/...)
    is a separate Helm chart and not in this PR's file list.

What may break:
  - LiteLLM with readOnlyRootFilesystem may fail if it writes outside
    /tmp or $HOME — watch the next release-smoke for permission-denied
    errors and add named emptyDir mounts for any new write paths.
Today the x402-buyer sidecar's /state directory is an emptyDir. When
the litellm pod restarts (rollout, OOM, node drain), consumed.json is
gone. The pre-signed auth pool reloads from the ConfigMap the
controller manages, and the buyer treats every auth as unconsumed —
attempting to spend nonces that the facilitator already marked used.

Cascade: facilitator returns 400 "nonce already used" -> buyer 402
back to LiteLLM -> caller retry -> same 400 -> eventually buyer pool
exhausted -> 503 until manual `buy.py process --all` reseeds.

Fix: convert /state to a PVC backed by local-path-provisioner (the
storage class already deployed via base/templates/local-path.yaml).
50Mi request; consumed.json is tiny but room left for log growth.

Deployment strategy switched to Recreate because a RWO PVC can't
be co-mounted during a RollingUpdate surge. Litellm is replicas: 1
so this just means rollouts have a ~5s gap instead of an overlap —
acceptable.

What this does NOT solve:
  - Multi-replica litellm. RWO PVC works only for replicas: 1; would
    need RWX (which local-path doesn't support — needs NFS/Longhorn)
    or per-replica state via StatefulSet. Out of scope; litellm has
    no current scaling need.
  - Hard node loss. local-path PVCs are node-local; if the k3d node
    is destroyed, state is gone (along with the rest of the cluster).
    For local-only operator that's the expected blast radius.

PSS compatibility note: the PVC mount works under PSS Restricted as
long as the buyer container runs with appropriate fsGroup. PR #12
(Restricted PSS sweep) handles that separately and will verify mount
permissions when it lands.
The infrastructure helmfile shipped 6 `bedag/raw` releases — a wrapper
chart whose only job is to apply inline YAML through helmfile. With
the `base` release already rendering every other YAML in
`base/templates/`, the inline approach has zero remaining
justification. This PR finishes the job by relocating all 6:

  - llm-buyer-podmonitor    → base/templates/llm.yaml (appended)
  - erpc-httproute          → base/templates/erpc.yaml (new file)
  - erpc-x402-middleware    → base/templates/erpc.yaml
  - erpc-metadata           → base/templates/erpc.yaml
  - obol-frontend-httproute → base/templates/obol-frontend.yaml (new file)
  - obol-frontend-rbac      → base/templates/obol-frontend.yaml

Net change to the rendering: zero. Same YAML, just sourced from the
chart's templates directory instead of inlined in helmfile.yaml
through the bedag/raw wrapper chart. Each relocated YAML carries a
provenance comment.

DAG: `base` now `needs: [traefik/traefik, monitoring/monitoring]` so
the Traefik (Middleware) / Gateway API (HTTPRoute) / Prometheus
operator (PodMonitor) CRDs are guaranteed present before the
relocated templates apply. New Namespace docs for `erpc` and
`obol-frontend` make the `base` release self-contained — the
upstream chart releases that originally created those namespaces
still set `createNamespace: true`, which is a no-op against an
existing namespace.

The `bedag` repository entry is removed (no infrastructure release
uses it anymore). Network helmfiles + hermes still use bedag/raw —
out of scope for this PR.

`migrateDefaultsHTTPRouteHostnames` in internal/stack/stack.go
targets the old in-helmfile HTTPRoute indentation pattern; it is a
no-op against the relocated templates and against the new helmfile,
preserved unchanged for users upgrading from older stacks. The
`hostnames: ["obol.stack"]` restriction is preserved on every
relocated HTTPRoute per CLAUDE.md guidance — removing it would
expose the frontend / eRPC to the public cloudflared tunnel.

`TestHelmfile_IncludesBuyerPodMonitor` rewired to read
`base/templates/llm.yaml`. All embed CRD tests, stack tests, and
go build are green.
…tches

Today HandleVerify returns 200 whenever matchPaidRoute returns nil.
Combined with Traefik ForwardAuth's "200 = allow" semantics, this
means a misconfigured Middleware on a paid route OR a code bug where
the route was supposed to match but didn't silently makes the route
FREE — revenue loss with no signal.

  - Adds paidPrefixes atomic.Pointer[[]string] to Verifier.
  - Verifier.load() derives prefixes from cfg.Routes patterns:
    "/services/foo/*" -> "/services/foo/" (trailing slash kept so
    HasPrefix doesn't false-match /services/foobar/).
  - HandleVerify: when matchRoute returns nil, check if URI is under
    any tracked prefix. If yes -> 403. If no -> 200 (legitimately free).

Complementary to PR #519 (gate /readyz on informer sync):
  - PR #519 ensures the pod isn't Ready until routes are loaded
    (closes the bootstrap-window leak).
  - This PR ensures that after Ready, any prefix the verifier KNOWS
    about that doesn't have a matching rule is fail-closed
    (closes the steady-state-bug leak).

Together they cover the "rule should match but doesn't" gap.
Closes the entire class of "CRD YAML and Go struct drifted" bugs.
PurchaseAutoRefill.MaxTotal was the most recent instance — it existed
in purchaserequest-crd.yaml for months while internal/monetizeapi/
types.go didn't have the corresponding field. Without this commit,
that pattern recurs by design: two sources of truth, one hand-
maintained, no enforcement of agreement.

Now Go is the single source of truth:
  - kubebuilder markers on every CRD-backed struct in types.go
    (validation, required, enum, pattern, printer columns, subresources)
  - `just generate` regenerates *-crd.yaml from those markers
    + zz_generated.deepcopy.go from object:generate=true
  - CI fails if `git status` is non-empty after `just generate` runs

This commit also fixes the documented MaxTotal / MaxSpendPerDay drift
by adding both fields to PurchaseAutoRefill — the generated CRD now
matches the prior hand-written one and the controller can read them.

Pinned controller-tools at v0.16.5 in tools/tools.go (compatible with
client-go v0.34.x; a newer release would force prometheus/common
through a panicking validation-scheme change). Generation is
deterministic; running locally produces no diff after a clean
checkout.

For future CRD edits:
  1. Edit types.go (add/change a field, update markers)
  2. `just generate`
  3. Commit both the Go and YAML diffs
  4. CI verifies the YAML was committed

PreSignedAuth.Payment is map[string]interface{} (opaque x402
payload), which controller-gen cannot deep-copy automatically; a
hand-written DeepCopy lives in deepcopy_manual.go and the type is
flagged object:generate=false.

The hack/boilerplate.go.txt file is force-added past *.txt gitignore;
it's an empty marker for now — add a copyright header later if the
repo settles on one.
PrometheusRule annotations use {{ $labels.X }} which Prometheus
evaluates at alert-firing time. When this file is rendered through
Helm (via chart: ./base in helmfile.yaml), Helm's Go-template engine
tries to evaluate $labels at chart-render time and fails with:

  Error: UPGRADE FAILED: parse error at
  (base-infra/templates/x402-prometheus-rules.yaml:N):
  undefined variable "$labels"

Wrap each templated brace pair as {{ "{{" }}...{{ "}}" }} so Helm
emits literal Prometheus template syntax verbatim into the YAML
output, where Prometheus picks it up at alert-eval time.

Bug surfaced by integration-branch full stack-up; not caught by
`go test ./...` (unit tests don't render Helm) nor by the agent
worktree validation (which only checked Go-side compilation).

Recommend adding a CI smoke that pipes embedded *.yaml templates
through `helm template ./base` to catch this class going forward.

Stacks on PR #513 (which introduced the file in commit 27e1ac5).
PR #523 moved 6 bedag/raw helmfile releases into the base chart so
there's one source of truth for what ships in each namespace. Fresh
installs work. EXISTING clusters being upgraded from pre-#523
obol-stack fail at `helm upgrade base` with:

  Error: UPGRADE FAILED: <resource> exists and cannot be imported
  into the current release: invalid ownership metadata; annotation
  validation error: key "meta.helm.sh/release-name" must equal "base"

This blocks `obol stack up` until the operator manually re-annotates
~10 resources (Namespaces, HTTPRoutes, Middlewares, ConfigMaps,
PrometheusRule, PodMonitor, ClusterRole/Binding).

Adds hack/migrate-bedag-raw-to-base.sh which finds all such orphans
and re-annotates them in bulk. Idempotent — safe to re-run.

Surfaced by the 14-PR integration test campaign; see
plans/integration-test-results-final-20260524.md Bug #2.
…oads

PR #521 enforces Restricted Pod Security Standard on x402 + llm
namespaces. The controller renders two httpd-based Deployments
(obol-skill-md publisher + agentidentity-default-registration well-
known/agent-registration.json publisher) without securityContext,
so PSS admission rejects them and they never start. Result:
marketplace API returns STACK_UNREACHABLE because skill-md isn't
reachable.

Adds Restricted-compliant securityContext to both renderers:
  pod:        runAsNonRoot, runAsUser=1000, RunAsGroup=1000,
              seccompProfile=RuntimeDefault, fsGroup=1000
  container:  allowPrivilegeEscalation=false, drop ALL capabilities

Both Deployments already bind httpd to 8080, which is non-root
safe, so no port change is required.

Surfaced by the 14-PR integration test campaign. The integration
test workaround patched the running Deployments manually:
plans/integration-test-results-final-20260524.md Bug #3.
Recording rule was sum(counter), which is wrong for any metric where
the counter resets across pod restarts — Prometheus counters are
per-process by design. The TSDB is the canonical persistence layer;
rate() and increase() perform reset detection at query time across
the samples the TSDB holds.

  - Renames the rule to x402:revenue:7d_by_offer (name matches what
    it returns; the old "lifetime" / "total_by_offer_current" names
    were aspirational against a finite retention window).
  - Expression: sum by (offer_namespace, offer_name) (
      increase(obol_x402_verifier_charged_requests_total[7d])
    )
  - 7d inside 8d retention gives 1-day headroom so reset detection
    has both-side samples at the window's left edge.

Per Robust Perception's "avoiding the counter-reset undercount"
canonical guidance. Zero new components — uses only native Prometheus
+ recording-rule primitives.

Found by the 14-PR integration test (plans/integration-test-results-
final-20260524.md). The OBOL parity smoke surfaced it more visibly
when a verifier restart produced a "0 req·24h" UI display on a row
with real on-chain traffic.

Stacks on PR #527 (Helm-escape fix for the same file).
Currently the verifier emits (offer_namespace, offer_name, chain).
Answering "what's my OBOL revenue?" requires joining metrics with
the ServiceOffer CR's spec.payment.asset.symbol at the frontend.
With asset_symbol on the label set, the answer is a direct PromQL
aggregation.

Cardinality cost: zero. Each offer pins exactly one asset (A=1
per offer), so the new dimension is functionally constant within
the existing (ns, name) group — no series multiplication. The
"don't label what you can derive" guidance exists to prevent
*multiplicative* blowups (chain x pod x pod_owner style); the
single-asset-per-offer invariant means there's no multiplication
to prevent.

The argument for adding asset_symbol is identical to the argument
that already justifies `chain` on these vecs: both are
CR-derived, both are query-meaningful, both have bounded values.

Changes:
  - 6 metric vecs: label slice gains "asset_symbol"
  - pruneSeriesNotIn key now (ns, name, chain, asset_symbol) so
    asset-repin doesn't leak the old series
  - verifier.load() live-set built with the same 4-tuple
  - prometheusLabels() emits rule.AssetSymbol (or "unknown" if
    empty as defensive fallback)
  - New _asset_symbol-suffixed recording rules added side-by-side
    with existing rules; existing rules unchanged (non-breaking)
  - Tests: emission asserts asset_symbol; prune test asserts
    asset-repin doesn't leak

Frontend can simplify the existing metric x CR join in a future PR
once it migrates to the _asset_symbol-suffixed rule.

Findings from: plans/integration-test-L7-paid-flow-20260524.md
(OBOL parity smoke surfaced this as a real gap when validating
the WalletStrip / EarningsStrip per-token columns).
…ting low-traffic alerts

X402PaymentFailureRateHigh and the settlement_rate recording rule
used clamp_min(denominator, 1) as a div-by-zero guard. For paid
endpoints under light load (sub-1 req/s), the floor is 1.0 instead
of the true denominator, so the ratio numerator/denominator returns
near-zero even when 50%+ of requests are failing — the alert never
fires.

Switch the floor to 1e-9. Epsilon prevents division-by-zero while
keeping the actual ratio accurate at any non-zero traffic level.

Surfaced by Expert #2 review of the PromQL design
(plans/integration-test-L7-paid-flow-20260524.md follow-ups).

Stacks on PR #531 (asset_symbol label) which is the tip of the
rules-file chain. Will rebase onto main as the chain merges.
PR #527 fixed an unescaped {{ $labels }} in a PrometheusRule
annotation that broke `helm upgrade base` on every `obol stack up`.
The bug shipped to integration testing because go test ./...
doesn't exercise Helm rendering.

This job pipes the embedded base chart through `helm template`
on every PR; parse errors fail the build before merge.

  - Runs against ./internal/embed/infrastructure/base
  - Uses helm v3.20.1 (matches obolup.sh pinned version)
  - Also runs `helm lint` for chart-structure issues
  - Substitutes {{OLLAMA_HOST_IP}}/{{CLUSTER_ID}} stubs in a temp
    copy of the chart (mirroring what `obol stack init` does via
    internal/defaults/defaults.go::InfrastructureReplacements)
  - Future: pair with a helmfile-lint job for state-value tests

If we ever land a chart-template change that this doesn't catch,
expand the helm-template invocation with --set values mimicking
what `obol stack up` provides.
After the OBOL parity smoke + Prometheus expert review, we made
explicit design choices worth recording so they don't get
re-litigated:

  1. Counters are intentionally per-process — Prometheus design.
     Pod restarts reset them; rate()/increase() handle this at
     query time via the TSDB's reset detection. Don't add
     persistence to the counter itself.

  2. Prometheus = recent operational telemetry (bounded by retention).
     On-chain settlement TXs = canonical lifetime financial record.

  3. Recording rules use the convention <level>:<metric>:<operations>;
     name the window (7d_by_offer, not lifetime_by_offer).

  4. Add labels you'd query by directly (chain, asset_symbol —
     both CR-derived, both query-meaningful, both bounded).

  5. div-by-zero guards use epsilon (1e-9), not 1.0.

  6. CRD versioning stance: stay on v1alpha1 during active dev;
     the alpha promise IS "no compat". Graduate only when an
     external operator commits to depending on the schema.

The PVC-backed counter persistence option was considered and
rejected for our single-operator local-k3d use case. The doc
walks through why, what would change that decision, and where
the canonical "lifetime" answer comes from.

Adds CLAUDE.md pointer so future contributors land here first.
The legacy obol.org/paused annotation tore down HTTPRoutes immediately,
which is indistinguishable from a crash to remote x402 buyers and ERC-8004
reputation scorers. obol sell stop was also broken: it patched
status.conditions which the controller immediately overwrote.

This replaces both with a real drain:

- New ServiceOffer spec.drainAt (date-time) + spec.drainGracePeriod
  (duration; default 1h) mark an offer as winding down.
- While draining, /skill.md and /.well-known/agent-registration.json
  advertise the offer with available=false and drainEndsAt set, so
  external discovery can react before traffic disappears.
- The HTTPRoute + payment gate stay up until DrainEndsAt, letting
  in-flight buyers complete payments.
- After the grace period, the controller tears down the route, sets
  Draining=False reason=Drained, and leaves the CR (delete is the
  canonical removal command).

obol sell stop sets spec.drainAt, supports --grace <duration> and
--force/--now (zero grace = abrupt teardown for behavior parity with
the old annotation).
Comment thread .github/workflows/helm-template-smoke.yml Fixed
Comment thread .github/workflows/lint-test.yaml Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants