ObolNetwork · bussyjd · May 24, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -342,6 +342,8 @@ A registry digest pin instead of `:latest` on the verifier means your dev rewrit
 
 For a fuller debug catalog with symptom→fix mapping, see `.agents/skills/obol-stack-dev/references/release-smoke-debugging.md`.
 
+For observability architecture decisions (Prometheus retention vs. on-chain canonical record, counter-reset semantics, recording-rule naming, label conventions, CRD versioning stance, `clamp_min` epsilon), see `docs/observability.md` — read this before adding a new metric, recording rule, or proposing counter persistence.
+
 ### Security: Tunnel Exposure
 
 The Cloudflare tunnel exposes the cluster to the public internet. Only x402-gated endpoints and discovery metadata should be reachable via the tunnel hostname. Internal services (frontend, eRPC, LiteLLM, monitoring) MUST have `hostnames: ["obol.stack"]` on their HTTPRoutes to restrict them to local access.

diff --git a/docs/observability.md b/docs/observability.md
@@ -0,0 +1,369 @@
+# Observability architecture
+
+Operator-facing reference for how Obol Stack records, queries, and reasons about
+payment-flow telemetry. Read this before adding a new metric, a new recording
+rule, or proposing "let's persist the counter to a PVC."
+
+## TL;DR
+
+- **Prometheus is for recent operational telemetry**, bounded by TSDB retention
+  (currently 8d in our cluster values).
+- **On-chain settlement TXs are the canonical record for lifetime financial
+  state.** Every settled x402 payment leaves an immutable on-chain trace via
+  `X-PAYMENT-RESPONSE` (settle tx hash, asset, amount, payer, payee).
+- **Counters reset on every pod restart. That is intentional.** Prometheus
+  counters are per-process by design. Use `increase()` / `rate()` at query
+  time — they detect resets in the TSDB and stitch ranges back together.
+- Recording rules use `<level>:<metric>:<operations>` (Prometheus convention)
+  and **name the window** in the rule (`7d_by_offer`, not `lifetime_by_offer`).
+- Div-by-zero guards use a small epsilon (`1e-9`), **never `1.0`**.
+- CRDs stay on `v1alpha1` during active dev — the alpha promise IS "no compat",
+  and we have no external operators yet.
+
+If you find yourself asking "how do we compute lifetime revenue for offer X
+since the project started," the answer is **not** a recording rule — it is a
+chain indexer over settle TXs.
+
+---
+
+## Why counters reset (and why that's fine)
+
+Prometheus counters are stored per-process. When a verifier pod restarts (rollout,
+node drain, OOM, image bump), the in-memory counter goes back to zero. This is
+**not** a bug to engineer around at write time. The Prometheus query engine
+already knows about it:
+
+- `rate(counter[5m])` and `increase(counter[5m])` perform **reset detection**:
+  if the last sample is less than the previous sample inside the window, the
+  engine assumes a reset and stitches the two ranges together rather than
+  emitting a negative delta.
+- This is the well-documented "counter reset semantics." See Robust Perception:
+  *Avoiding the counter-reset undercount* — the canonical writeup of why you
+  must always range-query counters rather than `sum()`'ing them raw.
+
+The corollary is the rule that bit us in PR #530:
+
+> Never write a recording rule of the form `sum(my_counter_total) by (...)`.
+> Always write `sum(increase(my_counter_total[<window>])) by (...)`.
+
+`sum(counter)` collapses to "whatever value the live samples currently hold,"
+which means **every pod restart silently zeros the recorded series**. The
+expert review caught a recording rule shipped in that exact broken form;
+PR #530 swapped it to `increase()` over an explicit window.
+
+---
+
+## The thin-layer architecture
+
+```
+                +------------------------------+
+                | x402-verifier (stateless)    |
+                |   - in-memory counters       |
+                |   - labels: route,           |
+                |     offer_namespace,         |
+                |     offer_name, chain,       |
+                |     asset_symbol             |
+                +---------------+--------------+
+                                |
+                                | /metrics scrape (Prometheus)
+                                v
+                +------------------------------+
+                | Prometheus TSDB (retention)  |
+                |   - 8d rolling window        |
+                |   - reset detection built in |
+                +---------------+--------------+
+                                |
+                                | recording rules with increase()
+                                v
+                +------------------------------+
+                | Pre-aggregated series        |
+                |   offer:x402_revenue:7d_by_offer
+                |   offer:x402_paid_requests:7d_by_offer
+                +---------------+--------------+
+                                |
+                                | PromQL queries
+                                v
+                +------------------------------+
+                | Frontend / dashboards        |
+                |   - reads pre-aggregated     |
+                |   - cheap, scoped to window  |
+                +------------------------------+
+
+
+   Parallel canonical path (for lifetime financial truth):
+
+                +------------------------------+
+                | x402-buyer / facilitator     |
+                +---------------+--------------+
+                                |
+                                | settle tx (on-chain)
+                                v
+                +------------------------------+
+                | Base / Base Sepolia          |
+                |   ERC-20 Transfer events     |
+                |   X-PAYMENT-RESPONSE header  |
+                |   carries settle tx hash     |
+                +------------------------------+
+                                |
+                                | chain indexer / explorer
+                                v
+                +------------------------------+
+                | Lifetime per-offer revenue   |
+                | "since first deploy" answer  |
+                +------------------------------+
+```
+
+The two paths answer **different questions**:
+
+- Prometheus answers "what is the system doing in the last N hours/days?" with
+  cheap, second-resolution queries and label-faceting.
+- On-chain answers "what was every payment that ever settled for offer X?" with
+  immutability and full historical depth, at the cost of being slower and
+  requiring an indexer.
+
+Mixing them is a category error. Don't try to make Prometheus answer the
+lifetime question, and don't try to make the chain answer "what is the current
+402-rate this minute?"
+
+---
+
+## When NOT to add persistence to the counter itself
+
+Three options come up repeatedly in design discussions. We rejected all three
+for the current use case:
+
+### PVC-backed verifier state
+**Why it's tempting**: counters survive restart, no `increase()` gymnastics.
+
+**Why we rejected it**: it bolts a stateful primitive onto a stateless
+component. `x402-verifier` is currently safe to scale, rollout, evict, and
+re-image freely. A PVC turns every restart into a sequence-recovery problem
+(double-counting on a torn write, undercounting on a crash before flush).
+Prometheus already solves reset detection correctly; we'd be reimplementing
+it badly and introducing a new failure mode.
+
+### Pushgateway
+**Why it's tempting**: "decouple short-lived job state from scrape."
+
+**Why we rejected it**: Pushgateway is for batch-job final values, not for
+long-running services. Using it for a live verifier inverts the ownership
+model (Pushgateway becomes the source of truth, verifier becomes a writer),
+loses per-pod identity, and adds a single-point-of-failure that, if it
+restarts, **also** zeros the counter — without `rate()` knowing about it.
+
+### OTel collector with `cumulativetodelta`
+**Why it's tempting**: collector-side reset stitching, hand off deltas to a
+downstream store.
+
+**Why we rejected it**: it solves a problem we don't have (we're not sending
+deltas to a backend that needs them), at the cost of a new infrastructure
+component to operate. For a single-operator local-k3d stack, this is over-
+engineering. If we ever export to an OTel-native backend, revisit.
+
+---
+
+## When you WOULD want persistence
+
+The only legitimate driver is an explicit **billing or compliance requirement
+to report "totals since first deploy" that exceeds Prometheus retention.**
+
+We do not have this requirement today. If we ever do:
+
+1. **Derive it from on-chain TXs**, not from metrics. Every paid request leaves
+   an `X-PAYMENT-RESPONSE` with a settle tx hash; an indexer over those is the
+   canonical answer.
+2. Only fall back to a persisted counter if for some reason the chain trace is
+   unavailable for the offer in question — and even then, treat the indexed
+   chain data as the source of truth and the counter as a soft mirror.
+
+The architecture review's framing was right: if you find yourself wanting
+Prometheus to answer a lifetime question, you've picked the wrong tool.
+
+---
+
+## Recording rule conventions
+
+Naming follows the standard Prometheus pattern:
+
+```
+<level>:<metric>:<operations>
+```
+
+Examples we ship:
+
+- `offer:x402_revenue:7d_by_offer` — revenue aggregated to the `offer` level,
+  base metric is `x402_revenue`, operation is `increase` over `7d` grouped
+  `by_offer`.
+- `offer:x402_paid_requests:7d_by_offer` — same shape for paid request count.
+
+Rules:
+
+1. **Name the window in the rule.** `7d_by_offer` is honest; `lifetime_by_offer`
+   is a lie (Prometheus has no "lifetime"). The window in the name must match
+   the window in the expression.
+2. **Use `increase()` over an explicit range, not `sum()` of the raw counter.**
+   See PR #530 — the original rule did `sum(by offer) (x402_revenue_total)` and
+   silently zeroed every time the verifier pod restarted. The fixed rule is
+   `sum by (offer_namespace, offer_name) (increase(x402_revenue_total[7d]))`.
+3. **Keep the window aligned with retention.** Recording a `30d` rule with 8d
+   retention is a footgun: the rule sees nulls and silently produces nothing.
+
+---
+
+## Label conventions
+
+Labels are the query interface. The rule of thumb:
+
+- **Add a label if it's an attribute you'd want to facet by directly and the
+  cardinality is bounded.** "Bounded" means you can write down all possible
+  values: chains, asset symbols, offer names. Not user addresses, not request
+  IDs, not arbitrary route paths beyond what the offer CR enumerates.
+- **Don't add a label that multiplies cardinality.** Every unique combination
+  of label values is a separate time series in TSDB. A label that adds 100
+  values multiplies storage by 100×.
+
+Concrete examples:
+
+| Label             | Source           | Why include it                            |
+|-------------------|------------------|-------------------------------------------|
+| `route`           | offer CR pattern | Direct query facet, bounded by # offers   |
+| `offer_namespace` | offer CR meta    | Tenancy facet                             |
+| `offer_name`      | offer CR meta    | Per-offer breakdown                       |
+| `chain`           | offer CR payment | "Revenue by chain" is a real question     |
+| `asset_symbol`    | offer CR payment | Added in PR #531 — per-token facet        |
+
+`chain` and `asset_symbol` are both CR-derived (operator-set, bounded) and
+query-meaningful ("how much USDC vs OBOL did we earn on Base last week?"). They
+both belong. PR #531 added `asset_symbol` for exactly this reason — the prior
+schema collapsed all asset types into one bucket.
+
+Anti-pattern: labeling by `payer_address` or `tx_hash`. Those are unbounded and
+belong on the chain trace, not on the metric.
+
+---
+
+## CRD versioning: stay on `v1alpha1` during active dev
+
+For the current single-operator local-stack development, the alpha-stays-alpha
+approach matches the design intent. Concretely:
+
+- **While in active dev with no external operators, stay on `v1alpha1` and edit
+  the schema in place.** The alpha promise IS "no compat" — that's the whole
+  point of the version channel. Renaming a field, dropping a field, tightening
+  validation: all fair game at `v1alpha1`.
+- **Bump to `v1alpha2` only when** you need both versions to coexist briefly to
+  validate a conversion path (which requires standing up a conversion webhook),
+  or to checkpoint a major redesign you want to land alongside the old shape.
+- **Graduate to `v1beta1` only when** all three are true:
+  1. The schema has been stable for ~2 releases (no breaking edits).
+  2. An external operator has committed to depending on it.
+  3. You're committing to backwards-compat for at least one release, with
+     deprecation warnings for any field you eventually want to remove.
+
+The architecture review surfaced "should we graduate to `v1beta1`?" as a flag.
+That was a "what if we ship externally" hypothetical, not an action item — and
+graduating prematurely locks us into compat overhead before the schema has
+earned it. The current ServiceOffer / RegistrationRequest / PurchaseRequest
+CRDs all stay on `v1alpha1` until the three conditions above hold.
+
+---
+
+## `clamp_min(..., 1)` is an anti-pattern
+
+Div-by-zero guards in PromQL exist because dividing by an empty counter
+produces a `NaN`. The naive fix is:
+
+```promql
+# WRONG
+my_success_rate
+  /
+ignoring(...) clamp_min(my_request_total, 1)
+```
+
+That `1` is poison under low traffic. Suppose the real request rate over 5m is
+3 successful out of 4 total (75%). With `clamp_min(..., 1)` and a window in
+which the counter shows `0` total requests (e.g. between scrapes), the formula
+returns `3/1 = 3.0` — a 300% success rate that breaks any alert downstream of
+it. More commonly: the **denominator is clamped to 1 when it should be e.g.
+0.5**, and your "success rate" reports half its real value, **causing
+low-traffic alerts to under-report and stay silent during exactly the windows
+when traffic is degraded**.
+
+The fix is to use an epsilon that's small enough never to dominate the real
+denominator:
+
+```promql
+# RIGHT
+my_success_rate
+  /
+ignoring(...) clamp_min(my_request_total, 1e-9)
+```
+
+`1e-9` keeps the division finite without distorting the result. Pick `1e-9` (or
+smaller) as the project-wide epsilon and use it consistently. **Never `1.0`,
+never `0.001`, never "a reasonable small number" — pick the smallest value that
+avoids NaN and stick with it.**
+
+This was fixed in the same review pass that produced this doc. Future
+contributors: if you write a guarded division, the epsilon is `1e-9`.
+
+---
+
+## Cross-references
+
+### Code
+
+- `internal/x402/metrics.go` — verifier metric definitions
+  (`obol_x402_verifier_requests_total`, `_payment_required_total`,
+  `_payment_verified_total`, `_payment_failed_total`, `_charged_requests_total`).
+- `internal/x402/verifier.go` — `prometheusLabels()` controls the verifier
+  label set; this is the canonical place to add a new bounded label.
+- `internal/x402/buyer/metrics.go` — buyer-side counters
+  (`payment_attempts`, `payment_success_total`, `payment_failure_total`,
+  `confirm_spend_failure_total`, `payment_unsettled_confirmations`) plus
+  gauges (`auth_remaining`, `auth_spent`, `active_model_mappings`).
+- `internal/x402/buyer/proxy.go` — `prometheusLabels()` for the buyer side.
+
+### Infrastructure
+
+- `internal/embed/infrastructure/values/monitoring.yaml.gotmpl` — Prometheus
+  values, including retention and recording rule wiring.
+- `internal/embed/infrastructure/base/templates/x402.yaml` — verifier
+  Deployment, ServiceMonitor / PodMonitor.
+
+### Pull requests that shaped this
+
+- **PR #527** — `fix(prometheus-rules): escape PromQL $labels for Helm
+  rendering`. Helm was interpreting `$labels` as a Helm template variable and
+  blanking it; the fix is to escape so the literal `$labels` reaches the
+  Prometheus rule engine.
+- **PR #530** — `fix(prometheus-rules): use increase() for the per-offer
+  revenue rule`. The original rule did `sum(counter)`, which silently zeroed
+  on verifier restart. Now uses `sum(increase(counter[7d]))` per the rules
+  above.
+- **PR #531** — `feat(x402-metrics): add asset_symbol label for per-token
+  queries`. Unlocks "USDC vs OBOL revenue by chain" without needing a
+  downstream join.
+
+### Reports
+
+- The OBOL parity integration test report — see `plans/` for the most recent
+  `release-smoke-hardening-*.md` and `post-490-integration-*.md` entries that
+  reference the metric audits behind PRs #527 / #530 / #531.
+
+---
+
+## Quick checklist for the next change
+
+Before opening a PR that touches metrics:
+
+- [ ] New label is bounded and CR-derived (or otherwise enumerable).
+- [ ] No label that could grow unbounded (payer address, tx hash, free-form
+      path beyond CR enumeration).
+- [ ] New recording rule uses `increase()` over an explicit window.
+- [ ] Window in the rule name matches window in the expression
+      (no `lifetime_*`).
+- [ ] Window is within Prometheus retention.
+- [ ] Any guarded division uses `1e-9` as the clamp floor.
+- [ ] If the new metric tries to answer a "lifetime" question, you've stopped
+      and reconsidered using on-chain data instead.