Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,8 @@ A registry digest pin instead of `:latest` on the verifier means your dev rewrit

For a fuller debug catalog with symptom→fix mapping, see `.agents/skills/obol-stack-dev/references/release-smoke-debugging.md`.

For observability architecture decisions (Prometheus retention vs. on-chain canonical record, counter-reset semantics, recording-rule naming, label conventions, CRD versioning stance, `clamp_min` epsilon), see `docs/observability.md` — read this before adding a new metric, recording rule, or proposing counter persistence.

### Security: Tunnel Exposure

The Cloudflare tunnel exposes the cluster to the public internet. Only x402-gated endpoints and discovery metadata should be reachable via the tunnel hostname. Internal services (frontend, eRPC, LiteLLM, monitoring) MUST have `hostnames: ["obol.stack"]` on their HTTPRoutes to restrict them to local access.
Expand Down
369 changes: 369 additions & 0 deletions docs/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,369 @@
# Observability architecture

Operator-facing reference for how Obol Stack records, queries, and reasons about
payment-flow telemetry. Read this before adding a new metric, a new recording
rule, or proposing "let's persist the counter to a PVC."

## TL;DR

- **Prometheus is for recent operational telemetry**, bounded by TSDB retention
(currently 8d in our cluster values).
- **On-chain settlement TXs are the canonical record for lifetime financial
state.** Every settled x402 payment leaves an immutable on-chain trace via
`X-PAYMENT-RESPONSE` (settle tx hash, asset, amount, payer, payee).
- **Counters reset on every pod restart. That is intentional.** Prometheus
counters are per-process by design. Use `increase()` / `rate()` at query
time — they detect resets in the TSDB and stitch ranges back together.
- Recording rules use `<level>:<metric>:<operations>` (Prometheus convention)
and **name the window** in the rule (`7d_by_offer`, not `lifetime_by_offer`).
- Div-by-zero guards use a small epsilon (`1e-9`), **never `1.0`**.
- CRDs stay on `v1alpha1` during active dev — the alpha promise IS "no compat",
and we have no external operators yet.

If you find yourself asking "how do we compute lifetime revenue for offer X
since the project started," the answer is **not** a recording rule — it is a
chain indexer over settle TXs.

---

## Why counters reset (and why that's fine)

Prometheus counters are stored per-process. When a verifier pod restarts (rollout,
node drain, OOM, image bump), the in-memory counter goes back to zero. This is
**not** a bug to engineer around at write time. The Prometheus query engine
already knows about it:

- `rate(counter[5m])` and `increase(counter[5m])` perform **reset detection**:
if the last sample is less than the previous sample inside the window, the
engine assumes a reset and stitches the two ranges together rather than
emitting a negative delta.
- This is the well-documented "counter reset semantics." See Robust Perception:
*Avoiding the counter-reset undercount* — the canonical writeup of why you
must always range-query counters rather than `sum()`'ing them raw.

The corollary is the rule that bit us in PR #530:

> Never write a recording rule of the form `sum(my_counter_total) by (...)`.
> Always write `sum(increase(my_counter_total[<window>])) by (...)`.

`sum(counter)` collapses to "whatever value the live samples currently hold,"
which means **every pod restart silently zeros the recorded series**. The
expert review caught a recording rule shipped in that exact broken form;
PR #530 swapped it to `increase()` over an explicit window.

---

## The thin-layer architecture

```
+------------------------------+
| x402-verifier (stateless) |
| - in-memory counters |
| - labels: route, |
| offer_namespace, |
| offer_name, chain, |
| asset_symbol |
+---------------+--------------+
|
| /metrics scrape (Prometheus)
v
+------------------------------+
| Prometheus TSDB (retention) |
| - 8d rolling window |
| - reset detection built in |
+---------------+--------------+
|
| recording rules with increase()
v
+------------------------------+
| Pre-aggregated series |
| offer:x402_revenue:7d_by_offer
| offer:x402_paid_requests:7d_by_offer
+---------------+--------------+
|
| PromQL queries
v
+------------------------------+
| Frontend / dashboards |
| - reads pre-aggregated |
| - cheap, scoped to window |
+------------------------------+


Parallel canonical path (for lifetime financial truth):

+------------------------------+
| x402-buyer / facilitator |
+---------------+--------------+
|
| settle tx (on-chain)
v
+------------------------------+
| Base / Base Sepolia |
| ERC-20 Transfer events |
| X-PAYMENT-RESPONSE header |
| carries settle tx hash |
+------------------------------+
|
| chain indexer / explorer
v
+------------------------------+
| Lifetime per-offer revenue |
| "since first deploy" answer |
+------------------------------+
```

The two paths answer **different questions**:

- Prometheus answers "what is the system doing in the last N hours/days?" with
cheap, second-resolution queries and label-faceting.
- On-chain answers "what was every payment that ever settled for offer X?" with
immutability and full historical depth, at the cost of being slower and
requiring an indexer.

Mixing them is a category error. Don't try to make Prometheus answer the
lifetime question, and don't try to make the chain answer "what is the current
402-rate this minute?"

---

## When NOT to add persistence to the counter itself

Three options come up repeatedly in design discussions. We rejected all three
for the current use case:

### PVC-backed verifier state
**Why it's tempting**: counters survive restart, no `increase()` gymnastics.

**Why we rejected it**: it bolts a stateful primitive onto a stateless
component. `x402-verifier` is currently safe to scale, rollout, evict, and
re-image freely. A PVC turns every restart into a sequence-recovery problem
(double-counting on a torn write, undercounting on a crash before flush).
Prometheus already solves reset detection correctly; we'd be reimplementing
it badly and introducing a new failure mode.

### Pushgateway
**Why it's tempting**: "decouple short-lived job state from scrape."

**Why we rejected it**: Pushgateway is for batch-job final values, not for
long-running services. Using it for a live verifier inverts the ownership
model (Pushgateway becomes the source of truth, verifier becomes a writer),
loses per-pod identity, and adds a single-point-of-failure that, if it
restarts, **also** zeros the counter — without `rate()` knowing about it.

### OTel collector with `cumulativetodelta`
**Why it's tempting**: collector-side reset stitching, hand off deltas to a
downstream store.

**Why we rejected it**: it solves a problem we don't have (we're not sending
deltas to a backend that needs them), at the cost of a new infrastructure
component to operate. For a single-operator local-k3d stack, this is over-
engineering. If we ever export to an OTel-native backend, revisit.

---

## When you WOULD want persistence

The only legitimate driver is an explicit **billing or compliance requirement
to report "totals since first deploy" that exceeds Prometheus retention.**

We do not have this requirement today. If we ever do:

1. **Derive it from on-chain TXs**, not from metrics. Every paid request leaves
an `X-PAYMENT-RESPONSE` with a settle tx hash; an indexer over those is the
canonical answer.
2. Only fall back to a persisted counter if for some reason the chain trace is
unavailable for the offer in question — and even then, treat the indexed
chain data as the source of truth and the counter as a soft mirror.

The architecture review's framing was right: if you find yourself wanting
Prometheus to answer a lifetime question, you've picked the wrong tool.

---

## Recording rule conventions

Naming follows the standard Prometheus pattern:

```
<level>:<metric>:<operations>
```

Examples we ship:

- `offer:x402_revenue:7d_by_offer` — revenue aggregated to the `offer` level,
base metric is `x402_revenue`, operation is `increase` over `7d` grouped
`by_offer`.
- `offer:x402_paid_requests:7d_by_offer` — same shape for paid request count.

Rules:

1. **Name the window in the rule.** `7d_by_offer` is honest; `lifetime_by_offer`
is a lie (Prometheus has no "lifetime"). The window in the name must match
the window in the expression.
2. **Use `increase()` over an explicit range, not `sum()` of the raw counter.**
See PR #530 — the original rule did `sum(by offer) (x402_revenue_total)` and
silently zeroed every time the verifier pod restarted. The fixed rule is
`sum by (offer_namespace, offer_name) (increase(x402_revenue_total[7d]))`.
3. **Keep the window aligned with retention.** Recording a `30d` rule with 8d
retention is a footgun: the rule sees nulls and silently produces nothing.

---

## Label conventions

Labels are the query interface. The rule of thumb:

- **Add a label if it's an attribute you'd want to facet by directly and the
cardinality is bounded.** "Bounded" means you can write down all possible
values: chains, asset symbols, offer names. Not user addresses, not request
IDs, not arbitrary route paths beyond what the offer CR enumerates.
- **Don't add a label that multiplies cardinality.** Every unique combination
of label values is a separate time series in TSDB. A label that adds 100
values multiplies storage by 100×.

Concrete examples:

| Label | Source | Why include it |
|-------------------|------------------|-------------------------------------------|
| `route` | offer CR pattern | Direct query facet, bounded by # offers |
| `offer_namespace` | offer CR meta | Tenancy facet |
| `offer_name` | offer CR meta | Per-offer breakdown |
| `chain` | offer CR payment | "Revenue by chain" is a real question |
| `asset_symbol` | offer CR payment | Added in PR #531 — per-token facet |

`chain` and `asset_symbol` are both CR-derived (operator-set, bounded) and
query-meaningful ("how much USDC vs OBOL did we earn on Base last week?"). They
both belong. PR #531 added `asset_symbol` for exactly this reason — the prior
schema collapsed all asset types into one bucket.

Anti-pattern: labeling by `payer_address` or `tx_hash`. Those are unbounded and
belong on the chain trace, not on the metric.

---

## CRD versioning: stay on `v1alpha1` during active dev

For the current single-operator local-stack development, the alpha-stays-alpha
approach matches the design intent. Concretely:

- **While in active dev with no external operators, stay on `v1alpha1` and edit
the schema in place.** The alpha promise IS "no compat" — that's the whole
point of the version channel. Renaming a field, dropping a field, tightening
validation: all fair game at `v1alpha1`.
- **Bump to `v1alpha2` only when** you need both versions to coexist briefly to
validate a conversion path (which requires standing up a conversion webhook),
or to checkpoint a major redesign you want to land alongside the old shape.
- **Graduate to `v1beta1` only when** all three are true:
1. The schema has been stable for ~2 releases (no breaking edits).
2. An external operator has committed to depending on it.
3. You're committing to backwards-compat for at least one release, with
deprecation warnings for any field you eventually want to remove.

The architecture review surfaced "should we graduate to `v1beta1`?" as a flag.
That was a "what if we ship externally" hypothetical, not an action item — and
graduating prematurely locks us into compat overhead before the schema has
earned it. The current ServiceOffer / RegistrationRequest / PurchaseRequest
CRDs all stay on `v1alpha1` until the three conditions above hold.

---

## `clamp_min(..., 1)` is an anti-pattern

Div-by-zero guards in PromQL exist because dividing by an empty counter
produces a `NaN`. The naive fix is:

```promql
# WRONG
my_success_rate
/
ignoring(...) clamp_min(my_request_total, 1)
```

That `1` is poison under low traffic. Suppose the real request rate over 5m is
3 successful out of 4 total (75%). With `clamp_min(..., 1)` and a window in
which the counter shows `0` total requests (e.g. between scrapes), the formula
returns `3/1 = 3.0` — a 300% success rate that breaks any alert downstream of
it. More commonly: the **denominator is clamped to 1 when it should be e.g.
0.5**, and your "success rate" reports half its real value, **causing
low-traffic alerts to under-report and stay silent during exactly the windows
when traffic is degraded**.

The fix is to use an epsilon that's small enough never to dominate the real
denominator:

```promql
# RIGHT
my_success_rate
/
ignoring(...) clamp_min(my_request_total, 1e-9)
```

`1e-9` keeps the division finite without distorting the result. Pick `1e-9` (or
smaller) as the project-wide epsilon and use it consistently. **Never `1.0`,
never `0.001`, never "a reasonable small number" — pick the smallest value that
avoids NaN and stick with it.**

This was fixed in the same review pass that produced this doc. Future
contributors: if you write a guarded division, the epsilon is `1e-9`.

---

## Cross-references

### Code

- `internal/x402/metrics.go` — verifier metric definitions
(`obol_x402_verifier_requests_total`, `_payment_required_total`,
`_payment_verified_total`, `_payment_failed_total`, `_charged_requests_total`).
- `internal/x402/verifier.go` — `prometheusLabels()` controls the verifier
label set; this is the canonical place to add a new bounded label.
- `internal/x402/buyer/metrics.go` — buyer-side counters
(`payment_attempts`, `payment_success_total`, `payment_failure_total`,
`confirm_spend_failure_total`, `payment_unsettled_confirmations`) plus
gauges (`auth_remaining`, `auth_spent`, `active_model_mappings`).
- `internal/x402/buyer/proxy.go` — `prometheusLabels()` for the buyer side.

### Infrastructure

- `internal/embed/infrastructure/values/monitoring.yaml.gotmpl` — Prometheus
values, including retention and recording rule wiring.
- `internal/embed/infrastructure/base/templates/x402.yaml` — verifier
Deployment, ServiceMonitor / PodMonitor.

### Pull requests that shaped this

- **PR #527** — `fix(prometheus-rules): escape PromQL $labels for Helm
rendering`. Helm was interpreting `$labels` as a Helm template variable and
blanking it; the fix is to escape so the literal `$labels` reaches the
Prometheus rule engine.
- **PR #530** — `fix(prometheus-rules): use increase() for the per-offer
revenue rule`. The original rule did `sum(counter)`, which silently zeroed
on verifier restart. Now uses `sum(increase(counter[7d]))` per the rules
above.
- **PR #531** — `feat(x402-metrics): add asset_symbol label for per-token
queries`. Unlocks "USDC vs OBOL revenue by chain" without needing a
downstream join.

### Reports

- The OBOL parity integration test report — see `plans/` for the most recent
`release-smoke-hardening-*.md` and `post-490-integration-*.md` entries that
reference the metric audits behind PRs #527 / #530 / #531.

---

## Quick checklist for the next change

Before opening a PR that touches metrics:

- [ ] New label is bounded and CR-derived (or otherwise enumerable).
- [ ] No label that could grow unbounded (payer address, tx hash, free-form
path beyond CR enumeration).
- [ ] New recording rule uses `increase()` over an explicit window.
- [ ] Window in the rule name matches window in the expression
(no `lifetime_*`).
- [ ] Window is within Prometheus retention.
- [ ] Any guarded division uses `1e-9` as the clamp floor.
- [ ] If the new metric tries to answer a "lifetime" question, you've stopped
and reconsidered using on-chain data instead.
Loading