Skip to content

fix(x402-metrics): align Prometheus retention with recording-rule windows; rename mis-named lifetime rule#516

Closed
bussyjd wants to merge 1 commit into
feat/x402-marketplace-metricsfrom
fix/prom-retention-window-alignment
Closed

fix(x402-metrics): align Prometheus retention with recording-rule windows; rename mis-named lifetime rule#516
bussyjd wants to merge 1 commit into
feat/x402-marketplace-metricsfrom
fix/prom-retention-window-alignment

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 23, 2026

Why

27e1ac5 (this PR's parent) added recording rules over [24h] and [7d]. The Prometheus retention was 6h. increase(x[24h]) over a 6h TSDB silently returns 6h extrapolated as 24h — the frontend reads it and shows the wrong number with no warning.

Same commit's x402:revenue:lifetime_by_offer rule is sum(counter), not sum(increase[lifetime]) — it resets on every replica restart, so "lifetime" is a misnomer.

Before

   recording rule expressions      |   actual TSDB retention
   ---------------------------------------------------------
   increase(x[24h])                |   <- reads last 6h only
   increase(x[7d])                 |   <- reads last 6h only
   sum(charged_requests_total)     |   <- survives, but resets
                                   |     on every verifier restart

   Frontend "24h revenue" / "7d earnings" / "lifetime" cells all wrong

After

   recording rule expressions      |   actual TSDB retention
   ---------------------------------------------------------
   increase(x[24h])                |   <- 8d window, valid
   increase(x[7d])                 |   <- 8d window, valid
   sum(charged_requests_total)     |   <- renamed to
   -> total_by_offer_current       |     `total_by_offer_current`
                                   |     with doc-comment that says
                                   |     "this resets on rollout"

   Cells match reality. Naming matches semantics.

What changed

  • monitoring.yaml.gotmpl: retention: 6h -> 8d (= 7d max window + 1d safety)
  • x402-prometheus-rules.yaml: rename rule + add explanatory comment

Disk impact

~32x increase in Prometheus disk footprint (proportional to retention). Local-only kube-prometheus-stack PVC currently has no size cap (inherits storageClass default). Watch for PVC-full on next obol stack up and add a cap if it bites.

Frontend consumer

obol-stack-front-end/src/lib/services/prometheus.ts (lines 49, 61) references x402:revenue:lifetime_by_offer in a doc comment + OR-fallback expression for the lifetime-earnings query. Architecture-review notes flagged the OR fallback added in #330 was unused on the consumer side anyway, so this rename has zero immediate UI impact. A follow-up small PR on the frontend repo should update the name + comment to x402:revenue:total_by_offer_current. Not modified in this PR (separate repo).

Test plan

  • go build ./... clean
  • go test ./internal/embed/... ./internal/x402/... green
  • On next stack up: PromQL x402:revenue:7d_by_offer_chain should return non-extrapolated values

Stacks on

PR #513. Rebase onto main after #513 merges.

…dows; rename mis-named lifetime rule

Two related metric-correctness fixes layered on top of the recording
rules added in 27e1ac5:

1. Retention 6h → 8d. The recording rules added in 27e1ac5 use [24h]
   and [7d] windows. `increase(x[24h])` against a 6h-retention TSDB
   silently returns "last 6h extrapolated to 24h" with no error. The
   frontend displays that result as "24h revenue" — wrong by 4x.
   8d (= 7d + 1d safety margin) keeps the [7d] rule valid across a
   brief Prometheus outage.

2. `x402:revenue:lifetime_by_offer` → `x402:revenue:total_by_offer_current`.
   The original expression was `sum(counter)` (not `sum(increase[lifetime])`),
   so it:
     * is NOT lifetime — it's "sum across currently-alive verifier
       replicas of their since-last-restart counts",
     * drops ~50% on every replica rollout,
     * compounds with the per-pod-registry issue addressed by the
       replicas:1 fix.

   Renaming makes the semantic explicit. True lifetime queries should
   use `sum_over_time(...[Nd])` against a long-retention store.

Retention bump increases Prometheus disk footprint roughly proportional
to (8d/6h) ≈ 32x. The local-only kube-prometheus-stack PVC sizing in
monitoring.yaml.gotmpl needs review on next `obol stack up` if disk
pressure shows up — currently no PVC size cap set, so it inherits the
storageClass default.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

@bussyjd bussyjd closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant