Skip to content

fix(prometheus-rules): use increase() for the per-offer revenue rule#530

Closed
bussyjd wants to merge 1 commit into
fix/prometheus-rules-helm-template-escapefrom
fix/prometheus-rule-increase-7d
Closed

fix(prometheus-rules): use increase() for the per-offer revenue rule#530
bussyjd wants to merge 1 commit into
fix/prometheus-rules-helm-template-escapefrom
fix/prometheus-rule-increase-7d

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

The recording rule x402:revenue:total_by_offer_current used a naive sum(counter) expression, which is wrong for any metric where the counter can reset across pod restarts. Prometheus counters are per-process by design — they reset to zero on every pod restart (rollout, OOM, eviction, node reschedule). A sum by (...) (counter) query therefore drops to zero whenever the verifier restarts, producing a misleading "0 requests" reading on offers with real on-chain traffic.

This PR fixes the rule to use increase(), which performs reset detection at query time across the samples held in the TSDB.

Before

- record: x402:revenue:total_by_offer_current
  expr: |
    sum by (offer_namespace, offer_name) (
      obol_x402_verifier_charged_requests_total
    )

After

- record: x402:revenue:7d_by_offer
  expr: |
    sum by (offer_namespace, offer_name) (
      increase(obol_x402_verifier_charged_requests_total[7d])
    )

Why increase([7d]) not [8d]

The TSDB is the canonical persistence layer (retention 8d). increase() needs samples on both sides of the window edge to do reset detection at the left edge of the window. A 7d window inside 8d retention gives a 1-day headroom so the rule keeps working at exactly the moment data ages out, instead of silently producing NaN/undercounts at the boundary.

Why the rename

The old names (total_by_offer_current, lifetime_by_offer) were aspirational against a finite retention window. The new name (7d_by_offer) matches what the expression actually returns and makes the bounded nature of the value explicit at the call site.

Canonical reference: Robust Perception "avoiding the counter-reset undercount".

Provenance

Found by the 14-PR integration test (plans/integration-test-results-final-20260524.md). The OBOL parity smoke surfaced it more visibly when a verifier restart produced a "0 req·24h" UI display on a row with real on-chain traffic.

Stacks on #527 (Helm-escape fix for the same file). Frontend consumer PR follows.

Test plan

  • go build ./... clean
  • go test ./internal/embed/... ./internal/x402/... green
  • Verify the recording rule lands in Prometheus after obol stack up and returns a per-offer 7d count for a verifier that's been restarted at least once
  • Frontend follow-up PR points at the renamed rule

Recording rule was sum(counter), which is wrong for any metric where
the counter resets across pod restarts — Prometheus counters are
per-process by design. The TSDB is the canonical persistence layer;
rate() and increase() perform reset detection at query time across
the samples the TSDB holds.

  - Renames the rule to x402:revenue:7d_by_offer (name matches what
    it returns; the old "lifetime" / "total_by_offer_current" names
    were aspirational against a finite retention window).
  - Expression: sum by (offer_namespace, offer_name) (
      increase(obol_x402_verifier_charged_requests_total[7d])
    )
  - 7d inside 8d retention gives 1-day headroom so reset detection
    has both-side samples at the window's left edge.

Per Robust Perception's "avoiding the counter-reset undercount"
canonical guidance. Zero new components — uses only native Prometheus
+ recording-rule primitives.

Found by the 14-PR integration test (plans/integration-test-results-
final-20260524.md). The OBOL parity smoke surfaced it more visibly
when a verifier restart produced a "0 req·24h" UI display on a row
with real on-chain traffic.

Stacks on PR #527 (Helm-escape fix for the same file).
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

@bussyjd bussyjd closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant