Skip to content

fix(prometheus-rules): use epsilon floor not 1.0 to avoid under-reporting low-traffic alerts#532

Closed
bussyjd wants to merge 1 commit into
feat/x402-asset-symbol-labelfrom
fix/alert-clamp-min-epsilon
Closed

fix(prometheus-rules): use epsilon floor not 1.0 to avoid under-reporting low-traffic alerts#532
bussyjd wants to merge 1 commit into
feat/x402-asset-symbol-labelfrom
fix/alert-clamp-min-epsilon

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

  • Replace clamp_min(denominator, 1) with clamp_min(denominator, 1e-9) in both X402PaymentFailureRateHigh alert and x402:settlement_rate:1h_by_offer_chain recording rule.
  • Update the comments above each rule to document why epsilon (div-by-zero guard) is the correct choice rather than 1 (which silently floored the denominator).

The bug

clamp_min(..., 1) floors the denominator at 1 req/s. The intent was to guard against division-by-zero when no samples exist in the lookback window. The effect was different: on any paid offer running below 1 req/s, the rule replaces the true denominator with 1, collapsing the ratio.

Concrete example for X402PaymentFailureRateHigh under light load:

  • failed = 0.001 req/s
  • verified = 0.001 req/s
  • true denominator = failed + verified = 0.002 req/s
  • true ratio = 0.001 / 0.002 = 0.5 (50% failure)
  • with clamp_min(..., 1): 0.001 / max(0.002, 1) = 0.001 / 1 = 0.001 (~0% failure)

Same arithmetic for the settlement-rate recording rule: the dashboard reads "100% settlement" on a half-broken low-traffic offer.

The 10% alert threshold (> 0.10) means the alert can never fire on any offer whose total failed + verified rate is below ~1 req/s, regardless of how badly it's failing.

The fix

clamp_min(..., 1e-9) keeps the original div-by-zero protection (the denominator never reaches zero in the division) without distorting the ratio. At any non-zero traffic level the rule returns the true ratio; only the truly-zero case is clamped, and there the numerator is also zero, so the ratio is well-defined at 0.

Provenance

Surfaced by Expert #2 review of the PromQL design in plans/integration-test-L7-paid-flow-20260524.md follow-ups.

Stack

Based on feat/x402-asset-symbol-label (PR #531), the current tip of the rules-file stack (#527#530#531). Will rebase onto main as the chain merges.

grep clamp_min over the repo returns only the two occurrences in this file, both touched here.

Test plan

  • go build ./... clean
  • go test ./internal/embed/... ./internal/x402/... green
  • Reviewer eyeballs the PromQL diff for typos in the comments

…ting low-traffic alerts

X402PaymentFailureRateHigh and the settlement_rate recording rule
used clamp_min(denominator, 1) as a div-by-zero guard. For paid
endpoints under light load (sub-1 req/s), the floor is 1.0 instead
of the true denominator, so the ratio numerator/denominator returns
near-zero even when 50%+ of requests are failing — the alert never
fires.

Switch the floor to 1e-9. Epsilon prevents division-by-zero while
keeping the actual ratio accurate at any non-zero traffic level.

Surfaced by Expert #2 review of the PromQL design
(plans/integration-test-L7-paid-flow-20260524.md follow-ups).

Stacks on PR #531 (asset_symbol label) which is the tip of the
rules-file chain. Will rebase onto main as the chain merges.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

@bussyjd bussyjd closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant