fix(prometheus-rules): use epsilon floor not 1.0 to avoid under-reporting low-traffic alerts#532
Closed
bussyjd wants to merge 1 commit into
Closed
Conversation
…ting low-traffic alerts X402PaymentFailureRateHigh and the settlement_rate recording rule used clamp_min(denominator, 1) as a div-by-zero guard. For paid endpoints under light load (sub-1 req/s), the floor is 1.0 instead of the true denominator, so the ratio numerator/denominator returns near-zero even when 50%+ of requests are failing — the alert never fires. Switch the floor to 1e-9. Epsilon prevents division-by-zero while keeping the actual ratio accurate at any non-zero traffic level. Surfaced by Expert #2 review of the PromQL design (plans/integration-test-L7-paid-flow-20260524.md follow-ups). Stacks on PR #531 (asset_symbol label) which is the tip of the rules-file chain. Will rebase onto main as the chain merges.
6 tasks
Collaborator
Author
|
Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
clamp_min(denominator, 1)withclamp_min(denominator, 1e-9)in bothX402PaymentFailureRateHighalert andx402:settlement_rate:1h_by_offer_chainrecording rule.1(which silently floored the denominator).The bug
clamp_min(..., 1)floors the denominator at 1 req/s. The intent was to guard against division-by-zero when no samples exist in the lookback window. The effect was different: on any paid offer running below 1 req/s, the rule replaces the true denominator with1, collapsing the ratio.Concrete example for
X402PaymentFailureRateHighunder light load:clamp_min(..., 1): 0.001 / max(0.002, 1) = 0.001 / 1 = 0.001 (~0% failure)Same arithmetic for the settlement-rate recording rule: the dashboard reads "100% settlement" on a half-broken low-traffic offer.
The 10% alert threshold (
> 0.10) means the alert can never fire on any offer whose totalfailed + verifiedrate is below ~1 req/s, regardless of how badly it's failing.The fix
clamp_min(..., 1e-9)keeps the original div-by-zero protection (the denominator never reaches zero in the division) without distorting the ratio. At any non-zero traffic level the rule returns the true ratio; only the truly-zero case is clamped, and there the numerator is also zero, so the ratio is well-defined at0.Provenance
Surfaced by Expert #2 review of the PromQL design in
plans/integration-test-L7-paid-flow-20260524.mdfollow-ups.Stack
Based on
feat/x402-asset-symbol-label(PR #531), the current tip of the rules-file stack (#527 → #530 → #531). Will rebase onto main as the chain merges.grep clamp_minover the repo returns only the two occurrences in this file, both touched here.Test plan
go build ./...cleango test ./internal/embed/... ./internal/x402/...green