Skip to content

docs(observability): record the thin-layer architecture decisions#534

Closed
bussyjd wants to merge 1 commit into
mainfrom
docs/observability-thin-layer-architecture
Closed

docs(observability): record the thin-layer architecture decisions#534
bussyjd wants to merge 1 commit into
mainfrom
docs/observability-thin-layer-architecture

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

Pure docs PR — captures the architectural decisions made during the OBOL parity integration smoke + Prometheus expert review so they don't get re-litigated.

docs/observability.md (~370 lines) establishes:

  • Two-path model: Prometheus = recent operational telemetry (bounded by 8d retention). On-chain settlement TXs = canonical lifetime financial record. The two paths answer different questions; conflating them is a category error.
  • Counter-reset semantics: counters are intentionally per-process. rate() / increase() handle resets at query time via the TSDB. The PVC-backed-state, Pushgateway, and OTel cumulativetodelta alternatives were all considered and rejected for our single-operator local-k3d use case, with reasoning so the rejection can be revisited if requirements change.
  • Recording rule conventions: <level>:<metric>:<operations>, name the window in the rule (7d_by_offer, never lifetime_*), use increase() over an explicit range, keep the range inside retention.
  • Label conventions: include only bounded, CR-derived, query-meaningful labels (chain, asset_symbol). Don't label by unbounded attributes (payer address, tx hash).
  • CRD versioning stance: stay on v1alpha1 during active dev; the alpha promise IS "no compat". Graduate to v1beta1 only when an external operator commits to depending on the schema. The architecture review's v1beta1 graduation flag was hypothetical, not actionable.
  • clamp_min(..., 1) is an anti-pattern: epsilon must be 1e-9, not 1.0. The 1.0 form silently makes low-traffic alerts under-report.

CLAUDE.md gets a one-line pointer in the Pitfalls section so future contributors land in docs/observability.md first when touching metrics.

References

Test plan

  • Markdown renders cleanly — no broken cross-refs, ASCII diagrams render in fixed-width
  • grep -n "docs/observability.md" CLAUDE.md returns the new pointer line
  • PR numbers verified via gh pr view 527 / 530 / 531
  • Doc length 369 lines — within the 250-400 target
  • Branch based on origin/main (no incidental changes)
  • Pure docs PR — no code, no test, no infrastructure changes

After the OBOL parity smoke + Prometheus expert review, we made
explicit design choices worth recording so they don't get
re-litigated:

  1. Counters are intentionally per-process — Prometheus design.
     Pod restarts reset them; rate()/increase() handle this at
     query time via the TSDB's reset detection. Don't add
     persistence to the counter itself.

  2. Prometheus = recent operational telemetry (bounded by retention).
     On-chain settlement TXs = canonical lifetime financial record.

  3. Recording rules use the convention <level>:<metric>:<operations>;
     name the window (7d_by_offer, not lifetime_by_offer).

  4. Add labels you'd query by directly (chain, asset_symbol —
     both CR-derived, both query-meaningful, both bounded).

  5. div-by-zero guards use epsilon (1e-9), not 1.0.

  6. CRD versioning stance: stay on v1alpha1 during active dev;
     the alpha promise IS "no compat". Graduate only when an
     external operator commits to depending on the schema.

The PVC-backed counter persistence option was considered and
rejected for our single-operator local-k3d use case. The doc
walks through why, what would change that decision, and where
the canonical "lifetime" answer comes from.

Adds CLAUDE.md pointer so future contributors land here first.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant