Skip to content

feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150

Open
mauripunzueta wants to merge 1 commit into
mainfrom
feat/observability-uptime
Open

feat(observability): uptime, Prometheus /metrics, and OTLP traces across all servers#150
mauripunzueta wants to merge 1 commit into
mainfrom
feat/observability-uptime

Conversation

@mauripunzueta

Copy link
Copy Markdown
Contributor

Summary

Adds a shared helios-observability crate and wires it into all four server
binaries (hfs, hts, sof-server, fhirpath-server):

  • GET /metrics Prometheus endpoint — http_requests_total,
    http_request_duration_seconds, uptime_seconds, with a service global
    label and a templated route label.
  • /health enriched with uptime_seconds + started_at.
  • Per-request metrics + a tracing span. Tenant is a span attribute only, never a
    metric label (cardinality).
  • Feature-gated (otel) OTLP trace export via tracing-opentelemetry
    (opentelemetry 0.32). OTLP metrics are produced out-of-process by a Collector
    scraping /metrics — we avoid the unmaintained opentelemetry-prometheus
    (protobuf RUSTSEC advisory).

Verification

  • Workspace cargo check (default) compiles; observability also compiles with
    --features otel. Clippy clean (CI flags) for touched crates. Unit tests pass.
  • End-to-end smoke test against fhirpath-server: /health uptime + /metrics
    counter/gauge/histogram confirmed.

Notes / known gaps in local verification

  1. Full HFS server not run locally — SQLite WAL fails over the WSL filesystem
    under the Windows toolchain (environment limitation, not a code issue). The
    stateless fhirpath-server exercised the same wiring successfully.
  2. Clippy not run with --all-features locally (build window). A pre-existing
    dead_code warning on build_embedded_job_store appears only under default
    features; CI's --all-features run won't hit it. CI is the authoritative gate.

…oss all servers

Add a shared `helios-observability` crate and wire it into all four server
binaries (hfs, hts, sof-server, fhirpath-server):

- `GET /metrics` Prometheus endpoint (maintained metrics +
  metrics-exporter-prometheus stack; avoids unmaintained
  opentelemetry-prometheus / its protobuf RUSTSEC advisory).
- `/health` enriched with `uptime_seconds` + `started_at`.
- Per-request `http_requests_total` / `http_request_duration_seconds` metrics
  (templated `route` label; `service` global label) and a tracing span.
- Feature-gated (`otel`) OTLP trace export via tracing-opentelemetry; OTLP
  metrics are produced out-of-process by a Collector scraping /metrics.

Tenant is recorded as a span attribute only, never a metric label, to bound
Prometheus cardinality. Verified end-to-end against fhirpath-server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant