feat(health): suppress unchanged success-only reports within configurable interval#2284
feat(health): suppress unchanged success-only reports within configurable interval#2284rsd-darshan wants to merge 5 commits into
Conversation
74f8c2c to
9b9bfee
Compare
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…able interval Health clients sent reports to core on every collection tick regardless of whether the content had changed. For stable machines the vast majority of reports are success-only and carry identical data, causing unnecessary API load even though the database deduplicates at storage time. Add a per-key last-sent cache in HealthReportSink that hashes report content (excluding timestamps, mirroring the DB-side hash_without_timestamps logic) and skips re-enqueuing a success-only report whose hash is unchanged within the configured suppress_unchanged_interval. Reports that contain any alert are always forwarded immediately. The interval defaults to 5 minutes and can be set to null to disable suppression entirely. Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
a872b57 to
4760297
Compare
|
Thanks @rsd-darshan can you also please add quick criterion test for hashing function? As it sits on hot path we may need to be sure it is not overly heavy on CPU. |
|
/ok to test 055260e |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2284.docs.buildwithfern.com/infra-controller |
|
Thanks for the feedback! Added a criterion benchmark for the hashing function in the latest commit — it covers five report shapes: empty, small/large success-only, and small/large with alerts, to capture the realistic common case as well as worst-case scenarios. |
|
Thanks, can you post results here? Latest CI run also had lint issues. |
|
There is also one minor problem. HashMap with last seen report would grow without bounds over lifetime if report keys will drift (e.g. hardware added/removed). So maybe we can think of struct which would evict stale keys (hours old) or evict them on schedule. |
|
Benchmark results (Apple M-series, release mode):
The common case (small success-only reports) is ~1.3 µs per hash. The hashing only runs on success-only reports, so the alert path is unaffected. The large_success_only case (256 probes) at ~110 µs is an extreme outlier unlikely in practice. Unbounded map growth — good catch. Fixed in the latest commit: entries older than |
Benchmarks the report hashing function across five report shapes: empty, small/large success-only, and small/large with alerts. Covers the realistic common case (success-only) as well as alert-heavy reports to surface any CPU cost at scale. Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
The last_sent cache would grow without bound if machine keys drift over the lifetime of the process (e.g. hardware added and removed). Evict entries older than 2× the suppress interval on every insert so the map stays bounded to the set of recently active machines. Also fix the criterion benchmark for content_hash to build the sink's HealthReport directly instead of converting from the external type. Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
7736892 to
b17108f
Compare
|
@yoks both points addressed in the latest commit — stale entry eviction is in, and benchmark results are in the comment above. Let me know if anything else needs changing! |
Summary
Fixes #1811
Health clients sent reports to core on every collection tick regardless of whether content had changed. For stable machines, the vast majority of reports are success-only and carry identical data — causing unnecessary API and network load even though the database already deduplicates at storage time via
hash_without_timestamps.suppress_unchanged_intervalconfig field (default5m) toHealthReportSinkConfigHealthReportSinkthat hashes report content (excluding timestamps) and skips re-enqueuing a success-only report whose content is unchanged within the intervalsuppress_unchanged_interval = nullto disable suppressionWhat changes
crates/health/src/config.rssuppress_unchanged_interval: Option<Duration>onHealthReportSinkConfig, serialized withhumantime_serde(e.g."5m"), defaults toSome(300s)crates/health/src/sink/health_report.rscontent_hash()— deterministic hash over sorted successes and alerts, excluding timestamps; mirrors the DB-sidehash_without_timestampsapproachLastSent— lightweight per-key struct tracking content hash and send timehandle_eventbeforequeue.save_latestTest plan
cargo test -p carbide-health --lib— all 147 tests pass