Skip to content

feat(health): suppress unchanged success-only reports within configurable interval#2284

Open
rsd-darshan wants to merge 5 commits into
NVIDIA:mainfrom
rsd-darshan:feat/variable-frequency-health-reports
Open

feat(health): suppress unchanged success-only reports within configurable interval#2284
rsd-darshan wants to merge 5 commits into
NVIDIA:mainfrom
rsd-darshan:feat/variable-frequency-health-reports

Conversation

@rsd-darshan
Copy link
Copy Markdown

@rsd-darshan rsd-darshan commented Jun 6, 2026

Summary

Fixes #1811

Health clients sent reports to core on every collection tick regardless of whether content had changed. For stable machines, the vast majority of reports are success-only and carry identical data — causing unnecessary API and network load even though the database already deduplicates at storage time via hash_without_timestamps.

  • Add a suppress_unchanged_interval config field (default 5m) to HealthReportSinkConfig
  • Add a per-key last-sent cache in HealthReportSink that hashes report content (excluding timestamps) and skips re-enqueuing a success-only report whose content is unchanged within the interval
  • Reports that contain any alert bypass suppression entirely and are always forwarded immediately
  • Set suppress_unchanged_interval = null to disable suppression

What changes

crates/health/src/config.rs

  • New field suppress_unchanged_interval: Option<Duration> on HealthReportSinkConfig, serialized with humantime_serde (e.g. "5m"), defaults to Some(300s)

crates/health/src/sink/health_report.rs

  • content_hash() — deterministic hash over sorted successes and alerts, excluding timestamps; mirrors the DB-side hash_without_timestamps approach
  • LastSent — lightweight per-key struct tracking content hash and send time
  • Suppression gate in handle_event before queue.save_latest
  • 4 new unit tests: suppress within interval, alerts bypass suppression, changed content forwarded, suppression disabled

Test plan

  • cargo test -p carbide-health --lib — all 147 tests pass
  • Deploy to a staging environment and confirm API call rate drops for stable machines
  • Confirm machines with active alerts still receive timely report updates

@rsd-darshan rsd-darshan requested a review from a team as a code owner June 6, 2026 16:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 6, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28115031-270b-4597-a330-b2e40d30d30d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…able interval

Health clients sent reports to core on every collection tick regardless of
whether the content had changed. For stable machines the vast majority of
reports are success-only and carry identical data, causing unnecessary API
load even though the database deduplicates at storage time.

Add a per-key last-sent cache in HealthReportSink that hashes report content
(excluding timestamps, mirroring the DB-side hash_without_timestamps logic)
and skips re-enqueuing a success-only report whose hash is unchanged within
the configured suppress_unchanged_interval. Reports that contain any alert
are always forwarded immediately. The interval defaults to 5 minutes and can
be set to null to disable suppression entirely.

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan rsd-darshan force-pushed the feat/variable-frequency-health-reports branch from a872b57 to 4760297 Compare June 6, 2026 16:28
@yoks
Copy link
Copy Markdown
Contributor

yoks commented Jun 7, 2026

Thanks @rsd-darshan can you also please add quick criterion test for hashing function? As it sits on hot path we may need to be sure it is not overly heavy on CPU.

@yoks
Copy link
Copy Markdown
Contributor

yoks commented Jun 7, 2026

/ok to test 055260e

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

@rsd-darshan
Copy link
Copy Markdown
Author

Thanks for the feedback! Added a criterion benchmark for the hashing function in the latest commit — it covers five report shapes: empty, small/large success-only, and small/large with alerts, to capture the realistic common case as well as worst-case scenarios.

@yoks
Copy link
Copy Markdown
Contributor

yoks commented Jun 7, 2026

Thanks, can you post results here? Latest CI run also had lint issues.

@yoks
Copy link
Copy Markdown
Contributor

yoks commented Jun 7, 2026

There is also one minor problem. HashMap with last seen report would grow without bounds over lifetime if report keys will drift (e.g. hardware added/removed). So maybe we can think of struct which would evict stale keys (hours old) or evict them on schedule.

@rsd-darshan
Copy link
Copy Markdown
Author

Benchmark results (Apple M-series, release mode):

Report shape Time
empty ~32 ns
small success-only (8 probes) ~1.27 µs
small with alerts (4 alerts) ~1.90 µs
large with alerts (64 alerts) ~52 µs
large success-only (256 probes) ~110 µs

The common case (small success-only reports) is ~1.3 µs per hash. The hashing only runs on success-only reports, so the alert path is unaffected. The large_success_only case (256 probes) at ~110 µs is an extreme outlier unlikely in practice.

Unbounded map growth — good catch. Fixed in the latest commit: entries older than 2 × suppress_unchanged_interval are evicted on each insert, keeping the map bounded to the set of recently active machines.

Benchmarks the report hashing function across five report shapes:
empty, small/large success-only, and small/large with alerts. Covers
the realistic common case (success-only) as well as alert-heavy reports
to surface any CPU cost at scale.

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
The last_sent cache would grow without bound if machine keys drift over
the lifetime of the process (e.g. hardware added and removed). Evict
entries older than 2× the suppress interval on every insert so the map
stays bounded to the set of recently active machines.

Also fix the criterion benchmark for content_hash to build the sink's
HealthReport directly instead of converting from the external type.

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan rsd-darshan force-pushed the feat/variable-frequency-health-reports branch from 7736892 to b17108f Compare June 7, 2026 08:53
@rsd-darshan
Copy link
Copy Markdown
Author

@yoks both points addressed in the latest commit — stale entry eviction is in, and benchmark results are in the comment above. Let me know if anything else needs changing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: health clients (like hw-health) should send health-reports to core with variable frequency

2 participants