feat(health): suppress unchanged success-only reports within configurable interval by rsd-darshan · Pull Request #2284 · NVIDIA/infra-controller

rsd-darshan · 2026-06-06T16:05:56Z

Summary

Health clients sent reports to core on every collection tick regardless of whether content had changed. For stable machines, the vast majority of reports are success-only and carry identical data — causing unnecessary API and network load even though the database already deduplicates at storage time via hash_without_timestamps.

Add a suppress_unchanged_interval config field (default 5m) to HealthReportSinkConfig
Add a per-key last-sent cache in HealthReportSink that hashes report content (excluding timestamps) and skips re-enqueuing a success-only report whose content is unchanged within the interval
Reports that contain any alert bypass suppression entirely and are always forwarded immediately
Set suppress_unchanged_interval = null to disable suppression

What changes

crates/health/src/config.rs

New field suppress_unchanged_interval: Option<Duration> on HealthReportSinkConfig, serialized with humantime_serde (e.g. "5m"), defaults to Some(300s)

crates/health/src/sink/health_report.rs

content_hash() — deterministic hash over sorted successes and alerts, excluding timestamps; mirrors the DB-side hash_without_timestamps approach
LastSent — lightweight per-key struct tracking content hash and send time
Suppression gate in handle_event before queue.save_latest
4 new unit tests: suppress within interval, alerts bypass suppression, changed content forwarded, suppression disabled

Test plan

cargo test -p carbide-health --lib — all 147 tests pass
Deploy to a staging environment and confirm API call rate drops for stable machines
Confirm machines with active alerts still receive timely report updates

copy-pr-bot · 2026-06-06T16:06:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-06T16:10:52Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28115031-270b-4597-a330-b2e40d30d30d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…able interval Health clients sent reports to core on every collection tick regardless of whether the content had changed. For stable machines the vast majority of reports are success-only and carry identical data, causing unnecessary API load even though the database deduplicates at storage time. Add a per-key last-sent cache in HealthReportSink that hashes report content (excluding timestamps, mirroring the DB-side hash_without_timestamps logic) and skips re-enqueuing a success-only report whose hash is unchanged within the configured suppress_unchanged_interval. Reports that contain any alert are always forwarded immediately. The interval defaults to 5 minutes and can be set to null to disable suppression entirely. Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>

yoks · 2026-06-07T07:41:21Z

Thanks @rsd-darshan can you also please add quick criterion test for hashing function? As it sits on hot path we may need to be sure it is not overly heavy on CPU.

yoks · 2026-06-07T07:41:48Z

/ok to test 055260e

github-actions · 2026-06-07T07:42:54Z

🌿 Preview your docs: https://nvidia-preview-pull-request-2284.docs.buildwithfern.com/infra-controller

rsd-darshan · 2026-06-07T08:02:51Z

Thanks for the feedback! Added a criterion benchmark for the hashing function in the latest commit — it covers five report shapes: empty, small/large success-only, and small/large with alerts, to capture the realistic common case as well as worst-case scenarios.

yoks · 2026-06-07T08:05:42Z

Thanks, can you post results here? Latest CI run also had lint issues.

yoks · 2026-06-07T08:11:00Z

There is also one minor problem. HashMap with last seen report would grow without bounds over lifetime if report keys will drift (e.g. hardware added/removed). So maybe we can think of struct which would evict stale keys (hours old) or evict them on schedule.

rsd-darshan · 2026-06-07T08:35:30Z

Benchmark results (Apple M-series, release mode):

Report shape	Time
empty	~32 ns
small success-only (8 probes)	~1.27 µs
small with alerts (4 alerts)	~1.90 µs
large with alerts (64 alerts)	~52 µs
large success-only (256 probes)	~110 µs

The common case (small success-only reports) is ~1.3 µs per hash. The hashing only runs on success-only reports, so the alert path is unaffected. The large_success_only case (256 probes) at ~110 µs is an extreme outlier unlikely in practice.

Unbounded map growth — good catch. Fixed in the latest commit: entries older than 2 × suppress_unchanged_interval are evicted on each insert, keeping the map bounded to the set of recently active machines.

Benchmarks the report hashing function across five report shapes: empty, small/large success-only, and small/large with alerts. Covers the realistic common case (success-only) as well as alert-heavy reports to surface any CPU cost at scale. Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>

The last_sent cache would grow without bound if machine keys drift over the lifetime of the process (e.g. hardware added and removed). Evict entries older than 2× the suppress interval on every insert so the map stays bounded to the set of recently active machines. Also fix the criterion benchmark for content_hash to build the sink's HealthReport directly instead of converting from the external type. Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>

rsd-darshan · 2026-06-07T09:30:11Z

@yoks both points addressed in the latest commit — stale entry eviction is in, and benchmark results are in the comment above. Let me know if anything else needs changing!

rsd-darshan requested a review from a team as a code owner June 6, 2026 16:05

rsd-darshan mentioned this pull request Jun 6, 2026

feat: health clients (like hw-health) should send health-reports to core with variable frequency #1811

Open

2 tasks

rsd-darshan force-pushed the feat/variable-frequency-health-reports branch from 74f8c2c to 9b9bfee Compare June 6, 2026 16:09

rsd-darshan force-pushed the feat/variable-frequency-health-reports branch from a872b57 to 4760297 Compare June 6, 2026 16:28

Merge branch 'main' into feat/variable-frequency-health-reports

055260e

rsd-darshan added 2 commits June 7, 2026 14:38

rsd-darshan force-pushed the feat/variable-frequency-health-reports branch from 7736892 to b17108f Compare June 7, 2026 08:53

Merge branch 'main' into feat/variable-frequency-health-reports

4108f43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(health): suppress unchanged success-only reports within configurable interval#2284

feat(health): suppress unchanged success-only reports within configurable interval#2284
rsd-darshan wants to merge 5 commits into
NVIDIA:mainfrom
rsd-darshan:feat/variable-frequency-health-reports

rsd-darshan commented Jun 6, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Review skipped

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

rsd-darshan commented Jun 7, 2026

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

rsd-darshan commented Jun 7, 2026

Uh oh!

rsd-darshan commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rsd-darshan commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

rsd-darshan commented Jun 7, 2026

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

yoks commented Jun 7, 2026

Uh oh!

rsd-darshan commented Jun 7, 2026

Uh oh!

rsd-darshan commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rsd-darshan commented Jun 6, 2026 •

edited

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading