Skip to content

feat(health): Add NVOS Streaming telemetry#1975

Open
mkoci wants to merge 65 commits into
NVIDIA:mainfrom
mkoci:feature-nvos-health
Open

feat(health): Add NVOS Streaming telemetry#1975
mkoci wants to merge 65 commits into
NVIDIA:mainfrom
mkoci:feature-nvos-health

Conversation

@mkoci
Copy link
Copy Markdown
Contributor

@mkoci mkoci commented May 28, 2026

Description

gNMI collector ([collectors.nvue.gnmi], disabled by default) subscribes to NVUE gNMI ON_CHANGE /system-events and SAMPLE paths for:

/components/component
/interfaces/interface
/system-events/*

It uses long-lived gRPC streams with reconnection (exp. back-off + jitter).

Builds on #711 (SSE streaming + OtlpSink for logs). Protos vendored for reproducible offline builds, same as #711.

Currently supports PrometheusSink and OtlpSink

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Future

It would be useful to port this to support HealthAlerts especially for ON_CHANGE from /system-events - there is extra information available here which is not available via the Switch BMC.

mkoci and others added 30 commits May 14, 2026 14:20
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
…ming

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ls on switch hosts

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… monitoring

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
@mkoci
Copy link
Copy Markdown
Contributor Author

mkoci commented May 29, 2026

I would like it be merged after #1913, as it has many changes to how auth/endpoints work.

Makes sense - I'll look over that PR and make sure I understand the changes. Perhaps I can stack it on this in test.

There's more work to do for Switch Hosts outside this PR. Will have to properly map TLS certificate support for both NICo and static config.

Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the whole change, so I don't know how important "label per instance" is to the concept, but we can't do it that way: Labels need to be bounded and well-known, we don't want to create prometheus labels for individual objects like instances.

Comment thread crates/health/src/collectors/nvue/gnmi/on_change_processor.rs
@kensimon kensimon dismissed their stale review June 1, 2026 19:56

See comments

@yoks yoks mentioned this pull request Jun 1, 2026
10 tasks
yoks added a commit that referenced this pull request Jun 2, 2026
## Description
Splits Health service prometheus endpoint between /metrics and
/telemetry.

Also removes backward compatability `/` endpoint for metrics, as it was
confusing and abused via health probes.

This issue was raised after discussion on PR
#1975

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [x] **Change** - Changes in existing functionality  
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [x] This PR contains breaking changes

Telemetry no longer exposed on `/metrics`
Default `/ `endpoint no longer exposes metrics

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated  
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b000451b-98f9-48a7-8386-2607ac34a873

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

@mkoci mkoci requested review from Coco-Ben and kensimon June 7, 2026 18:49
@mkoci
Copy link
Copy Markdown
Contributor Author

mkoci commented Jun 7, 2026

@Coco-Ben , @yoks, @kensimon - This should be in better shape now. I've merged in both #1913 and #2056

As this is an optional feature in config, I can address any bugs found in testing in subsequent PRs. This is becoming unwieldy 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants