feat(health): Add NVOS Streaming telemetry#1975
Conversation
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
Signed-off-by: mkoci <mkoci@nvidia.com>
…ming Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ls on switch hosts Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… monitoring Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Makes sense - I'll look over that PR and make sure I understand the changes. Perhaps I can stack it on this in test. There's more work to do for Switch Hosts outside this PR. Will have to properly map TLS certificate support for both NICo and static config. |
# Conflicts: # Cargo.lock # crates/health/src/collectors/runtime.rs # crates/health/src/discovery/spawn.rs # docs/architecture/health_aggregation.md
kensimon
left a comment
There was a problem hiding this comment.
I haven't reviewed the whole change, so I don't know how important "label per instance" is to the concept, but we can't do it that way: Labels need to be bounded and well-known, we don't want to create prometheus labels for individual objects like instances.
## Description Splits Health service prometheus endpoint between /metrics and /telemetry. Also removes backward compatability `/` endpoint for metrics, as it was confusing and abused via health probes. This issue was raised after discussion on PR #1975 ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [x] This PR contains breaking changes Telemetry no longer exposed on `/metrics` Default `/ `endpoint no longer exposes metrics ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance -->
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-1975.docs.buildwithfern.com/infra-controller |
…as it is redundant and doesn't exist in most versions
Description
gNMI collector (
[collectors.nvue.gnmi], disabled by default) subscribes to NVUE gNMI ON_CHANGE/system-eventsandSAMPLEpaths for:It uses long-lived gRPC streams with reconnection (exp. back-off + jitter).
Builds on #711 (SSE streaming + OtlpSink for logs). Protos vendored for reproducible offline builds, same as #711.
Currently supports
PrometheusSinkandOtlpSinkType of Change
Related Issues
Testing
Additional Notes
num-gpusfor Switch host sdn partition response #1962Future
It would be useful to port this to support
HealthAlertsespecially forON_CHANGEfrom/system-events- there is extra information available here which is not available via the Switch BMC.