diff --git a/content/posts/2026-03-18-network-health/index.md b/content/posts/2026-03-18-network-health/index.md new file mode 100644 index 0000000..ec20f51 --- /dev/null +++ b/content/posts/2026-03-18-network-health/index.md @@ -0,0 +1,206 @@ +--- +layout: :theme/post +title: "Network Health: one dashboard for your cluster network" +description: "An overview of the Network Health dashboard in NetObserv: built-in rules, alerts vs recording rules, and how to add your own metrics." +tags: network,health,observability,prometheus,alerts,recording,rules,dashboard +authors: [lberetta] +--- + +NetObserv now features a dedicated **Network Health** section designed to provide a high-level overview of your cluster's networking status. This interface relies on a set of predefined health rules that automatically surface potential issues by analyzing NetObserv metrics. + +Out of the box, these rules monitor several key signals such as: + +- **DNS errors and NXDOMAIN responses** +- **packet drops** +- **network policy denials** +- **latency trends** +- **ingress errors** + +These built-in rules provide immediate diagnostic value without requiring users to write complex PromQL queries. However, the real power of the Network Health system lies in its **extensibility**, allowing operators to define and integrate custom health rules tailored to the specific behavior and expectations of their infrastructure. + +The dashboard is organized by scope: **Global**, **Nodes**, **Namespaces**, and **Workloads**. The tab counts show how many items you have in each scope, so you know at a glance where to look. A green status means no violations; yellow and red reflect warning and critical severity. You can find Network Health in the NetObserv console (standalone or OpenShift at **Observe > Network Traffic**). + +## Understanding Health Rules: Alerts vs Recording Rules + +Behind the scenes, the Network Health section is powered by **PrometheusRule** resources. NetObserv supports two different rule modes, each designed for a different monitoring strategy. + +### Alert mode + +**Alert rules** trigger when a metric exceeds a defined threshold. + +For example: *Packet loss > 10%* + +These rules are useful for detecting immediate issues that require action, and they integrate with the existing Prometheus and Alertmanager alerting pipeline. In the Network Health dashboard, alert rules appear when they are **pending** (before the threshold is sustained) or **actively firing**. + +### Recording mode + +**Recording rules** continuously compute and store metric values in Prometheus without generating alerts. + +In the Network Health dashboard, these metrics become visible as soon as the value reaches the lowest configured severity threshold (for example the *info* level). As the value evolves, the rule may move between *info*, *warning*, and *critical* states according to the thresholds defined in its configuration. + +Recording rules are particularly useful for: + +- continuously monitoring health indicators +- tracking performance trends over time +- reducing alert fatigue + +### Key difference + +**Alert rules** highlight situations where something is already wrong. + +**Recording rules**, on the other hand, provide continuous visibility into network conditions, allowing operators to observe how metrics evolve and detect early warning signs before a critical threshold is reached. + +In the FlowCollector you choose the **mode** per template or per variant: `Alert` or `Recording`. For example: + +```yaml +spec: + processor: + metrics: + healthRules: + - template: PacketDropsByKernel + mode: Recording + variants: + - thresholds: + critical: "10" + - thresholds: + critical: "15" + warning: "10" + info: "5" + groupBy: Node +``` + +You configure built-in rules under **Processor configuration > Metrics configuration**: **healthRules** and **include list** (the metrics each rule needs). Full details are in the [Health rules documentation](https://github.com/netobserv/netobserv-operator/blob/main/docs/HealthRules.md) and the [runbooks](https://github.com/openshift/runbooks/tree/master/alerts/network-observability-operator). + +## Health in the topology + +When you use the **Topology** tab and select a node, namespace, or workload, the side panel can show a **Health** tab if there are violations for that resource. So you can go from “this namespace has DNS issues” in Network Health to the topology and click the namespace to see the same violations in context. + +## Configuring custom health rules + +Custom health signals can be integrated into the Network Health dashboard by creating a **PrometheusRule** resource. You can add both **custom alerts** and **custom recording rules**. + +**Custom alerts:** Define alerting rules in your PrometheusRule (with PromQL and thresholds). To have them show in the Network Health dashboard, add the label **netobserv: "true"** on the rule and the usual annotations (**summary**, **description**, and optionally **netobserv_io_network_health** for unit, threshold, and dashboard category). PrometheusRule allows annotations on individual alert rules, so you attach this metadata directly to each rule. Sample custom alerts are available in the [operator repository](https://github.com/netobserv/netobserv-operator/tree/main/config/samples/alerts); the [Health rules documentation](https://github.com/netobserv/netobserv-operator/blob/main/docs/HealthRules.md#creating-your-own-rules-that-contribute-to-the-health-dashboard) describes the metadata in detail. + +**Custom recording rules:** For recording rules, the PrometheusRule CRD does not allow annotations on individual rules of type `record`. NetObserv therefore uses a single annotation at the **PrometheusRule metadata level**: **netobserv.io/network-health**. This annotation defines how each recorded metric is interpreted and rendered in the dashboard. It can contain: + +- **severity thresholds** (info, warning, critical) +- the **metric unit** +- the **dashboard category** (Namespaces, Nodes, Owners/Workloads, or Global) +- optional **links** for contextual troubleshooting + +In both cases you need the label **netobserv: "true"** on the PrometheusRule and on each rule’s `labels`. The operator lists PrometheusRules **cluster-wide**, so you can create the resource in any namespace. (If you use the NetObserv namespace, keep in mind that uninstalling the operator may remove it and your rules with it; using another namespace such as `monitoring` avoids that.) + +### Example: custom recording rule + +The following example defines a simple recording rule and shows it in the Global tab with custom thresholds: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: my-recording-rules + namespace: netobserv + labels: + netobserv: "true" + annotations: + netobserv.io/network-health: | + { + "my_simple_number": { + "summary": "Test metric (value {{ $value }})", + "description": "Numeric value to test thresholds.", + "netobserv_io_network_health": "{\"unit\":\"\",\"upperBound\":\"100\",\"recordingThresholds\":{\"info\":\"10\",\"warning\":\"25\",\"critical\":\"50\"}}" + } + } +spec: + groups: + - name: SimpleNumber + interval: 30s + rules: + - record: my_simple_number + expr: vector(25) + labels: + netobserv: "true" +``` + +The value of the annotation is a JSON object: keys are the **metric names** (the `record:` field of each rule), and each value has `summary`, `description`, and optionally **netobserv_io_network_health** (the string that holds unit, thresholds, and category). Full details, field descriptions, and lifecycle notes are in the operator docs: [External (contributed) recording rules for Network Health](https://github.com/netobserv/netobserv-operator/blob/main/docs/HealthRules.md#external-contributed-recording-rules-for-network-health). + +## Demo: Istio and BookInfo recording rules + +If you have **Istio** and the **BookInfo** sample app installed, you can add recording rules that use Istio metrics to detect service-level problems and have them show up in the Network Health dashboard. Two useful signals are: + +1. **5xx error rate by workload** – percentage of server errors (HTTP 5xx) over total requests per destination app. Surfaces which service is misbehaving. +2. **P99 latency by workload** – 99th percentile request duration in milliseconds per destination app. Surfaces which service is slow. + +Both rules use the **destination_app** label (productpage, details, reviews, ratings in BookInfo) so they appear in the **Workloads** tab. You can tune the thresholds (info, warning, critical) to your SLOs. + +Ensure the **Prometheus** instance that evaluates this PrometheusRule also scrapes Istio metrics (e.g. from the mesh or the namespace where Istio exposes metrics). In OpenShift with the default monitoring stack, the same Prometheus often scrapes both NetObserv and the service mesh. + +Create a file (e.g. `istio-network-health-rules.yaml`) and apply it with `kubectl apply -f istio-network-health-rules.yaml`: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: istio-network-health + namespace: netobserv + labels: + netobserv: "true" + annotations: + netobserv.io/network-health: | + { + "istio_5xx_error_rate_percent": { + "summary": "5xx error rate {{ $value | printf \"%.2f\" }}% for {{ $labels.destination_app }}", + "description": "Percentage of HTTP 5xx responses over total requests (5m rate) for this workload.", + "netobserv_io_network_health": "{\"unit\":\"%\",\"upperBound\":\"100\",\"workloadLabels\":[\"destination_app\"],\"recordingThresholds\":{\"info\":\"1\",\"warning\":\"5\",\"critical\":\"10\"}}" + }, + "istio_p99_latency_milliseconds": { + "summary": "P99 latency {{ $value | printf \"%.0f\" }}ms for {{ $labels.destination_app }}", + "description": "99th percentile request duration in milliseconds (5m window) for this workload.", + "netobserv_io_network_health": "{\"unit\":\"ms\",\"upperBound\":\"5000\",\"workloadLabels\":[\"destination_app\"],\"recordingThresholds\":{\"info\":\"500\",\"warning\":\"1000\",\"critical\":\"2000\"}}" + } + } +spec: + groups: + - name: istio-network-health + interval: 30s + rules: + - record: istio_5xx_error_rate_percent + expr: | + sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_app) + / sum(rate(istio_requests_total[5m])) by (destination_app) + * 100 + labels: + netobserv: "true" + - record: istio_p99_latency_milliseconds + expr: | + histogram_quantile(0.99, + sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_app) + ) + labels: + netobserv: "true" +``` + +After a short delay (evaluation interval + operator sync), the new metrics appear in the **Workloads** tab. If BookInfo is under load and a service starts returning 5xx or latency increases, the corresponding rule will move from info to warning or critical according to the thresholds. You can inject faults (e.g. with Istio’s fault injection) to see the dashboard update. + +**Note:** Metric names (`istio_requests_total`, `istio_request_duration_milliseconds_bucket`) match the default Istio telemetry. If you use a different Istio version or custom metric names, adjust the PromQL accordingly. + +## Summary + +- **Network Health** provides a high-level overview of your cluster’s networking status via predefined and custom health rules. +- Rules can run in **Alert** mode (fire when a threshold is exceeded, integrate with Alertmanager) or **Recording** mode (continuous visibility, trend and baseline use cases, less alert fatigue). +- **Alert rules** show up when something is already wrong; **recording rules** let you observe how metrics evolve and spot early warning signs. +- **Custom rules:** Add custom **alerts** (annotations on each rule) or **recording rules** (**netobserv.io/network-health** annotation at PrometheusRule metadata level) so they appear in the dashboard. +- Health violations are also visible in the **Topology** view when you select a resource. + +For runbooks and default thresholds of each built-in rule, see the [NetObserv runbooks](https://github.com/openshift/runbooks/tree/master/alerts/network-observability-operator). For configuration reference, see the [FlowCollector and Health rules documentation](https://github.com/netobserv/netobserv-operator/tree/main/docs) in the netobserv-operator repo. + +## Wrapping it up + +We've seen: + +- What the Network Health dashboard is and how it surfaces built-in signals (DNS, packet drops, latency, ingress errors, and more). +- The difference between **alert** and **recording** rules, and when to use each. +- How to configure custom health rules (alerts and recording rules) so they appear in the dashboard. +- A concrete demo with Istio and BookInfo: recording rules for 5xx error rate and P99 latency by workload. + +As always, you can reach out to the development team on Slack (#netobserv-project on [slack.cncf.io](https://slack.cncf.io/)) or via our [discussion pages](https://github.com/netobserv/netobserv-operator/discussions). diff --git a/data/authors.yml b/data/authors.yml index fe94b27..af5a1d2 100644 --- a/data/authors.yml +++ b/data/authors.yml @@ -42,3 +42,10 @@ stleerh: profile: "https://github.com/stleerh" nickname: "stleerh" bio: "" +lberetta: + name: "Leandro Beretta" + avatar: "https://github.com/leandroberetta.png" + profile: "https://github.com/leandroberetta" + nickname: "lberetta" + bio: | + I'm a software engineer working at Red Hat since 2015. I'm from Buenos Aires, Argentina. \ No newline at end of file