feat(monitor): operator Prometheus metrics with mTLS#4558
feat(monitor): operator Prometheus metrics with mTLS#4558rene-dekker wants to merge 4 commits intotigera:masterfrom
Conversation
pkg/render/monitor/monitor.go
Outdated
| rules = append(rules, | ||
| monitoringv1.Rule{ | ||
| Alert: "TLSCertExpiringWarning", | ||
| Expr: intstr.FromString("tigera_operator_tls_certificate_expiry_timestamp_seconds - time() < 29 * 24 * 3600"), |
There was a problem hiding this comment.
I picked 29 days, because we automatically renew our own certs 30d before expiry. I wanted to exclude that day to generate fewer alerts.
Add operator metrics endpoint with configurable mTLS via METRICS_SCHEME, METRICS_HOST, and METRICS_PORT env vars. The monitor controller creates a server cert, Service, and ServiceMonitor for Prometheus integration. Client auth trusts the tigera-ca-private CA rather than individual leaf certs. Includes a custom Prometheus collector for operator status gauges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the implicit metrics-enabled detection (METRICS_HOST/PORT set) with an explicit METRICS_ENABLED=true env var. Default METRICS_HOST to 0.0.0.0 and METRICS_PORT to 8484 when enabled. Log a helpful message when mTLS is enabled but the server certificate is not yet available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use port 9484 instead of 8484 to reduce the chance of conflicts on host-networked nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set SecureServing: true so controller-runtime actually serves TLS instead of plain HTTP when METRICS_SCHEME=https. Add egress rule in calicoSystemPrometheusPolicy to allow Prometheus to reach the operator metrics Service. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| MinVersion: tls.VersionTLS12, | ||
| }, nil | ||
| } | ||
| cfg.MinVersion = tls.VersionTLS12 |
There was a problem hiding this comment.
Do we need a configurable min TLS version (and configurable client cert checking?)
I think we have that in the other metrics endpoints (at least for OSS Calico)
|
|
||
| // metricsEnabled returns true when the operator metrics endpoint is enabled. | ||
| func metricsEnabled() bool { | ||
| return strings.EqualFold(os.Getenv("METRICS_ENABLED"), "true") |
There was a problem hiding this comment.
Technically a breaking change, since someone might be using metrics today without this set?
But probably not a big deal - easy fix.
| } | ||
|
|
||
| // metricsTLSEnabled returns true when the operator metrics endpoint should use mTLS. | ||
| func metricsTLSEnabled() bool { |
There was a problem hiding this comment.
might be worth a metrics.go in the main package to put these helpers in to keep main.go a bit less of a "God file"
| // dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets | ||
| // for the metrics endpoint. The monitor controller creates the server cert, and | ||
| // the client CA is loaded from the Prometheus client TLS secret. | ||
| type dynamicCertLoader struct { |
There was a problem hiding this comment.
I wonder, do we need all of this complexity?
Couldn't we just use an optional secret file mount, and have kubelet auto-load changes to the mounted paths?
|
|
||
| func conditionLabel(ct operatorv1.StatusConditionType) string { | ||
| switch ct { | ||
| case operatorv1.ComponentAvailable: |
There was a problem hiding this comment.
Isn't this just effectively strings.ToLower(ct)?
|
|
||
| expiry, err := time.Parse(expiryFormat, expiryStr) | ||
| if err != nil { | ||
| continue |
| continue | ||
| } | ||
|
|
||
| issuer := s.Annotations["certificates.operator.tigera.io/issuer"] |
There was a problem hiding this comment.
Do we have a const we can use? Should we have one?
| } | ||
|
|
||
| func (c *OperatorCollector) collectLicense(ctx context.Context, ch chan<- prometheus.Metric) { | ||
| license, err := utils.FetchLicenseKey(ctx, c.client) |
There was a problem hiding this comment.
Is this going to generate an API call for OSS clusters?
| } | ||
|
|
||
| // metricsEnabled returns true when the operator metrics endpoint is enabled. | ||
| func metricsEnabled() bool { |
There was a problem hiding this comment.
Looks like a duplicate of the one in the main package - would be nice to have a common impl.
| bearerTokenFile = "/var/run/secrets/kubernetes.io/serviceaccount/token" | ||
| KubeControllerMetrics = "calico-kube-controllers-metrics" | ||
|
|
||
| OperatorMetricsSecretName = "tigera-operator-tls" |
There was a problem hiding this comment.
It's a bit confusing that this is "tigera-operator-tls" but the var is "metrics" focused - I think it is correct for the secret name to NOT include the word metrics, but perhaps the variable name should match.
Would be worth some comments explaining as well. If we do end up using the cert in the future for anything else, we might need to move its creation out of this controller?
| } | ||
|
|
||
| // serviceMonitorOperator creates a ServiceMonitor for the operator's metrics endpoint. | ||
| func (mc *monitorComponent) serviceMonitorOperator() *monitoringv1.ServiceMonitor { |
There was a problem hiding this comment.
we have "opertorMetricsService" but "serviceMonitorOperator" - naming is a bit inconsistently formatted.
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: OperatorMetricsServiceName, | ||
| Namespace: common.TigeraPrometheusNamespace, | ||
| Labels: map[string]string{"team": "network-operators"}, |
There was a problem hiding this comment.
I think this stems all the way back from when everything was manifest based installed. It may even have come from the prometheus example bundle. I think we should delete them all, everywhere probably.
Summary
METRICS_HOST,METRICS_PORT, andMETRICS_SCHEMEenv varsMETRICS_SCHEME=https: server cert fromtigera-operator-tls, client auth truststigera-ca-privateCAoperator_installation_statusandoperator_tigera_statusgaugesTest plan
make buildpassesmake ut UT_DIR=./pkg/controller/metrics— 10 tests passmake ut UT_DIR=./pkg/render/monitor— 20 tests passmake ut UT_DIR=./pkg/controller/monitor— 17 tests passMETRICS_HOST/METRICS_PORTset, verify metrics scrapedMETRICS_SCHEME=https, verify mTLS handshake with Prometheus🤖 Generated with Claude Code
Example alerts:

Example metrics: