Skip to content

feat(monitor): operator Prometheus metrics with mTLS#4558

Open
rene-dekker wants to merge 4 commits intotigera:masterfrom
rene-dekker:EV-6493
Open

feat(monitor): operator Prometheus metrics with mTLS#4558
rene-dekker wants to merge 4 commits intotigera:masterfrom
rene-dekker:EV-6493

Conversation

@rene-dekker
Copy link
Copy Markdown
Member

@rene-dekker rene-dekker commented Mar 16, 2026

Summary

  • Add configurable Prometheus metrics endpoint to the operator via METRICS_HOST, METRICS_PORT, and METRICS_SCHEME env vars
  • mTLS support when METRICS_SCHEME=https: server cert from tigera-operator-tls, client auth trusts tigera-ca-private CA
  • Monitor controller creates Service, ServiceMonitor, and server TLS cert for automatic Prometheus discovery
  • Custom Prometheus collector exposes operator_installation_status and operator_tigera_status gauges

Test plan

  • make build passes
  • make ut UT_DIR=./pkg/controller/metrics — 10 tests pass
  • make ut UT_DIR=./pkg/render/monitor — 20 tests pass
  • make ut UT_DIR=./pkg/controller/monitor — 17 tests pass
  • Manual: deploy with METRICS_HOST/METRICS_PORT set, verify metrics scraped
  • Manual: deploy with METRICS_SCHEME=https, verify mTLS handshake with Prometheus

🤖 Generated with Claude Code

Example alerts:
image

Example metrics:

$ curl --cert client.crt --key client.key --cacert ca.crt https://tigera-operator-metrics.tigera-operator:9484/metrics | grep tigera_operator_tls_certificate
tigera_operator_tls_certificate_expiry_timestamp_seconds Unix timestamp of certificate expiry for operator-managed TLS secrets.
# TYPE tigera_operator_tls_certificate_expiry_timestamp_seconds gauge
tigera_operator_tls_certificate_expiry_timestamp_seconds{issuer="byo-signer",name="calico-apiserver-certs",namespace="calico-system"} 1.774828114e+09
tigera_operator_tls_certificate_expiry_timestamp_seconds{issuer="tigera-operator-signer",name="calico-apiserver-certs",namespace="tigera-operator"} 1.844609326e+09

$ curl --cert client.crt --key client.key --cacert ca.crt https://tigera-operator-metrics.tigera-operator:9484/metrics | grep tigera_operator_component_status
tigera_operator_component_status{component="apiserver",condition="available"} 1
tigera_operator_component_status{component="apiserver",condition="degraded"} 0
tigera_operator_component_status{component="apiserver",condition="progressing"} 0
tigera_operator_component_status{component="calico",condition="available"} 1
tigera_operator_component_status{component="calico",condition="degraded"} 0
tigera_operator_component_status{component="calico",condition="progressing"} 0


$ curl --cert client.crt --key client.key --cacert ca.crt https://tigera-operator-metrics.tigera-operator:9484/metrics | grep tigera_operator_license
tigera_operator_license_expiry_timestamp_seconds{package="Enterprise"} 2.051337599e+09
# HELP tigera_operator_license_valid Whether the Tigera license is valid (including grace period). 1 = valid, 0 = invalid.
tigera_operator_license_valid{package="Enterprise"} 1

rules = append(rules,
monitoringv1.Rule{
Alert: "TLSCertExpiringWarning",
Expr: intstr.FromString("tigera_operator_tls_certificate_expiry_timestamp_seconds - time() < 29 * 24 * 3600"),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked 29 days, because we automatically renew our own certs 30d before expiry. I wanted to exclude that day to generate fewer alerts.

rene-dekker and others added 4 commits March 27, 2026 09:04
Add operator metrics endpoint with configurable mTLS via METRICS_SCHEME,
METRICS_HOST, and METRICS_PORT env vars. The monitor controller creates
a server cert, Service, and ServiceMonitor for Prometheus integration.
Client auth trusts the tigera-ca-private CA rather than individual leaf
certs. Includes a custom Prometheus collector for operator status gauges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the implicit metrics-enabled detection (METRICS_HOST/PORT set)
with an explicit METRICS_ENABLED=true env var. Default METRICS_HOST to
0.0.0.0 and METRICS_PORT to 8484 when enabled. Log a helpful message
when mTLS is enabled but the server certificate is not yet available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use port 9484 instead of 8484 to reduce the chance of conflicts on
host-networked nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set SecureServing: true so controller-runtime actually serves TLS
instead of plain HTTP when METRICS_SCHEME=https. Add egress rule in
calicoSystemPrometheusPolicy to allow Prometheus to reach the operator
metrics Service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rene-dekker rene-dekker marked this pull request as ready for review March 27, 2026 18:19
@rene-dekker rene-dekker requested a review from a team as a code owner March 27, 2026 18:19
MinVersion: tls.VersionTLS12,
}, nil
}
cfg.MinVersion = tls.VersionTLS12
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a configurable min TLS version (and configurable client cert checking?)

I think we have that in the other metrics endpoints (at least for OSS Calico)


// metricsEnabled returns true when the operator metrics endpoint is enabled.
func metricsEnabled() bool {
return strings.EqualFold(os.Getenv("METRICS_ENABLED"), "true")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a breaking change, since someone might be using metrics today without this set?

But probably not a big deal - easy fix.

}

// metricsTLSEnabled returns true when the operator metrics endpoint should use mTLS.
func metricsTLSEnabled() bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth a metrics.go in the main package to put these helpers in to keep main.go a bit less of a "God file"

// dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets
// for the metrics endpoint. The monitor controller creates the server cert, and
// the client CA is loaded from the Prometheus client TLS secret.
type dynamicCertLoader struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, do we need all of this complexity?

Couldn't we just use an optional secret file mount, and have kubelet auto-load changes to the mounted paths?


func conditionLabel(ct operatorv1.StatusConditionType) string {
switch ct {
case operatorv1.ComponentAvailable:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just effectively strings.ToLower(ct)?


expiry, err := time.Parse(expiryFormat, expiryStr)
if err != nil {
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a debug log?

continue
}

issuer := s.Annotations["certificates.operator.tigera.io/issuer"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a const we can use? Should we have one?

}

func (c *OperatorCollector) collectLicense(ctx context.Context, ch chan<- prometheus.Metric) {
license, err := utils.FetchLicenseKey(ctx, c.client)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to generate an API call for OSS clusters?

}

// metricsEnabled returns true when the operator metrics endpoint is enabled.
func metricsEnabled() bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a duplicate of the one in the main package - would be nice to have a common impl.

bearerTokenFile = "/var/run/secrets/kubernetes.io/serviceaccount/token"
KubeControllerMetrics = "calico-kube-controllers-metrics"

OperatorMetricsSecretName = "tigera-operator-tls"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing that this is "tigera-operator-tls" but the var is "metrics" focused - I think it is correct for the secret name to NOT include the word metrics, but perhaps the variable name should match.

Would be worth some comments explaining as well. If we do end up using the cert in the future for anything else, we might need to move its creation out of this controller?

}

// serviceMonitorOperator creates a ServiceMonitor for the operator's metrics endpoint.
func (mc *monitorComponent) serviceMonitorOperator() *monitoringv1.ServiceMonitor {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have "opertorMetricsService" but "serviceMonitorOperator" - naming is a bit inconsistently formatted.

ObjectMeta: metav1.ObjectMeta{
Name: OperatorMetricsServiceName,
Namespace: common.TigeraPrometheusNamespace,
Labels: map[string]string{"team": "network-operators"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this label?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this stems all the way back from when everything was manifest based installed. It may even have come from the prometheus example bundle. I think we should delete them all, everywhere probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants