Skip to content

STOR-2884: Add service to generate a TLS cert for the operator#106

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
jsafrane:add-service-to-generate-a-tls
Mar 23, 2026
Merged

STOR-2884: Add service to generate a TLS cert for the operator#106
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
jsafrane:add-service-to-generate-a-tls

Conversation

@jsafrane
Copy link
Copy Markdown
Contributor

@jsafrane jsafrane commented Mar 11, 2026

Add service secrets-store-csi-driver-operator-metrics to the CSI driver OLM manifests. OLM will instantiate it together with the operator Deployment.

This service causes service-ca-operator to generate a TLS key + certificate for the operator. As result, the operator stops generating a self-signed cert and uses the provided one instead.

Before this PR:

oc -n openshift-cluster-csi-drivers logs deployment/secrets-store-csi-driver-operator 

W0311 12:08:40.326159       1 cmd.go:257] Using insecure, self-signed certificates

After the PR:

I0311 12:11:26.571006       1 cmd.go:253] Using service-serving-cert provided certificates

+ the TLS credentials are created using cluster-global policy (e.g. post-quantum crypto only) + they are rotated together with the other cluster certificates.

This PR does not enable metrics collection from the operator, it just makes sure the metrics are exposed via a HTTPS endpoint with a correctly generated certificate + key.

Unlike e.g. openshift/csi-operator#522, no Dockerfile change is necessary. It already copies all yaml files:

COPY manifests/stable/*.yaml /manifests/

Summary by CodeRabbit

  • New Features

    • Secure metrics endpoint enabled for the operator with certificate-based TLS.
  • Chores

    • Deployment updated to consume serving certificates and restart when certificates change to ensure continued secure monitoring.
    • Service added to support certificate provisioning for the metrics endpoint.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

Walkthrough

Adds a new ClusterIP Service manifest to expose the operator's metrics (with service-ca-operator cert annotation) and updates the operator CSV to mount the serving TLS secret and terminate when the cert files change.

Changes

Cohort / File(s) Summary
Metrics Service Configuration
config/manifests/stable/secrets-store-csi-driver-operator-metrics-service.yaml
Adds a ClusterIP Service named for operator metrics, annotated for automatic serving certificate generation, exposing port https (443 -> targetPort 8443/TCP); sessionAffinity set to None.
Operator CSV / Pod spec
config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml
Adds --terminate-on-files args for the serving cert and key, adds a volumeMount at /var/run/secrets/serving-cert, and declares a secret-backed volume metrics-serving-cert sourced from secrets-store-csi-driver-operator-metrics-serving-cert.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed This PR only modifies YAML manifest files and does not add or modify any Ginkgo test files or definitions.
Test Structure And Quality ✅ Passed This PR does not include any Ginkgo test code changes, only Kubernetes manifest modifications.
Title check ✅ Passed The title accurately describes the primary change: adding a Kubernetes Service to trigger TLS certificate generation for the operator via service-ca-operator.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml (1)

416-417: Consider adding optional: true to handle startup race condition.

The secret volume references secrets-store-csi-driver-operator-metrics-serving-cert, which is created by service-ca-operator after the Service is deployed. If the operator pod starts before the secret is created, it will fail to schedule.

Adding optional: true would allow the pod to start even if the secret doesn't exist yet, and the operator would use self-signed certificates until the serving cert becomes available.

♻️ Suggested change
               - name: metrics-serving-cert
                 secret:
                   secretName: secrets-store-csi-driver-operator-metrics-serving-cert
+                  optional: true

Also applies to: 427-429

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml`
around lines 416 - 417, The secret volume that mounts at mountPath
"/var/run/secrets/serving-cert" with name "metrics-serving-cert" (referencing
the secret "secrets-store-csi-driver-operator-metrics-serving-cert") should
include optional: true under its secret spec so the pod can start if the
ServiceCA-created secret isn't yet present; update the corresponding secret
volume entry (and the other similar entry around the block referencing the same
secret) to add optional: true to the secret field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml`:
- Around line 416-417: The secret volume that mounts at mountPath
"/var/run/secrets/serving-cert" with name "metrics-serving-cert" (referencing
the secret "secrets-store-csi-driver-operator-metrics-serving-cert") should
include optional: true under its secret spec so the pod can start if the
ServiceCA-created secret isn't yet present; update the corresponding secret
volume entry (and the other similar entry around the block referencing the same
secret) to add optional: true to the secret field.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0fc7eb76-a9c9-44f9-ac5f-59b019e614a0

📥 Commits

Reviewing files that changed from the base of the PR and between 1ce018a and f813aa0.

📒 Files selected for processing (2)
  • config/manifests/stable/secrets-store-csi-driver-operator-metrics-service.yaml
  • config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml

Add service secrets-store-csi-driver-operator-metrics to the CSI driver OLM
manifests. OLM will instantiate it together with the operator Deployment.

This service causes service-ca-operator to generate a TLS key + certificate
for the operator. As result, the operator stops generating a self-signed
cert and uses the provided one instead.
@jsafrane jsafrane force-pushed the add-service-to-generate-a-tls branch from f813aa0 to 7e55ea4 Compare March 11, 2026 12:17
@jsafrane jsafrane changed the title Add service to generate a TLS cert for the operator STOR-2884: Add service to generate a TLS cert for the operator Mar 11, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 11, 2026

@jsafrane: This pull request references STOR-2884 which is a valid jira issue.

Details

In response to this:

Add service secrets-store-csi-driver-operator-metrics to the CSI driver OLM manifests. OLM will instantiate it together with the operator Deployment.

This service causes service-ca-operator to generate a TLS key + certificate for the operator. As result, the operator stops generating a self-signed cert and uses the provided one instead.

Before this PR:

oc -n openshift-cluster-csi-drivers logs deployment/secrets-store-csi-driver-operator 

W0311 12:08:40.326159       1 cmd.go:257] Using insecure, self-signed certificates

After the PR:

I0311 12:11:26.571006       1 cmd.go:253] Using service-serving-cert provided certificates

+ the TLS credentials are created using cluster-global policy (e.g. post-quantum crypto only) + they are rotated together with the other cluster certificates.

This PR does not enable metrics collection from the operator, it just makes sure the metrics are exposed via a HTTPS endpoint with a correctly generated certificate + key.

Summary by CodeRabbit

  • New Features

  • Secure metrics endpoint enabled for the operator with certificate-based TLS.

  • Chores

  • Deployment updated to consume serving certificates and restart when certificates change to ensure continued secure monitoring.

  • Service added to support certificate provisioning for the metrics endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml (2)

416-417: Consider adding readOnly: true for explicit security hardening.

While secret volumes are read-only by default in Kubernetes, explicitly setting readOnly: true documents the intent and provides defense-in-depth.

🔒 Suggested change
                      - mountPath: /var/run/secrets/serving-cert
                        name: metrics-serving-cert
+                       readOnly: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml`
around lines 416 - 417, Add an explicit readOnly: true to the volumeMount entry
that mounts the serving certificate (the block containing mountPath:
/var/run/secrets/serving-cert and name: metrics-serving-cert) so the secret
volume is explicitly read-only; update the volumeMount for metrics-serving-cert
to include readOnly: true to document intent and provide defense-in-depth.

427-429: Add optional: true to prepare for future metrics exposure.

The secret secrets-store-csi-driver-operator-metrics-serving-cert is created asynchronously by service-ca-operator. Per the Service definition comment, the operator does not currently expose metrics on this port (this is a TODO for future work). However, once metrics are exposed, the volumeMount will be actively used, and a race condition could occur where the pod attempts to start before the secret exists.

Adding optional: true now ensures the deployment will be robust when metrics are eventually exposed:

Suggested fix
                  - name: metrics-serving-cert
                    secret:
                      secretName: secrets-store-csi-driver-operator-metrics-serving-cert
+                     optional: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml`
around lines 427 - 429, The volume entry named "metrics-serving-cert"
referencing secret "secrets-store-csi-driver-operator-metrics-serving-cert"
should mark the secret as optional to avoid startup races; update the volume
spec for metrics-serving-cert so the secret block includes optional: true
alongside secretName (i.e., add optional: true under the secret section for the
metrics-serving-cert volume).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml`:
- Around line 416-417: Add an explicit readOnly: true to the volumeMount entry
that mounts the serving certificate (the block containing mountPath:
/var/run/secrets/serving-cert and name: metrics-serving-cert) so the secret
volume is explicitly read-only; update the volumeMount for metrics-serving-cert
to include readOnly: true to document intent and provide defense-in-depth.
- Around line 427-429: The volume entry named "metrics-serving-cert" referencing
secret "secrets-store-csi-driver-operator-metrics-serving-cert" should mark the
secret as optional to avoid startup races; update the volume spec for
metrics-serving-cert so the secret block includes optional: true alongside
secretName (i.e., add optional: true under the secret section for the
metrics-serving-cert volume).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 50e870ea-d9a5-4c17-a4ed-4c8b0a19f921

📥 Commits

Reviewing files that changed from the base of the PR and between f813aa0 and 7e55ea4.

📒 Files selected for processing (2)
  • config/manifests/stable/secrets-store-csi-driver-operator-metrics-service.yaml
  • config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • config/manifests/stable/secrets-store-csi-driver-operator-metrics-service.yaml

@openshift-ci openshift-ci Bot requested review from mpatlasov and mytreya-rh March 11, 2026 12:20
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 11, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 11, 2026

@jsafrane: This pull request references STOR-2884 which is a valid jira issue.

Details

In response to this:

Add service secrets-store-csi-driver-operator-metrics to the CSI driver OLM manifests. OLM will instantiate it together with the operator Deployment.

This service causes service-ca-operator to generate a TLS key + certificate for the operator. As result, the operator stops generating a self-signed cert and uses the provided one instead.

Before this PR:

oc -n openshift-cluster-csi-drivers logs deployment/secrets-store-csi-driver-operator 

W0311 12:08:40.326159       1 cmd.go:257] Using insecure, self-signed certificates

After the PR:

I0311 12:11:26.571006       1 cmd.go:253] Using service-serving-cert provided certificates

+ the TLS credentials are created using cluster-global policy (e.g. post-quantum crypto only) + they are rotated together with the other cluster certificates.

This PR does not enable metrics collection from the operator, it just makes sure the metrics are exposed via a HTTPS endpoint with a correctly generated certificate + key.

Unlike e.g. openshift/csi-operator#522, no Dockerfile change is necessary. It already copies all yaml files:

COPY manifests/stable/*.yaml /manifests/

Summary by CodeRabbit

  • New Features

  • Secure metrics endpoint enabled for the operator with certificate-based TLS.

  • Chores

  • Deployment updated to consume serving certificates and restart when certificates change to ensure continued secure monitoring.

  • Service added to support certificate provisioning for the metrics endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

name: secrets-store-csi-driver-operator-metrics
spec:
ports:
# This is a fake port, the operator does not expose any ports currently.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to understand what is meant by "fake port" here :) ?
Does it mean the port is not being exposed through containerPort field in the container Spec?

Because, the metrics port is served by the operator currently i think**.
But i agree it is not exposed through a service which the current manifest is doing.

** Based on the library-go code and below checks on the operator

$kubectl get pod -n openshift-cluster-csi-drivers -l app=secrets-store-csi-driver-operator -o jsonpath='{.items[0].status.podIP}{"\n"}'
10.129.0.78
$k run test --rm -it --restart=Never --image=registry.access.redhat.com/ubi9/ubi-minimal:latest -n openshift-cluster-csi-drivers -- sh -c 'curl -k https://10.129.0.78:8443/healthz; echo'
ok
pod "test" deleted
$

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean the port is not being exposed through containerPort field in the container Spec?

Yes, exactly. The operator exposes it, and if you know its number, you could access it, but nothing actually reads the port (nor the new service)

@mytreya-rh
Copy link
Copy Markdown
Contributor

w.r.t failing CI, this is being discussed in this slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1772715056609579

@mpatlasov
Copy link
Copy Markdown
Contributor

/lgtm
/retest-required

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2026
@mpatlasov
Copy link
Copy Markdown
Contributor

/retest-required

@mpatlasov
Copy link
Copy Markdown
Contributor

/test operator-e2e-gcp

1 similar comment
@mpatlasov
Copy link
Copy Markdown
Contributor

/test operator-e2e-gcp

@mpatlasov
Copy link
Copy Markdown
Contributor

/retest-required

1 similar comment
@mpatlasov
Copy link
Copy Markdown
Contributor

/retest-required

@chiragkyal
Copy link
Copy Markdown
Member

chiragkyal commented Mar 13, 2026

@mpatlasov, The e2e tests are failing due to a known issue, which is being discussed here: https://redhat-internal.slack.com/archives/C68TNFWA2/p1772715056609579

There is also a release blocker bug to track and fix the problem: https://issues.redhat.com/browse/OCPBUGS-77845
We'd have to wait for the fixes to land before we can properly provision a manual STS cluster. :)

@radeore
Copy link
Copy Markdown

radeore commented Mar 13, 2026

/test ?

@chiragkyal
Copy link
Copy Markdown
Member

/test all

@mpatlasov
Copy link
Copy Markdown
Contributor

@mpatlasov, The e2e tests are failing due to a known issue, which is being discussed here: https://redhat-internal.slack.com/archives/C68TNFWA2/p1772715056609579

There is also a release blocker bug to track and fix the problem: https://issues.redhat.com/browse/OCPBUGS-77845 We'd have to wait for the fixes to land before we can properly provision a manual STS cluster. :)

@chiragkyal , the bug you mentioned (OCPBUGS-77845) is in the POST state now, the PR for this bug was opened on Mar 6, but it's still not merged. Hence, it looks as we're still in "wait for the fixes" state here. However e2e for aws and gcp passed in the morning today: here an here.

This puzzles me, any ideas how it's possible that the fix is not merged, but e2e failures vanished?

@chiragkyal
Copy link
Copy Markdown
Member

@mpatlasov, The e2e tests are failing due to a known issue, which is being discussed here: https://redhat-internal.slack.com/archives/C68TNFWA2/p1772715056609579
There is also a release blocker bug to track and fix the problem: https://issues.redhat.com/browse/OCPBUGS-77845 We'd have to wait for the fixes to land before we can properly provision a manual STS cluster. :)

@chiragkyal , the bug you mentioned (OCPBUGS-77845) is in the POST state now, the PR for this bug was opened on Mar 6, but it's still not merged. Hence, it looks as we're still in "wait for the fixes" state here. However e2e for aws and gcp passed in the morning today: here an here.

This puzzles me, any ideas how it's possible that the fix is not merged, but e2e failures vanished?

I had a discussion with the Installer team and other concerned team members about this blocker. And they were kind enough to come up with a workaround openshift/release#76228 to unblock us. Consider this as a temproray workaround since the actual fix might take some time to land. PTAL at Slack Thread

@chiragkyal
Copy link
Copy Markdown
Member

/retest-required

@mpatlasov
Copy link
Copy Markdown
Contributor

/test operator-e2e-fips

4 similar comments
@radeore
Copy link
Copy Markdown

radeore commented Mar 16, 2026

/test operator-e2e-fips

@radeore
Copy link
Copy Markdown

radeore commented Mar 16, 2026

/test operator-e2e-fips

@radeore
Copy link
Copy Markdown

radeore commented Mar 17, 2026

/test operator-e2e-fips

@mpatlasov
Copy link
Copy Markdown
Contributor

/test operator-e2e-fips

@radeore
Copy link
Copy Markdown

radeore commented Mar 19, 2026

/test operator-e2e-fips

1 similar comment
@radeore
Copy link
Copy Markdown

radeore commented Mar 20, 2026

/test operator-e2e-fips

@mytreya-rh
Copy link
Copy Markdown
Contributor

@radeore the operator-e2e-fips job is failing due to a non-transient issue in gather-azure-cli job.
We explored a workaround here: openshift/release#76454
But i think we are close to a fix here: openshift/release#76450
just FYI

mytreya-rh added a commit to mytreya-rh/release that referenced this pull request Mar 23, 2026
…true on SS-CSI FIPS workflows

We have been seeing recurrent TLS handshake failures in gather-azure-cli on FIPS clusters in openshift/secrets-store-csi-driver#62 and openshift/secrets-store-csi-driver-operator#106
This PR is created to allow_best_effort_post_steps: true on SS-CSI FIPS workflows, so that failures in post step: gather-azure-cli which declares itself as best-effort are ignored.
A fix to gather-azure-cli is proposed in openshift#76450

Once that PR or an equivalent change merges, the workaround in this PR can be revered.
A tracker is created for the same: https://redhat.atlassian.net/browse/SSCSI-267
openshift-merge-bot Bot pushed a commit to openshift/release that referenced this pull request Mar 23, 2026
…true on SS-CSI FIPS workflows (#76454)

We have been seeing recurrent TLS handshake failures in gather-azure-cli on FIPS clusters in openshift/secrets-store-csi-driver#62 and openshift/secrets-store-csi-driver-operator#106
This PR is created to allow_best_effort_post_steps: true on SS-CSI FIPS workflows, so that failures in post step: gather-azure-cli which declares itself as best-effort are ignored.
A fix to gather-azure-cli is proposed in #76450

Once that PR or an equivalent change merges, the workaround in this PR can be revered.
A tracker is created for the same: https://redhat.atlassian.net/browse/SSCSI-267
@chiragkyal
Copy link
Copy Markdown
Member

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 23, 2026

@jsafrane: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@radeore
Copy link
Copy Markdown

radeore commented Mar 23, 2026

@radeore
Copy link
Copy Markdown

radeore commented Mar 23, 2026

/verified by CI and @radeore

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@radeore: This PR has been marked as verified by CI and @radeore.

Details

In response to this:

/verified by CI and @radeore

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit db592fd into openshift:main Mar 23, 2026
14 checks passed
BaiyangZhou pushed a commit to BaiyangZhou/release that referenced this pull request Mar 24, 2026
…true on SS-CSI FIPS workflows (openshift#76454)

We have been seeing recurrent TLS handshake failures in gather-azure-cli on FIPS clusters in openshift/secrets-store-csi-driver#62 and openshift/secrets-store-csi-driver-operator#106
This PR is created to allow_best_effort_post_steps: true on SS-CSI FIPS workflows, so that failures in post step: gather-azure-cli which declares itself as best-effort are ignored.
A fix to gather-azure-cli is proposed in openshift#76450

Once that PR or an equivalent change merges, the workaround in this PR can be revered.
A tracker is created for the same: https://redhat.atlassian.net/browse/SSCSI-267
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants