STOR-2884: Add service to generate a TLS cert for the operator#106
Conversation
WalkthroughAdds a new ClusterIP Service manifest to expose the operator's metrics (with service-ca-operator cert annotation) and updates the operator CSV to mount the serving TLS secret and terminate when the cert files change. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml (1)
416-417: Consider addingoptional: trueto handle startup race condition.The secret volume references
secrets-store-csi-driver-operator-metrics-serving-cert, which is created by service-ca-operator after the Service is deployed. If the operator pod starts before the secret is created, it will fail to schedule.Adding
optional: truewould allow the pod to start even if the secret doesn't exist yet, and the operator would use self-signed certificates until the serving cert becomes available.♻️ Suggested change
- name: metrics-serving-cert secret: secretName: secrets-store-csi-driver-operator-metrics-serving-cert + optional: trueAlso applies to: 427-429
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml` around lines 416 - 417, The secret volume that mounts at mountPath "/var/run/secrets/serving-cert" with name "metrics-serving-cert" (referencing the secret "secrets-store-csi-driver-operator-metrics-serving-cert") should include optional: true under its secret spec so the pod can start if the ServiceCA-created secret isn't yet present; update the corresponding secret volume entry (and the other similar entry around the block referencing the same secret) to add optional: true to the secret field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml`:
- Around line 416-417: The secret volume that mounts at mountPath
"/var/run/secrets/serving-cert" with name "metrics-serving-cert" (referencing
the secret "secrets-store-csi-driver-operator-metrics-serving-cert") should
include optional: true under its secret spec so the pod can start if the
ServiceCA-created secret isn't yet present; update the corresponding secret
volume entry (and the other similar entry around the block referencing the same
secret) to add optional: true to the secret field.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0fc7eb76-a9c9-44f9-ac5f-59b019e614a0
📒 Files selected for processing (2)
config/manifests/stable/secrets-store-csi-driver-operator-metrics-service.yamlconfig/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml
Add service secrets-store-csi-driver-operator-metrics to the CSI driver OLM manifests. OLM will instantiate it together with the operator Deployment. This service causes service-ca-operator to generate a TLS key + certificate for the operator. As result, the operator stops generating a self-signed cert and uses the provided one instead.
f813aa0 to
7e55ea4
Compare
|
@jsafrane: This pull request references STOR-2884 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml (2)
416-417: Consider addingreadOnly: truefor explicit security hardening.While secret volumes are read-only by default in Kubernetes, explicitly setting
readOnly: truedocuments the intent and provides defense-in-depth.🔒 Suggested change
- mountPath: /var/run/secrets/serving-cert name: metrics-serving-cert + readOnly: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml` around lines 416 - 417, Add an explicit readOnly: true to the volumeMount entry that mounts the serving certificate (the block containing mountPath: /var/run/secrets/serving-cert and name: metrics-serving-cert) so the secret volume is explicitly read-only; update the volumeMount for metrics-serving-cert to include readOnly: true to document intent and provide defense-in-depth.
427-429: Addoptional: trueto prepare for future metrics exposure.The secret
secrets-store-csi-driver-operator-metrics-serving-certis created asynchronously by service-ca-operator. Per the Service definition comment, the operator does not currently expose metrics on this port (this is a TODO for future work). However, once metrics are exposed, the volumeMount will be actively used, and a race condition could occur where the pod attempts to start before the secret exists.Adding
optional: truenow ensures the deployment will be robust when metrics are eventually exposed:Suggested fix
- name: metrics-serving-cert secret: secretName: secrets-store-csi-driver-operator-metrics-serving-cert + optional: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml` around lines 427 - 429, The volume entry named "metrics-serving-cert" referencing secret "secrets-store-csi-driver-operator-metrics-serving-cert" should mark the secret as optional to avoid startup races; update the volume spec for metrics-serving-cert so the secret block includes optional: true alongside secretName (i.e., add optional: true under the secret section for the metrics-serving-cert volume).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@config/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml`:
- Around line 416-417: Add an explicit readOnly: true to the volumeMount entry
that mounts the serving certificate (the block containing mountPath:
/var/run/secrets/serving-cert and name: metrics-serving-cert) so the secret
volume is explicitly read-only; update the volumeMount for metrics-serving-cert
to include readOnly: true to document intent and provide defense-in-depth.
- Around line 427-429: The volume entry named "metrics-serving-cert" referencing
secret "secrets-store-csi-driver-operator-metrics-serving-cert" should mark the
secret as optional to avoid startup races; update the volume spec for
metrics-serving-cert so the secret block includes optional: true alongside
secretName (i.e., add optional: true under the secret section for the
metrics-serving-cert volume).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 50e870ea-d9a5-4c17-a4ed-4c8b0a19f921
📒 Files selected for processing (2)
config/manifests/stable/secrets-store-csi-driver-operator-metrics-service.yamlconfig/manifests/stable/secrets-store-csi-driver-operator.clusterserviceversion.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- config/manifests/stable/secrets-store-csi-driver-operator-metrics-service.yaml
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jsafrane: This pull request references STOR-2884 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| name: secrets-store-csi-driver-operator-metrics | ||
| spec: | ||
| ports: | ||
| # This is a fake port, the operator does not expose any ports currently. |
There was a problem hiding this comment.
Would like to understand what is meant by "fake port" here :) ?
Does it mean the port is not being exposed through containerPort field in the container Spec?
Because, the metrics port is served by the operator currently i think**.
But i agree it is not exposed through a service which the current manifest is doing.
** Based on the library-go code and below checks on the operator
$kubectl get pod -n openshift-cluster-csi-drivers -l app=secrets-store-csi-driver-operator -o jsonpath='{.items[0].status.podIP}{"\n"}'
10.129.0.78
$k run test --rm -it --restart=Never --image=registry.access.redhat.com/ubi9/ubi-minimal:latest -n openshift-cluster-csi-drivers -- sh -c 'curl -k https://10.129.0.78:8443/healthz; echo'
ok
pod "test" deleted
$
There was a problem hiding this comment.
Does it mean the port is not being exposed through containerPort field in the container Spec?
Yes, exactly. The operator exposes it, and if you know its number, you could access it, but nothing actually reads the port (nor the new service)
|
w.r.t failing CI, this is being discussed in this slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1772715056609579 |
|
/lgtm |
|
/retest-required |
|
/test operator-e2e-gcp |
1 similar comment
|
/test operator-e2e-gcp |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@mpatlasov, The e2e tests are failing due to a known issue, which is being discussed here: https://redhat-internal.slack.com/archives/C68TNFWA2/p1772715056609579 There is also a release blocker bug to track and fix the problem: https://issues.redhat.com/browse/OCPBUGS-77845 |
|
/test ? |
|
/test all |
@chiragkyal , the bug you mentioned (OCPBUGS-77845) is in the POST state now, the PR for this bug was opened on Mar 6, but it's still not merged. Hence, it looks as we're still in "wait for the fixes" state here. However e2e for aws and gcp passed in the morning today: here an here. This puzzles me, any ideas how it's possible that the fix is not merged, but e2e failures vanished? |
I had a discussion with the Installer team and other concerned team members about this blocker. And they were kind enough to come up with a workaround openshift/release#76228 to unblock us. Consider this as a temproray workaround since the actual fix might take some time to land. PTAL at Slack Thread |
|
/retest-required |
|
/test operator-e2e-fips |
4 similar comments
|
/test operator-e2e-fips |
|
/test operator-e2e-fips |
|
/test operator-e2e-fips |
|
/test operator-e2e-fips |
|
/test operator-e2e-fips |
1 similar comment
|
/test operator-e2e-fips |
|
@radeore the operator-e2e-fips job is failing due to a non-transient issue in gather-azure-cli job. |
…true on SS-CSI FIPS workflows We have been seeing recurrent TLS handshake failures in gather-azure-cli on FIPS clusters in openshift/secrets-store-csi-driver#62 and openshift/secrets-store-csi-driver-operator#106 This PR is created to allow_best_effort_post_steps: true on SS-CSI FIPS workflows, so that failures in post step: gather-azure-cli which declares itself as best-effort are ignored. A fix to gather-azure-cli is proposed in openshift#76450 Once that PR or an equivalent change merges, the workaround in this PR can be revered. A tracker is created for the same: https://redhat.atlassian.net/browse/SSCSI-267
…true on SS-CSI FIPS workflows (#76454) We have been seeing recurrent TLS handshake failures in gather-azure-cli on FIPS clusters in openshift/secrets-store-csi-driver#62 and openshift/secrets-store-csi-driver-operator#106 This PR is created to allow_best_effort_post_steps: true on SS-CSI FIPS workflows, so that failures in post step: gather-azure-cli which declares itself as best-effort are ignored. A fix to gather-azure-cli is proposed in #76450 Once that PR or an equivalent change merges, the workaround in this PR can be revered. A tracker is created for the same: https://redhat.atlassian.net/browse/SSCSI-267
|
/retest |
|
@jsafrane: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
'secrets-store-csi-driver-operator' is now using the service-ca-operator generated certificate as seen from the pod logs, |
|
/verified by CI and @radeore |
|
@radeore: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…true on SS-CSI FIPS workflows (openshift#76454) We have been seeing recurrent TLS handshake failures in gather-azure-cli on FIPS clusters in openshift/secrets-store-csi-driver#62 and openshift/secrets-store-csi-driver-operator#106 This PR is created to allow_best_effort_post_steps: true on SS-CSI FIPS workflows, so that failures in post step: gather-azure-cli which declares itself as best-effort are ignored. A fix to gather-azure-cli is proposed in openshift#76450 Once that PR or an equivalent change merges, the workaround in this PR can be revered. A tracker is created for the same: https://redhat.atlassian.net/browse/SSCSI-267
Add service secrets-store-csi-driver-operator-metrics to the CSI driver OLM manifests. OLM will instantiate it together with the operator Deployment.
This service causes service-ca-operator to generate a TLS key + certificate for the operator. As result, the operator stops generating a self-signed cert and uses the provided one instead.
Before this PR:
After the PR:
+ the TLS credentials are created using cluster-global policy (e.g. post-quantum crypto only) + they are rotated together with the other cluster certificates.
This PR does not enable metrics collection from the operator, it just makes sure the metrics are exposed via a HTTPS endpoint with a correctly generated certificate + key.
Unlike e.g. openshift/csi-operator#522, no Dockerfile change is necessary. It already copies all yaml files:
secrets-store-csi-driver-operator/config/bundle.Dockerfile
Line 8 in 1ce018a
Summary by CodeRabbit
New Features
Chores