-
Notifications
You must be signed in to change notification settings - Fork 2k
Add opt-in workaround for OCM CA bundle race condition in acm-mch step #72976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add opt-in workaround for OCM CA bundle race condition in acm-mch step #72976
Conversation
|
@ccardenosa: GitHub didn't allow me to request PR reviews from the following users: openshift/openshift-team-edge-ztp. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
7cf2ff7 to
915f710
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
✅ Workaround Verified WorkingThe rehearsal job confirms the workaround successfully resolves the OCM CA bundle race condition: Execution SummaryThis workaround will be needed until the upstream fix (open-cluster-management-io/ocm#1309) is merged and released in a future ACM/MCE version. |
915f710 to
79ab2e2
Compare
|
/assign @sg-rh Could you please review this workaround? |
|
/pj-rehearse ack |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/assign @vboulos Could you please review this workaround? |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
79ab2e2 to
d5b6d56
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
d5b6d56 to
5b57089
Compare
|
/pj-rehearse ack |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
5b57089 to
e52e1e6
Compare
|
/pj-rehearse ack |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
e52e1e6 to
3c79102
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ccardenosa The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3c79102 to
2c58fda
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
This adds an opt-in workaround for the cluster-manager controller race condition that causes CRDs to be created with an invalid "placeholder" CA bundle. Upstream fix: open-cluster-management-io/ocm#1309 Problem: The cluster-manager controller may create ClusterManagementAddOn and ManagedClusterAddOn CRDs before the cert rotation controller creates the CA bundle ConfigMap. When this happens, the CRDs are created with caBundle: cGxhY2Vob2xkZXI= (base64 of "placeholder"), causing: 1. Webhook conversion fails with "InvalidCABundle" 2. CRDs not becoming Established 3. API endpoints not registered 4. MCH fails: "no matches for kind ClusterManagementAddOn" Additionally, the cluster-manager controller reads CA from ca-bundle-configmap. If this ConfigMap doesn't exist or is empty, it keeps re-applying CRDs with the placeholder CA, overwriting any manual patches. Opt-in Mechanism: Workarounds are now controlled via ENABLE_WORKAROUND_LIST env var: - Default: "[]" (no workarounds enabled) - To enable: ENABLE_WORKAROUND_LIST="[72976]" - Each workaround is identified by its CI PR number This ensures workarounds are: - Explicitly enabled and traceable - Easy to remove once upstream fix is released - No unexpected behavior in production jobs Workaround (6 steps): When MCH fails to reach Running status and workaround 72976 is enabled, detect the race condition by checking for placeholder CA bundles, then: 1. Patch webhook services with serving-cert-secret-name annotation 2. Wait for service-ca-operator to create TLS secrets 3. Create ca-bundle-configmap from the serving cert secret 4. Extract real CA bundle from secrets and patch CRDs 5. Verify CRDs become Established 6. Restart cluster-manager and force MCE operator reconciliation Design: - Workaround only runs if enabled via ENABLE_WORKAROUND_LIST - Detection is specific: checks for the exact placeholder value - Once upstream fix is released, remove 72976 from the list - Eventually remove workaround code entirely after fix is stable Enabled for: - openshift-kni-eco-ci-cd-ztp-left-shifting-kpi__ci-4.21.yaml (hub deployment) Tested on sno-vhub-0: MCE reached Available status and MCH reached Running with 22/22 components after workaround applied. Discovered in Prow jobs: - periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005051399989104640 - periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005219283428184064 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
🔬 Bug Still Exists - Proved via Binary Analysis (Jan 6, 2026)BackgroundRehearsal job #2008550209683984384 succeeded without triggering the workaround. To verify whether the bug still exists or was fixed upstream, I performed binary analysis on the deployed cluster-manager. Cluster Status
Binary AnalysisI searched the running $ oc exec -n multicluster-engine $POD -- grep -ao "placeholder" /registration-operator | wc -l
172Result: The buggy code is still compiled into the binary (172 occurrences of "placeholder"). CRD State CheckDespite the bug existing in code, the CRDs have real certificates: $ oc get crd clustermanagementaddons.addon.open-cluster-management.io \
-o jsonpath="{.spec.conversion.webhook.clientConfig.caBundle}" | base64 -d | head -c 50
-----BEGIN CERTIFICATE-----
MIIDPzCCAiegAwIBAgIIBhmpSdaTem8wDQYJKoZIhvcNAQE✅ Real certificate (not "placeholder") Conclusion
The race condition didn't trigger because Why This Workaround is Still Needed
The workaround remains necessary until the upstream fix is merged and released in ACM/MCE. |
2c58fda to
e1ec1f9
Compare
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse ack |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Summary
This PR adds an opt-in workaround to the
acm-mchstep to handle a race condition in the OCM cluster-manager controller that causes MultiClusterHub deployments to fail intermittently.Opt-in Mechanism
Workarounds are controlled via the
ENABLE_WORKAROUND_LISTenvironment variable:ENABLE_WORKAROUND_LIST: "[]"ENABLE_WORKAROUND_LIST: "[72976]"ENABLE_WORKAROUND_LIST: "[72976, 12345]"Benefits of Opt-in Design
Currently Enabled For
openshift-kni-eco-ci-cd-ztp-left-shifting-kpi__ci-4.21.yamlztp-left-shifting-kpiopenshift-kni-eco-ci-cd-main__ci-4.21.yamlmainRelated Issues
Problem
The cluster-manager controller has a race condition where it may create CRDs (
ClusterManagementAddOn,ManagedClusterAddOn) before the cert rotation controller creates the CA bundle ConfigMap. When this happens:caBundle: cGxhY2Vob2xkZXI=(base64 of literal string "placeholder")InvalidCABundleerrorEstablished: Falsestate"no matches for kind 'ClusterManagementAddOn' in version 'addon.open-cluster-management.io/v1alpha1'"Evidence from Failed Prow Jobs
Solution
When
ENABLE_WORKAROUND_LISTincludes72976, the workaround only triggers if the initial 30-minute wait for MCH fails:Workaround Steps (when enabled and race condition detected)
cGxhY2Vob2xkZXI=)service.beta.openshift.io/serving-cert-secret-nameannotation to webhook servicesservice-ca-operatorcreate TLS certificatesca-bundle-configmapfrom serving cert secretDesign Decisions
Cleanup Path
Once ocm#1309 is merged and released in ACM/MCE:
72976fromENABLE_WORKAROUND_LISTin affected configsTesting
Changes
ci-operator/step-registry/acm/mch/acm-mch-commands.shENABLE_WORKAROUND_LISTopt-in mechanismci-operator/step-registry/acm/mch/acm-mch-ref.yamlENABLE_WORKAROUND_LISTenvironment variableci-operator/config/openshift-kni/eco-ci-cd/openshift-kni-eco-ci-cd-ztp-left-shifting-kpi__ci-4.21.yamlENABLE_WORKAROUND_LIST: "[72976]"ci-operator/config/openshift-kni/eco-ci-cd/openshift-kni-eco-ci-cd-main__ci-4.21.yamlENABLE_WORKAROUND_LIST: "[72976]"/cc @openshift/openshift-team-edge-ztp