Skip to content

OCPBUGS-78832: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation#7988

Open
wking wants to merge 2 commits intoopenshift:mainfrom
wking:narrowly-scoped-cvo-bootstrap
Open

OCPBUGS-78832: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation#7988
wking wants to merge 2 commits intoopenshift:mainfrom
wking:narrowly-scoped-cvo-bootstrap

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Mar 17, 2026

What this PR does / why we need it:

The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load.

We could use the CVO's render command like the standalone installer, but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist.

All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since openshift/cluster-version-operator#1352. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs.

I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 17, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e53f36a9-6ef3-448f-892f-f01b47321e5f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Mar 17, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and muraee March 17, 2026 18:17
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Mar 17, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 18, 2026

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify 1a59094 link true /test verify

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wking added 2 commits March 18, 2026 15:55
… include.release.openshift.io/bootstrap-cluster-version-operator annotation

The cluster-version operator has a complicated system for deciding
whether a given release-image manifest should be managed in the
current cluster [1,2].  Implementing that system here, or even using
library-go and remembering to vendor-bump here, both seem like an
annoying maintenance load.

We could use the CVO's render command like the standalone installer
[3,4], but that logic is fairly complicated because it needs to
generate all the artifacts necessary for bootstrap MachineConfig
rendering, or the production machine-config operator will complain
about MachineConfigPools requesting rendered-... MachineConfig that
don't exist.

All we actually need out of the bootstrap container are the resources
that the cluster-version operator needs to launch and run, which are
labeled with the grep target since [5].  That avoids installing
anything the cluster doesn't actually need here by mistake.  Once the
production CVO container starts, it will apply the remaining resources
that the cluster actually needs.

The new "is there a .status.history entry?" guard keeps this loop from
running if we already have a functioning cluster-version operator (we
don't want to be wrestling with the CVO over the state of the
ClusterVersion CRD).  The 'oc apply' (instead of 'oc create') gives us
a clear "all of those exist now" exit code we can use to break out of
the loop during the initial setup (because this init-container needs
to complete before the long-running CVO container can start).

I'm also dropping the openshift-config and openshift-config-managed
Namespace creation.  They are from a30db71 (Refactor
cluster-version-operator, 2024-11-18, openshift#5125), but that commit doesn't
explain why they were added or hint at where they lived before (if
anywhere).  I would expect the cluster-version operator to be able to
create those Namespaces from the release-image manifests when they are
needed, as with other cluster resources.

I'm also shifting the ClusterVersion custom resource apply into the
loop, to avoid attempting to apply before the ClusterVersion CRD
exists and to more gracefully recover from temporary API hiccup sorts
of things.

I'm also adding some debugging echos and other output to make it
easier to debug "hey, why is it applying these resources that I didn't
expect it to?" or "... not applying the resources I did expect?".

[1]: https://github.com/openshift/enhancements/blob/2b38513b8661632f08e64f4acc3b856e842f8669/dev-guide/cluster-version-operator/dev/operators.md#manifest-inclusion-annotations
[2]: https://github.com/openshift/library-go/blob/ac826d10cb4081fe3034b027863c08953d95f602/pkg/manifest/manifest.go#L296-L376
[3]: https://github.com/openshift/installer/blob/a300d8c0e9d9d566a85740244a7da74d3d63e23c/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L189-L216
[4]: https://github.com/openshift/cluster-version-operator/blob/eaf28f5165bde27435b0f0c9a69458677034a58d/pkg/payload/render.go
[5]: openshift/cluster-version-operator#1352
…r-version-operator: Regenerate

Regenerate with:

  $ UPDATE=true make test
@wking wking force-pushed the narrowly-scoped-cvo-bootstrap branch from b18cd52 to 87457d8 Compare March 18, 2026 23:10
@wking wking changed the title WIP: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation OCPBUGS-78832: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation Mar 19, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Mar 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@wking: This pull request references Jira Issue OCPBUGS-78832, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load.

We could use the CVO's render command like the standalone installer, but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist.

All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since openshift/cluster-version-operator#1352. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs.

I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 19, 2026
@openshift-bot
Copy link
Copy Markdown

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have all the evidence. Let me produce the final report.


Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

=== Job 1: ci/prow/verify ===
Commit 87457d8dd6:
  1: CT1 Title does not start with one of fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build: "control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-version-operator: Regenerate"

Commit bc26dbe16b:
  1: CT1 Title does not start with one of fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build
  1: T1  Title exceeds max length (144>120)
  48: B1 Line exceeds max length (175>140)
  50: B1 Line exceeds max length (160>140)

make: *** [Makefile:423: run-gitlint] Error 5

=== Job 2: ci/prow/e2e-azure-self-managed ===
--- FAIL: TestCreateCluster/ValidateHostedCluster (2841.86s)
  Failed to wait for 2 nodes to become ready in 45m0s: context deadline exceeded
  observed **v1.Node collection invalid: expected 2 nodes, got 0
  Degraded=True: UnavailableReplicas([catalog-operator, cluster-network-operator,
    cluster-storage-operator, cluster-version-operator, csi-snapshot-controller-operator,
    dns-operator, hosted-cluster-config-operator, ingress-operator, olm-operator, packageserver])

Summary

Both jobs fail due to issues introduced by PR #7988. The verify job fails because commit messages do not follow the required conventional commit format (gitlint CT1/T1/B1 violations). The e2e-azure-self-managed job fails because the PR changes the CVO bootstrap init container to grep for the annotation include.release.openshift.io/bootstrap-cluster-version-operator: .*hypershift in release payload manifests, but this annotation does not yet exist in the CI release payload. The annotation is supposed to be added by cluster-version-operator PR #1352, which has not been merged. Consequently, the CVO bootstrap init container runs indefinitely finding zero matching manifests, the CVO pod never initializes, all hosted cluster operators remain unavailable, zero nodes join, and the test times out after 45 minutes.

Root Cause

Job 1 (verify): The run-gitlint Makefile target (line 423) enforces conventional commit formatting via gitlint 0.19.1 with a custom CT1 rule. Both commits in the PR fail validation:

  • Commit 87457d8dd6 ("control-plane-operator/.../cluster-version-operator: Regenerate") — missing conventional commit prefix (e.g., chore:, feat:)
  • Commit bc26dbe16b ("control-plane-operator/.../v2/cvo: Consume include.release.openshift.io/bootstrap-cluster-version-operator annotation") — missing prefix, title exceeds 120 chars (144), and two body lines exceed 140 chars (URLs in commit message references)

Job 2 (e2e-azure-self-managed): The PR modifies the CVO deployment template to use a new bootstrap init container script that runs:

grep -rl 'include.release.openshift.io/bootstrap-cluster-version-operator: .*hypershift' /var/payload/manifests

This annotation (include.release.openshift.io/bootstrap-cluster-version-operator) is a new annotation that must be added to CVO manifests in the release payload by openshift/cluster-version-operator PR #1352. That PR is still open and has merge conflicts (mergeable_state: dirty). Since the annotation doesn't exist in any manifest in the current CI release payload:

  1. The grep -rl returns nothing → ls -l of empty expansion fails
  2. The oc apply in the loop applies zero manifests (only /tmp/clusterversion.json if it even gets there)
  3. The bootstrap init container loops forever waiting for clusterversions.config.openshift.io version to have a status.history entry
  4. The CVO pod stays in Pending phase with ContainersNotInitialized — the bootstrap init container shows ready: false, started: true, restartCount: 0, state: running since 23:55:40 and never completes
  5. Without the CVO, no cluster operators are deployed → all 10 operators remain unavailable → zero nodes join → test times out after 45 minutes
Recommendations
  1. Do not merge this PR until openshift/cluster-version-operator #1352 is merged and included in a CI release payload. The CVO PR must land first to add the include.release.openshift.io/bootstrap-cluster-version-operator annotation to the relevant manifests.

  2. Alternatively, make the bootstrap script backward-compatible — if no manifests match the new annotation, fall back to the previous annotation (include.release.openshift.io/hypershift: "true") or skip the grep entirely so the init container can complete without matching manifests.

  3. Fix commit messages to satisfy gitlint:

    • Prefix both commits with a conventional type (e.g., chore: for the regeneration commit, feat: for the annotation consumption commit)
    • Shorten the second commit title to ≤120 characters (e.g., feat(cvo): consume bootstrap-cluster-version-operator annotation)
    • Shorten or wrap the body URL references to ≤140 characters per line
  4. Consider coordinating the cross-repo dependency — file a dependency note on the CVO PR or use a ci-operator payload override to test against a payload that includes the CVO changes.

Evidence
Evidence Detail
Verify gitlint error CT1 Title does not start with one of fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build on both commits
Verify title length Commit bc26dbe16b title is 144 chars (max 120)
Verify body lines Lines 48 and 50 exceed 140 char limit (175 and 160 chars respectively — long GitHub URLs)
E2E test failure TestCreateCluster/ValidateHostedCluster failed after 2841.86s (47.4 min)
Node count expected 2 nodes, got 0 — zero nodes joined the hosted cluster
CVO pod status Pod cluster-version-operator-665b89cb58-2pf4j stuck in Pending phase with ContainersNotInitialized
Bootstrap init container state: running since 2026-03-18T23:55:40Z, ready: false, restartCount: 0 — never completed
Bootstrap script grep -rl 'include.release.openshift.io/bootstrap-cluster-version-operator: .*hypershift' /var/payload/manifests — annotation not in payload
Unavailable operators All 10 operators unavailable: catalog-operator, cluster-network-operator, cluster-storage-operator, cluster-version-operator, csi-snapshot-controller-operator, dns-operator, hosted-cluster-config-operator, ingress-operator, olm-operator, packageserver
Missing dependency openshift/cluster-version-operator PR #1352 — still open, has merge conflicts
All CVO conditions Unknown ClusterVersionSucceeding, ClusterVersionProgressing, ClusterVersionAvailable, ClusterVersionReleaseAccepted all Unknown: StatusUnknown(Condition not found in the CVO.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants