🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out#2566
🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out#2566camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Fixes operator-controller (Boxcutter runtime) behavior where a ClusterExtension spec change could be ignored while a prior ClusterExtensionRevision is still rolling out, by re-resolving from the catalog when the rolling-out revision no longer matches the current spec.
Changes:
- Update
ResolveBundleto reuse the rolling-out revision only when it still matches the current package/version constraints; otherwise re-resolve from the catalog. - Add
rollingOutRevisionMatchesSpechelper to compare rolling-out revision metadata against spec version constraints. - Add unit and E2E coverage for “spec change while rolling out” behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
internal/operator-controller/controllers/clusterextension_reconcile_steps.go |
Implements spec-vs-rolling-out compatibility check and re-resolve behavior. |
internal/operator-controller/controllers/clusterextension_controller_test.go |
Updates an existing Boxcutter test fixture and adds 2 new tests for re-resolve vs reuse behavior. |
test/e2e/features/update.feature |
Adds E2E scenario verifying version change during a stuck rollout triggers re-resolution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_controller_test.go
Outdated
Show resolved
Hide resolved
b31a1f4 to
d31cbc3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2566 +/- ##
==========================================
+ Coverage 64.33% 68.64% +4.31%
==========================================
Files 131 131
Lines 9333 9380 +47
==========================================
+ Hits 6004 6439 +435
+ Misses 2853 2444 -409
- Partials 476 497 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pedjak
left a comment
There was a problem hiding this comment.
/lgtm
IMHO, the added e2e test could be made faster.
d31cbc3 to
ca77804
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Fixes Boxcutter ClusterExtension reconciliation so spec updates (notably version changes) aren’t silently ignored when there are existing rolling-out revisions (e.g., rollout stuck on probe failure). This ensures the controller can re-resolve from the catalog to honor updated desired state.
Changes:
- Update
ResolveBundleto reuse a rolling-out revision only if it still matches the current spec; otherwise re-resolve from the catalog. - Add helper logic to compare a rolling-out revision’s package/version against the current catalog constraints.
- Add/adjust unit and E2E coverage for “spec changed while rolling out” behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
internal/operator-controller/controllers/clusterextension_reconcile_steps.go |
Conditional short-circuit: reuse rolling-out revision only when it matches spec; otherwise re-resolve. |
internal/operator-controller/controllers/clusterextension_controller_test.go |
Adds unit tests for re-resolve on mismatch and no re-resolve on match; adjusts an existing test setup. |
test/e2e/features/update.feature |
Adds E2E scenario covering version change while rollout is stuck. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
392ae57 to
d952dcf
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the ClusterExtension controller’s bundle-resolution behavior during active (and potentially stuck) rollouts so that it can recover instead of indefinitely reusing a rolling-out revision.
Changes:
- Add logic to conditionally reuse a rolling-out revision vs re-resolving, based on (a) ProgressDeadlineExceeded on the latest rolling-out revision and (b) whether any rolling-out revision still matches the requested package/version constraint.
- Add unit tests covering the new “reuse vs re-resolve” decision logic for rolling-out revisions.
- Add e2e scenarios intended to validate re-resolution on spec change and on progression deadline exceeded.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
internal/operator-controller/controllers/clusterextension_reconcile_steps.go |
Introduces reuseRollingOutRevision and helpers to decide whether to reuse an in-flight revision or call the resolver again. |
internal/operator-controller/controllers/clusterextension_controller_test.go |
Adds unit tests for re-resolution triggers and matching logic during rollouts. |
test/e2e/features/update.feature |
Adds e2e coverage for spec-change and (intended) progress-deadline-based re-resolution while a rollout is stuck. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
259bebc to
9c7214a
Compare
There was a problem hiding this comment.
Pull request overview
Fixes Boxcutter rollout behavior so that, during an active/stuck rollout, the controller can re-engage bundle resolution instead of always reusing the existing rolling revision—enabling recovery from stuck rollouts and honoring certain spec changes.
Changes:
- Update
ResolveBundlelogic to conditionally reuse rolling-out revisions vs re-resolve based on (a)ProgressDeadlineExceededand (b) whether a rolling revision matches the requested version constraint. - Add unit tests covering spec/version mismatch and progress-deadline-triggered re-resolution behaviors (including version ranges and multiple rolling revisions).
- Add E2E scenarios for version-change re-resolution during stuck rollout and a progress-deadline-related scenario.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
internal/operator-controller/controllers/clusterextension_reconcile_steps.go |
Introduces “reuse vs re-resolve” decision logic while revisions are rolling out, including deadline and version-constraint matching. |
internal/operator-controller/controllers/clusterextension_controller_test.go |
Adds extensive unit coverage for the new rolling-out reuse / re-resolve decision paths. |
test/e2e/features/update.feature |
Adds E2E scenarios intended to validate spec-change and progress-deadline triggers during stuck rollouts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
Generated-by: Cursor/Claude
9c7214a to
7252d79
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes a Boxcutter rollout behavior where the ClusterExtension controller would indefinitely reuse a stuck rolling-out revision, ignoring spec changes and rollout progression deadlines. It introduces targeted re-resolution triggers so the controller can recover by re-engaging the resolver when appropriate.
Changes:
- Add rollout-time decision logic to either reuse an existing rolling-out revision or re-resolve when (a) ProgressDeadlineExceeded is observed or (b) no rolling-out revision matches the current spec version constraint.
- Update resolution-error handling to use semver range satisfaction (not string equality) when deciding whether fallback to the installed revision is allowed.
- Add unit and e2e coverage for spec-mismatch and progress-deadline re-resolution behaviors.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
test/e2e/features/update.feature |
Adds e2e scenarios for re-resolution during stuck rollouts (spec-change trigger + deadline-related scenario). |
internal/operator-controller/controllers/clusterextension_reconcile_steps.go |
Implements reuseRollingOutRevision and semver-range matching for spec compatibility; updates resolution error decision logic accordingly. |
internal/operator-controller/controllers/clusterextension_controller_test.go |
Adds unit tests covering stuck-rollout re-resolution decision paths and updates an existing Boxcutter failure test setup. |
Comments suppressed due to low confidence (1)
internal/operator-controller/controllers/clusterextension_reconcile_steps.go:353
- In
handleResolutionError, this branch now triggers whenever the installed version does not satisfy the spec’s version constraint (including invalid/unparseable constraints), but the log/error text still says “spec requests version change”. Consider updating the log message (and possibly the formatted status message) to reflect “spec version constraint not satisfied by installed bundle” so it’s accurate for range constraints and parse failures, not only explicit version changes.
// Check if the spec is requesting a version that differs from what's installed.
// Uses the same semver range matching as reuseRollingOutRevision so that ranges
// like ">=1.0.0, <2.0.0" are correctly recognized as satisfied by "1.0.0".
if !versionMatchesSpec(state.revisionStates.Installed.Version, ext) {
specVersion := ""
if ext.Spec.Source.Catalog != nil {
specVersion = ext.Spec.Source.Catalog.Version
}
installedVersion := state.revisionStates.Installed.Version
msg := fmt.Sprintf("unable to upgrade to version %s: %v (currently installed: %s)", specVersion, err, installedVersion)
l.Error(err, "resolution failed and spec requests version change - cannot fall back",
"requestedVersion", specVersion,
"installedVersion", installedVersion)
setStatusProgressing(ext, err)
setInstalledStatusFromRevisionStates(ext, state.revisionStates)
ensureFailureConditionsWithReason(ext, ocv1.ReasonRetrying, msg)
return nil, err
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| version: 1.0.2 | ||
| upgradeConstraintPolicy: SelfCertified | ||
| """ | ||
| And ClusterExtensionRevision "${NAME}-1" reports Available as False with Reason ProbeFailure | ||
| When ClusterExtension is updated to version "1.0.1" | ||
| Then ClusterExtension is rolled out | ||
| And ClusterExtension is available | ||
| And bundle "test-operator.1.0.1" is installed in version "1.0.1" |
There was a problem hiding this comment.
This is a potential downgrade from 1.0.2 to 1.0.1. What if the deployment for 1.0.2 included an initContainer that performed a data migration, which succeeded, but then the container that runs the 1.0.2 controller fails the probe because of a silly typo the user made in an envvar in their deploymentConfig?
We'd downgrade to 1.0.1, which would be unable to read the migrated data, and maybe 1.0.1 would make matters worse.
If we instead required human intervention, a human would find the log in the failing deployment, see it complaining about the envvar typo, and then go and fix it in the CE.spec.config.deploymentConfig. And then 1.0.2 would progress cleanly.
| func rollingOutRevisionMatchesSpec(rm *RevisionMetadata, ext *ocv1.ClusterExtension) bool { | ||
| if ext.Spec.Source.Catalog == nil { | ||
| return false | ||
| } | ||
| if rm.Package != ext.Spec.Source.Catalog.PackageName { | ||
| return false | ||
| } | ||
| return versionMatchesSpec(rm.Version, ext) | ||
| } |
There was a problem hiding this comment.
To me, this check feels way too implicit and unintuitive. If I understand it correctly, it requires that I specifically craft an updated version range to exclude all of the versions of the active releases?
But if I expand the range to include a new version, no dice.
|
/hold I'm concerned that this could cause unsafe roll-forward or roll-back without the explicit acknowledgement from the user that forcing a re-resolution is an unsafe thing to do. IMO, default behavior needs to remain "wait for human intervention". |
What's wrong
When a revision gets stuck rolling out (e.g., probe failure), the controller ignores everything — spec changes, progression deadlines — and keeps reusing the stuck revision forever. The admin has no way to recover without manually deleting revisions.
What this PR does
Adds two triggers that tell the controller to call the resolver again during a stuck rollout:
Progression deadline exceeded — if the revision controller marks a revision as ProgressDeadlineExceeded, re-resolve automatically. This works even if the admin hasn't changed the spec.
Spec change — if the CE consumer changes the requested version and no rolling-out revision matches the new spec, re-resolve immediately.
If neither trigger fires, the controller keeps using the current revision (same as today).
Decision Tree
Today:
Rolling out? → always reuse, never re-resolve
After:
Rolling out?
→ Deadline exceeded? → YES → re-resolve
→ Revision matches spec? → YES → reuse
→ No match → re-resolve
What's tested
Motivated by: https://redhat.atlassian.net/browse/OCPBUGS-77829