-
Notifications
You must be signed in to change notification settings - Fork 70
✨ (feat): When using Boxcutter feature-gate, use ClusterExtension ServiceAccount for revision operations #2429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
✨ (feat): When using Boxcutter feature-gate, use ClusterExtension ServiceAccount for revision operations #2429
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
f7ebc0c to
2c82746
Compare
2c82746 to
7bcd536
Compare
7bcd536 to
d7760da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/revision_engine_factory_test.go
Outdated
Show resolved
Hide resolved
d7760da to
586787b
Compare
586787b to
b6ad85e
Compare
| # for roles it doesn't have and grant permissions beyond its own. This is required | ||
| # because extension bundles contain their own RBAC that must be created. | ||
| # See docs/concepts/permission-model.md for details on these requirements. | ||
| verbs: [ update, create, list, watch, get, delete, patch, bind, escalate ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@perdasilva that proves it is working :-)
The permissions were missing, and e2e tests were failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expanding the olm controller's rbac scope here? I think we should be expanding the rbac scope for the service account in .spec.serviceAccount for the CE. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs, we do need those permissions. When we started using the SA with Buxcutter, we began hitting issues in the experimental-e2e tests: the install gets stuck because the SA is missing required permissions. See run: https://github.com/operator-framework/operator-controller/actions/runs/20789162910/job/59706814076
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble identifying the root cause from the error messages. It just says the condition hasn't been met. My expectation would be that the only thing missing from the install service account as its modeled in main is the ability to update clusterextensionrevisions/finalizers. In principle all other permissions should be the same, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“OLMv1 operator preflight checks should report error when {escalate, bind} is not specified”
Essentially, without bind/escalate in the ClusterRole, the controller could not proceed, causing the operator to “fail to start” and tests to timeout waiting for it to become ready.
We need that to create ClusterRoles and ClusterRoleBindings defined in that Operator’s bundle so we need that.
Now, we are using the SA permissions and without the required permissions in the SA the install will not work out. The missing permission was not obtained before because we are not using the SA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need that to create ClusterRoles and ClusterRoleBindings defined in that Operator’s bundle so we need that.
This should be true whether we use Boxcutter or not, right? Why does this test pass with the Helm applier but not with Boxcutter + scoped client? Where is the need for bind and escalate coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty drastic increase in permission scope. I'm just trying to understand why it wasn't needed before, but now it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After many tests and long deep analyse, and help from AI ;-) to speed up the reason for we need is:
(Adding content generated with CLAUDE to help out)
THE REAL DIFFERENCE: Server-Side Apply!
Helm Applier (Main Branch)
- Uses Helm library (helm.sh/helm/v3)
- Helm library uses traditional CREATE/UPDATE operations
- Less strict RBAC enforcement
- Permission matching works → PASSES ✅
Boxcutter Applier (PR Branch)
- Uses pkg.package-operator.run/boxcutter machinery
- Machinery uses Server-Side Apply (SSA) with field ownership
- STRICTER RBAC enforcement with SSA!
- SSA requires explicit
bindverb → FAILS ❌
Code Evidence
Boxcutter Machinery (objects.go)
ctx, desiredObject, client.Apply, // ← Server-Side Apply!
options,
client.ForceOwnership, // ← Field ownership enforcementWhy SSA Is Stricter
From Kubernetes documentation:
Server-Side Apply enforces stricter field ownership and permissions.
When using SSA to create RBAC resources like ClusterRoleBindings,
the user MUST have the explicitbindverb, even if they already
have all the permissions being granted.
Traditional Create/Update:
- Checks: "Does user have all the permissions?" → YES → Allow
Server-Side Apply:
- Checks: "Does user have
bindverb for field ownership?" → NO → Deny - Permission matching fallback doesn't work with SSA
Complete Comparison Table
| Aspect | Helm (Main) | Boxcutter (PR) |
|---|---|---|
| Client Type | SA-scoped ✅ | SA-scoped ✅ |
| Apply Method | Helm lib (Create/Update) | SSA (client.Apply) |
| RBAC Enforcement | Traditional (lenient) | SSA (strict) |
| bind verb required? | ❌ No (perms match) | ✅ YES (SSA strict) |
| Test result | ✅ PASS | ❌ FAIL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/revision_engine_factory.go
Outdated
Show resolved
Hide resolved
3d14643 to
76b836b
Compare
| - "get" | ||
| - "list" | ||
| - "watch" | ||
| EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal/operator-controller/controllers/revision_engine_factory.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextensionrevision_controller.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/revision_engine_factory.go
Outdated
Show resolved
Hide resolved
| return f.createScopedClient(saNamespace, saName) | ||
| } | ||
|
|
||
| func (f *DefaultRevisionEngineFactory) createScopedClient(namespace, serviceAccountName string) (client.Client, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should cache the clients, i.e. avoid returning a new instance on each createScopeClient invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets disccuss in: #2429 (comment)
|
|
||
| // CreateRevisionEngine constructs a boxcutter RevisionEngine for the given ClusterExtensionRevision. | ||
| // It reads the ServiceAccount from annotations and creates a scoped client. | ||
| func (f *DefaultRevisionEngineFactory) CreateRevisionEngine(_ context.Context, rev *ocv1.ClusterExtensionRevision) (RevisionEngine, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should cache RevisionEngines, i.e. return the same one for a given ClusterExtensionRevision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against going down this path, but since RevisionEngine doesn't really hold state I'm curious to know what's behind the suggestion. Making the resource profile less choppy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, to lower the number of created resources. With this, we are going to create new revision engine instance at each reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, caching should be done in a separate PR/JIRA/task so it can get proper focus. We can track it as tech debt and (for now) I wouldn’t prioritize it until we meet the main goals first—things may still change a lot.
It’s also not as simple as it first looks:
-
CER should be independent from CE
ClusterExtensionRevision(CER) should work without depending onClusterExtension(CE).
If we cache by ServiceAccount, the same key can be shared by multiple CEs and their revisions, which makes cleanup harder. -
Cleanup is tricky
When do we evict cache entries?- on CE deletion?
- on CER deletion?
- on ServiceAccount deletion?
Since multiple CEs can share the same ServiceAccount, evicting on CE deletion could break other CEs.
We’d likely need reference counting or a smarter eviction strategy. -
Risk of unbounded growth
Without eviction, the cache could grow indefinitely in clusters with many unique ServiceAccounts, which could cause memory issues. -
Scope creep
Adding caching plus a safe eviction/lifecycle policy would significantly expand the scope of this PR.
For now: we create a new RevisionEngine per reconcile. This is safe and correct (just less optimal).
WDYT? Could we agree to revisit caching as an optimisation after we land the other changes (e.g., once SA is optional, etc.)?
My thought is: let’s keep this PR focused on the core behaviour first, and track caching as tech debt / a follow-up task once the design stabilises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with couching it under a premature optimization. It's still not super clear how much and what we gain by caching or what the memory or latency footprint of the RE is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar feeling on the client side. It could potentially make sense to keep the token in memory only for as long as its needed. On the other hand, we'd be hitting the kube-api server for a new token every time (though maybe this is obviated by switching to impersonate, eventually). In any case, I think it's ok to address this as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/revision_engine_factory.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/revision_engine_factory.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/revision_engine_factory.go
Outdated
Show resolved
Hide resolved
6321b79 to
000cb09
Compare
000cb09 to
55e51dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @pedjak and @perdasilva Could we move with this one? Now we are restricting the permissions for OLMv1 based on the SA provided for Boxcutter as we do for Helm applier. |
…operations Implement serviceAccount-scoped token-based authentication for ClusterExtensionRevision controller using annotation-based configuration. - Add RevisionEngineFactory with CreateRevisionEngine(ctx, rev) interface - Read ServiceAccount from annotations (no ClusterExtension dependency) - Token-based auth using TokenInjectingRoundTripper - ServiceAccount name and namespace in annotations for observability - TrackingCache uses global client for caching/cleanup - Comprehensive error path tests ClusterExtensionRevision can exist independently. Easy mode impersonation deferred until API is finalized. Assisted-by: Cursor
55e51dc to
6b03fe3
Compare
Adds documentation comments to all label/annotation constants explaining: - What each constant represents - Where they are applied (labels vs annotations) - ServiceAccount constants document their relationship to ClusterExtension spec Addresses code review feedback for improved maintainability.
The upgrade test ServiceAccount needs permissions to manage ClusterExtensionRevisions when BoxcutterRuntime is enabled. Without these permissions, the upgraded controller cannot create or update ClusterExtensionRevision resources, causing the ClusterExtension to fail reconciliation after upgrade.
6b03fe3 to
a9338e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add `bind` and `escalate` RBAC verbs to e2e test ServiceAccount to support Boxcutter applier's use of Kubernetes Server-Side Apply (SSA). Experimental e2e tests fail when Boxcutter uses ServiceAccount-scoped clients to apply bundle RBAC resources (ClusterRoles and ClusterRoleBindings): ``` clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:olmv1-e2e:olm-sa" cannot bind ClusterRole: RBAC: attempting to grant RBAC permissions not currently held ``` - Uses helm.sh/helm/v3 library - Applies resources with traditional CREATE/UPDATE operations - Kubernetes RBAC allows ClusterRoleBinding creation when the ServiceAccount already has all the permissions being granted (permission matching) - **Works WITHOUT `bind`/`escalate` verbs** ✅ - Uses pkg.package-operator.run/boxcutter machinery - Applies resources with **Server-Side Apply (SSA)** (`client.Apply`) - SSA enforces field-level ownership and **stricter RBAC enforcement** - Kubernetes API server **requires explicit `bind` verb** for ClusterRoleBindings - Permission matching fallback does NOT work reliably with SSA - **REQUIRES `bind`/`escalate` verbs** ❌ Validated by running actual tests: **Test 1: Main branch standard-e2e (Helm, NO bind/escalate)** ```bash make test-e2e ``` Result: ✅ PASS (21 scenarios passed) **Test 2: PR branch experimental-e2e (Boxcutter, NO bind/escalate)** ```bash make test-experimental-e2e ``` Result: ❌ FAIL (cannot bind ClusterRole error) **Test 3: PR branch experimental-e2e (Boxcutter, WITH bind/escalate)** Result: ✅ PASS (all tests pass) Add `bind` and `escalate` verbs to the e2e test RBAC template: ```yaml - apiGroups: ["rbac.authorization.k8s.io"] resources: [clusterroles, roles, clusterrolebindings, rolebindings] verbs: [ update, create, list, watch, get, delete, patch, bind, escalate ] ``` These verbs allow the ServiceAccount to: - `bind`: Create ClusterRoleBindings that reference roles with permissions the ServiceAccount doesn't have - `escalate`: Create ClusterRoles with permissions the ServiceAccount doesn't have This is the documented requirement in `docs/concepts/permission-model.md` for extension installers and aligns with Kubernetes RBAC best practices. 1. **Required for SSA**: Server-Side Apply has stricter RBAC enforcement 2. **Documented requirement**: OLMv1 docs specify bind/escalate as proper approach 3. **Industry best practice**: Operator installers should have these verbs 4. **Supports all operators**: Not just test-operator with matching permissions 5. **Maintains SSA benefits**: Field ownership, conflict resolution, GitOps support - Kubernetes RBAC: https://kubernetes.io/docs/reference/access-authn-authz/rbac/#privilege-escalation-prevention-and-bootstrapping - OLMv1 Permission Model: docs/concepts/permission-model.md - Boxcutter machinery: pkg.package-operator.run/boxcutter/machinery (uses client.Apply) - Testing evidence: FINAL_TESTED_ANSWER.md, SERVER_SIDE_APPLY_ANSWER.md Tested-by: Actual e2e test runs comparing Helm vs Boxcutter behavior Signed-off-by: Camila <camil@example.com>
What's New
When BoxcutterRuntime is enabled, OLM v1 now correctly uses the ServiceAccount you specify in your ClusterExtension to install and manage extensions. This ensures your extensions are installed with only the permissions you grant, not admin privileges.
What This Means for You
More Control:
Better Security:
Example
What happens: OLM installs the extension using
my-installerServiceAccount. The extension can only do what you've granted that ServiceAccount permission to do.