USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635
USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635copejon wants to merge 1 commit into
Conversation
|
@copejon: This pull request references USHIFT-6401 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughRBAC bootstrap context threading propagates the post-start hook context through role initialization. ChangesRBAC bootstrap context threading
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)level=warning msg="The linter 'gomodguard' is deprecated (since v2.12.0) due to: new major version. Replaced by gomodguard_v2." ... [truncated 31032 characters] ... elet: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-cli-plugin: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-controller: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: copejon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go (1)
166-173: ⚡ Quick win
wait.Pollloop is not context-aware — cancellation won't short-circuit it.The hook context is now correctly threaded into the inner function, so individual API calls will fail fast when the context is cancelled. However,
wait.Pollitself has no awareness of the context; if the context is cancelled mid-poll-interval, the loop continues blocking for up to 30 more seconds before the next iteration observes the error. Replacing it withwait.PollWithContext(orwait.PollUntilContextTimeout) fully honors the shutdown signal.♻️ Proposed refactor
- err := wait.Poll(1*time.Second, 30*time.Second, func() (done bool, err error) { + err := wait.PollUntilContextTimeout(hookContext.Context, 1*time.Second, 30*time.Second, true, func(ctx context.Context) (done bool, err error) { client, err := clientset.NewForConfig(hookContext.LoopbackClientConfig) if err != nil { utilruntime.HandleError(fmt.Errorf("unable to initialize client set: %v", err)) return false, nil } - return ensureRBACPolicy(hookContext, p, client) + return ensureRBACPolicy(ctx, p, client) })Note: adjust
hookContext.ContexttohookContextifPostStartHookContextembedscontext.Context.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go` around lines 166 - 173, The wait.Poll call in the RBAC setup loop is not context-aware and can block after cancellation; replace the wait.Poll invocation in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or wait.PollUntilContextTimeout) so the loop short-circuits on hookContext cancellation; pass the hookContext (or hookContext.Context if PostStartHookContext embeds context.Context) as the context argument and keep the same polling interval and timeout while preserving the existing ensureRBACPolicy(hookContext, p, client) call and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`:
- Around line 166-173: The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9f88f2a4-91ab-409f-bd6c-6d75d87351ef
⛔ Files ignored due to path filters (1)
vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (2)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.godeps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go
d255cf8 to
b220d71
Compare
…ootstrap Thread context.Context through ensureRBACPolicy, primeAggregatedClusterRoles, and primeSplitClusterRoleBindings so that RBAC bootstrap API calls respect the post-start hook's cancellation signal instead of hanging indefinitely on context.TODO(). Includes carry patch (0040) so the fix survives future rebases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b220d71 to
1d08a19
Compare
|
@copejon: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| return false, nil | ||
| } | ||
| return ensureRBACPolicy(p, client) | ||
| return ensureRBACPolicy(hookContext, p, client) |
There was a problem hiding this comment.
I'm wondering if that context has a deadline? Or is it just a context for signaling shutdown?
Because if it's the latter, it improves nothing - it won't expire on its own to break the deadlock.
Maybe we need to wrap the context with .WithTimeout() based on wait.Poll() params as well? Or just use the 15 seconds from previous PR.
Because if I'm correct and hookContext does not time out, then we gained nothing.
Replace
context.TODO()with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles,primeSplitClusterRoleBindings)Summary by CodeRabbit