Skip to content

USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635

Open
copejon wants to merge 1 commit into
openshift:mainfrom
copejon:fix-USHIFT-6401-alt-fix
Open

USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635
copejon wants to merge 1 commit into
openshift:mainfrom
copejon:fix-USHIFT-6401-alt-fix

Conversation

@copejon
Copy link
Copy Markdown
Contributor

@copejon copejon commented May 7, 2026

Replace context.TODO() with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles, primeSplitClusterRoleBindings)

Summary by CodeRabbit

  • Chores
    • Improved RBAC policy initialization to propagate operation context throughout post-start reconciliation, ensuring readiness checks and role/role-binding priming use the provided context for API calls.
    • No user-facing behavior changes; reconciliation semantics remain unchanged.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 7, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

Details

In response to this:

Replace context.TODO() with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles, primeSplitClusterRoleBindings)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3aea76a0-b7ec-4c4c-99f9-2b2adaf577af

📥 Commits

Reviewing files that changed from the base of the PR and between b220d71 and 1d08a19.

⛔ Files ignored due to path filters (1)
  • vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (2)
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go
  • scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go

Walkthrough

RBAC bootstrap context threading propagates the post-start hook context through role initialization. EnsureRBACPolicy() passes hookContext to ensureRBACPolicy(), which now accepts and uses context for all ClusterRole and ClusterRoleBinding API operations, replacing prior context.TODO() usage. Helper functions receive and propagate the same context.

Changes

RBAC bootstrap context threading

Layer / File(s) Summary
Entry point context threading
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go, scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
EnsureRBACPolicy() now passes hookContext to ensureRBACPolicy() as the first parameter.
Core reconciliation context propagation
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go, scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
ensureRBACPolicy() accepts context.Context and uses it for ClusterRole/ClusterRoleBinding list operations during etcd readiness checks before priming aggregated and split roles.
Helper functions context propagation
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go, scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
primeAggregatedClusterRoles and primeSplitClusterRoleBindings accept and use the provided context for all ClusterRole and ClusterRoleBinding Get/Create calls.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: replacing context.TODO() with the hook's context in RBAC bootstrap helpers to fix a deadlock issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not introduce or modify any Ginkgo test names. Changes are limited to RBAC bootstrap context propagation in production code, not test files.
Test Structure And Quality ✅ Passed PR contains only production code changes (RBAC bootstrap refactoring), no Ginkgo test files to review against test quality criteria.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are only to RBAC bootstrap logic in storage_rbac.go and a patch file, not to test code.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests. Changes are limited to RBAC bootstrap code and patch file for context propagation fixes. Check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only internal RBAC bootstrap code for context propagation. No deployment manifests, operators, or scheduling constraints are added.
Ote Binary Stdout Contract ✅ Passed PR makes no stdout modifications; purely context threading in vendored Kubernetes RBAC code. No fmt.Print/klog calls or logging config changes added.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR makes no changes to Ginkgo e2e tests. It only modifies RBAC bootstrap code to propagate context, fixing deadlock issues on edge devices.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=warning msg="The linter 'gomodguard' is deprecated (since v2.12.0) due to: new major version. Replaced by gomodguard_v2."
level=warning msg="Suggested new configuration:\nlinters:\n enable:\n - gomodguard_v2\n"
level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/apparentlymart/go-cidr@v1.1.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20190321100706-95778dfbb74e: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/miekg/dns@v1.1.63: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/openshift/api@v0.0.0-20260408092441-8b086e6b9eb9: is

... [truncated 31032 characters] ...

elet: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-cli-plugin: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-controller: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go (1)

166-173: ⚡ Quick win

wait.Poll loop is not context-aware — cancellation won't short-circuit it.

The hook context is now correctly threaded into the inner function, so individual API calls will fail fast when the context is cancelled. However, wait.Poll itself has no awareness of the context; if the context is cancelled mid-poll-interval, the loop continues blocking for up to 30 more seconds before the next iteration observes the error. Replacing it with wait.PollWithContext (or wait.PollUntilContextTimeout) fully honors the shutdown signal.

♻️ Proposed refactor
-		err := wait.Poll(1*time.Second, 30*time.Second, func() (done bool, err error) {
+		err := wait.PollUntilContextTimeout(hookContext.Context, 1*time.Second, 30*time.Second, true, func(ctx context.Context) (done bool, err error) {
 			client, err := clientset.NewForConfig(hookContext.LoopbackClientConfig)
 			if err != nil {
 				utilruntime.HandleError(fmt.Errorf("unable to initialize client set: %v", err))
 				return false, nil
 			}
-			return ensureRBACPolicy(hookContext, p, client)
+			return ensureRBACPolicy(ctx, p, client)
 		})

Note: adjust hookContext.Context to hookContext if PostStartHookContext embeds context.Context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`
around lines 166 - 173, The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`:
- Around line 166-173: The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9f88f2a4-91ab-409f-bd6c-6d75d87351ef

📥 Commits

Reviewing files that changed from the base of the PR and between e98bbde and d255cf8.

⛔ Files ignored due to path filters (1)
  • vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (2)
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go

@copejon copejon force-pushed the fix-USHIFT-6401-alt-fix branch from d255cf8 to b220d71 Compare May 19, 2026 22:04
@copejon copejon marked this pull request as ready for review May 19, 2026 22:06
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026
@openshift-ci openshift-ci Bot requested review from pacevedom and vanhalenar May 19, 2026 22:06
…ootstrap

Thread context.Context through ensureRBACPolicy, primeAggregatedClusterRoles,
and primeSplitClusterRoleBindings so that RBAC bootstrap API calls respect the
post-start hook's cancellation signal instead of hanging indefinitely on
context.TODO().

Includes carry patch (0040) so the fix survives future rebases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@copejon copejon force-pushed the fix-USHIFT-6401-alt-fix branch from b220d71 to 1d08a19 Compare May 19, 2026 22:47
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@copejon: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

return false, nil
}
return ensureRBACPolicy(p, client)
return ensureRBACPolicy(hookContext, p, client)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if that context has a deadline? Or is it just a context for signaling shutdown?
Because if it's the latter, it improves nothing - it won't expire on its own to break the deadlock.

Maybe we need to wrap the context with .WithTimeout() based on wait.Poll() params as well? Or just use the 15 seconds from previous PR.
Because if I'm correct and hookContext does not time out, then we gained nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants