OCPBUGS-77056: Release lock before WaitForCacheSync in secret monitor#2132
OCPBUGS-77056: Release lock before WaitForCacheSync in secret monitor#2132bentito wants to merge 6 commits intoopenshift:masterfrom
Conversation
When many routes use spec.tls.externalCertificate, the router calls AddSecretEventHandler once per route at startup. Previously this held the global write lock for the entire duration of each call, including the WaitForCacheSync round-trip to the API server and etcd (~100-600ms per secret). This serialized all N secret registrations, causing startup time of O(N * api_latency): 112 secrets x 587ms median = 65s (healthy cluster) 1738 secrets x 1414ms avg = 40min (high etcd latency) Fix: write the new monitoredItem into s.monitors before releasing the lock (so concurrent callers for the same secret reuse the same informer), then release the lock before WaitForCacheSync. This allows concurrent AddSecretEventHandler calls for different secrets to perform their etcd round-trips in parallel, reducing startup time from O(N * api_latency) to O(api_latency). On sync failure, the map entry is cleaned up so a future caller can retry with a fresh informer. Fixes: OCPBUGS-77056 Signed-off-by: Brett Tofel <btofel@redhat.com>
|
@bentito: This pull request references Jira Issue OCPBUGS-77056, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/assign |
|
/assign @everettraven |
|
@everettraven Could you give this proposed fix a look from the standpoint of whether it makes logical sense or whether we might just be moving the problem to the API server? Thanks! |
… cache sync Signed-off-by: Brett Tofel <btofel@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bentito The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Brett Tofel <btofel@redhat.com>
…external cert serving
|
/retest Note: The |
|
@bentito: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Routes using
spec.tls.externalCertificatetrigger a call toAddSecretEventHandlerfor each route at router startup. The previousimplementation held the global write mutex (
s.lock) for the entire durationof the call, including the
WaitForCacheSyncround-trip to the API server andetcd (~100–600ms per secret). This caused all N secret registrations to run
strictly serially:
The customer reports ~2,000 external cert routes causing router pods to fail
their 120s startup probe repeatedly and enter CrashLoopBackOff.
Root Cause
In
addSecretEventHandler,WaitForCacheSyncwas called insidedefer s.lock.Unlock(), meaning the global write lock was held while waitingfor each secret's informer to complete its initial LIST to etcd. No other
registration could begin until the previous one finished.
Fix
Write the new
monitoredItemintos.monitorsbefore releasing the lock(so concurrent callers for the same secret safely reuse the same informer),
then release the lock before
WaitForCacheSync. This allows concurrentAddSecretEventHandlercalls for different secrets to perform their etcdround-trips in parallel, reducing startup time from O(N × api_latency) to
O(api_latency).
On sync failure the map entry is cleaned up so a future caller can retry with
a fresh informer.
Testing
All existing unit tests in
pkg/secret/...pass.A concurrent registration test should be added (tracked separately).
Fixes: OCPBUGS-77056