Increase pathological event thresholds for API server rollout scenarios#31122
Increase pathological event thresholds for API server rollout scenarios#31122gangwgr wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces the prior unconditional ScalingReplicaSet allowance with three KMS-encryption–scoped pathological event matchers. Each matcher derives buffered KMS encryption e2e intervals from ChangesKMS-encryption Scoped Pathological Matchers
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| // KMS encryption tests (and similar tests that trigger multiple kube-apiserver rollouts) | ||
| // cause cascading rollouts across openshift-apiserver and openshift-oauth-apiserver, | ||
| // generating ScalingReplicaSet events well above the default threshold. | ||
| registry.AddPathologicalEventMatcherOrDie(&SimplePathologicalEventMatcher{ |
There was a problem hiding this comment.
In which step enables this configuration only for KMS CI?. I think we shouldn't update for all configurations.
There was a problem hiding this comment.
Agreed, I have uupdated matchers are now scoped to only apply when encryption tests are detected in the test intervals.
f0f6117 to
9ff012b
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)
446-482: ⚡ Quick winAdd focused unit tests for matcher scope + thresholds.
Please add tests asserting namespace/deployment/reason matching and repeat-threshold behavior for all three new matchers (
100/50/50) so future edits don’t unintentionally broaden allowance.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go` around lines 446 - 482, Add unit tests that exercise the three new SimplePathologicalEventMatcher entries (names: APIServerScalingReplicaSetDuringRollout, APIServerOperatorStatusChangedDuringRollout, OperatorDeploymentUpdatedDuringRollout) by locating them via the registry (registry.AddPathologicalEventMatcherOrDie/registry lookup) and verifying their locatorKeyRegexes and messageReasonRegex behave as expected: include positive tests where namespace, deployment and reason match the regexes and negative tests where one of those fields does not match; additionally assert repeatThresholdOverride equals 100 for APIServerScalingReplicaSetDuringRollout and 50 for the other two matchers, and include boundary tests to ensure the matcher does not falsely accept broader values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 446-482: Add unit tests that exercise the three new
SimplePathologicalEventMatcher entries (names:
APIServerScalingReplicaSetDuringRollout,
APIServerOperatorStatusChangedDuringRollout,
OperatorDeploymentUpdatedDuringRollout) by locating them via the registry
(registry.AddPathologicalEventMatcherOrDie/registry lookup) and verifying their
locatorKeyRegexes and messageReasonRegex behave as expected: include positive
tests where namespace, deployment and reason match the regexes and negative
tests where one of those fields does not match; additionally assert
repeatThresholdOverride equals 100 for APIServerScalingReplicaSetDuringRollout
and 50 for the other two matchers, and include boundary tests to ensure the
matcher does not falsely accept broader values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 021666eb-9047-4696-acf6-e64f6dff4a6b
📒 Files selected for processing (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
9ff012b to
60ccd43
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 1385-1393: The three new SimplePathologicalEventMatcher entries
(including the one named "APIServerScalingReplicaSetDuringEncryption" and the
two adjacent matchers added alongside it) are missing explicit
repeatThresholdOverride values, leaving them unbounded under
allowIfWithinIntervals; add the repeatThresholdOverride field to each matcher
and set it to enforce the intended caps (set 100 for the
APIServerScalingReplicaSetDuringEncryption matcher and 50/50 for the other two
matchers) so the Allows logic respects those limits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7bd1ea0a-7884-4eef-bc93-10a88c017895
📒 Files selected for processing (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
60ccd43 to
8861ba0
Compare
|
Scheduling required tests: |
ardaguclu
left a comment
There was a problem hiding this comment.
Overall mechanism is nice to me and better than disabling the monitors. Dropped some comments.
| encryptionTestIntervals := finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool { | ||
| testName := eventInterval.Locator.Keys[monitorapi.LocatorE2ETestKey] | ||
| return eventInterval.Source == monitorapi.SourceE2ETest && | ||
| (strings.Contains(testName, "Encryption") || strings.Contains(testName, "encryption")) |
There was a problem hiding this comment.
Isn't encryption too generic that may overlap with some other tests. Is it possible to add KMS to ensure that this will cover our setup?.
There was a problem hiding this comment.
Updated. Changed to match EncryptionKMS and encryption-kms specifically so it won't overlap with other encryption-related tests.
| (strings.Contains(testName, "Encryption") || strings.Contains(testName, "encryption")) | ||
| }) | ||
| for i := range encryptionTestIntervals { | ||
| encryptionTestIntervals[i].To = encryptionTestIntervals[i].To.Add(time.Minute * 10) |
There was a problem hiding this comment.
What are we fixing by adding and subtracting 10 minutes?
There was a problem hiding this comment.
After the encryption test finishes, the API server rollout it triggered is still in progress - pods are still restarting and operators are still reacting. The +-10 minute buffer gives time for these trailing events to settle. Same buffer value used by the vsphere configuration test matcher for the same reason
| // Encryption tests trigger multiple kube-apiserver rollouts that cascade into | ||
| // these namespaces, generating ScalingReplicaSet events (observed 68-82 times). | ||
| func newEncryptionTestScalingReplicaSetMatcher(finalIntervals monitorapi.Intervals) EventMatcher { | ||
| encryptionIntervals := getEncryptionTestIntervals(finalIntervals) |
There was a problem hiding this comment.
I think we shouldn't depend on another function. We can carry the same filters here like others do.
| monitorapi.LocatorDeploymentKey: regexp.MustCompile(`^apiserver$`), | ||
| }, | ||
| messageReasonRegex: regexp.MustCompile(`^ScalingReplicaSet$`), | ||
| repeatThresholdOverride: 100, |
There was a problem hiding this comment.
We need to determine the expected threshold and update periodically after each new test.
There was a problem hiding this comment.
Agreed. These thresholds come from what we've seen in the first few CI runs - for example ScalingReplicaSet fired 68-82 times, so we set the cap at 100 to give some room. We'll keep an eye on the periodic job results and tune them if needed
| // Operators report status changes repeatedly during cascading rollouts | ||
| // triggered by encryption configuration changes (observed 22-25 times). | ||
| func newEncryptionTestOperatorStatusChangedMatcher(finalIntervals monitorapi.Intervals) EventMatcher { | ||
| encryptionIntervals := getEncryptionTestIntervals(finalIntervals) |
There was a problem hiding this comment.
Just carry the filters without depending on a function.
| encryptionIntervals := getEncryptionTestIntervals(finalIntervals) | ||
| return &OverlapOtherIntervalsPathologicalEventMatcher{ | ||
| delegate: &SimplePathologicalEventMatcher{ | ||
| name: "APIServerOperatorStatusChangedDuringEncryption", |
There was a problem hiding this comment.
Just a note for future readers: OperatorStatus is changed to progressing when roll out is needed.
| delegate: &SimplePathologicalEventMatcher{ | ||
| name: "OperatorDeploymentUpdatedDuringEncryption", | ||
| locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{ | ||
| monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^(openshift-console-operator|openshift-authentication-operator)$`), |
There was a problem hiding this comment.
Why openshift-apiserver-operator is not listed here but openshift-console-operator is?
There was a problem hiding this comment.
Different operators emit different event types. The apiserver-operator fires OperatorStatusChanged (covered by the second matcher), while console-operator and authentication-operator fire DeploymentUpdated when they react to the apiserver changes
There was a problem hiding this comment.
Do we know that why console-operator is rolled out?. We only focus on aggregated apiservers.
There was a problem hiding this comment.
Not sure I see in this job https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_clus[…]in-e2e-gcp-operator-encryption-kms-ote/2049754074668077056
namespace/openshift-console-operator deployment/console-operator - reason/DeploymentUpdated Updated Deployment.apps/console -n openshift-console because it changed (31 times)
that's why i added we can remove console-operator it has indirect effect of rollout.
As per I'm assumption he console-operator detects the auth config change and re-syncs the console deployment that's why it has rollouts
8861ba0 to
d70d6dd
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)
446-451:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the KMS-only scope out of
MatchesAny.These matchers are always registered, but
OverlapOtherIntervalsPathologicalEventMatcher.Matches()ignoresallowIfWithinIntervals. That means non-KMS jobs will still classify matchingScalingReplicaSet/OperatorStatusChanged/DeploymentUpdatedevents as matches, which is broader than the rollout-only scope described here. Gate the registrations behind the same KMS-test detection so the matchers are absent when no KMS interval exists.Suggested patch
// KMS encryption tests trigger multiple kube-apiserver rollouts that cascade across // apiserver, oauth-apiserver, and their operators. These matchers are scoped to only // apply when KMS encryption tests are detected in the intervals. - registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestScalingReplicaSetMatcher(finalIntervals)) - registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestOperatorStatusChangedMatcher(finalIntervals)) - registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestDeploymentUpdatedMatcher(finalIntervals)) + hasKMSEncryptionIntervals := len(finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool { + testName := eventInterval.Locator.Keys[monitorapi.LocatorE2ETestKey] + return eventInterval.Source == monitorapi.SourceE2ETest && + (strings.Contains(testName, "EncryptionKMS") || strings.Contains(testName, "encryption-kms")) + })) > 0 + if hasKMSEncryptionIntervals { + registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestScalingReplicaSetMatcher(finalIntervals)) + registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestOperatorStatusChangedMatcher(finalIntervals)) + registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestDeploymentUpdatedMatcher(finalIntervals)) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go` around lines 446 - 451, The three KMS-specific matchers (newKMSEncryptionTestScalingReplicaSetMatcher, newKMSEncryptionTestOperatorStatusChangedMatcher, newKMSEncryptionTestDeploymentUpdatedMatcher) are always registered via registry.AddPathologicalEventMatcherOrDie but their Matches() ignores allowIfWithinIntervals so they must only be registered when KMS intervals are present; modify the registration site to check the same KMS-test detection (i.e., the presence/contents of finalIntervals or the existing KMS detection predicate used elsewhere) and only call registry.AddPathologicalEventMatcherOrDie for these three matchers when a KMS interval exists, leaving them absent for non-KMS runs.
🧹 Nitpick comments (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)
1365-1451: ⚡ Quick winAdd focused tests for the KMS matcher boundaries.
These three builders now encode most of the PR’s behavior: KMS test detection, ±10 minute buffering, namespace/deployment scoping, and 100/50/50 caps. A small table-driven test around
Allowswould lock down the intended cases: inside-window vs outside-window, non-KMS test intervals, and one-over-threshold counts.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go` around lines 1365 - 1451, Add table-driven unit tests that exercise the Allows behavior for newKMSEncryptionTestScalingReplicaSetMatcher, newKMSEncryptionTestOperatorStatusChangedMatcher, and newKMSEncryptionTestDeploymentUpdatedMatcher: create synthetic monitorapi.Intervals that include a matching KMS e2e test interval and variants (inside the ±10 minute buffered window, just outside the buffer, and a non-KMS test interval), then assert that event candidates matching the locator/namespace/deployment and reason are allowed only when inside the buffered window and not allowed when outside or for non-KMS intervals; also include a case that exceeds the repeatThresholdOverride by one to ensure the matcher rejects beyond the cap. Use table-driven subtests to cover each matcher, vary From/To to validate the ±10 minute logic, and construct events with the exact locator keys and message reasons referenced in the functions to target the code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 446-451: The three KMS-specific matchers
(newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher,
newKMSEncryptionTestDeploymentUpdatedMatcher) are always registered via
registry.AddPathologicalEventMatcherOrDie but their Matches() ignores
allowIfWithinIntervals so they must only be registered when KMS intervals are
present; modify the registration site to check the same KMS-test detection
(i.e., the presence/contents of finalIntervals or the existing KMS detection
predicate used elsewhere) and only call
registry.AddPathologicalEventMatcherOrDie for these three matchers when a KMS
interval exists, leaving them absent for non-KMS runs.
---
Nitpick comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 1365-1451: Add table-driven unit tests that exercise the Allows
behavior for newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher, and
newKMSEncryptionTestDeploymentUpdatedMatcher: create synthetic
monitorapi.Intervals that include a matching KMS e2e test interval and variants
(inside the ±10 minute buffered window, just outside the buffer, and a non-KMS
test interval), then assert that event candidates matching the
locator/namespace/deployment and reason are allowed only when inside the
buffered window and not allowed when outside or for non-KMS intervals; also
include a case that exceeds the repeatThresholdOverride by one to ensure the
matcher rejects beyond the cap. Use table-driven subtests to cover each matcher,
vary From/To to validate the ±10 minute logic, and construct events with the
exact locator keys and message reasons referenced in the functions to target the
code paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 41411c95-690f-4ea4-aa36-82683a054b07
📒 Files selected for processing (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
This is already handled by the OverlapOtherIntervalsPathologicalEventMatcher itself. When no KMS test intervals exist, allowIfWithinIntervals is empty, so Allows() always returns false - the matcher has no effect. This is the same unconditional registration pattern used by all other OverlapOtherIntervalsPathologicalEventMatcher matchers in this file (vsphere, imagepolicy, sigterm, etc.). |
|
Scheduling required tests: |
|
/test e2e-gcp-csi |
|
/test e2e-gcp-ovn-upgrade |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ardaguclu, gangwgr The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| messageHumanRegex: regexp.MustCompile(`message changed from "\\ufeff`), | ||
| }) | ||
|
|
||
| // KMS encryption tests trigger multiple kube-apiserver rollouts that cascade across |
There was a problem hiding this comment.
would it make sense to link the analysis we did here in the doc ?
xref: https://docs.google.com/document/d/14EJEJ6Xi7DPRN9gIaNUdLet8BwjJlVWf16Q_md5H1xA/edit?tab=t.0
d70d6dd to
22f4314
Compare
|
New changes are detected. LGTM label has been removed. |
|
Scheduling required tests: |
|
Changes look good to me |
22f4314 to
3f656ba
Compare
|
Scheduling required tests: |
|
/test e2e-aws-ovn-fips |
|
/test e2e-gcp-ovn-upgrade |
|
/test e2e-aws-ovn-fips |
3f656ba to
ff97901
Compare
KMS encryption tests trigger multiple kube-apiserver rollouts that cascade across openshift-apiserver, openshift-oauth-apiserver, and their operators, generating ScalingReplicaSet, OperatorStatusChanged, and DeploymentUpdated events well above the default threshold of 20. Add three new matchers scoped to only apply when encryption tests are detected in the test intervals using OverlapOtherIntervalsPathological EventMatcher, following the same pattern as vsphere configuration and imagepolicy signature validation matchers: - APIServerScalingReplicaSetDuringEncryption - APIServerOperatorStatusChangedDuringEncryption - OperatorDeploymentUpdatedDuringEncryption
ff97901 to
5008288
Compare
|
@gangwgr: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Scheduling required tests: |
KMS encryption tests and similar tests that trigger multiple kube-apiserver rollouts cause cascading rollouts across openshift-apiserver, openshift-oauth-apiserver, and their operators. This generates ScalingReplicaSet, OperatorStatusChanged, and DeploymentUpdated events well above the default threshold of 20.
Add three new matchers with increased repeatThresholdOverride values:
Summary by CodeRabbit