Skip to content

Increase pathological event thresholds for API server rollout scenarios#31122

Open
gangwgr wants to merge 1 commit intoopenshift:mainfrom
gangwgr:kms-encryption-threshold
Open

Increase pathological event thresholds for API server rollout scenarios#31122
gangwgr wants to merge 1 commit intoopenshift:mainfrom
gangwgr:kms-encryption-threshold

Conversation

@gangwgr
Copy link
Copy Markdown
Contributor

@gangwgr gangwgr commented May 5, 2026

KMS encryption tests and similar tests that trigger multiple kube-apiserver rollouts cause cascading rollouts across openshift-apiserver, openshift-oauth-apiserver, and their operators. This generates ScalingReplicaSet, OperatorStatusChanged, and DeploymentUpdated events well above the default threshold of 20.

Add three new matchers with increased repeatThresholdOverride values:

  • APIServerScalingReplicaSetDuringRollout (threshold: 100)
  • APIServerOperatorStatusChangedDuringRollout (threshold: 50)
  • OperatorDeploymentUpdatedDuringRollout (threshold: 50)

Summary by CodeRabbit

  • Bug Fixes
    • Monitoring now conditionally tolerates expected cascade events only during nearby KMS encryption test windows: scaling of API server replicas (limited to openshift-apiserver / openshift-oauth-apiserver apiserver), operator status changes (openshift-apiserver-operator), and deployment updates (console and authentication operators). Each allowance is time-scoped to buffered encryption test intervals and uses adjusted repeat thresholds to reduce false positives and improve test reliability.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces the prior unconditional ScalingReplicaSet allowance with three KMS-encryption–scoped pathological event matchers. Each matcher derives buffered KMS encryption e2e intervals from finalIntervals (locator contains EncryptionKMS or encryption-kms, expanded ±10 minutes) and only allows its pathological event when overlapping those intervals.

Changes

KMS-encryption Scoped Pathological Matchers

Layer / File(s) Summary
Interval Extraction / Scoping
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
Derives buffered KMS encryption e2e-test intervals by filtering finalIntervals for intervals whose LocatorE2ETestKey contains EncryptionKMS or encryption-kms, then expanding each matched interval by ±10 minutes.
Matcher Constructors — ScalingReplicaSet
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
Adds newKMSEncryptionTestScalingReplicaSetMatcher(finalIntervals) that permits ScalingReplicaSet with repeatThresholdOverride: 100, restricted to namespaces openshift-apiserver and openshift-oauth-apiserver, deployment key apiserver, and only when overlapping the buffered KMS encryption intervals (via OverlapOtherIntervalsPathologicalEventMatcher).
Matcher Constructors — OperatorStatusChanged
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
Adds newKMSEncryptionTestOperatorStatusChangedMatcher(finalIntervals) that permits OperatorStatusChanged with repeatThresholdOverride: 50, restricted to openshift-apiserver-operator, and only when overlapping the buffered KMS encryption intervals.
Matcher Constructors — DeploymentUpdated
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
Adds newKMSEncryptionTestDeploymentUpdatedMatcher(finalIntervals) that permits DeploymentUpdated with repeatThresholdOverride: 50, restricted to deployments console-operator (namespace openshift-console-operator) and authentication-operator (namespace openshift-authentication-operator), and only when overlapping the buffered KMS encryption intervals.
Registration / Wiring
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go
Removes the unconditional ScalingReplicaSet allowance in NewUniversalPathologicalEventMatchers and registers the three new KMS-encryption-scoped matchers so their allowances apply only when events overlap the derived buffered KMS encryption intervals.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Increase pathological event thresholds for API server rollout scenarios' accurately summarizes the main objective of the PR, which is to raise repeat thresholds for pathological events during API server rollouts caused by KMS encryption tests.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains no Ginkgo test definitions. Changes are limited to internal pathological event matcher names which are hardcoded string literals with no dynamic content.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test quality is not applicable. PR modifies only duplicated_event_patterns.go, a library file with no test code. No Ginkgo tests are added or modified.
Microshift Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. It only modifies test monitoring library code. The check for MicroShift compatibility is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added in this PR. Changes modify pathological event matchers only in library code used for test monitoring. SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies test event monitoring infrastructure, not deployment manifests, operator code, or controllers. No scheduling constraints or topology-dependent workload placement logic is introduced.
Ote Binary Stdout Contract ✅ Passed Library code with no process-level entry points or stdout writes. Not an OTE binary entry point.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add new Ginkgo e2e tests. Changes are in a utility library containing event matcher functions, not test code. No IPv4 or external connectivity concerns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-merge-bot openshift-merge-bot Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 5, 2026
// KMS encryption tests (and similar tests that trigger multiple kube-apiserver rollouts)
// cause cascading rollouts across openshift-apiserver and openshift-oauth-apiserver,
// generating ScalingReplicaSet events well above the default threshold.
registry.AddPathologicalEventMatcherOrDie(&SimplePathologicalEventMatcher{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which step enables this configuration only for KMS CI?. I think we shouldn't update for all configurations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I have uupdated matchers are now scoped to only apply when encryption tests are detected in the test intervals.

@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from f0f6117 to 9ff012b Compare May 5, 2026 05:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)

446-482: ⚡ Quick win

Add focused unit tests for matcher scope + thresholds.

Please add tests asserting namespace/deployment/reason matching and repeat-threshold behavior for all three new matchers (100/50/50) so future edits don’t unintentionally broaden allowance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`
around lines 446 - 482, Add unit tests that exercise the three new
SimplePathologicalEventMatcher entries (names:
APIServerScalingReplicaSetDuringRollout,
APIServerOperatorStatusChangedDuringRollout,
OperatorDeploymentUpdatedDuringRollout) by locating them via the registry
(registry.AddPathologicalEventMatcherOrDie/registry lookup) and verifying their
locatorKeyRegexes and messageReasonRegex behave as expected: include positive
tests where namespace, deployment and reason match the regexes and negative
tests where one of those fields does not match; additionally assert
repeatThresholdOverride equals 100 for APIServerScalingReplicaSetDuringRollout
and 50 for the other two matchers, and include boundary tests to ensure the
matcher does not falsely accept broader values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 446-482: Add unit tests that exercise the three new
SimplePathologicalEventMatcher entries (names:
APIServerScalingReplicaSetDuringRollout,
APIServerOperatorStatusChangedDuringRollout,
OperatorDeploymentUpdatedDuringRollout) by locating them via the registry
(registry.AddPathologicalEventMatcherOrDie/registry lookup) and verifying their
locatorKeyRegexes and messageReasonRegex behave as expected: include positive
tests where namespace, deployment and reason match the regexes and negative
tests where one of those fields does not match; additionally assert
repeatThresholdOverride equals 100 for APIServerScalingReplicaSetDuringRollout
and 50 for the other two matchers, and include boundary tests to ensure the
matcher does not falsely accept broader values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 021666eb-9047-4696-acf6-e64f6dff4a6b

📥 Commits

Reviewing files that changed from the base of the PR and between f0f6117 and 9ff012b.

📒 Files selected for processing (1)
  • pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from 9ff012b to 60ccd43 Compare May 5, 2026 05:59
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 1385-1393: The three new SimplePathologicalEventMatcher entries
(including the one named "APIServerScalingReplicaSetDuringEncryption" and the
two adjacent matchers added alongside it) are missing explicit
repeatThresholdOverride values, leaving them unbounded under
allowIfWithinIntervals; add the repeatThresholdOverride field to each matcher
and set it to enforce the intended caps (set 100 for the
APIServerScalingReplicaSetDuringEncryption matcher and 50/50 for the other two
matchers) so the Allows logic respects those limits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7bd1ea0a-7884-4eef-bc93-10a88c017895

📥 Commits

Reviewing files that changed from the base of the PR and between 9ff012b and 60ccd43.

📒 Files selected for processing (1)
  • pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

Comment thread pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go Outdated
@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from 60ccd43 to 8861ba0 Compare May 5, 2026 06:07
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Copy link
Copy Markdown
Member

@ardaguclu ardaguclu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall mechanism is nice to me and better than disabling the monitors. Dropped some comments.

encryptionTestIntervals := finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool {
testName := eventInterval.Locator.Keys[monitorapi.LocatorE2ETestKey]
return eventInterval.Source == monitorapi.SourceE2ETest &&
(strings.Contains(testName, "Encryption") || strings.Contains(testName, "encryption"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't encryption too generic that may overlap with some other tests. Is it possible to add KMS to ensure that this will cover our setup?.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Changed to match EncryptionKMS and encryption-kms specifically so it won't overlap with other encryption-related tests.

(strings.Contains(testName, "Encryption") || strings.Contains(testName, "encryption"))
})
for i := range encryptionTestIntervals {
encryptionTestIntervals[i].To = encryptionTestIntervals[i].To.Add(time.Minute * 10)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we fixing by adding and subtracting 10 minutes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10 minutes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the encryption test finishes, the API server rollout it triggered is still in progress - pods are still restarting and operators are still reacting. The +-10 minute buffer gives time for these trailing events to settle. Same buffer value used by the vsphere configuration test matcher for the same reason

// Encryption tests trigger multiple kube-apiserver rollouts that cascade into
// these namespaces, generating ScalingReplicaSet events (observed 68-82 times).
func newEncryptionTestScalingReplicaSetMatcher(finalIntervals monitorapi.Intervals) EventMatcher {
encryptionIntervals := getEncryptionTestIntervals(finalIntervals)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't depend on another function. We can carry the same filters here like others do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

monitorapi.LocatorDeploymentKey: regexp.MustCompile(`^apiserver$`),
},
messageReasonRegex: regexp.MustCompile(`^ScalingReplicaSet$`),
repeatThresholdOverride: 100,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to determine the expected threshold and update periodically after each new test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. These thresholds come from what we've seen in the first few CI runs - for example ScalingReplicaSet fired 68-82 times, so we set the cap at 100 to give some room. We'll keep an eye on the periodic job results and tune them if needed

// Operators report status changes repeatedly during cascading rollouts
// triggered by encryption configuration changes (observed 22-25 times).
func newEncryptionTestOperatorStatusChangedMatcher(finalIntervals monitorapi.Intervals) EventMatcher {
encryptionIntervals := getEncryptionTestIntervals(finalIntervals)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just carry the filters without depending on a function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

encryptionIntervals := getEncryptionTestIntervals(finalIntervals)
return &OverlapOtherIntervalsPathologicalEventMatcher{
delegate: &SimplePathologicalEventMatcher{
name: "APIServerOperatorStatusChangedDuringEncryption",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for future readers: OperatorStatus is changed to progressing when roll out is needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

delegate: &SimplePathologicalEventMatcher{
name: "OperatorDeploymentUpdatedDuringEncryption",
locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{
monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^(openshift-console-operator|openshift-authentication-operator)$`),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why openshift-apiserver-operator is not listed here but openshift-console-operator is?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different operators emit different event types. The apiserver-operator fires OperatorStatusChanged (covered by the second matcher), while console-operator and authentication-operator fire DeploymentUpdated when they react to the apiserver changes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know that why console-operator is rolled out?. We only focus on aggregated apiservers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I see in this job https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_clus[…]in-e2e-gcp-operator-encryption-kms-ote/2049754074668077056
namespace/openshift-console-operator deployment/console-operator - reason/DeploymentUpdated Updated Deployment.apps/console -n openshift-console because it changed (31 times)
that's why i added we can remove console-operator it has indirect effect of rollout.

As per I'm assumption he console-operator detects the auth config change and re-syncs the console deployment that's why it has rollouts

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from 8861ba0 to d70d6dd Compare May 5, 2026 07:35
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)

446-451: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the KMS-only scope out of MatchesAny.

These matchers are always registered, but OverlapOtherIntervalsPathologicalEventMatcher.Matches() ignores allowIfWithinIntervals. That means non-KMS jobs will still classify matching ScalingReplicaSet / OperatorStatusChanged / DeploymentUpdated events as matches, which is broader than the rollout-only scope described here. Gate the registrations behind the same KMS-test detection so the matchers are absent when no KMS interval exists.

Suggested patch
 	// KMS encryption tests trigger multiple kube-apiserver rollouts that cascade across
 	// apiserver, oauth-apiserver, and their operators. These matchers are scoped to only
 	// apply when KMS encryption tests are detected in the intervals.
-	registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestScalingReplicaSetMatcher(finalIntervals))
-	registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestOperatorStatusChangedMatcher(finalIntervals))
-	registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestDeploymentUpdatedMatcher(finalIntervals))
+	hasKMSEncryptionIntervals := len(finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool {
+		testName := eventInterval.Locator.Keys[monitorapi.LocatorE2ETestKey]
+		return eventInterval.Source == monitorapi.SourceE2ETest &&
+			(strings.Contains(testName, "EncryptionKMS") || strings.Contains(testName, "encryption-kms"))
+	})) > 0
+	if hasKMSEncryptionIntervals {
+		registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestScalingReplicaSetMatcher(finalIntervals))
+		registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestOperatorStatusChangedMatcher(finalIntervals))
+		registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestDeploymentUpdatedMatcher(finalIntervals))
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`
around lines 446 - 451, The three KMS-specific matchers
(newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher,
newKMSEncryptionTestDeploymentUpdatedMatcher) are always registered via
registry.AddPathologicalEventMatcherOrDie but their Matches() ignores
allowIfWithinIntervals so they must only be registered when KMS intervals are
present; modify the registration site to check the same KMS-test detection
(i.e., the presence/contents of finalIntervals or the existing KMS detection
predicate used elsewhere) and only call
registry.AddPathologicalEventMatcherOrDie for these three matchers when a KMS
interval exists, leaving them absent for non-KMS runs.
🧹 Nitpick comments (1)
pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)

1365-1451: ⚡ Quick win

Add focused tests for the KMS matcher boundaries.

These three builders now encode most of the PR’s behavior: KMS test detection, ±10 minute buffering, namespace/deployment scoping, and 100/50/50 caps. A small table-driven test around Allows would lock down the intended cases: inside-window vs outside-window, non-KMS test intervals, and one-over-threshold counts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`
around lines 1365 - 1451, Add table-driven unit tests that exercise the Allows
behavior for newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher, and
newKMSEncryptionTestDeploymentUpdatedMatcher: create synthetic
monitorapi.Intervals that include a matching KMS e2e test interval and variants
(inside the ±10 minute buffered window, just outside the buffer, and a non-KMS
test interval), then assert that event candidates matching the
locator/namespace/deployment and reason are allowed only when inside the
buffered window and not allowed when outside or for non-KMS intervals; also
include a case that exceeds the repeatThresholdOverride by one to ensure the
matcher rejects beyond the cap. Use table-driven subtests to cover each matcher,
vary From/To to validate the ±10 minute logic, and construct events with the
exact locator keys and message reasons referenced in the functions to target the
code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 446-451: The three KMS-specific matchers
(newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher,
newKMSEncryptionTestDeploymentUpdatedMatcher) are always registered via
registry.AddPathologicalEventMatcherOrDie but their Matches() ignores
allowIfWithinIntervals so they must only be registered when KMS intervals are
present; modify the registration site to check the same KMS-test detection
(i.e., the presence/contents of finalIntervals or the existing KMS detection
predicate used elsewhere) and only call
registry.AddPathologicalEventMatcherOrDie for these three matchers when a KMS
interval exists, leaving them absent for non-KMS runs.

---

Nitpick comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 1365-1451: Add table-driven unit tests that exercise the Allows
behavior for newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher, and
newKMSEncryptionTestDeploymentUpdatedMatcher: create synthetic
monitorapi.Intervals that include a matching KMS e2e test interval and variants
(inside the ±10 minute buffered window, just outside the buffer, and a non-KMS
test interval), then assert that event candidates matching the
locator/namespace/deployment and reason are allowed only when inside the
buffered window and not allowed when outside or for non-KMS intervals; also
include a case that exceeds the repeatThresholdOverride by one to ensure the
matcher rejects beyond the cap. Use table-driven subtests to cover each matcher,
vary From/To to validate the ±10 minute logic, and construct events with the
exact locator keys and message reasons referenced in the functions to target the
code paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 41411c95-690f-4ea4-aa36-82683a054b07

📥 Commits

Reviewing files that changed from the base of the PR and between 8861ba0 and d70d6dd.

📒 Files selected for processing (1)
  • pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

@gangwgr
Copy link
Copy Markdown
Contributor Author

gangwgr commented May 5, 2026

♻️ Duplicate comments (1)

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)> 446-451: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the KMS-only scope out of MatchesAny.
These matchers are always registered, but OverlapOtherIntervalsPathologicalEventMatcher.Matches() ignores allowIfWithinIntervals. That means non-KMS jobs will still classify matching ScalingReplicaSet / OperatorStatusChanged / DeploymentUpdated events as matches, which is broader than the rollout-only scope described here. Gate the registrations behind the same KMS-test detection so the matchers are absent when no KMS interval exists.

Suggested patch

 	// KMS encryption tests trigger multiple kube-apiserver rollouts that cascade across
 	// apiserver, oauth-apiserver, and their operators. These matchers are scoped to only
 	// apply when KMS encryption tests are detected in the intervals.
-	registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestScalingReplicaSetMatcher(finalIntervals))
-	registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestOperatorStatusChangedMatcher(finalIntervals))
-	registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestDeploymentUpdatedMatcher(finalIntervals))
+	hasKMSEncryptionIntervals := len(finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool {
+		testName := eventInterval.Locator.Keys[monitorapi.LocatorE2ETestKey]
+		return eventInterval.Source == monitorapi.SourceE2ETest &&
+			(strings.Contains(testName, "EncryptionKMS") || strings.Contains(testName, "encryption-kms"))
+	})) > 0
+	if hasKMSEncryptionIntervals {
+		registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestScalingReplicaSetMatcher(finalIntervals))
+		registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestOperatorStatusChangedMatcher(finalIntervals))
+		registry.AddPathologicalEventMatcherOrDie(newKMSEncryptionTestDeploymentUpdatedMatcher(finalIntervals))
+	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`
around lines 446 - 451, The three KMS-specific matchers
(newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher,
newKMSEncryptionTestDeploymentUpdatedMatcher) are always registered via
registry.AddPathologicalEventMatcherOrDie but their Matches() ignores
allowIfWithinIntervals so they must only be registered when KMS intervals are
present; modify the registration site to check the same KMS-test detection
(i.e., the presence/contents of finalIntervals or the existing KMS detection
predicate used elsewhere) and only call
registry.AddPathologicalEventMatcherOrDie for these three matchers when a KMS
interval exists, leaving them absent for non-KMS runs.

🧹 Nitpick comments (1)

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go (1)> 1365-1451: ⚡ Quick win

Add focused tests for the KMS matcher boundaries.
These three builders now encode most of the PR’s behavior: KMS test detection, ±10 minute buffering, namespace/deployment scoping, and 100/50/50 caps. A small table-driven test around Allows would lock down the intended cases: inside-window vs outside-window, non-KMS test intervals, and one-over-threshold counts.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`
around lines 1365 - 1451, Add table-driven unit tests that exercise the Allows
behavior for newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher, and
newKMSEncryptionTestDeploymentUpdatedMatcher: create synthetic
monitorapi.Intervals that include a matching KMS e2e test interval and variants
(inside the ±10 minute buffered window, just outside the buffer, and a non-KMS
test interval), then assert that event candidates matching the
locator/namespace/deployment and reason are allowed only when inside the
buffered window and not allowed when outside or for non-KMS intervals; also
include a case that exceeds the repeatThresholdOverride by one to ensure the
matcher rejects beyond the cap. Use table-driven subtests to cover each matcher,
vary From/To to validate the ±10 minute logic, and construct events with the
exact locator keys and message reasons referenced in the functions to target the
code paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 446-451: The three KMS-specific matchers
(newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher,
newKMSEncryptionTestDeploymentUpdatedMatcher) are always registered via
registry.AddPathologicalEventMatcherOrDie but their Matches() ignores
allowIfWithinIntervals so they must only be registered when KMS intervals are
present; modify the registration site to check the same KMS-test detection
(i.e., the presence/contents of finalIntervals or the existing KMS detection
predicate used elsewhere) and only call
registry.AddPathologicalEventMatcherOrDie for these three matchers when a KMS
interval exists, leaving them absent for non-KMS runs.

---

Nitpick comments:
In
`@pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go`:
- Around line 1365-1451: Add table-driven unit tests that exercise the Allows
behavior for newKMSEncryptionTestScalingReplicaSetMatcher,
newKMSEncryptionTestOperatorStatusChangedMatcher, and
newKMSEncryptionTestDeploymentUpdatedMatcher: create synthetic
monitorapi.Intervals that include a matching KMS e2e test interval and variants
(inside the ±10 minute buffered window, just outside the buffer, and a non-KMS
test interval), then assert that event candidates matching the
locator/namespace/deployment and reason are allowed only when inside the
buffered window and not allowed when outside or for non-KMS intervals; also
include a case that exceeds the repeatThresholdOverride by one to ensure the
matcher rejects beyond the cap. Use table-driven subtests to cover each matcher,
vary From/To to validate the ±10 minute logic, and construct events with the
exact locator keys and message reasons referenced in the functions to target the
code paths.

ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 41411c95-690f-4ea4-aa36-82683a054b07

📥 Commits
Reviewing files that changed from the base of the PR and between 8861ba0 and d70d6dd.

📒 Files selected for processing (1)

  • pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

This is already handled by the OverlapOtherIntervalsPathologicalEventMatcher itself. When no KMS test intervals exist, allowIfWithinIntervals is empty, so Allows() always returns false - the matcher has no effect. This is the same unconditional registration pattern used by all other OverlapOtherIntervalsPathologicalEventMatcher matchers in this file (vsphere, imagepolicy, sigterm, etc.).

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gangwgr
Copy link
Copy Markdown
Contributor Author

gangwgr commented May 5, 2026

/test e2e-gcp-csi

@gangwgr
Copy link
Copy Markdown
Contributor Author

gangwgr commented May 5, 2026

/test e2e-gcp-ovn-upgrade

@ardaguclu
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ardaguclu, gangwgr
Once this PR has been reviewed and has the lgtm label, please assign miyadav for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

messageHumanRegex: regexp.MustCompile(`message changed from "\\ufeff`),
})

// KMS encryption tests trigger multiple kube-apiserver rollouts that cascade across
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to link the analysis we did here in the doc ?

xref: https://docs.google.com/document/d/14EJEJ6Xi7DPRN9gIaNUdLet8BwjJlVWf16Q_md5H1xA/edit?tab=t.0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from d70d6dd to 22f4314 Compare May 5, 2026 10:38
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

New changes are detected. LGTM label has been removed.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@ardaguclu
Copy link
Copy Markdown
Member

Changes look good to me

@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from 22f4314 to 3f656ba Compare May 5, 2026 13:49
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gangwgr
Copy link
Copy Markdown
Contributor Author

gangwgr commented May 5, 2026

/test e2e-aws-ovn-fips

@gangwgr
Copy link
Copy Markdown
Contributor Author

gangwgr commented May 5, 2026

/test e2e-gcp-ovn-upgrade

@gangwgr
Copy link
Copy Markdown
Contributor Author

gangwgr commented May 5, 2026

/test e2e-aws-ovn-fips

@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from 3f656ba to ff97901 Compare May 6, 2026 04:25
KMS encryption tests trigger multiple kube-apiserver rollouts that
cascade across openshift-apiserver, openshift-oauth-apiserver, and
their operators, generating ScalingReplicaSet, OperatorStatusChanged,
and DeploymentUpdated events well above the default threshold of 20.

Add three new matchers scoped to only apply when encryption tests are
detected in the test intervals using OverlapOtherIntervalsPathological
EventMatcher, following the same pattern as vsphere configuration and
imagepolicy signature validation matchers:
- APIServerScalingReplicaSetDuringEncryption
- APIServerOperatorStatusChangedDuringEncryption
- OperatorDeploymentUpdatedDuringEncryption
@gangwgr gangwgr force-pushed the kms-encryption-threshold branch from ff97901 to 5008288 Compare May 6, 2026 04:43
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@gangwgr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial-2of2 3f656ba link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-aws-csi 3f656ba link true /test e2e-aws-csi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants