[release-4.21] OCPBUGS-78192: Exclude disruption during NoExecuteTaintManager serial tests#30857
Conversation
The upstream Kubernetes NoExecuteTaintManager serial test applies NoExecute taints to worker nodes where its test pods are scheduled. This evicts all pods on those nodes that lack a matching toleration, including metrics-server replicas. When both metrics-server pods are evicted simultaneously, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 for ~25-30s until replacement pods become ready. With 3 worker nodes and metrics-server running 2 replicas on 2 of them, there is a ~22% chance (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted at once. The existing P99-based disruption threshold does not account for this because serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter this test. The result is a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage. This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations. Filter out disruption intervals that overlap with the NoExecuteTaintManager test window so this expected disruption is not counted against the threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@openshift-cherrypick-robot: Detected clone of Jira Issue OCPBUGS-78191 with correct target version. Will retitle the PR to link to the clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-78192, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@stbenjam: This pull request references Jira Issue OCPBUGS-78192, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
@stbenjam: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: openshift-cherrypick-robot, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@openshift-cherrypick-robot: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
f413cd7
into
openshift:release-4.21
|
@openshift-cherrypick-robot: Jira Issue Verification Checks: Jira Issue OCPBUGS-78192 Jira Issue OCPBUGS-78192 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This is an automated cherry-pick of #30855
/assign openshift-ci-robot