OCPBUGS-65896: controller/workload: Rework Degraded/Progressing conditions#2128
OCPBUGS-65896: controller/workload: Rework Degraded/Progressing conditions#2128tchap wants to merge 4 commits intoopenshift:masterfrom
Conversation
|
@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cc @atiratree |
| // Update is done when all pods have been updated to the latest revision | ||
| // and the deployment controller has reported NewReplicaSetAvailable | ||
| workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status) | ||
| workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type) |
There was a problem hiding this comment.
We don't use IsUpdatingTooLong any more, we based the check on ProgressDeadlineExceeded from the deployment controller.
There was a problem hiding this comment.
let's also set .spec.progressDeadlineSeconds to 15m on the consumer workloads to honor the superseded progressingConditionTimeout value
There was a problem hiding this comment.
I don't think we can do that. Sync from the delegate returns the synchronized deployment already. So we would have to send another update request, I guess.
There was a problem hiding this comment.
We can align this in auth when updating library-go. I will actually open a PR there once this is accepted to run CI anyway.
There was a problem hiding this comment.
Yup, I meant that we should update it in the consumer repos like auth.
There was a problem hiding this comment.
Might be better to open a proof PR before the library-go changes are merged.
There was a problem hiding this comment.
| // Update is done when all pods have been updated to the latest revision | ||
| // and the deployment controller has reported NewReplicaSetAvailable | ||
| workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status) | ||
| workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type) |
There was a problem hiding this comment.
let's also set .spec.progressDeadlineSeconds to 15m on the consumer workloads to honor the superseded progressingConditionTimeout value
atiratree
left a comment
There was a problem hiding this comment.
Just one nit, otherwise LGTM, thanks!
| // Update is done when all pods have been updated to the latest revision | ||
| // and the deployment controller has reported NewReplicaSetAvailable | ||
| workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status) | ||
| workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type) |
There was a problem hiding this comment.
Yup, I meant that we should update it in the consumer repos like auth.
| // Update is done when all pods have been updated to the latest revision | ||
| // and the deployment controller has reported NewReplicaSetAvailable | ||
| workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status) | ||
| workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type) |
There was a problem hiding this comment.
Might be better to open a proof PR before the library-go changes are merged.
9a05b4b to
0b782a2
Compare
|
/lgtm |
To enable scaling not to automatically set Progressing/Degraded, conditions handling is aligned so that the Deployment generation is not consulted any more. Degraded only happens when the Deployment is not Available or it times out progressing. Progressing is set to True when the Deployment is not in NewReplicaSetAvailable and it hasn't timed out progressing.
0b782a2 to
08c8c15
Compare
WalkthroughRefactors workload controller condition evaluation: replaces generation-based Progressing checks with timeout/progress checks, simplifies Degraded detection to pod-level failures and availability, tightens version-recording gating, and adds helpers for error messages, failing-pod detection, and timeout detection. Tests expanded to cover new flows and version-recording assertions. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller
participant Deployment as DeploymentStatus
participant PodLister
participant VersionRecorder
Controller->>Deployment: read deployment status (generation, observedGeneration, available/updated/desired replicas, conditions)
Controller->>PodLister: list pods for deployment
alt pod list returns error
Controller->>Controller: collect error -> errs
Controller->>Controller: set Degraded = True (SyncError)
else pod list OK
Controller->>Controller: hasFailingPods(pods) -> failingPods
alt failingPods non-empty
Controller->>Controller: set Degraded = True (X of Y instances unavailable...)
else
Controller->>Controller: hasDeploymentTimedOutProgressing(status) -> (msg, timedOut)
alt timedOut
Controller->>Controller: set Progressing = False (ProgressDeadlineExceeded)
Controller->>Controller: set Degraded = True (ProgressDeadlineExceeded)
else hasDeploymentProgressed?
Controller->>Controller: set Progressing = True (PodsUpdating)
else
Controller->>Controller: set Progressing = False (AsExpected)
Controller->>Controller: set Degraded = False (AsExpected)
end
end
end
alt generation/replica gating satisfied
Controller->>VersionRecorder: SetVersion(operand, targetVersion)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/operator/apiserver/controller/workload/workload_test.go (1)
54-67: Please add explicit coverage for the new version-recording gate.
pkg/operator/apiserver/controller/workload/workload.goLines 358-361 changed whenSetVersionfires, but this table still never exercises that path: no scenario setsoperatorConfigAtHighestRevision = true, and the test controller leavesversionRecorderunset. A small fake recorder plus one positive and one negative case would keep this behavior from regressing.Also applies to: 846-878
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 54 - 67, Add explicit scenarios to TestUpdateOperatorStatus that exercise the version-recording gate by (1) injecting a fake versionRecorder into the test controller and (2) adding two table entries: one with operatorConfigAtHighestRevision = true and one with it = false; verify that when operatorConfigAtHighestRevision is true the fake recorder's SetVersion method is invoked (and not invoked for the false case). Locate the test table in TestUpdateOperatorStatus and the controller instance used to run scenarios, set controller.versionRecorder to a simple fake recorder that records calls, and add assertions in each scenario's validateOperatorStatus to check whether SetVersion was called as expected to cover the SetVersion path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 54-67: Add explicit scenarios to TestUpdateOperatorStatus that
exercise the version-recording gate by (1) injecting a fake versionRecorder into
the test controller and (2) adding two table entries: one with
operatorConfigAtHighestRevision = true and one with it = false; verify that when
operatorConfigAtHighestRevision is true the fake recorder's SetVersion method is
invoked (and not invoked for the false case). Locate the test table in
TestUpdateOperatorStatus and the controller instance used to run scenarios, set
controller.versionRecorder to a simple fake recorder that records calls, and add
assertions in each scenario's validateOperatorStatus to check whether SetVersion
was called as expected to cover the SetVersion path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 15f9a88c-d378-463c-ad2b-87fa88d3a8d7
📒 Files selected for processing (2)
pkg/operator/apiserver/controller/workload/workload.gopkg/operator/apiserver/controller/workload/workload_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/operator/apiserver/controller/workload/workload_test.go (1)
847-889: Add negativeSetVersion()cases for the other readiness gates.
SetVersion()is also gated onGeneration == ObservedGeneration,AvailableReplicas == desiredReplicas, andUpdatedReplicas == desiredReplicasinpkg/operator/apiserver/controller/workload/workload.go:355-364. Right now only theoperatorConfigAtHighestRevisiongate is exercised, so a regression that records the version during an incomplete rollout would still pass this suite.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 847 - 889, Add negative test cases in the workload_test table that exercise the other readiness gates which should prevent SetVersion(): create entries similar to the existing cases but where (a) Generation != ObservedGeneration, (b) Status.AvailableReplicas < Spec.Replicas, and (c) Status.UpdatedReplicas < Spec.Replicas; each should set operatorConfigAtHighestRevision=true (so only the other gate blocks) use the same fakeVersionRecorder and validateVersionRecorder to assert r.setVersionCalls remains empty, and keep validateOperatorStatus as a no-op; locate tests in workload_test.go and mirror the structure/fields used by the existing "version ..." cases so the assertions against setVersionCalls catch regressions in workload.go’s SetVersion gating logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 847-889: Add negative test cases in the workload_test table that
exercise the other readiness gates which should prevent SetVersion(): create
entries similar to the existing cases but where (a) Generation !=
ObservedGeneration, (b) Status.AvailableReplicas < Spec.Replicas, and (c)
Status.UpdatedReplicas < Spec.Replicas; each should set
operatorConfigAtHighestRevision=true (so only the other gate blocks) use the
same fakeVersionRecorder and validateVersionRecorder to assert r.setVersionCalls
remains empty, and keep validateOperatorStatus as a no-op; locate tests in
workload_test.go and mirror the structure/fields used by the existing "version
..." cases so the assertions against setVersionCalls catch regressions in
workload.go’s SetVersion gating logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: f83fb810-ff6d-40fd-9f77-c2365c142fe9
📒 Files selected for processing (1)
pkg/operator/apiserver/controller/workload/workload_test.go
882ecf8 to
3ae8d51
Compare
|
/hold cancel |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atiratree, tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| @@ -4,7 +4,6 @@ import ( | |||
| "context" | |||
There was a problem hiding this comment.
Am I right that those are the three (at least) main behavioral changes introduced in this PR ?
1. Scaling the workload (up/down) won't report Progressing=True
2. A stuck rollout eventually stops reporting Progressing=True
3. A workload missing some replicas (2/3) no longer triggers Degraded=True
There was a problem hiding this comment.
Yeah, right. We only degrade on Available == 0.
Also checking the Progressing timeout now depends on the native Deployment ProgressDeadlineExceeded instead of using a helper and once Progressing deadline is hit, we go Degraded.
There was a problem hiding this comment.
Ok, I added another commit that actually makes Degraded work as expected, i.e. Degraded on not all replicas available. This is done by actually checking relevant pods and counting the failing ones so that starting new pods is not included into the Degrading condition.
|
New changes are detected. LGTM label has been removed. |
dee9075 to
405d2b1
Compare
Previously, the controller only reported DeploymentDegraded when zero replicas were available. This missed partial outages where some pods were crashlooping or stuck. Now the controller degrades whenever AvailableReplicas < desired and at least one pod shows a concrete failure: non-zero exit, crashloop (RestartCount > 1), or a non-transient waiting reason. Pods that are still starting (ContainerCreating, PodInitializing) are not treated as failures to avoid false positives during normal rollouts and scale-ups. Additionally, the SyncError degraded condition is now applied via defer to ensure it is always set regardless of the return path.
405d2b1 to
b94b8e2
Compare
|
@tchap: This pull request references Jira Issue OCPBUGS-65896, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/operator/apiserver/controller/workload/workload_test.go (1)
1217-1236: Make the fake pod lister honor namespace and selectors.The new degraded-path logic is selector-sensitive, but this fake always returns every injected pod. That lets these tests pass even if production code queries the wrong namespace or builds the wrong selector.
♻️ Proposed fix
type fakePodNamespaceLister struct { - lister *fakePodLister + lister *fakePodLister + namespace string } func (f *fakePodNamespaceLister) List(selector labels.Selector) (ret []*corev1.Pod, err error) { - return f.lister.pods, f.lister.err + if f.lister.err != nil { + return nil, f.lister.err + } + for _, pod := range f.lister.pods { + if pod.Namespace != f.namespace { + continue + } + if selector != nil && !selector.Matches(labels.Set(pod.Labels)) { + continue + } + ret = append(ret, pod) + } + return ret, nil } func (f *fakePodLister) List(selector labels.Selector) (ret []*corev1.Pod, err error) { - return f.pods, f.err + if f.err != nil { + return nil, f.err + } + for _, pod := range f.pods { + if selector != nil && !selector.Matches(labels.Set(pod.Labels)) { + continue + } + ret = append(ret, pod) + } + return ret, nil } func (f *fakePodLister) Pods(namespace string) corev1listers.PodNamespaceLister { return &fakePodNamespaceLister{ - lister: f, + lister: f, + namespace: namespace, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 1217 - 1236, The fake pod lister currently ignores namespace and selectors causing tests to pass incorrectly; update fakePodNamespaceLister and fakePodLister methods so listing and getting respect namespace and label selectors: have fakePodNamespaceLister store the target namespace (set in fakePodLister.Pods(namespace)), implement fakePodNamespaceLister.List(selector) to iterate f.lister.pods and return only pods whose Namespace equals that stored namespace and whose labels match selector.Matches(labels.Set(pod.Labels)), and implement fakePodNamespaceLister.Get(name) to return the pod with matching Namespace and Name (or the existing error if not found). Ensure fakePodLister.Pods(namespace) returns a fakePodNamespaceLister with the namespace populated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/apiserver/controller/workload/workload.go`:
- Around line 417-420: The pod lookup in hasFailingPods currently uses
workload.Spec.Template.Labels which can miss older ReplicaSet pods; instead
build a selector from the Deployment's selector (workload.Spec.Selector) and use
that to list pods via podsLister.Pods(workload.Namespace).List(...). Convert the
LabelSelector to a labels.Selector (e.g. via metav1.LabelSelectorAsSelector),
handle/return any conversion error and nil selector, and then use that selector
when calling podsLister to ensure all pods owned by the Deployment are
considered.
- Around line 423-443: The loop that builds failureMessages only inspects
container statuses and misses pod-level pre-container failures (e.g.,
PodScheduled=False/Unschedulable), so detect those and append a failure message
when no container-based msg was added for a pod: inside the pods loop (where
failureMessages is collected), after iterating container statuses and before
continuing, check pod.Status.Conditions for a PodScheduled condition with Status
== corev1.ConditionFalse or Reason == "Unschedulable" (or commonly used
scheduling failure reasons) and if present append a descriptive message like
"pod X not scheduled: <reason/message>"; ensure this runs per-pod so pods stuck
before container creation mark the rollout as degraded.
---
Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 1217-1236: The fake pod lister currently ignores namespace and
selectors causing tests to pass incorrectly; update fakePodNamespaceLister and
fakePodLister methods so listing and getting respect namespace and label
selectors: have fakePodNamespaceLister store the target namespace (set in
fakePodLister.Pods(namespace)), implement fakePodNamespaceLister.List(selector)
to iterate f.lister.pods and return only pods whose Namespace equals that stored
namespace and whose labels match selector.Matches(labels.Set(pod.Labels)), and
implement fakePodNamespaceLister.Get(name) to return the pod with matching
Namespace and Name (or the existing error if not found). Ensure
fakePodLister.Pods(namespace) returns a fakePodNamespaceLister with the
namespace populated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 49d5cedd-2955-4826-a170-eb740f971b22
📒 Files selected for processing (2)
pkg/operator/apiserver/controller/workload/workload.gopkg/operator/apiserver/controller/workload/workload_test.go
| func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) { | ||
| pods, err := podsLister.Pods(workload.Namespace).List(labels.SelectorFromSet(workload.Spec.Template.Labels)) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
Use the Deployment selector when discovering failing pods.
Line 418 matches on workload.Spec.Template.Labels, which can be stricter than the Deployment’s selector and can exclude older ReplicaSet pods during a rollout. That can hide failing owned pods and incorrectly clear DeploymentDegraded while AvailableReplicas < desiredReplicas.
🐛 Proposed fix
func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) {
- pods, err := podsLister.Pods(workload.Namespace).List(labels.SelectorFromSet(workload.Spec.Template.Labels))
+ if workload.Spec.Selector == nil {
+ return nil, fmt.Errorf("deployment/%s.%s is missing spec.selector", workload.Name, workload.Namespace)
+ }
+ selector, err := metav1.LabelSelectorAsSelector(workload.Spec.Selector)
+ if err != nil {
+ return nil, err
+ }
+ pods, err := podsLister.Pods(workload.Namespace).List(selector)
if err != nil {
return nil, err
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) { | |
| pods, err := podsLister.Pods(workload.Namespace).List(labels.SelectorFromSet(workload.Spec.Template.Labels)) | |
| if err != nil { | |
| return nil, err | |
| func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) { | |
| if workload.Spec.Selector == nil { | |
| return nil, fmt.Errorf("deployment/%s.%s is missing spec.selector", workload.Name, workload.Namespace) | |
| } | |
| selector, err := metav1.LabelSelectorAsSelector(workload.Spec.Selector) | |
| if err != nil { | |
| return nil, err | |
| } | |
| pods, err := podsLister.Pods(workload.Namespace).List(selector) | |
| if err != nil { | |
| return nil, err | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/apiserver/controller/workload/workload.go` around lines 417 -
420, The pod lookup in hasFailingPods currently uses
workload.Spec.Template.Labels which can miss older ReplicaSet pods; instead
build a selector from the Deployment's selector (workload.Spec.Selector) and use
that to list pods via podsLister.Pods(workload.Namespace).List(...). Convert the
LabelSelector to a labels.Selector (e.g. via metav1.LabelSelectorAsSelector),
handle/return any conversion error and nil selector, and then use that selector
when calling podsLister to ensure all pods owned by the Deployment are
considered.
| var failureMessages []string | ||
| for _, pod := range pods { | ||
| for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) { | ||
| if c.Ready { | ||
| continue | ||
| } | ||
|
|
||
| var msg string | ||
| switch { | ||
| case c.State.Terminated != nil && c.State.Terminated.ExitCode != 0: | ||
| msg = fmt.Sprintf("container %s terminated with exit code %d in pod %s", c.Name, c.State.Terminated.ExitCode, pod.Name) | ||
| case c.RestartCount > 1: | ||
| msg = fmt.Sprintf("container %s is crashlooping in pod %s", c.Name, pod.Name) | ||
| case c.State.Waiting != nil && isFailureWaitingReason(c.State.Waiting.Reason): | ||
| msg = fmt.Sprintf("container %s is waiting (%s) in pod %s", c.Name, c.State.Waiting.Reason, pod.Name) | ||
| } | ||
| if len(msg) > 0 { | ||
| failureMessages = append(failureMessages, msg) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Treat pre-container pod failures as degraded too.
This helper only looks at container statuses. Pods stuck before container creation—most notably PodScheduled=False / Unschedulable—won’t produce any message here, so a partially available rollout can stay DeploymentDegraded=False until ProgressDeadlineExceeded fires.
🐛 Proposed fix
var failureMessages []string
for _, pod := range pods {
+ podFailedBeforeStart := false
+ for _, cond := range pod.Status.Conditions {
+ if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse && cond.Reason == corev1.PodReasonUnschedulable {
+ failureMessages = append(failureMessages, fmt.Sprintf("pod %s is unschedulable: %s", pod.Name, cond.Message))
+ podFailedBeforeStart = true
+ break
+ }
+ }
+ if podFailedBeforeStart {
+ continue
+ }
+
for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) {
if c.Ready {
continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var failureMessages []string | |
| for _, pod := range pods { | |
| for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) { | |
| if c.Ready { | |
| continue | |
| } | |
| var msg string | |
| switch { | |
| case c.State.Terminated != nil && c.State.Terminated.ExitCode != 0: | |
| msg = fmt.Sprintf("container %s terminated with exit code %d in pod %s", c.Name, c.State.Terminated.ExitCode, pod.Name) | |
| case c.RestartCount > 1: | |
| msg = fmt.Sprintf("container %s is crashlooping in pod %s", c.Name, pod.Name) | |
| case c.State.Waiting != nil && isFailureWaitingReason(c.State.Waiting.Reason): | |
| msg = fmt.Sprintf("container %s is waiting (%s) in pod %s", c.Name, c.State.Waiting.Reason, pod.Name) | |
| } | |
| if len(msg) > 0 { | |
| failureMessages = append(failureMessages, msg) | |
| } | |
| } | |
| } | |
| var failureMessages []string | |
| for _, pod := range pods { | |
| podFailedBeforeStart := false | |
| for _, cond := range pod.Status.Conditions { | |
| if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse && cond.Reason == corev1.PodReasonUnschedulable { | |
| failureMessages = append(failureMessages, fmt.Sprintf("pod %s is unschedulable: %s", pod.Name, cond.Message)) | |
| podFailedBeforeStart = true | |
| break | |
| } | |
| } | |
| if podFailedBeforeStart { | |
| continue | |
| } | |
| for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) { | |
| if c.Ready { | |
| continue | |
| } | |
| var msg string | |
| switch { | |
| case c.State.Terminated != nil && c.State.Terminated.ExitCode != 0: | |
| msg = fmt.Sprintf("container %s terminated with exit code %d in pod %s", c.Name, c.State.Terminated.ExitCode, pod.Name) | |
| case c.RestartCount > 1: | |
| msg = fmt.Sprintf("container %s is crashlooping in pod %s", c.Name, pod.Name) | |
| case c.State.Waiting != nil && isFailureWaitingReason(c.State.Waiting.Reason): | |
| msg = fmt.Sprintf("container %s is waiting (%s) in pod %s", c.Name, c.State.Waiting.Reason, pod.Name) | |
| } | |
| if len(msg) > 0 { | |
| failureMessages = append(failureMessages, msg) | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/apiserver/controller/workload/workload.go` around lines 423 -
443, The loop that builds failureMessages only inspects container statuses and
misses pod-level pre-container failures (e.g.,
PodScheduled=False/Unschedulable), so detect those and append a failure message
when no container-based msg was added for a pod: inside the pods loop (where
failureMessages is collected), after iterating container statuses and before
continuing, check pod.Status.Conditions for a PodScheduled condition with Status
== corev1.ConditionFalse or Reason == "Unschedulable" (or commonly used
scheduling failure reasons) and if present append a descriptive message like
"pod X not scheduled: <reason/message>"; ensure this runs per-pod so pods stuck
before container creation mark the rollout as degraded.
|
@tchap: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
To enable scaling not to automatically set Progressing/Degraded,
conditions handling is aligned so that the Deployment generation is not
consulted any more.
Degraded only happens when the deployment is not Available or it times
out progressing. This is less strict than before, but checking available
replicas would mean we go Degraded on scaling automatically.
Progressing is set to True when the deployment is not in
NewReplicaSetAvailable and it hasn't timed out progressing.
I also improved the overall test case names in a separate commit.
This affects
authentication-operatorandopenshift-apiserver-operator.Summary by CodeRabbit
Bug Fixes
Tests