fix(ci): add operator health check before applying Backstage CR#4868
fix(ci): add operator health check before applying Backstage CR#4868zdrapela wants to merge 1 commit into
Conversation
|
Skipping CI for Draft Pull Request. |
|
/agentic_review |
|
/test ? |
|
/test pull-ci-redhat-developer-rhdh-main-e2e-gke-operator-pull |
Code Review by Qodo
1. Readiness check matches wrong pod
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4868 +/- ##
===========================================
+ Coverage 41.03% 69.60% +28.57%
===========================================
Files 121 111 -10
Lines 2220 4702 +2482
Branches 562 536 -26
===========================================
+ Hits 911 3273 +2362
- Misses 1304 1428 +124
+ Partials 5 1 -4
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
The GKE operator nightly job failed because the operator controller-manager pod disappeared after installation, causing the Backstage CR to never be reconciled into a deployment. The deployment wait loop timed out after 5 minutes with no useful diagnostics. Changes: - Add 'oc wait' with label selector 'control-plane=controller-manager' in prepare_operator() to verify the controller-manager pod is running and ready after CRD becomes available (5-minute timeout) - Add the same label-based readiness check in deploy_rhdh_operator() before applying the Backstage CR (2-minute timeout, defense in depth) - Enhance _operator_debug_info() to show operator pod status and namespace events for better debugging of similar failures - Align GKE operator job retry count to 3 (matching AKS and EKS) Verified on GKE cluster gke-us-central1-c-standard-ci-rhdh: - Positive case: correctly detects running operator pod (condition met) - Negative case: correctly detects missing operator pod (no matching resources found) - Recovery case: correctly waits for operator recovery after scale-up Assisted-by: OpenCode
5404a23 to
19ef974
Compare
|
|
The container image build workflow finished with status: |
Review Summary by QodoAdd operator health checks before applying Backstage CR
WalkthroughsDescription• Add operator health checks before applying Backstage CR • Verify controller-manager pod readiness in prepare_operator() • Add defense-in-depth check in deploy_rhdh_operator() • Enhance debug info with pod status and namespace events • Align GKE operator job retry count to 3 Diagramflowchart LR
A["Install RHDH Operator"] --> B["Wait for Backstage CRD"]
B --> C["Health Check: Controller-Manager Pod Ready"]
C --> D["Deploy Backstage CR"]
D --> E["Defense-in-Depth: Verify Pod Still Running"]
E --> F["Apply Backstage CR"]
C -->|Failure| G["Collect Debug Info"]
E -->|Failure| G
File Changes1. .ci/pipelines/install-methods/operator.sh
|
|
/test e2e-gke-operator-nightly |
|
This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 21 days. |



Problem
The GKE operator nightly job failed because the RHDH operator controller-manager pod disappeared after installation. When the Backstage CR was applied, there was no operator to reconcile it, so the deployment was never created. The 5-minute wait timed out with:
No E2E tests ran at all.
Root Cause
There was no health check verifying the operator controller-manager pod was actually running before applying the Backstage CR. The CRD could become available while the operator pod was still starting, crashing, or being evicted.
Changes
.ci/pipelines/install-methods/operator.sh:k8s_wait::deploymentinprepare_operator()to verify the controller-manager pod is running and ready after CRD becomes available (5-minute timeout)k8s_wait::deploymentindeploy_rhdh_operator()before applying the Backstage CR as a defense-in-depth check (2-minute timeout)_operator_debug_info()to show operator pod status and namespace events.ci/pipelines/jobs/gke-operator.sh:prepare_operatorretry count to"3"(matching AKS and EKS jobs which already use 3 retries)Verification
Tested on the GKE cluster (
gke-us-central1-c-standard-ci-rhdh) with a real RHDH operator deployment:Initial implementation used
"controller-manager"as the pod name pattern fork8s_wait::deployment, but cluster testing revealed the pod is namedrhdh-operator-<hash>, not*controller-manager*. Fixed to use${OPERATOR_MANAGER}(=rhdh-operator) which correctly matches the pod name prefix.