fix(ci): add operator health check before applying Backstage CR by zdrapela · Pull Request #4868 · redhat-developer/rhdh

zdrapela · 2026-05-21T13:16:39Z

Problem

The GKE operator nightly job failed because the RHDH operator controller-manager pod disappeared after installation. When the Backstage CR was applied, there was no operator to reconcile it, so the deployment was never created. The 5-minute wait timed out with:

Timeout waiting for: Backstage deployment created by operator
Backstage deployment not created after 5 minutes
Checking operator logs...
No resources found in rhdh-operator namespace.

No E2E tests ran at all.

Root Cause

There was no health check verifying the operator controller-manager pod was actually running before applying the Backstage CR. The CRD could become available while the operator pod was still starting, crashing, or being evicted.

Changes

.ci/pipelines/install-methods/operator.sh:

Added k8s_wait::deployment in prepare_operator() to verify the controller-manager pod is running and ready after CRD becomes available (5-minute timeout)
Added k8s_wait::deployment in deploy_rhdh_operator() before applying the Backstage CR as a defense-in-depth check (2-minute timeout)
Enhanced _operator_debug_info() to show operator pod status and namespace events

.ci/pipelines/jobs/gke-operator.sh:

Aligned prepare_operator retry count to "3" (matching AKS and EKS jobs which already use 3 retries)

Verification

Tested on the GKE cluster (gke-us-central1-c-standard-ci-rhdh) with a real RHDH operator deployment:

Positive case: Correctly detects running operator pod (phase=Running, ready=True)
Negative case: Correctly detects missing operator pod (scaled deployment to 0)
Recovery case: Correctly detects operator recovery after scale-up

Initial implementation used "controller-manager" as the pod name pattern for k8s_wait::deployment, but cluster testing revealed the pod is named rhdh-operator-<hash>, not *controller-manager*. Fixed to use ${OPERATOR_MANAGER} (= rhdh-operator) which correctly matches the pod name prefix.

openshift-ci · 2026-05-21T13:16:43Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

zdrapela · 2026-05-21T13:16:48Z

/agentic_review

zdrapela · 2026-05-21T13:16:55Z

/test ?

zdrapela · 2026-05-21T13:17:01Z

/test pull-ci-redhat-developer-rhdh-main-e2e-gke-operator-pull

rhdh-qodo-merge · 2026-05-21T13:17:03Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

1. Readiness check matches wrong pod 🐞 Bug ☼ Reliability

Description

The new operator “health check” calls k8s_wait::deployment with resource_name set to the operator
namespace (e.g., "rhdh-operator"), but k8s_wait::deployment only checks the first pod whose name
matches that string, not the controller-manager pod specifically. This can incorrectly succeed while
the controller-manager pod is absent/unready, allowing the Backstage CR to be applied without an
active reconciler.

Code

.ci/pipelines/install-methods/operator.sh[R65-67]

Relevance

⭐⭐⭐ High
Team previously fixed “wrong pod match” by making selection deterministic (tail -n1) and moved
toward label-based selectors.
PR-#3075
PR-#4414

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
OPERATOR_MANAGER is the operator namespace, but k8s_wait::deployment does not wait on a Deployment
resource or a specific label; it greps pod names and checks only the first match. Meanwhile,
operator debug/log collection targets the controller-manager via label selector, indicating the
intended target pod is label-identified rather than “first pod whose name contains the namespace
string.”
.ci/pipelines/env_variables.sh[61-71]
.ci/pipelines/lib/k8s-wait.sh[15-46]
.ci/pipelines/lib/k8s-wait.sh[33-41]
.ci/pipelines/install-methods/operator.sh[52-69]
.ci/pipelines/install-methods/operator.sh[90-99]
.ci/pipelines/install-methods/operator.sh[142-146]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The operator readiness gate uses `k8s_wait::deployment "$ns" "$ns" ...`, but `k8s_wait::deployment` does a grep-based lookup (`oc get pods | grep <resource_name> | head -n 1`) and verifies only that single pod’s phase/Ready condition. Because the new call passes the namespace string as the match term, the check is not guaranteed to validate the controller-manager pod.

### Issue Context
- `OPERATOR_MANAGER` is used as the operator namespace (`namespace::configure "${OPERATOR_MANAGER}"`, delete namespace, etc.).
- `_operator_debug_info()` already assumes the controller-manager is identified by label `control-plane=controller-manager`, but the new readiness wait does not use that selector.

### Fix Focus Areas
- Replace the new waits in `prepare_operator()` and `deploy_rhdh_operator()` with a label-based readiness check, e.g.:
 - `oc wait -n "$operator_ns" pod -l control-plane=controller-manager --for=condition=Ready --timeout=5m`
 - and for the defense-in-depth check: `--timeout=2m`
 - optionally assert at least one matching pod exists before waiting.
- Alternatively, enhance `k8s_wait::deployment` (or add a new helper) to support label selectors and use it here.

#### Target lines
- .ci/pipelines/install-methods/operator.sh[60-68]
- .ci/pipelines/install-methods/operator.sh[90-99]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

codecov · 2026-05-21T13:20:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.60%. Comparing base (b08abdf) to head (19ef974).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #4868       +/-   ##
===========================================
+ Coverage   41.03%   69.60%   +28.57%     
===========================================
  Files         121      111       -10     
  Lines        2220     4702     +2482     
  Branches      562      536       -26     
===========================================
+ Hits          911     3273     +2362     
- Misses       1304     1428      +124     
+ Partials        5        1        -4

Flag	Coverage Δ
install-dynamic-plugins	`92.44% <ø> (?)`
rhdh	`38.81% <ø> (-2.23%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b08abdf...19ef974. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The GKE operator nightly job failed because the operator controller-manager pod disappeared after installation, causing the Backstage CR to never be reconciled into a deployment. The deployment wait loop timed out after 5 minutes with no useful diagnostics. Changes: - Add 'oc wait' with label selector 'control-plane=controller-manager' in prepare_operator() to verify the controller-manager pod is running and ready after CRD becomes available (5-minute timeout) - Add the same label-based readiness check in deploy_rhdh_operator() before applying the Backstage CR (2-minute timeout, defense in depth) - Enhance _operator_debug_info() to show operator pod status and namespace events for better debugging of similar failures - Align GKE operator job retry count to 3 (matching AKS and EKS) Verified on GKE cluster gke-us-central1-c-standard-ci-rhdh: - Positive case: correctly detects running operator pod (condition met) - Negative case: correctly detects missing operator pod (no matching resources found) - Recovery case: correctly waits for operator recovery after scale-up Assisted-by: OpenCode

sonarqubecloud · 2026-05-21T13:30:28Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-21T13:30:38Z

The container image build workflow finished with status: cancelled.

rhdh-qodo-merge · 2026-05-21T13:31:41Z

Code Review by Qodo

Sorry, something went wrong

We weren't able to complete the code review on our side. Please try again

rhdh-qodo-merge · 2026-05-21T13:32:12Z

Review Summary by Qodo

Add operator health checks before applying Backstage CR

🐞 Bug fix

Walkthroughs

Description

• Add operator health checks before applying Backstage CR
• Verify controller-manager pod readiness in prepare_operator()
• Add defense-in-depth check in deploy_rhdh_operator()
• Enhance debug info with pod status and namespace events
• Align GKE operator job retry count to 3

Diagram

flowchart LR
  A["Install RHDH Operator"] --> B["Wait for Backstage CRD"]
  B --> C["Health Check: Controller-Manager Pod Ready"]
  C --> D["Deploy Backstage CR"]
  D --> E["Defense-in-Depth: Verify Pod Still Running"]
  E --> F["Apply Backstage CR"]
  C -->|Failure| G["Collect Debug Info"]
  E -->|Failure| G

File Changes

1. .ci/pipelines/install-methods/operator.sh 🐞 Bug fix +37/-1

Add operator pod health checks and enhanced debugging

• Added oc wait health check in prepare_operator() to verify controller-manager pod is ready
 after CRD becomes available (5-minute timeout)
• Added defense-in-depth health check in deploy_rhdh_operator() before applying Backstage CR
 (2-minute timeout)
• Enhanced _operator_debug_info() to display operator pod status, namespace events, and operator
 logs for better diagnostics
• Uses label selector control-plane=controller-manager to target the operator pod precisely

.ci/pipelines/install-methods/operator.sh

2. .ci/pipelines/jobs/gke-operator.sh ⚙️ Configuration changes +1/-1

Align GKE operator retry count to 3
• Updated prepare_operator call to pass retry count of "3" to align with AKS and EKS job
 configurations
.ci/pipelines/jobs/gke-operator.sh

zdrapela · 2026-05-21T13:54:45Z

/test e2e-gke-operator-nightly

github-actions · 2026-05-21T14:04:02Z

Image was built and published successfully. It is available at:

github-actions · 2026-05-29T02:19:27Z

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 21 days.

openshift-ci Bot added the do-not-merge/work-in-progress label May 21, 2026

zdrapela force-pushed the fix/ci-operator-health-check branch from 5404a23 to 19ef974 Compare May 21, 2026 13:29

zdrapela marked this pull request as ready for review May 21, 2026 13:29

openshift-ci Bot removed the do-not-merge/work-in-progress label May 21, 2026

openshift-ci Bot requested review from HusneShabbir and gustavolira May 21, 2026 13:29

rhdh-qodo-merge Bot added Enhancement Other Bug fix labels May 21, 2026

github-actions Bot added Stale and removed Stale labels May 29, 2026

Conversation

zdrapela commented May 21, 2026

Problem

Root Cause

Changes

Verification

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

zdrapela commented May 21, 2026

Uh oh!

zdrapela commented May 21, 2026

Uh oh!

zdrapela commented May 21, 2026

Uh oh!

rhdh-qodo-merge Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented May 21, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

rhdh-qodo-merge Bot commented May 21, 2026

Code Review by Qodo

Sorry, something went wrong

Uh oh!

rhdh-qodo-merge Bot commented May 21, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

zdrapela commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rhdh-qodo-merge Bot commented May 21, 2026 •

edited

Loading

codecov Bot commented May 21, 2026 •

edited

Loading