Skip to content

fix(ci): add operator health check before applying Backstage CR#4868

Open
zdrapela wants to merge 1 commit into
redhat-developer:mainfrom
zdrapela:fix/ci-operator-health-check
Open

fix(ci): add operator health check before applying Backstage CR#4868
zdrapela wants to merge 1 commit into
redhat-developer:mainfrom
zdrapela:fix/ci-operator-health-check

Conversation

@zdrapela
Copy link
Copy Markdown
Member

Problem

The GKE operator nightly job failed because the RHDH operator controller-manager pod disappeared after installation. When the Backstage CR was applied, there was no operator to reconcile it, so the deployment was never created. The 5-minute wait timed out with:

Timeout waiting for: Backstage deployment created by operator
Backstage deployment not created after 5 minutes
Checking operator logs...
No resources found in rhdh-operator namespace.

No E2E tests ran at all.

Root Cause

There was no health check verifying the operator controller-manager pod was actually running before applying the Backstage CR. The CRD could become available while the operator pod was still starting, crashing, or being evicted.

Changes

.ci/pipelines/install-methods/operator.sh:

  • Added k8s_wait::deployment in prepare_operator() to verify the controller-manager pod is running and ready after CRD becomes available (5-minute timeout)
  • Added k8s_wait::deployment in deploy_rhdh_operator() before applying the Backstage CR as a defense-in-depth check (2-minute timeout)
  • Enhanced _operator_debug_info() to show operator pod status and namespace events

.ci/pipelines/jobs/gke-operator.sh:

  • Aligned prepare_operator retry count to "3" (matching AKS and EKS jobs which already use 3 retries)

Verification

Tested on the GKE cluster (gke-us-central1-c-standard-ci-rhdh) with a real RHDH operator deployment:

  • Positive case: Correctly detects running operator pod (phase=Running, ready=True)
  • Negative case: Correctly detects missing operator pod (scaled deployment to 0)
  • Recovery case: Correctly detects operator recovery after scale-up

Initial implementation used "controller-manager" as the pod name pattern for k8s_wait::deployment, but cluster testing revealed the pod is named rhdh-operator-<hash>, not *controller-manager*. Fixed to use ${OPERATOR_MANAGER} (= rhdh-operator) which correctly matches the pod name prefix.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@zdrapela
Copy link
Copy Markdown
Member Author

/agentic_review

@zdrapela
Copy link
Copy Markdown
Member Author

/test ?

@zdrapela
Copy link
Copy Markdown
Member Author

/test pull-ci-redhat-developer-rhdh-main-e2e-gke-operator-pull

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 21, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Readiness check matches wrong pod 🐞 Bug ☼ Reliability
Description
The new operator “health check” calls k8s_wait::deployment with resource_name set to the operator
namespace (e.g., "rhdh-operator"), but k8s_wait::deployment only checks the first pod whose name
matches that string, not the controller-manager pod specifically. This can incorrectly succeed while
the controller-manager pod is absent/unready, allowing the Backstage CR to be applied without an
active reconciler.
Code

.ci/pipelines/install-methods/operator.sh[R65-67]

Relevance

⭐⭐⭐ High

Team previously fixed “wrong pod match” by making selection deterministic (tail -n1) and moved
toward label-based selectors.

PR-#3075
PR-#4414

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
OPERATOR_MANAGER is the operator namespace, but k8s_wait::deployment does not wait on a Deployment
resource or a specific label; it greps pod names and checks only the first match. Meanwhile,
operator debug/log collection targets the controller-manager via label selector, indicating the
intended target pod is label-identified rather than “first pod whose name contains the namespace
string.”

.ci/pipelines/env_variables.sh[61-71]
.ci/pipelines/lib/k8s-wait.sh[15-46]
.ci/pipelines/lib/k8s-wait.sh[33-41]
.ci/pipelines/install-methods/operator.sh[52-69]
.ci/pipelines/install-methods/operator.sh[90-99]
.ci/pipelines/install-methods/operator.sh[142-146]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The operator readiness gate uses `k8s_wait::deployment "$ns" "$ns" ...`, but `k8s_wait::deployment` does a grep-based lookup (`oc get pods | grep <resource_name> | head -n 1`) and verifies only that single pod’s phase/Ready condition. Because the new call passes the namespace string as the match term, the check is not guaranteed to validate the controller-manager pod.

### Issue Context
- `OPERATOR_MANAGER` is used as the operator namespace (`namespace::configure "${OPERATOR_MANAGER}"`, delete namespace, etc.).
- `_operator_debug_info()` already assumes the controller-manager is identified by label `control-plane=controller-manager`, but the new readiness wait does not use that selector.

### Fix Focus Areas
- Replace the new waits in `prepare_operator()` and `deploy_rhdh_operator()` with a label-based readiness check, e.g.:
 - `oc wait -n "$operator_ns" pod -l control-plane=controller-manager --for=condition=Ready --timeout=5m`
 - and for the defense-in-depth check: `--timeout=2m`
 - optionally assert at least one matching pod exists before waiting.
- Alternatively, enhance `k8s_wait::deployment` (or add a new helper) to support label selectors and use it here.

#### Target lines
- .ci/pipelines/install-methods/operator.sh[60-68]
- .ci/pipelines/install-methods/operator.sh[90-99]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.60%. Comparing base (b08abdf) to head (19ef974).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4868       +/-   ##
===========================================
+ Coverage   41.03%   69.60%   +28.57%     
===========================================
  Files         121      111       -10     
  Lines        2220     4702     +2482     
  Branches      562      536       -26     
===========================================
+ Hits          911     3273     +2362     
- Misses       1304     1428      +124     
+ Partials        5        1        -4     
Flag Coverage Δ
install-dynamic-plugins 92.44% <ø> (?)
rhdh 38.81% <ø> (-2.23%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b08abdf...19ef974. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The GKE operator nightly job failed because the operator controller-manager
pod disappeared after installation, causing the Backstage CR to never be
reconciled into a deployment. The deployment wait loop timed out after 5
minutes with no useful diagnostics.

Changes:
- Add 'oc wait' with label selector 'control-plane=controller-manager' in
  prepare_operator() to verify the controller-manager pod is running and
  ready after CRD becomes available (5-minute timeout)
- Add the same label-based readiness check in deploy_rhdh_operator()
  before applying the Backstage CR (2-minute timeout, defense in depth)
- Enhance _operator_debug_info() to show operator pod status and
  namespace events for better debugging of similar failures
- Align GKE operator job retry count to 3 (matching AKS and EKS)

Verified on GKE cluster gke-us-central1-c-standard-ci-rhdh:
- Positive case: correctly detects running operator pod (condition met)
- Negative case: correctly detects missing operator pod (no matching
  resources found)
- Recovery case: correctly waits for operator recovery after scale-up

Assisted-by: OpenCode
@zdrapela zdrapela force-pushed the fix/ci-operator-health-check branch from 5404a23 to 19ef974 Compare May 21, 2026 13:29
@zdrapela zdrapela marked this pull request as ready for review May 21, 2026 13:29
@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@rhdh-qodo-merge
Copy link
Copy Markdown

Code Review by Qodo

Grey Divider

Sorry, something went wrong

We weren't able to complete the code review on our side. Please try again

Grey Divider

Qodo Logo

@rhdh-qodo-merge
Copy link
Copy Markdown

Review Summary by Qodo

Add operator health checks before applying Backstage CR

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Add operator health checks before applying Backstage CR
• Verify controller-manager pod readiness in prepare_operator()
• Add defense-in-depth check in deploy_rhdh_operator()
• Enhance debug info with pod status and namespace events
• Align GKE operator job retry count to 3
Diagram
flowchart LR
  A["Install RHDH Operator"] --> B["Wait for Backstage CRD"]
  B --> C["Health Check: Controller-Manager Pod Ready"]
  C --> D["Deploy Backstage CR"]
  D --> E["Defense-in-Depth: Verify Pod Still Running"]
  E --> F["Apply Backstage CR"]
  C -->|Failure| G["Collect Debug Info"]
  E -->|Failure| G

Loading

File Changes

1. .ci/pipelines/install-methods/operator.sh 🐞 Bug fix +37/-1

Add operator pod health checks and enhanced debugging

• Added oc wait health check in prepare_operator() to verify controller-manager pod is ready
 after CRD becomes available (5-minute timeout)
• Added defense-in-depth health check in deploy_rhdh_operator() before applying Backstage CR
 (2-minute timeout)
• Enhanced _operator_debug_info() to display operator pod status, namespace events, and operator
 logs for better diagnostics
• Uses label selector control-plane=controller-manager to target the operator pod precisely

.ci/pipelines/install-methods/operator.sh


2. .ci/pipelines/jobs/gke-operator.sh ⚙️ Configuration changes +1/-1

Align GKE operator retry count to 3

• Updated prepare_operator call to pass retry count of "3" to align with AKS and EKS job
 configurations

.ci/pipelines/jobs/gke-operator.sh


Grey Divider

Qodo Logo

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-gke-operator-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@github-actions
Copy link
Copy Markdown
Contributor

This PR is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 21 days.

@github-actions github-actions Bot added Stale and removed Stale labels May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant