Skip to content

fix(ci): deploy showcase-runtime with internal DB and fix test ordering#4809

Draft
zdrapela wants to merge 2 commits into
redhat-developer:mainfrom
zdrapela:fix/showcase-runtime-test-ordering
Draft

fix(ci): deploy showcase-runtime with internal DB and fix test ordering#4809
zdrapela wants to merge 2 commits into
redhat-developer:mainfrom
zdrapela:fix/showcase-runtime-test-ordering

Conversation

@zdrapela
Copy link
Copy Markdown
Member

Summary

  • Deploy showcase-runtime with internal PostgreSQL (Helm sub-chart / operator-managed) instead of external Crunchy DB
  • Fix Playwright project dependency ordering so showcase-runtime tests actually run instead of being skipped
  • Remove Crunchy PostgresCluster dependency from runtime namespace

Problem

In CI, showcase-runtime tests (config-map, schema-mode) were always skipped because showcase-runtime-db (external DB tests) had failures that blocked the dependent showcase-runtime project via Playwright dependencies. The dependency direction was wrong.

Changes

CI pipeline:

  • Reverse Playwright project dependency: showcase-runtime runs first, showcase-runtime-db depends on it
  • New Helm values (values-showcase-runtime.yaml) with postgresql.enabled=true (internal DB)
  • New operator CR (rhdh-start-runtime-local.yaml) with internal DB, flavours: [] to disable lightspeed sidecar
  • New minimal app-config (app-config-rhdh-runtime.yaml) without backend.database
  • Operator uses separate rhdh-runtime-config secret for RHDH_RUNTIME_URL (avoids POSTGRES_* env var conflicts with internal DB)
  • Disable keycloak/github plugins in operator runtime dynamic-plugins (no config available)
  • Remove unused files: old external-DB CR, values, app-config, and dead initiate_runtime_deployment() function

E2E tests:

  • External DB tests use prepareForExternalDatabase() to switch from internal to external DB at runtime (patches ConfigMap + adds env vars via JSON patch)
  • Extract waitForRuntimeDeploymentReady() utility for reuse
  • Fix checkPodFailureStates() to treat ephemeral volume PVC provisioning as transient (not fatal)

Verification

Tested on OCP cluster with both deployment methods:

Deployment RHDH Starts config-map.spec.ts
Helm (internal PostgreSQL sub-chart) 2/2 Ready, HTTP 200 PASSED (1.3m)
Operator (internal DB, with lightspeed) 2/2 Ready, HTTP 200 PASSED (3.2m)
Operator (internal DB, flavours: []) 1/1 Ready, HTTP 200 PASSED

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 14, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@zdrapela
Copy link
Copy Markdown
Member Author

/review

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-v4-18-helm-nightly

@rhdh-qodo-merge
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Warning

/review is deprecated. Use /agentic_review instead (removal date not yet scheduled).

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 Security concerns

TLS verification disabled:
.ci/pipelines/resources/rhdh-operator/rhdh-start-runtime-local.yaml sets NODE_TLS_REJECT_UNAUTHORIZED=0, which disables TLS certificate verification for Node.js. This weakens transport security and can hide real certificate/hostname problems. It should be avoided or tightly scoped to CI-only scenarios, preferring proper CA configuration (e.g., postgres-crt + NODE_EXTRA_CA_CERTS) instead.

⚡ Recommended focus areas for review

Security

The operator CR injects NODE_TLS_REJECT_UNAUTHORIZED=0, which disables TLS certificate verification for Node.js. This can mask real TLS issues in the external DB tests and is risky if it ever leaks beyond CI/test contexts. Consider removing it, scoping it to only the specific test pod/job that needs it, or relying on NODE_EXTRA_CA_CERTS/proper CA injection instead.

    envs:
      - name: NODE_TLS_REJECT_UNAUTHORIZED
        value: "0"
    secrets:
      - name: rhdh-runtime-config
  route:
    enabled: true
# Disable all default flavours (e.g. lightspeed) to avoid unnecessary sidecar
# containers and init containers that slow down startup and restarts.
# Runtime tests don't need lightspeed — they only test ConfigMap changes and DB connectivity.
flavours: []
📚 Focus areas based on broader codebase context

Possible Issue

prepareForExternalDatabase() patches app-config to set backend.database with only a connection block (host/port/user/password) and omits key fields like client, database (name), and any SSL configuration. This may leave runtime behavior dependent on whatever defaults/previous config exists, and can cause external DB tests to fail or be flaky depending on the base deployment config. (Ref 4)

export async function prepareForExternalDatabase(
  kubeClient: KubeClient,
  namespace: string,
  deploymentName: string,
): Promise<void> {
  // --- 1. Remove stale POSTGRES_* env vars patched onto the deployment ---
  // Schema-mode tests may have added individual secretKeyRef env vars pointing
  // to a *-postgresql secret. These override the bulk envFrom injection from
  // postgres-cred and must be removed before external DB tests.
  await removeSchemaModePatchedEnvVars(kubeClient, deploymentName, namespace);

  // --- 2. Patch app-config ConfigMap to use external DB connection ---
  const configMapName = await kubeClient.findAppConfigMap(namespace);
  const configMapResponse = await kubeClient.getConfigMap(
    configMapName,
    namespace,
  );
  const configMap = configMapResponse.body;
  const configKey = Object.keys(configMap.data || {}).find((key) =>
    key.includes("app-config"),
  );

  if (!configKey || !configMap.data) {
    throw new Error(
      `No app-config data key found in ConfigMap '${configMapName}'`,
    );
  }

  const appConfig = yaml.load(configMap.data[configKey]) as AppConfigYaml;

  console.log(
    "Patching app-config to use external database connection (env var placeholders)...",
  );
  appConfig.backend = appConfig.backend || {};
  appConfig.backend.database = {
    connection: {
      host: "${POSTGRES_HOST}",
      port: "${POSTGRES_PORT}",
      user: "${POSTGRES_USER}",
      password: "${POSTGRES_PASSWORD}",
    },
  };

  configMap.data[configKey] = yaml.dump(appConfig);
  delete configMap.metadata?.creationTimestamp;
  delete configMap.metadata?.resourceVersion;

  await kubeClient.coreV1Api.replaceNamespacedConfigMap(
    configMapName,
    namespace,
    configMap,
  );
  console.log("App-config patched for external database connection");

  // --- 3. Add POSTGRES_* env vars to the deployment via secretKeyRef ---
  // The deployment starts with internal DB (no postgres-cred env vars).
  // Add individual env vars pointing to the postgres-cred secret so the
  // app-config ${POSTGRES_HOST} etc. placeholders resolve correctly.
  await ensurePostgresCredEnvVars(kubeClient, deploymentName, namespace);
}

Reference reasoning: The existing schema-mode setup shows a more complete and explicit backend.database config, including client: "pg", connection.database: "${POSTGRES_DB}", and ssl settings. Aligning the external-db patch to include these fields would make the external DB switch deterministic and consistent with how the repo configures DB connections elsewhere.

📄 References
  1. redhat-developer/rhdh/e2e-tests/playwright/e2e/external-database/verify-tls-config-with-external-crunchy.spec.ts [1-37]
  2. redhat-developer/rhdh/e2e-tests/playwright.config.ts [205-217]
  3. redhat-developer/rhdh/e2e-tests/playwright/e2e/plugin-division-mode-schema/verify-schema-mode.spec.ts [78-186]
  4. redhat-developer/rhdh/e2e-tests/playwright/e2e/plugin-division-mode-schema/schema-mode-setup.ts [27-354]

@zdrapela
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 14, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Duplicate plugin disable entries 🐞 Bug ≡ Correctness
Description
run_operator_runtime_config_change_tests() appends new dynamic-plugin entries with disabled: true
for GitHub/Keycloak modules, but those same packages already exist in the generated config (from
values_showcase.yaml) with disabled: false, creating conflicting duplicates. Depending on how
the dynamic plugin loader resolves duplicates, the plugins may remain enabled and still fail at
runtime due to missing configuration in the minimal runtime namespace.
Code

.ci/pipelines/jobs/ocp-operator.sh[R104-111]

+  # Disable plugins that require external configuration (GitHub org, Keycloak, etc.)
+  # not available in the minimal runtime namespace.
+  cat >> /tmp/configmap-dynamic-plugins-runtime.yaml << 'RUNTIME_PLUGINS_EOF'
+      - package: ./dynamic-plugins/dist/backstage-plugin-catalog-backend-module-github-dynamic
+        disabled: true
+      - package: ./dynamic-plugins/dist/backstage-community-plugin-catalog-backend-module-keycloak-dynamic
+        disabled: true
+RUNTIME_PLUGINS_EOF
Relevance

⭐⭐⭐ High

Team previously fixed CI issues from duplicate/Conflicting dynamic-plugin entries by adding
dedupe/override logic in values merging.

PR-#2417
PR-#4791

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The runtime dynamic-plugins ConfigMap is generated from values_showcase.yaml (per
HELM_CHART_VALUE_FILE_NAME) and that file already includes both packages with disabled: false;
the operator job then appends the same packages with disabled: true, resulting in duplicates with
conflicting flags.

.ci/pipelines/jobs/ocp-operator.sh[85-114]
.ci/pipelines/env_variables.sh[28-32]
.ci/pipelines/lib/config.sh[73-99]
.ci/pipelines/value_files/values_showcase.yaml[20-49]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`ocp-operator.sh` generates the runtime `dynamic-plugins` ConfigMap from `values_showcase.yaml` (which already includes the GitHub + Keycloak catalog backend modules as `disabled: false`) and then appends *additional* entries for the same packages marked `disabled: true`. This produces conflicting duplicate entries for the same package and relies on unspecified “last one wins” behavior.

## Issue Context
- Runtime dynamic-plugins ConfigMap is produced from `.global.dynamic` in the base values file, so the runtime file already contains these plugin entries.
- Appending duplicates makes the effective state ambiguous and can result in the plugins still being enabled.

## Fix Focus Areas
- .ci/pipelines/jobs/ocp-operator.sh[103-112]
- .ci/pipelines/value_files/values_showcase.yaml[20-49]
- .ci/pipelines/env_variables.sh[28-31]
- .ci/pipelines/lib/config.sh[73-99]

## Suggested fix approach
1. After generating `/tmp/configmap-dynamic-plugins-runtime.yaml`, *edit* the YAML to set `disabled: true` for existing entries whose `.package` matches the two packages (use `yq`), rather than appending new list items.
2. Alternatively, generate runtime dynamic-plugins from a dedicated runtime values file (with those packages removed/disabled) instead of mutating the generated ConfigMap via `cat >>`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. PVC failures treated transient 🐞 Bug ☼ Reliability
Description
checkPodFailureStates() treats PodScheduled=False as transient when the condition message contains
the substring persistentvolumeclaim, which can match genuine unbound-PVC scheduling failures and
suppress fast-fail detection. This can convert real infra failures into long waits until the outer
deployment-ready timeout expires.
Code

e2e-tests/playwright/utils/kube-client.ts[R630-640]

+            // Ephemeral volume PVC provisioning causes a brief Unschedulable state
+            // that resolves once the PVC is created. Treat as transient, not fatal.
+            if (
+              condition.message?.includes("ephemeral volume controller") ||
+              condition.message?.includes("persistentvolumeclaim")
+            ) {
+              console.log(
+                `Pod ${podName} waiting for ephemeral volume provisioning (transient)`,
+              );
+              continue;
+            }
Relevance

⭐⭐ Medium

Repo favors fast-fail pod scheduling/startup errors, but no prior guidance on treating
'persistentvolumeclaim' as transient.

PR-#3830
PR-#4414

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The new conditional explicitly continues (suppresses the error return) for any
PodScheduled=False with a message containing persistentvolumeclaim; since
waitForDeploymentReady() loops until timeout, this can delay failure reporting significantly.

e2e-tests/playwright/utils/kube-client.ts[623-642]
e2e-tests/playwright/utils/kube-client.ts[719-822]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`checkPodFailureStates()` now considers any `PodScheduled=False` condition whose message contains `persistentvolumeclaim` as transient and continues waiting. This substring is broad and can match real scheduling failures (e.g., PVC never binds / no storage class), which should remain fast-fail.

## Issue Context
This logic runs inside `waitForDeploymentReady()`. When a real PVC bind failure is masked, the loop continues until the overall timeout, slowing CI feedback and making failures less actionable.

## Fix Focus Areas
- e2e-tests/playwright/utils/kube-client.ts[623-642]
- e2e-tests/playwright/utils/kube-client.ts[719-822]

## Suggested fix approach
- Restrict the transient condition to a more specific message pattern tied to ephemeral volume provisioning (e.g., match only `ephemeral volume controller`), OR
- Gate the transient path behind additional checks (e.g., only for pods known to use `ephemeral.volumeClaimTemplate`, or only for messages that explicitly indicate PVC creation is in progress), and keep other `persistentvolumeclaim` scheduling failures as fatal.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm-nightly

@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from 4ec83ad to d383484 Compare May 14, 2026 13:49
@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.60%. Comparing base (fe1ffcc) to head (dbba412).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4809       +/-   ##
===========================================
+ Coverage   41.09%   69.60%   +28.51%     
===========================================
  Files         118      111        -7     
  Lines        2217     4702     +2485     
  Branches      563      513       -50     
===========================================
+ Hits          911     3273     +2362     
- Misses       1301     1429      +128     
+ Partials        5        0        -5     
Flag Coverage Δ
install-dynamic-plugins 92.44% <ø> (?)
rhdh 38.81% <ø> (-2.29%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe1ffcc...dbba412. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from d383484 to af1c911 Compare May 14, 2026 14:06
@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@zdrapela zdrapela changed the title fix(ci): deploy showcase-runtime with internal DB and fix test ordering fix(ci): deploy showcase-runtime with internal DB and fix test ordering May 14, 2026
@zdrapela zdrapela changed the title fix(ci): deploy showcase-runtime with internal DB and fix test ordering fix(ci): deploy showcase-runtime with internal DB and fix test orderind May 14, 2026
@zdrapela zdrapela changed the title fix(ci): deploy showcase-runtime with internal DB and fix test orderind fix(ci): deploy showcase-runtime with internal DB and fix test ordering May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Copy Markdown
Member Author

/retest

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from af1c911 to 91c4806 Compare May 15, 2026 06:10
@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from 91c4806 to ce2f25e Compare May 15, 2026 15:34
@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from 1ac8a0c to 57931fb Compare May 18, 2026 12:31
@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm-nightly

@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from 57931fb to 1be8d8a Compare May 19, 2026 11:31
@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from 1be8d8a to dbba412 Compare May 20, 2026 09:26
@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

Deploy showcase-runtime with internal PostgreSQL (Helm sub-chart / operator-managed)
instead of external Crunchy DB. This makes the runtime deployment self-contained and
fixes the test ordering issue where showcase-runtime tests were always skipped because
showcase-runtime-db failures blocked them via Playwright project dependencies.

Changes:
- Reverse Playwright project dependency: showcase-runtime runs first (no deps),
  showcase-runtime-db depends on it (runs after)
- Pipeline calls showcase-runtime-db which triggers showcase-runtime as dependency
- New Helm values (values-showcase-runtime.yaml) with postgresql.enabled=true
- New operator CR (rhdh-start-runtime-local.yaml) with internal DB, uses
  rhdh-runtime-config secret for RHDH_RUNTIME_URL (separate from postgres-cred
  to avoid POSTGRES_* env vars overriding operator's internal DB config)
- New minimal app-config (app-config-rhdh-runtime.yaml) without backend.database
- External DB tests use prepareForExternalDatabase() to switch from internal to
  external DB at runtime (patches ConfigMap + adds env vars via JSON patch)
- Extract waitForRuntimeDeploymentReady() utility for reuse
- Remove Crunchy PostgresCluster dependency from runtime namespace

Verified on OCP cluster:
- Helm: deployment + config-map.spec.ts PASSED
- Operator: deployment succeeded (2/2 ready, HTTP 200), config-map test
  ConfigMap update worked, restart hit transient PVC provisioning delay
  (sandbox cluster infra issue, not code bug)

Assisted-by: OpenCode
@zdrapela zdrapela force-pushed the fix/showcase-runtime-test-ordering branch from dbba412 to 699ad7e Compare May 20, 2026 11:50
@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

Make schema-mode (pluginDivisionMode: schema) tests work on operator
deployments using the internal operator-managed PostgreSQL.

Changes:
- schema-mode-setup.ts: skip ensureDeploymentEnvVars() for operator
  (operator injects env vars via envFrom from managed secret, direct
  Deployment patches would be reverted by reconciliation)
- schema-mode-setup.ts: preserve POSTGRESQL_ADMIN_PASSWORD when updating
  the operator-managed secret (PostgreSQL image needs it on startup)
- schema-mode-setup.ts: add restart retry logic (3 attempts, 30s delay)
  to handle transient operator reconciliation failures
- verify-schema-mode.spec.ts: use isOperatorDeployment() from helper.ts
  instead of INSTALL_METHOD env var for install method detection
- verify-schema-mode.spec.ts: increase test.setTimeout to 600000ms
  (operator restarts are slower due to reconciliation)
- schema-mode-env.sh: accept 3rd parameter install_method, use in log
  messages for better debugging
- ocp-operator.sh: pass 'operator' as 3rd arg to
  configure_schema_mode_runtime_env

Assisted-by: OpenCode
@github-actions
Copy link
Copy Markdown
Contributor

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@sonarqubecloud
Copy link
Copy Markdown

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-operator-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-v4.18-helm-nightly

@zdrapela
Copy link
Copy Markdown
Member Author

/test e2e-ocp-helm-nightly

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 21, 2026

@zdrapela: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ocp-helm-nightly 74b142a link false /test e2e-ocp-helm-nightly

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 26, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant