OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests.#30880
OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests.#30880lucaconsalvi wants to merge 6 commits intoopenshift:mainfrom
Conversation
Introduces the openshift/two-node-regression suite with 5 regression tests that validate podman-etcd resource agent behavior under disruptive conditions: - OCP-88178: learner_node CRM attribute cleanup during stop/start - OCP-88179: active resource count excludes stopping resources - OCP-88180: simultaneous stop delay prevents WAL corruption - OCP-88181: coordinated recovery after etcd container kill - OCP-88213: attribute retry during force-new-cluster recovery Also adds shared pacemaker/CRM utilities to utils/common.go and updates the openshift/two-node suite qualifier to exclude regression tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRegisters a new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (6 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lucaconsalvi The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/extended/two_node/tnf_resilience.go (2)
576-586: Redundant logging loop before assertion loop.Lines 576-582 log node presence, and lines 583-586 assert the same condition. The logging is redundant since assertion failures include the failure message.
♻️ Suggested consolidation
- for _, node := range nodes { - if strings.Contains(failedSection, node.Name) { - framework.Logf("Coordinated failure confirmed: node %s found in Failed Resource Actions", node.Name) - } else { - framework.Logf("Node %s NOT found in Failed Resource Actions", node.Name) - } - } for _, node := range nodes { o.Expect(failedSection).To(o.ContainSubstring(node.Name), fmt.Sprintf("Expected Failed Resource Actions to show failure on %s for coordinated recovery", node.Name)) + framework.Logf("Coordinated failure confirmed: node %s found in Failed Resource Actions", node.Name) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_resilience.go` around lines 576 - 586, The first loop that calls framework.Logf to report whether each node is present in failedSection is redundant because the following assertion loop (using o.Expect with o.ContainSubstring) already verifies presence and provides failure messages; remove the preliminary logging loop that iterates over nodes and uses framework.Logf, or alternatively merge by performing the o.Expect check and only logging via framework.Logf inside the assertion failure branch—target the loops handling nodes, failedSection, framework.Logf, and the assertion calls o.Expect/o.ContainSubstring (keep the assertion loop intact).
543-546: Consider using idiomatic Gomega assertion.Line 546 uses
.To(o.BeNil())while the rest of the file uses.ShouldNot(o.HaveOccurred())for error checks. Consider using the consistent pattern for readability.♻️ Suggested change
_, err = exutil.DebugNodeRetryWithOptionsAndChroot( oc, targetNode.Name, "openshift-etcd", "bash", "-c", "podman kill etcd 2>/dev/null") - o.Expect(err).To(o.BeNil(), "Expected to kill etcd container without command errors") + o.Expect(err).ShouldNot(o.HaveOccurred(), "Expected to kill etcd container without command errors")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_resilience.go` around lines 543 - 546, The test currently asserts the kill-etcd error with o.Expect(err).To(o.BeNil()), which is inconsistent with the file's idiomatic Gomega usage; replace that assertion with the project's standard pattern—use o.Expect(err).ShouldNot(o.HaveOccurred()) (or the equivalent Expect(...).NotTo(HaveOccurred())) to check the error returned by exutil.DebugNodeRetryWithOptionsAndChroot so the assertion style matches other checks in this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 576-586: The first loop that calls framework.Logf to report
whether each node is present in failedSection is redundant because the following
assertion loop (using o.Expect with o.ContainSubstring) already verifies
presence and provides failure messages; remove the preliminary logging loop that
iterates over nodes and uses framework.Logf, or alternatively merge by
performing the o.Expect check and only logging via framework.Logf inside the
assertion failure branch—target the loops handling nodes, failedSection,
framework.Logf, and the assertion calls o.Expect/o.ContainSubstring (keep the
assertion loop intact).
- Around line 543-546: The test currently asserts the kill-etcd error with
o.Expect(err).To(o.BeNil()), which is inconsistent with the file's idiomatic
Gomega usage; replace that assertion with the project's standard pattern—use
o.Expect(err).ShouldNot(o.HaveOccurred()) (or the equivalent
Expect(...).NotTo(HaveOccurred())) to check the error returned by
exutil.DebugNodeRetryWithOptionsAndChroot so the assertion style matches other
checks in this file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cfb85c1b-ac14-4bc8-aef5-151a5de45974
📒 Files selected for processing (3)
pkg/testsuites/standard_suites.gotest/extended/two_node/tnf_resilience.gotest/extended/two_node/utils/common.go
…tomatically skip these tests on SNO environments.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/extended/two_node/tnf_resilience.go (1)
651-656: Potential race between learner detection and attribute setting.The test waits for the standby node to appear as a learner in etcd (line 628-643), then immediately queries the
learner_nodeCRM attribute (line 653). However, the attribute is set by the leader's monitor action, which runs asynchronously after detecting the learner. There could be a timing gap where the etcd member list shows the learner but the attribute hasn't been set yet.Consider wrapping the attribute query in an
Eventuallyblock to handle this race condition.♻️ Proposed fix
// Verify learner_node attribute is set before we delete it g.By("Verifying learner_node CRM attribute is set") - attrOutput, err := utils.QueryCRMAttribute(oc, execNode.Name, crmAttributeName) - o.Expect(err).ShouldNot(o.HaveOccurred(), - "Expected learner_node attribute to exist after force-new-cluster recovery") - framework.Logf("learner_node attribute value: %s", attrOutput) + var attrOutput string + o.Eventually(func() error { + var queryErr error + attrOutput, queryErr = utils.QueryCRMAttribute(oc, execNode.Name, crmAttributeName) + return queryErr + }, etcdResourceRecoveryTimeout, utils.FiveSecondPollInterval).ShouldNot( + o.HaveOccurred(), "Expected learner_node attribute to exist after force-new-cluster recovery") + framework.Logf("learner_node attribute value: %s", attrOutput)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_resilience.go` around lines 651 - 656, There is a race where the etcd learner appears before the leader's monitor sets the CRM attribute; change the direct call to utils.QueryCRMAttribute(oc, execNode.Name, crmAttributeName) into an o.Eventually that repeatedly calls utils.QueryCRMAttribute until it returns the expected non-empty value (or no error) within a reasonable timeout, then use o.Expect on the eventual result; update the block following g.By("Verifying learner_node CRM attribute is set") to poll for the attribute instead of reading it once so the leader's asynchronous monitor has time to set it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 317-327: The AfterEach cleanup currently ignores the error
returned by utils.GetNodes which can leave nodeList nil and cause a panic when
accessing nodeList.Items; update the call in the AfterEach closure that invokes
utils.GetNodes(oc, utils.AllNodes) to capture the error (e.g., nodeList, err :=
utils.GetNodes(...)), check if err != nil and log via framework.Logf (or
framework.Logf with a descriptive message) then return, and also defensively
verify nodeList != nil before checking nodeList.Items to avoid nil dereference.
---
Nitpick comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 651-656: There is a race where the etcd learner appears before the
leader's monitor sets the CRM attribute; change the direct call to
utils.QueryCRMAttribute(oc, execNode.Name, crmAttributeName) into an
o.Eventually that repeatedly calls utils.QueryCRMAttribute until it returns the
expected non-empty value (or no error) within a reasonable timeout, then use
o.Expect on the eventual result; update the block following g.By("Verifying
learner_node CRM attribute is set") to poll for the attribute instead of reading
it once so the leader's asynchronous monitor has time to set it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 494acb98-87a0-40b5-93a4-4d736eda6dae
📒 Files selected for processing (1)
test/extended/two_node/tnf_resilience.go
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
test/extended/two_node/tnf_resilience.go (1)
322-324:⚠️ Potential issue | 🟠 MajorHandle
GetNodeserrors inAfterEachcleanupAt Line 322, the
utils.GetNodeserror is ignored and Line 323 dereferencesnodeList.Itemsimmediately. If node retrieval fails, cleanup can panic or skip critical recovery steps.Proposed fix
- nodeList, _ := utils.GetNodes(oc, utils.AllNodes) - if len(nodeList.Items) == 0 { + nodeList, err := utils.GetNodes(oc, utils.AllNodes) + if err != nil || nodeList == nil || len(nodeList.Items) == 0 { framework.Logf("Warning: Could not retrieve nodes during cleanup") return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_resilience.go` around lines 322 - 324, In the AfterEach cleanup code, don't ignore the error returned by utils.GetNodes; check the error from utils.GetNodes(oc, utils.AllNodes) before accessing nodeList.Items and handle failures (e.g., log the error with framework.Logf or framework.Failf and skip further node-dependent cleanup) to avoid panics — update the block that calls utils.GetNodes to capture the error, guard any use of nodeList (including len(nodeList.Items)) behind a nil/err check, and ensure cleanup either retries or exits gracefully when GetNodes fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 250-266: expectPacemakerLogFound currently checks whole pacemaker
history and can be satisfied by stale log lines; change it to accept a baseline
marker (e.g., a sinceTime string or timestamp parameter) and pass that to
getPacemakerLogGrep so the grep only returns lines after the baseline; update
callers to capture a baseline (timestamp or log offset) immediately before the
disruptive action and pass it into expectPacemakerLogFound/getPacemakerLogGrep
so assertions only consider logs emitted after the disruption.
- Around line 209-212: The negative-log assertion currently swallows errors from
getPacemakerLogGrep and treats them like "no match", allowing tests to pass when
log retrieval actually failed; change the caller so that when
getPacemakerLogGrep returns a non-nil error (not just an empty/negative match)
the test fails (or returns the error) instead of logging and continuing. Locate
the call site that logs and skips on error (the must-not-appear check that calls
getPacemakerLogGrep) and replace the error-ignored path with a fail/return path
(e.g., t.Fatalf/t.Helper return or propagate the error) so inability to read
logs causes test failure; keep the existing logic that treats an empty grep
result as success. Ensure getPacemakerLogGrep remains unchanged except for
improving any error wrapping if helpful to provide context.
- Around line 186-201: The loop in verifyEtcdCloneStartedOnAllNodes scans from
the first "etcd-clone" occurrence to the end of the pcs status output, allowing
unrelated "Started" lines to satisfy the check; restrict the search to the
etcd-clone resource block by finding the block end (e.g., the next resource
header or a blank line) after etcdIdx and use only that slice (etcdSection) when
scanning for lines that contain "Started" and node.Name; update the code that
computes etcdSection (and keeps the existing node/name checks) so it stops at
the resource boundary instead of the end of statusOutput.
---
Duplicate comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 322-324: In the AfterEach cleanup code, don't ignore the error
returned by utils.GetNodes; check the error from utils.GetNodes(oc,
utils.AllNodes) before accessing nodeList.Items and handle failures (e.g., log
the error with framework.Logf or framework.Failf and skip further node-dependent
cleanup) to avoid panics — update the block that calls utils.GetNodes to capture
the error, guard any use of nodeList (including len(nodeList.Items)) behind a
nil/err check, and ensure cleanup either retries or exits gracefully when
GetNodes fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d04935e4-057e-405a-a8fe-f7ea2413ba59
📒 Files selected for processing (1)
test/extended/two_node/tnf_resilience.go
- Handle GetNodes error in AfterEach to prevent nil panic - Scope verifyEtcdCloneStartedOnAllNodes to etcd-clone block only - Add per-node pacemaker log baselines to prevent stale log matches - Fail test on log retrieval errors instead of silently skipping - Poll for learner_node attribute with Eventually for async monitor Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/two_node/tnf_resilience.go (1)
586-591: Podman kill may fail if etcd container is already stopped or restarting.Line 590 asserts no error from the
podman kill etcdcommand. However, if etcd is in the middle of a restart cycle or already stopped,podman killreturns a non-zero exit code. The2>/dev/nullsuppresses stderr but doesn't suppress the exit code that will be captured as an error.Consider allowing the command to fail gracefully since the goal is to disrupt etcd, not necessarily to succeed at killing:
🛡️ Proposed fix
g.By(fmt.Sprintf("Killing etcd container on %s", targetNode.Name)) - _, err = exutil.DebugNodeRetryWithOptionsAndChroot( + output, err := exutil.DebugNodeRetryWithOptionsAndChroot( oc, targetNode.Name, "openshift-etcd", - "bash", "-c", "podman kill etcd 2>/dev/null") - o.Expect(err).ShouldNot(o.HaveOccurred(), "Expected to kill etcd container without command errors") + "bash", "-c", "podman kill etcd 2>/dev/null; true") + framework.Logf("Podman kill output: %s, err: %v", output, err)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_resilience.go` around lines 586 - 591, The test currently fails if "podman kill etcd" returns a non-zero exit (e.g., container already stopped); update the call that invokes exutil.DebugNodeRetryWithOptionsAndChroot (the "bash -c" argument) to tolerate failure by making the shell command idempotent (e.g., append "|| true" or otherwise suppress non-zero exit) so the intent to disrupt etcd is preserved without causing the test to error; keep the surrounding expectation o.Expect(err).ShouldNot... as-is since the remote command will now return success even when etcd was already stopped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 586-591: The test currently fails if "podman kill etcd" returns a
non-zero exit (e.g., container already stopped); update the call that invokes
exutil.DebugNodeRetryWithOptionsAndChroot (the "bash -c" argument) to tolerate
failure by making the shell command idempotent (e.g., append "|| true" or
otherwise suppress non-zero exit) so the intent to disrupt etcd is preserved
without causing the test to error; keep the surrounding expectation
o.Expect(err).ShouldNot... as-is since the remote command will now return
success even when etcd was already stopped.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ba3e73e0-ffae-4bc0-9af8-b592fefaa724
📒 Files selected for processing (1)
test/extended/two_node/tnf_resilience.go
|
Scheduling required tests: |
The podman kill command can fail if etcd is already stopped or restarting. Append '; true' to tolerate non-zero exit codes since the real assertion is whether the cluster recovers, not whether the kill command succeeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 238-249: getPacemakerLogBaselines sometimes leaves
baselines[node.Name] empty on error which lets getPacemakerLogGrep search the
entire log; change the error branch in getPacemakerLogBaselines to assign a
non-empty sentinel (e.g., "0") for that node and keep the warning, and update
getPacemakerLogGrep to detect that sentinel (baseline == "0") and skip that node
instead of searching full history; ensure both getPacemakerLogBaselines and
getPacemakerLogGrep reference the same sentinel handling so missing baselines no
longer cause time-scoped queries to fall back to the full log.
- Around line 223-232: The getPacemakerLogGrep function uses "tail -n
+<baseline>" which includes the baseline line and builds a pipeline whose
early-stage failures can be masked; update the command string in
getPacemakerLogGrep to (1) enable bash pipefail by prefixing "set -o pipefail;"
so any source-command failure (e.g., tail can't read the log) is propagated, and
(2) advance the baseline by one (use shell arithmetic $((<baseline>+1)) in the
tail argument) so the baseline line itself is excluded; keep these changes
inside the existing fmt.Sprintf-based command construction used before calling
DebugNodeRetryWithOptionsAndChroot.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f14c5294-1c44-436c-b6f4-fa9361785940
📒 Files selected for processing (1)
test/extended/two_node/tnf_resilience.go
|
Scheduling required tests: |
|
@lucaconsalvi: This pull request references OCPEDGE-2386 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@lucaconsalvi: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
podman-etcdresource agent in two-node fencing clustersopenshift/tnf-resiliencetest suite (serial, disruptive,Parallelism: 1)utils/common.goopenshift/two-nodesuite qualifier to exclude tnf-resilience testsThe tests are placed in a separate suite because they are more disruptive than the standard
openshift/two-nodetests — they stop/restart etcd resources, kill etcd containers, and trigger force-new-cluster recovery. Running them separately avoids destabilizing the cluster for other two-node tests.Tests
learner_nodeCRM attribute is cleared on stop and startpodman kill etcdlearner_nodeafter deletion during FNC recoveryTest plan
run-teston a live two-node fencing cluster (4.22.0-0.nightly-2026-03-11-034211)./openshift-tests run openshift/tnf-resilienceon the remote instance (cross-compiled binary), confirming the suite runner respectsParallelism: 1