Skip to content

OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests.#30880

Open
lucaconsalvi wants to merge 6 commits intoopenshift:mainfrom
lucaconsalvi:tnf-resilience-tests
Open

OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests.#30880
lucaconsalvi wants to merge 6 commits intoopenshift:mainfrom
lucaconsalvi:tnf-resilience-tests

Conversation

@lucaconsalvi
Copy link

@lucaconsalvi lucaconsalvi commented Mar 13, 2026

Summary

  • Adds 5 resilience tests for the podman-etcd resource agent in two-node fencing clusters
  • Creates the openshift/tnf-resilience test suite (serial, disruptive, Parallelism: 1)
  • Adds shared pacemaker/CRM utilities to utils/common.go
  • Updates openshift/two-node suite qualifier to exclude tnf-resilience tests

The tests are placed in a separate suite because they are more disruptive than the standard openshift/two-node tests — they stop/restart etcd resources, kill etcd containers, and trigger force-new-cluster recovery. Running them separately avoids destabilizing the cluster for other two-node tests.

Tests

ID Test What it validates
OCP-88178 Learner cleanup learner_node CRM attribute is cleared on stop and start
OCP-88179 Active count Stopping resources excluded from active count
OCP-88180 Simultaneous stop Second node delays stop to prevent WAL corruption
OCP-88181 Coordinated recovery Both nodes coordinate after podman kill etcd
OCP-88213 Attribute retry Leader retries setting learner_node after deletion during FNC recovery

Test plan

  • All 5 tests passed via run-test on a live two-node fencing cluster (4.22.0-0.nightly-2026-03-11-034211)
  • All 5 tests passed via ./openshift-tests run openshift/tnf-resilience on the remote instance (cross-compiled binary), confirming the suite runner respects Parallelism: 1

lucaconsalvi and others added 2 commits March 13, 2026 15:12
Introduces the openshift/two-node-regression suite with 5 regression tests
that validate podman-etcd resource agent behavior under disruptive conditions:

- OCP-88178: learner_node CRM attribute cleanup during stop/start
- OCP-88179: active resource count excludes stopping resources
- OCP-88180: simultaneous stop delay prevents WAL corruption
- OCP-88181: coordinated recovery after etcd container kill
- OCP-88213: attribute retry during force-new-cluster recovery

Also adds shared pacemaker/CRM utilities to utils/common.go and updates
the openshift/two-node suite qualifier to exclude regression tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 747bde77-9701-4830-938c-10152617a7ab

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Registers a new openshift/tnf-resilience test suite, adds a comprehensive two-node etcd resilience test exercising Pacemaker/CRM operations, introduces Pacemaker/CRM helper utilities, and updates go.mod.

Changes

Cohort / File(s) Summary
Test Suite Registration
pkg/testsuites/standard_suites.go
Added openshift/tnf-resilience to staticSuites and updated openshift/two-node qualifiers to exclude the new suite.
Resilience Test Implementation
test/extended/two_node/tnf_resilience.go
Large new test file with: learnerCleanupResult type, CRM/pcs script builders, attribute injection/cleanup, disable/enable cycle runners, pacemaker log inspections, node readiness and etcd health checks, multiple test scenarios, setup/teardown and extensive helper functions.
Cluster Utilities
test/extended/two_node/utils/common.go
Added helpers: DisablePacemakerResource, EnablePacemakerResource, SetCRMAttribute, QueryCRMAttribute, DeleteCRMAttribute to run remote pcs/crm_attribute commands and return captured output/errors.
Module file
go.mod
Updated module dependencies required by the added test code.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning The openshift/tnf-resilience test suite contains five Ginkgo tests with explicit multi-node assumptions incompatible with Single Node OpenShift. Add [Skipped:SingleReplicaTopology] label to the Describe block to automatically skip these tests on SNO environments.
Stable And Deterministic Test Names ❓ Inconclusive Test names must be stable and deterministic, not containing dynamic values like timestamps or random identifiers that change between test runs. Unable to examine test file - file location test/extended/two_node/tnf_resilience.go could not be verified in the current environment.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a new test suite with etcd resilience tests for two-node clusters.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Structure And Quality ✅ Passed The test file demonstrates solid test quality practices with clear single responsibility per test, comprehensive setup/cleanup, appropriate timeouts, meaningful assertions, and consistency with codebase patterns.
Microshift Test Compatibility ✅ Passed The test is protected from running on MicroShift through the [apigroup:config.openshift.io] tag, which is unavailable on MicroShift.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Tests contain no hardcoded IPv4 addresses, no IPv4-only IP parsing, and no external connectivity requirements. All operations are cluster-local.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from eggfoobar and sjenning March 13, 2026 16:34
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lucaconsalvi
Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/extended/two_node/tnf_resilience.go (2)

576-586: Redundant logging loop before assertion loop.

Lines 576-582 log node presence, and lines 583-586 assert the same condition. The logging is redundant since assertion failures include the failure message.

♻️ Suggested consolidation
-		for _, node := range nodes {
-			if strings.Contains(failedSection, node.Name) {
-				framework.Logf("Coordinated failure confirmed: node %s found in Failed Resource Actions", node.Name)
-			} else {
-				framework.Logf("Node %s NOT found in Failed Resource Actions", node.Name)
-			}
-		}
 		for _, node := range nodes {
 			o.Expect(failedSection).To(o.ContainSubstring(node.Name),
 				fmt.Sprintf("Expected Failed Resource Actions to show failure on %s for coordinated recovery", node.Name))
+			framework.Logf("Coordinated failure confirmed: node %s found in Failed Resource Actions", node.Name)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_resilience.go` around lines 576 - 586, The first
loop that calls framework.Logf to report whether each node is present in
failedSection is redundant because the following assertion loop (using o.Expect
with o.ContainSubstring) already verifies presence and provides failure
messages; remove the preliminary logging loop that iterates over nodes and uses
framework.Logf, or alternatively merge by performing the o.Expect check and only
logging via framework.Logf inside the assertion failure branch—target the loops
handling nodes, failedSection, framework.Logf, and the assertion calls
o.Expect/o.ContainSubstring (keep the assertion loop intact).

543-546: Consider using idiomatic Gomega assertion.

Line 546 uses .To(o.BeNil()) while the rest of the file uses .ShouldNot(o.HaveOccurred()) for error checks. Consider using the consistent pattern for readability.

♻️ Suggested change
 		_, err = exutil.DebugNodeRetryWithOptionsAndChroot(
 			oc, targetNode.Name, "openshift-etcd",
 			"bash", "-c", "podman kill etcd 2>/dev/null")
-		o.Expect(err).To(o.BeNil(), "Expected to kill etcd container without command errors")
+		o.Expect(err).ShouldNot(o.HaveOccurred(), "Expected to kill etcd container without command errors")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_resilience.go` around lines 543 - 546, The test
currently asserts the kill-etcd error with o.Expect(err).To(o.BeNil()), which is
inconsistent with the file's idiomatic Gomega usage; replace that assertion with
the project's standard pattern—use o.Expect(err).ShouldNot(o.HaveOccurred()) (or
the equivalent Expect(...).NotTo(HaveOccurred())) to check the error returned by
exutil.DebugNodeRetryWithOptionsAndChroot so the assertion style matches other
checks in this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 576-586: The first loop that calls framework.Logf to report
whether each node is present in failedSection is redundant because the following
assertion loop (using o.Expect with o.ContainSubstring) already verifies
presence and provides failure messages; remove the preliminary logging loop that
iterates over nodes and uses framework.Logf, or alternatively merge by
performing the o.Expect check and only logging via framework.Logf inside the
assertion failure branch—target the loops handling nodes, failedSection,
framework.Logf, and the assertion calls o.Expect/o.ContainSubstring (keep the
assertion loop intact).
- Around line 543-546: The test currently asserts the kill-etcd error with
o.Expect(err).To(o.BeNil()), which is inconsistent with the file's idiomatic
Gomega usage; replace that assertion with the project's standard pattern—use
o.Expect(err).ShouldNot(o.HaveOccurred()) (or the equivalent
Expect(...).NotTo(HaveOccurred())) to check the error returned by
exutil.DebugNodeRetryWithOptionsAndChroot so the assertion style matches other
checks in this file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cfb85c1b-ac14-4bc8-aef5-151a5de45974

📥 Commits

Reviewing files that changed from the base of the PR and between b317dd5 and 00cf165.

📒 Files selected for processing (3)
  • pkg/testsuites/standard_suites.go
  • test/extended/two_node/tnf_resilience.go
  • test/extended/two_node/utils/common.go

…tomatically skip these tests on SNO environments.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/extended/two_node/tnf_resilience.go (1)

651-656: Potential race between learner detection and attribute setting.

The test waits for the standby node to appear as a learner in etcd (line 628-643), then immediately queries the learner_node CRM attribute (line 653). However, the attribute is set by the leader's monitor action, which runs asynchronously after detecting the learner. There could be a timing gap where the etcd member list shows the learner but the attribute hasn't been set yet.

Consider wrapping the attribute query in an Eventually block to handle this race condition.

♻️ Proposed fix
 		// Verify learner_node attribute is set before we delete it
 		g.By("Verifying learner_node CRM attribute is set")
-		attrOutput, err := utils.QueryCRMAttribute(oc, execNode.Name, crmAttributeName)
-		o.Expect(err).ShouldNot(o.HaveOccurred(),
-			"Expected learner_node attribute to exist after force-new-cluster recovery")
-		framework.Logf("learner_node attribute value: %s", attrOutput)
+		var attrOutput string
+		o.Eventually(func() error {
+			var queryErr error
+			attrOutput, queryErr = utils.QueryCRMAttribute(oc, execNode.Name, crmAttributeName)
+			return queryErr
+		}, etcdResourceRecoveryTimeout, utils.FiveSecondPollInterval).ShouldNot(
+			o.HaveOccurred(), "Expected learner_node attribute to exist after force-new-cluster recovery")
+		framework.Logf("learner_node attribute value: %s", attrOutput)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_resilience.go` around lines 651 - 656, There is a
race where the etcd learner appears before the leader's monitor sets the CRM
attribute; change the direct call to utils.QueryCRMAttribute(oc, execNode.Name,
crmAttributeName) into an o.Eventually that repeatedly calls
utils.QueryCRMAttribute until it returns the expected non-empty value (or no
error) within a reasonable timeout, then use o.Expect on the eventual result;
update the block following g.By("Verifying learner_node CRM attribute is set")
to poll for the attribute instead of reading it once so the leader's
asynchronous monitor has time to set it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 317-327: The AfterEach cleanup currently ignores the error
returned by utils.GetNodes which can leave nodeList nil and cause a panic when
accessing nodeList.Items; update the call in the AfterEach closure that invokes
utils.GetNodes(oc, utils.AllNodes) to capture the error (e.g., nodeList, err :=
utils.GetNodes(...)), check if err != nil and log via framework.Logf (or
framework.Logf with a descriptive message) then return, and also defensively
verify nodeList != nil before checking nodeList.Items to avoid nil dereference.

---

Nitpick comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 651-656: There is a race where the etcd learner appears before the
leader's monitor sets the CRM attribute; change the direct call to
utils.QueryCRMAttribute(oc, execNode.Name, crmAttributeName) into an
o.Eventually that repeatedly calls utils.QueryCRMAttribute until it returns the
expected non-empty value (or no error) within a reasonable timeout, then use
o.Expect on the eventual result; update the block following g.By("Verifying
learner_node CRM attribute is set") to poll for the attribute instead of reading
it once so the leader's asynchronous monitor has time to set it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 494acb98-87a0-40b5-93a4-4d736eda6dae

📥 Commits

Reviewing files that changed from the base of the PR and between 00cf165 and 778acd8.

📒 Files selected for processing (1)
  • test/extended/two_node/tnf_resilience.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
test/extended/two_node/tnf_resilience.go (1)

322-324: ⚠️ Potential issue | 🟠 Major

Handle GetNodes errors in AfterEach cleanup

At Line 322, the utils.GetNodes error is ignored and Line 323 dereferences nodeList.Items immediately. If node retrieval fails, cleanup can panic or skip critical recovery steps.

Proposed fix
-		nodeList, _ := utils.GetNodes(oc, utils.AllNodes)
-		if len(nodeList.Items) == 0 {
+		nodeList, err := utils.GetNodes(oc, utils.AllNodes)
+		if err != nil || nodeList == nil || len(nodeList.Items) == 0 {
 			framework.Logf("Warning: Could not retrieve nodes during cleanup")
 			return
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_resilience.go` around lines 322 - 324, In the
AfterEach cleanup code, don't ignore the error returned by utils.GetNodes; check
the error from utils.GetNodes(oc, utils.AllNodes) before accessing
nodeList.Items and handle failures (e.g., log the error with framework.Logf or
framework.Failf and skip further node-dependent cleanup) to avoid panics —
update the block that calls utils.GetNodes to capture the error, guard any use
of nodeList (including len(nodeList.Items)) behind a nil/err check, and ensure
cleanup either retries or exits gracefully when GetNodes fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 250-266: expectPacemakerLogFound currently checks whole pacemaker
history and can be satisfied by stale log lines; change it to accept a baseline
marker (e.g., a sinceTime string or timestamp parameter) and pass that to
getPacemakerLogGrep so the grep only returns lines after the baseline; update
callers to capture a baseline (timestamp or log offset) immediately before the
disruptive action and pass it into expectPacemakerLogFound/getPacemakerLogGrep
so assertions only consider logs emitted after the disruption.
- Around line 209-212: The negative-log assertion currently swallows errors from
getPacemakerLogGrep and treats them like "no match", allowing tests to pass when
log retrieval actually failed; change the caller so that when
getPacemakerLogGrep returns a non-nil error (not just an empty/negative match)
the test fails (or returns the error) instead of logging and continuing. Locate
the call site that logs and skips on error (the must-not-appear check that calls
getPacemakerLogGrep) and replace the error-ignored path with a fail/return path
(e.g., t.Fatalf/t.Helper return or propagate the error) so inability to read
logs causes test failure; keep the existing logic that treats an empty grep
result as success. Ensure getPacemakerLogGrep remains unchanged except for
improving any error wrapping if helpful to provide context.
- Around line 186-201: The loop in verifyEtcdCloneStartedOnAllNodes scans from
the first "etcd-clone" occurrence to the end of the pcs status output, allowing
unrelated "Started" lines to satisfy the check; restrict the search to the
etcd-clone resource block by finding the block end (e.g., the next resource
header or a blank line) after etcdIdx and use only that slice (etcdSection) when
scanning for lines that contain "Started" and node.Name; update the code that
computes etcdSection (and keeps the existing node/name checks) so it stops at
the resource boundary instead of the end of statusOutput.

---

Duplicate comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 322-324: In the AfterEach cleanup code, don't ignore the error
returned by utils.GetNodes; check the error from utils.GetNodes(oc,
utils.AllNodes) before accessing nodeList.Items and handle failures (e.g., log
the error with framework.Logf or framework.Failf and skip further node-dependent
cleanup) to avoid panics — update the block that calls utils.GetNodes to capture
the error, guard any use of nodeList (including len(nodeList.Items)) behind a
nil/err check, and ensure cleanup either retries or exits gracefully when
GetNodes fails.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d04935e4-057e-405a-a8fe-f7ea2413ba59

📥 Commits

Reviewing files that changed from the base of the PR and between 778acd8 and c4b91d7.

📒 Files selected for processing (1)
  • test/extended/two_node/tnf_resilience.go

- Handle GetNodes error in AfterEach to prevent nil panic
- Scope verifyEtcdCloneStartedOnAllNodes to etcd-clone block only
- Add per-node pacemaker log baselines to prevent stale log matches
- Fail test on log retrieval errors instead of silently skipping
- Poll for learner_node attribute with Eventually for async monitor

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/two_node/tnf_resilience.go (1)

586-591: Podman kill may fail if etcd container is already stopped or restarting.

Line 590 asserts no error from the podman kill etcd command. However, if etcd is in the middle of a restart cycle or already stopped, podman kill returns a non-zero exit code. The 2>/dev/null suppresses stderr but doesn't suppress the exit code that will be captured as an error.

Consider allowing the command to fail gracefully since the goal is to disrupt etcd, not necessarily to succeed at killing:

🛡️ Proposed fix
 		g.By(fmt.Sprintf("Killing etcd container on %s", targetNode.Name))
-		_, err = exutil.DebugNodeRetryWithOptionsAndChroot(
+		output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
 			oc, targetNode.Name, "openshift-etcd",
-			"bash", "-c", "podman kill etcd 2>/dev/null")
-		o.Expect(err).ShouldNot(o.HaveOccurred(), "Expected to kill etcd container without command errors")
+			"bash", "-c", "podman kill etcd 2>/dev/null; true")
+		framework.Logf("Podman kill output: %s, err: %v", output, err)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_resilience.go` around lines 586 - 591, The test
currently fails if "podman kill etcd" returns a non-zero exit (e.g., container
already stopped); update the call that invokes
exutil.DebugNodeRetryWithOptionsAndChroot (the "bash -c" argument) to tolerate
failure by making the shell command idempotent (e.g., append "|| true" or
otherwise suppress non-zero exit) so the intent to disrupt etcd is preserved
without causing the test to error; keep the surrounding expectation
o.Expect(err).ShouldNot... as-is since the remote command will now return
success even when etcd was already stopped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 586-591: The test currently fails if "podman kill etcd" returns a
non-zero exit (e.g., container already stopped); update the call that invokes
exutil.DebugNodeRetryWithOptionsAndChroot (the "bash -c" argument) to tolerate
failure by making the shell command idempotent (e.g., append "|| true" or
otherwise suppress non-zero exit) so the intent to disrupt etcd is preserved
without causing the test to error; keep the surrounding expectation
o.Expect(err).ShouldNot... as-is since the remote command will now return
success even when etcd was already stopped.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba3e73e0-ffae-4bc0-9af8-b592fefaa724

📥 Commits

Reviewing files that changed from the base of the PR and between c4b91d7 and b6a80a6.

📒 Files selected for processing (1)
  • test/extended/two_node/tnf_resilience.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

The podman kill command can fail if etcd is already stopped or
restarting. Append '; true' to tolerate non-zero exit codes since
the real assertion is whether the cluster recovers, not whether
the kill command succeeded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/tnf_resilience.go`:
- Around line 238-249: getPacemakerLogBaselines sometimes leaves
baselines[node.Name] empty on error which lets getPacemakerLogGrep search the
entire log; change the error branch in getPacemakerLogBaselines to assign a
non-empty sentinel (e.g., "0") for that node and keep the warning, and update
getPacemakerLogGrep to detect that sentinel (baseline == "0") and skip that node
instead of searching full history; ensure both getPacemakerLogBaselines and
getPacemakerLogGrep reference the same sentinel handling so missing baselines no
longer cause time-scoped queries to fall back to the full log.
- Around line 223-232: The getPacemakerLogGrep function uses "tail -n
+<baseline>" which includes the baseline line and builds a pipeline whose
early-stage failures can be masked; update the command string in
getPacemakerLogGrep to (1) enable bash pipefail by prefixing "set -o pipefail;"
so any source-command failure (e.g., tail can't read the log) is propagated, and
(2) advance the baseline by one (use shell arithmetic $((<baseline>+1)) in the
tail argument) so the baseline line itself is excluded; keep these changes
inside the existing fmt.Sprintf-based command construction used before calling
DebugNodeRetryWithOptionsAndChroot.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f14c5294-1c44-436c-b6f4-fa9361785940

📥 Commits

Reviewing files that changed from the base of the PR and between b6a80a6 and 462c91e.

📒 Files selected for processing (1)
  • test/extended/two_node/tnf_resilience.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@lucaconsalvi lucaconsalvi changed the title Add two-node-resilience test suite with 5 etcd resilience tests OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests. Mar 16, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 16, 2026
@openshift-ci-robot
Copy link

@lucaconsalvi: This pull request references OCPEDGE-2386 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds 5 resilience tests for the podman-etcd resource agent in two-node fencing clusters
  • Creates the openshift/tnf-resilience test suite (serial, disruptive, Parallelism: 1)
  • Adds shared pacemaker/CRM utilities to utils/common.go
  • Updates openshift/two-node suite qualifier to exclude tnf-resilience tests

The tests are placed in a separate suite because they are more disruptive than the standard openshift/two-node tests — they stop/restart etcd resources, kill etcd containers, and trigger force-new-cluster recovery. Running them separately avoids destabilizing the cluster for other two-node tests.

Tests

ID Test What it validates
OCP-88178 Learner cleanup learner_node CRM attribute is cleared on stop and start
OCP-88179 Active count Stopping resources excluded from active count
OCP-88180 Simultaneous stop Second node delays stop to prevent WAL corruption
OCP-88181 Coordinated recovery Both nodes coordinate after podman kill etcd
OCP-88213 Attribute retry Leader retries setting learner_node after deletion during FNC recovery

Test plan

  • All 5 tests passed via run-test on a live two-node fencing cluster (4.22.0-0.nightly-2026-03-11-034211)
  • All 5 tests passed via ./openshift-tests run openshift/tnf-resilience on the remote instance (cross-compiled binary), confirming the suite runner respects Parallelism: 1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2026

@lucaconsalvi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants