Skip to content

test(harness): fix release-gate replay/cli-matrix triage failures (tiebreaker, quorum, mid-sync, over-commit, ps-cdp)#156

Merged
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
fix/release-gate-replay-harness
Jun 13, 2026
Merged

test(harness): fix release-gate replay/cli-matrix triage failures (tiebreaker, quorum, mid-sync, over-commit, ps-cdp)#156
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
fix/release-gate-replay-harness

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 13, 2026

Copy link
Copy Markdown
Member

Harness/docs-only fixes for 7 of the 8 replay/cli-matrix failures surfaced by the release-gate stand sweep at the candidate SHA. No product code changed. (The 8th, an inactive-return-backfills-redundancy failure, is a genuine product race tracked separately — see below.)

Verdicts

Test Verdict Cause
n-rst-tiebreaker-machinery-alive harness witness r d stamps an intentional 5-min suppression window; assert waited only 120s → guaranteed timeout
r-c-with-tiebreaker-peer harness a manual --diskless node is never flagged TieBreaker, and at 1 diskful no auto-witness is managed → tiebreaker_present could never match
r-d-sub-quorum-teardown-converges harness after r d node1 (3→2 diskful) the auto-tiebreaker legally relocates onto the freed node1 → resource_absent node1 unsatisfiable
r-d-last-uptodate-midsync-rejected legit-skip FILE_THIN skip-syncs the empty replica → the mid-sync window is unobservable in CRD status; SKIP cleanly
ps-cdp-zfs legit-skip /dev/loop9 exists as a device node but is not a free PhysicalDevice; the free-device gate was too weak
vd-resize-single-replica-no-restart harness flake passes 2× in isolation; transient empty-stderr exit-10 under sweep load
rg-c-overcommit-spawn-fails harness cell pinned the upstream short-spawn contract; BS intentionally does deferred best-effort autoplace (pinned in spawn_test.go) — renamed to …-spawn-defers, known-delta row 83

Tests 1/2/3 share #154's lesson (the auto-witness legally relocates onto a freed/sole spare node; the harness must not wait for a phantom witness or resource_absent there) but live in separate replay YAMLs #154 didn't touch.

Changes

  1. Harness: node-scoped free-device gate via REST /v1/physical-storage; new skip_if_reached modifier for clean SKIP when a scenario is not reproducible on a fast stand.
  2. Replay: assertion fixes in 4 YAMLs — no weakening; each new assert is a true observable condition.
  3. cli-matrix: …-spawn-fails…-spawn-defers pinning the real BS deferred-autoplace contract; known-delta row 83.

Validated on the stand 3× each (PASS/SKIP as classified).

Product bug found, tracked separately (NOT in this PR)

inactive-return-backfills-redundancy is a genuine TOCTOU race: the witness reaper removeWitnesses does a non-atomic Get→Delete and can clobber a replica the placer just promoted, losing the redundancy backfill. Medium-risk fix on the tiebreaker+autoplace surface; needs its own PR with L6/L7 validation.

Summary by CodeRabbit

  • New Features

    • Resource group spawn-resources now uses deferred best-effort placement when over-committed, placing available replicas and returning actionable feedback instead of failing.
  • Tests

    • Updated end-to-end tests to validate new over-committed spawn behavior.
    • Enhanced test framework with conditional skip mechanism for non-applicable scenarios.
  • Documentation

    • Added CLI parity documentation for accepted over-committed resource group behavior.

Andrei Kvapil (kvaps) and others added 3 commits June 13, 2026 10:57
Two operator-replay harness robustness fixes surfaced by the
release-gate sweep on stand dev:

- fixture_device_on_any_worker probed only `test -b <dev>` on ANY
  worker. The kernel pre-creates loop device NODES whether or not
  anything is attached, so the bare loop node passed the gate on every
  stand even when nothing was attached -- and ps-cdp-zfs then FAILed
  (exit 10, "no free PhysicalDevice") instead of SKIPping. The probe is
  now node-scoped (the workflow pins the create to node1) and consults
  the controller's authoritative free physical-device view (REST GET
  /v1/physical-storage), so an absent / in-use device yields a clean
  SKIP.

- Add an optional `skip_if_reached: <disk_state>` await modifier: SKIP
  the whole workflow cleanly when the target reaches that state before
  the awaited one. Lets a scenario that needs a transient state a fast
  stand can race past (e.g. a mid-sync window on a skip-sync pool) SKIP
  as not-exercisable instead of timing out to a false FAIL.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Four release-gate replay workflows asserted conditions that the
candidate's correct, deliberate behaviour can never satisfy on stand
dev. All four are harness bugs (no assertion weakening; each new
assertion is a true, observable condition):

- r-c-with-tiebreaker-peer: after dropping the diskful peer the shape is
  1 diskful + 1 user-diskless. The hand-created node3 carries DISKLESS
  only (not TIE_BREAKER), and at 1 diskful no auto-witness is managed
  (shouldTieBreakerExist needs >=2 diskful) -- so the `tiebreaker_present`
  grep for "TieBreaker" could never match. Assert the observable truth
  (node3 stays Diskless); the Bug-327 contract (bare `r c` comes up
  DISKFUL) is unchanged.

- r-d-sub-quorum-teardown-converges: after `r d node1` on a 3-diskful RD
  the shape drops to even 2-diskful, re-arming the auto-tiebreaker, which
  re-occupies the just-freed node1 with a witness -- so `resource_absent
  node1` could never hold. Assert node1's diskful backing is gone
  (Diskless), then the second delete reaps the witness too (both nodes
  absent), the survivor stays UpToDate, and the RD deletes cleanly -- the
  real U197 sub-quorum-converges contract.

- n-rst-tiebreaker-machinery-alive: `r d` of the witness stamps the
  DELIBERATE 5-minute auto-tiebreaker SUPPRESSION window, so the witness
  cannot re-add within the old 120s await. Bump to 330s (5min window +
  margin) -- bounded on the known product constant
  autoTiebreakerSuppressionWindow, not blind inflation.

- r-d-last-uptodate-midsync-rejected: the `stand` FILE_THIN pool
  skip-syncs a fresh replica (day0 skip-initial-sync), so node2 reaches
  UpToDate in <10s without ever exposing a CRD-observable mid-sync state
  -- the U130 rejection window is not exercisable as a pure-CLI replay
  here (the contract stays pinned at L1 + the L6 cell). Add
  `skip_if_reached: UpToDate` so it SKIPs cleanly instead of timing out.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The rg-c-overcommit-spawn-fails cell asserted the UPSTREAM contract
(spawn-resources fails short with FAIL_NOT_ENOUGH_NODES / 996 and places
nothing) on an over-committed resource group. blockstor deliberately
diverges: spawn takes a best-effort DEFERRED autoplace path -- it spawns
the RD, places as many diskful replicas as the topology allows (3 of 7
on a 3-node cluster), and surfaces the shortfall as an INFO in the
SUCCESS envelope (exit 0). This is pinned at L1
(TestSpawnImpossiblePlacementReturnsActionableError) and backed by the
RGRebalanceReconciler additive backfill, so the cell was asserting a
contract the product intentionally does not implement.

Rename the cell to rg-c-overcommit-spawn-defers and rewrite it to pin
the actual contract: create accepts pc=7, spawn succeeds with the
deferred-autoplace envelope, RD survives, best-effort places 1..nodes
diskful (never over- or under-places). Document the divergence as
known-delta row 83.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR extends the operator test harness with conditional workflow skipping to handle variable hardware configurations, updates multiple replay test scenarios with corrected resource state expectations, and adds a new e2e CLI test validating deferred best-effort spawn behavior on over-committed resource groups.

Changes

Over-committed RG spawn deferred behavior

Layer / File(s) Summary
Over-committed spawn deferred behavior test
tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh
New e2e test that creates an over-committed RG (place_count=7 on 3-node cluster), verifies spawn-resources exits successfully with "placed N of M" envelope, confirms RD persists for CSI retry, and validates diskful replica counts are within bounds.
CLI parity whitelist entry for deferred spawn
docs/cli-parity-known-deltas.md
Adds delta #83 documenting that blockstor spawn-resources uses best-effort autoplace on over-committed RGs, with reconciler top-up behavior when nodes become available.

Test harness skip-if-reached conditional gating

Layer / File(s) Summary
Fixture and await_assertion skip_if_reached support
tests/operator-harness/lib.sh
Adds fixture_free_physical_device_on_node for node-targeted free-device REST checks, updates fixture_device_on_any_worker to use it, extends await_assertion setup and polling loop to handle optional skip_if_reached disk states, returning sentinel code 2 when matched.
run_step skip sentinel propagation
tests/operator-harness/lib.sh
Updates run_step to capture await_assertion return code, immediately propagating sentinel 2 for scenario skipping while treating other non-zero codes as failures.
Replay runner skip_if_reached documentation and control-flow
tests/operator-harness/replay-runner.sh
Documents skip_if_reached await modifier, implements early step-loop exit when run_step returns 2, and adds post-teardown skip handling (prints SKIP, exits 0, bypasses invariants).

Operator replay test scenario updates

Layer / File(s) Summary
Tiebreaker auto-readd timeout and suppression-window docs
tests/operator-harness/replay/n-rst-tiebreaker-machinery-alive.yaml
Documents auto-tiebreaker suppression-window timing (5 minutes), increases witness-auto-readded timeout from 120s to 330s to allow suppression window to lapse.
Tiebreaker peer diskless state verification
tests/operator-harness/replay/r-c-with-tiebreaker-peer.yaml
Changes post-delete assertion from tiebreaker_present to replica_diskless for node3, clarifying diskless peer classification and auto-tiebreaker management behavior.
Mid-sync assertion skip guard for UpToDate states
tests/operator-harness/replay/r-d-last-uptodate-midsync-rejected.yaml
Adds skip_if_reached: UpToDate guard to node2 await step so U130 rejection probe skips when UpToDate is reached before mid-sync SyncTarget is observable.
Sub-quorum teardown witness lifecycle and replica verification
tests/operator-harness/replay/r-d-sub-quorum-teardown-converges.yaml
Asserts node1 diskless-transition on first delete, adds explicit witness-reap step for node1 resource_absent, changes surviving-replica check from count to node3 disk_state=UpToDate validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cozystack/blockstor#110: Modifies operator test harness control-flow in tests/operator-harness/lib.sh and tests/operator-harness/replay-runner.sh, with changes to run_step and step execution patterns.

Poem

🐰 Over-commit flows now defer with grace,
No fail-fast here in this test-harness place,
Skip-if-reached gates let scenarios breathe,
When hardware won't play, they cleanly retreat,
Replicas placed where they can be—a feat! 🌾

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary changes: fixing release-gate test harness failures across replay and cli-matrix tests, with specific categories listed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/release-gate-replay-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the test harness and replay tests to accommodate blockstor's deliberate behavioral differences from upstream LINSTOR, such as deferred autoplacement on over-committed resource groups and auto-tiebreaker suppression windows. It introduces a skip_if_reached mechanism to cleanly skip test scenarios that are not exercisable on fast-syncing stands. The reviewer's feedback identifies two robustness improvements in tests/operator-harness/lib.sh: adding type-safety checks to an embedded Python script to prevent potential AttributeError crashes, and ensuring skip_rd and skip_node are non-empty before executing kubectl commands to avoid invalid resource name errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +261 to +264
for g in groups if isinstance(groups, list) else []:
for d in (g.get('nodes',{}) or {}).get(node,[]) or []:
if d.get('device')==dev:
sys.exit(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The embedded Python script parses the REST API response for physical storage. However, if any element in the list is not a dictionary (e.g., due to an unexpected API response format), calling .get() or accessing .get('device') will raise an AttributeError and crash the script. Adding type checks ensures the script is robust against unexpected JSON structures.

Suggested change
for g in groups if isinstance(groups, list) else []:
for d in (g.get('nodes',{}) or {}).get(node,[]) or []:
if d.get('device')==dev:
sys.exit(0)
for g in groups if isinstance(groups, list) else []:
if not isinstance(g, dict): continue
nodes = g.get('nodes', {}) or {}
if not isinstance(nodes, dict): continue
for d in nodes.get(node, []) or []:
if not isinstance(d, dict): continue
if d.get('device')==dev:
sys.exit(0)

Comment on lines +335 to +342
if [[ -n "$skip_state" ]]; then
local cur_state
cur_state=$(kubectl get resource "${skip_rd}.${skip_node}" -o jsonpath='{.status.volumes[0].diskState}' 2>/dev/null || echo "")
if [[ "$cur_state" == "$skip_state" ]]; then
echo " ASSERTION SKIP: kind=$kind reached skip_if_reached=$skip_state on $skip_node before the awaited state — scenario not exercisable on this stand (skip-sync pool)" >&2
return 2
fi
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If skip_state is configured but node or rd is missing or empty in the await spec, running kubectl get resource "." will result in an invalid resource name error. Adding a check to ensure both skip_rd and skip_node are non-empty before executing the kubectl command prevents potential test harness failures.

Suggested change
if [[ -n "$skip_state" ]]; then
local cur_state
cur_state=$(kubectl get resource "${skip_rd}.${skip_node}" -o jsonpath='{.status.volumes[0].diskState}' 2>/dev/null || echo "")
if [[ "$cur_state" == "$skip_state" ]]; then
echo " ASSERTION SKIP: kind=$kind reached skip_if_reached=$skip_state on $skip_node before the awaited state — scenario not exercisable on this stand (skip-sync pool)" >&2
return 2
fi
fi
if [[ -n "$skip_state" && -n "$skip_rd" && -n "$skip_node" ]]; then
local cur_state
cur_state=$(kubectl get resource "${skip_rd}.${skip_node}" -o jsonpath='{.status.volumes[0].diskState}' 2>/dev/null || echo "")
if [[ "$cur_state" == "$skip_state" ]]; then
echo " ASSERTION SKIP: kind=$kind reached skip_if_reached=$skip_state on $skip_node before the awaited state — scenario not exercisable on this stand (skip-sync pool)" >&2
return 2
fi
fi

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh`:
- Around line 132-156: The script hardcodes nodes=3 which breaks on clusters
with >3 workers; replace the static assignment with a dynamic node count and use
that for the upper-bound check. Specifically, change the variable nodes
(currently set as nodes=3) to compute the actual number of cluster worker nodes
(for example nodes="$(kubectl get nodes --no-headers | wc -l)" or another
cluster-appropriate kubectl query), keeping the rest of the logic (diskful
counting and the comparisons against nodes) unchanged so the test no longer
falsely fails on 4+ node clusters.

In `@tests/operator-harness/replay-runner.sh`:
- Around line 97-109: The documentation for the await modifier `skip_if_reached`
is inconsistent: it states that `skip_if_reached` performs a clean SKIP (exit 0)
but elsewhere the contract text still claims exit 0 means a full PASS; update
the contract text so exit code semantics are unified—describe that exit 0 may
indicate either PASS or a clean SKIP caused by `skip_if_reached`, and clearly
state how automation should distinguish SKIP vs PASS (e.g., a separate SKIP
marker/logging or an explicit exit code convention if you prefer a distinct
code); make the change in the same doc block referencing `skip_if_reached` and
the await-modifier contract so both places use the same wording about exit 0
meaning "PASS or clean SKIP" or switch to a distinct non-zero/zero convention
consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e3ad797-b5b4-4416-b94a-653580728676

📥 Commits

Reviewing files that changed from the base of the PR and between 7389580 and 6efb1e1.

📒 Files selected for processing (9)
  • docs/cli-parity-known-deltas.md
  • tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh
  • tests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.sh
  • tests/operator-harness/lib.sh
  • tests/operator-harness/replay-runner.sh
  • tests/operator-harness/replay/n-rst-tiebreaker-machinery-alive.yaml
  • tests/operator-harness/replay/r-c-with-tiebreaker-peer.yaml
  • tests/operator-harness/replay/r-d-last-uptodate-midsync-rejected.yaml
  • tests/operator-harness/replay/r-d-sub-quorum-teardown-converges.yaml
💤 Files with no reviewable changes (1)
  • tests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.sh

Comment on lines +129 to +158
# Best-effort placement: at least one diskful, never more than the
# topology allows (3 nodes). Count diskful (non-DISKLESS / non-TIE_BREAKER)
# Resource CRDs for the RD, polling briefly so the placer has landed.
nodes=3
diskful=0
for _ in $(seq 1 15); do
diskful=$(kubectl get resources.blockstor.cozystack.io -o json 2>/dev/null \
| RD="$RD" python3 -c "import json,sys,os
d=json.load(sys.stdin)
rd=os.environ['RD']
n=0
for it in d.get('items',[]):
if it.get('spec',{}).get('resourceDefinitionName')!=rd: continue
flags=it.get('spec',{}).get('flags',[]) or []
if 'DISKLESS' in flags or 'TIE_BREAKER' in flags: continue
n+=1
print(n)" 2>/dev/null || echo 0)
[[ "$diskful" -ge 1 ]] && break
sleep 2
done

if [[ "$diskful" -lt 1 ]]; then
echo "FAIL (D1): deferred spawn placed ZERO diskful replicas on a cluster that can host $nodes" >&2
exit 1
fi
if [[ "$diskful" -gt "$nodes" ]]; then
echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful on a $nodes-node cluster" >&2
exit 1
fi
echo ">> deferred spawn placed $diskful diskful (best-effort, <= $nodes nodes) — OK"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add an explicit Status convergence assertion for placed replicas.

This block validates count bounds, but it does not assert replica Status convergence (e.g., placed diskful replicas reaching an expected steady state), so the scenario can pass while state is still transient.

As per coding guidelines, tests/e2e/cli-matrix/**: “L6 operator-CLI e2e tests must test real linstor CLI → REST → satellite → DRBD kernel interactions and assert Status convergence under tests/e2e/cli-matrix/.”

Source: Coding guidelines

Comment on lines +132 to +156
nodes=3
diskful=0
for _ in $(seq 1 15); do
diskful=$(kubectl get resources.blockstor.cozystack.io -o json 2>/dev/null \
| RD="$RD" python3 -c "import json,sys,os
d=json.load(sys.stdin)
rd=os.environ['RD']
n=0
for it in d.get('items',[]):
if it.get('spec',{}).get('resourceDefinitionName')!=rd: continue
flags=it.get('spec',{}).get('flags',[]) or []
if 'DISKLESS' in flags or 'TIE_BREAKER' in flags: continue
n+=1
print(n)" 2>/dev/null || echo 0)
[[ "$diskful" -ge 1 ]] && break
sleep 2
done

if [[ "$diskful" -lt 1 ]]; then
echo "FAIL (D1): deferred spawn placed ZERO diskful replicas on a cluster that can host $nodes" >&2
exit 1
fi
if [[ "$diskful" -gt "$nodes" ]]; then
echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful on a $nodes-node cluster" >&2
exit 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded nodes=3 makes this test environment-fragile.

Line [132] assumes exactly 3 workers, but Line [69] only requires a minimum of 3. On stands with 4+ workers, a valid placement can exceed 3 and trigger a false failure at Line [154].

Suggested fix
-nodes=3
+nodes=$(kubectl get nodes -l '!node-role.kubernetes.io/control-plane' --no-headers 2>/dev/null \
+    | awk '$2 == "Ready"' | wc -l | tr -d ' ')
+max_placeable=$(( nodes < 7 ? nodes : 7 ))
@@
-if [[ "$diskful" -gt "$nodes" ]]; then
-    echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful on a $nodes-node cluster" >&2
+if [[ "$diskful" -gt "$max_placeable" ]]; then
+    echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful (max placeable=$max_placeable, ready nodes=$nodes)" >&2
     exit 1
 fi
-echo ">> deferred spawn placed $diskful diskful (best-effort, <= $nodes nodes) — OK"
+echo ">> deferred spawn placed $diskful diskful (best-effort, <= $max_placeable) — OK"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh` around lines 132 - 156,
The script hardcodes nodes=3 which breaks on clusters with >3 workers; replace
the static assignment with a dynamic node count and use that for the upper-bound
check. Specifically, change the variable nodes (currently set as nodes=3) to
compute the actual number of cluster worker nodes (for example nodes="$(kubectl
get nodes --no-headers | wc -l)" or another cluster-appropriate kubectl query),
keeping the rest of the logic (diskful counting and the comparisons against
nodes) unchanged so the test no longer falsely fails on 4+ node clusters.

Comment on lines +97 to +109
# Optional await modifier (any disk_state-shaped await with a `node`):
#
# skip_if_reached: <disk_state>
# SKIP the WHOLE workflow cleanly (exit 0, neither PASS nor FAIL)
# if rd@node reaches <disk_state> before the awaited `expected`
# state. For scenarios that need a transient state which a fast
# stand can race past — e.g. the U130 mid-sync rejection needs the
# new replica observed SyncTarget, but a FILE_THIN skip-sync stand
# seeds it straight to UpToDate (no CRD-observable mid-sync). On
# such a stand the scenario is "not exercisable here", not a fault:
# `expected: SyncTarget` + `skip_if_reached: UpToDate` turns the
# would-be timeout FAIL into a clean SKIP. Teardown still runs.
#

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align exit-code documentation with SKIP semantics.

skip_if_reached now documents/implements clean SKIP with exit 0, but Line 127 still says exit 0 means full PASS only. Please unify the contract text to avoid ambiguous automation behavior.

Suggested doc patch
 # Exit codes:
 #
-#   0 — every step + assertion passed, all invariants hold
+#   0 — PASS, or clean SKIP when scenario is not exercisable on this stand
 #   1 — at least one step failed; details on stderr
 #   2 — usage / config error
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/operator-harness/replay-runner.sh` around lines 97 - 109, The
documentation for the await modifier `skip_if_reached` is inconsistent: it
states that `skip_if_reached` performs a clean SKIP (exit 0) but elsewhere the
contract text still claims exit 0 means a full PASS; update the contract text so
exit code semantics are unified—describe that exit 0 may indicate either PASS or
a clean SKIP caused by `skip_if_reached`, and clearly state how automation
should distinguish SKIP vs PASS (e.g., a separate SKIP marker/logging or an
explicit exit code convention if you prefer a distinct code); make the change in
the same doc block referencing `skip_if_reached` and the await-modifier contract
so both places use the same wording about exit 0 meaning "PASS or clean SKIP" or
switch to a distinct non-zero/zero convention consistently.

@kvaps Andrei Kvapil (kvaps) merged commit 170751b into main Jun 13, 2026
27 of 28 checks passed
@kvaps Andrei Kvapil (kvaps) deleted the fix/release-gate-replay-harness branch June 13, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant