test(harness): fix release-gate replay/cli-matrix triage failures (tiebreaker, quorum, mid-sync, over-commit, ps-cdp)#156
Conversation
Two operator-replay harness robustness fixes surfaced by the release-gate sweep on stand dev: - fixture_device_on_any_worker probed only `test -b <dev>` on ANY worker. The kernel pre-creates loop device NODES whether or not anything is attached, so the bare loop node passed the gate on every stand even when nothing was attached -- and ps-cdp-zfs then FAILed (exit 10, "no free PhysicalDevice") instead of SKIPping. The probe is now node-scoped (the workflow pins the create to node1) and consults the controller's authoritative free physical-device view (REST GET /v1/physical-storage), so an absent / in-use device yields a clean SKIP. - Add an optional `skip_if_reached: <disk_state>` await modifier: SKIP the whole workflow cleanly when the target reaches that state before the awaited one. Lets a scenario that needs a transient state a fast stand can race past (e.g. a mid-sync window on a skip-sync pool) SKIP as not-exercisable instead of timing out to a false FAIL. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Four release-gate replay workflows asserted conditions that the candidate's correct, deliberate behaviour can never satisfy on stand dev. All four are harness bugs (no assertion weakening; each new assertion is a true, observable condition): - r-c-with-tiebreaker-peer: after dropping the diskful peer the shape is 1 diskful + 1 user-diskless. The hand-created node3 carries DISKLESS only (not TIE_BREAKER), and at 1 diskful no auto-witness is managed (shouldTieBreakerExist needs >=2 diskful) -- so the `tiebreaker_present` grep for "TieBreaker" could never match. Assert the observable truth (node3 stays Diskless); the Bug-327 contract (bare `r c` comes up DISKFUL) is unchanged. - r-d-sub-quorum-teardown-converges: after `r d node1` on a 3-diskful RD the shape drops to even 2-diskful, re-arming the auto-tiebreaker, which re-occupies the just-freed node1 with a witness -- so `resource_absent node1` could never hold. Assert node1's diskful backing is gone (Diskless), then the second delete reaps the witness too (both nodes absent), the survivor stays UpToDate, and the RD deletes cleanly -- the real U197 sub-quorum-converges contract. - n-rst-tiebreaker-machinery-alive: `r d` of the witness stamps the DELIBERATE 5-minute auto-tiebreaker SUPPRESSION window, so the witness cannot re-add within the old 120s await. Bump to 330s (5min window + margin) -- bounded on the known product constant autoTiebreakerSuppressionWindow, not blind inflation. - r-d-last-uptodate-midsync-rejected: the `stand` FILE_THIN pool skip-syncs a fresh replica (day0 skip-initial-sync), so node2 reaches UpToDate in <10s without ever exposing a CRD-observable mid-sync state -- the U130 rejection window is not exercisable as a pure-CLI replay here (the contract stays pinned at L1 + the L6 cell). Add `skip_if_reached: UpToDate` so it SKIPs cleanly instead of timing out. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The rg-c-overcommit-spawn-fails cell asserted the UPSTREAM contract (spawn-resources fails short with FAIL_NOT_ENOUGH_NODES / 996 and places nothing) on an over-committed resource group. blockstor deliberately diverges: spawn takes a best-effort DEFERRED autoplace path -- it spawns the RD, places as many diskful replicas as the topology allows (3 of 7 on a 3-node cluster), and surfaces the shortfall as an INFO in the SUCCESS envelope (exit 0). This is pinned at L1 (TestSpawnImpossiblePlacementReturnsActionableError) and backed by the RGRebalanceReconciler additive backfill, so the cell was asserting a contract the product intentionally does not implement. Rename the cell to rg-c-overcommit-spawn-defers and rewrite it to pin the actual contract: create accepts pc=7, spawn succeeds with the deferred-autoplace envelope, RD survives, best-effort places 1..nodes diskful (never over- or under-places). Document the divergence as known-delta row 83. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughThis PR extends the operator test harness with conditional workflow skipping to handle variable hardware configurations, updates multiple replay test scenarios with corrected resource state expectations, and adds a new e2e CLI test validating deferred best-effort spawn behavior on over-committed resource groups. ChangesOver-committed RG spawn deferred behavior
Test harness skip-if-reached conditional gating
Operator replay test scenario updates
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the test harness and replay tests to accommodate blockstor's deliberate behavioral differences from upstream LINSTOR, such as deferred autoplacement on over-committed resource groups and auto-tiebreaker suppression windows. It introduces a skip_if_reached mechanism to cleanly skip test scenarios that are not exercisable on fast-syncing stands. The reviewer's feedback identifies two robustness improvements in tests/operator-harness/lib.sh: adding type-safety checks to an embedded Python script to prevent potential AttributeError crashes, and ensuring skip_rd and skip_node are non-empty before executing kubectl commands to avoid invalid resource name errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for g in groups if isinstance(groups, list) else []: | ||
| for d in (g.get('nodes',{}) or {}).get(node,[]) or []: | ||
| if d.get('device')==dev: | ||
| sys.exit(0) |
There was a problem hiding this comment.
The embedded Python script parses the REST API response for physical storage. However, if any element in the list is not a dictionary (e.g., due to an unexpected API response format), calling .get() or accessing .get('device') will raise an AttributeError and crash the script. Adding type checks ensures the script is robust against unexpected JSON structures.
| for g in groups if isinstance(groups, list) else []: | |
| for d in (g.get('nodes',{}) or {}).get(node,[]) or []: | |
| if d.get('device')==dev: | |
| sys.exit(0) | |
| for g in groups if isinstance(groups, list) else []: | |
| if not isinstance(g, dict): continue | |
| nodes = g.get('nodes', {}) or {} | |
| if not isinstance(nodes, dict): continue | |
| for d in nodes.get(node, []) or []: | |
| if not isinstance(d, dict): continue | |
| if d.get('device')==dev: | |
| sys.exit(0) |
| if [[ -n "$skip_state" ]]; then | ||
| local cur_state | ||
| cur_state=$(kubectl get resource "${skip_rd}.${skip_node}" -o jsonpath='{.status.volumes[0].diskState}' 2>/dev/null || echo "") | ||
| if [[ "$cur_state" == "$skip_state" ]]; then | ||
| echo " ASSERTION SKIP: kind=$kind reached skip_if_reached=$skip_state on $skip_node before the awaited state — scenario not exercisable on this stand (skip-sync pool)" >&2 | ||
| return 2 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
If skip_state is configured but node or rd is missing or empty in the await spec, running kubectl get resource "." will result in an invalid resource name error. Adding a check to ensure both skip_rd and skip_node are non-empty before executing the kubectl command prevents potential test harness failures.
| if [[ -n "$skip_state" ]]; then | |
| local cur_state | |
| cur_state=$(kubectl get resource "${skip_rd}.${skip_node}" -o jsonpath='{.status.volumes[0].diskState}' 2>/dev/null || echo "") | |
| if [[ "$cur_state" == "$skip_state" ]]; then | |
| echo " ASSERTION SKIP: kind=$kind reached skip_if_reached=$skip_state on $skip_node before the awaited state — scenario not exercisable on this stand (skip-sync pool)" >&2 | |
| return 2 | |
| fi | |
| fi | |
| if [[ -n "$skip_state" && -n "$skip_rd" && -n "$skip_node" ]]; then | |
| local cur_state | |
| cur_state=$(kubectl get resource "${skip_rd}.${skip_node}" -o jsonpath='{.status.volumes[0].diskState}' 2>/dev/null || echo "") | |
| if [[ "$cur_state" == "$skip_state" ]]; then | |
| echo " ASSERTION SKIP: kind=$kind reached skip_if_reached=$skip_state on $skip_node before the awaited state — scenario not exercisable on this stand (skip-sync pool)" >&2 | |
| return 2 | |
| fi | |
| fi |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh`:
- Around line 132-156: The script hardcodes nodes=3 which breaks on clusters
with >3 workers; replace the static assignment with a dynamic node count and use
that for the upper-bound check. Specifically, change the variable nodes
(currently set as nodes=3) to compute the actual number of cluster worker nodes
(for example nodes="$(kubectl get nodes --no-headers | wc -l)" or another
cluster-appropriate kubectl query), keeping the rest of the logic (diskful
counting and the comparisons against nodes) unchanged so the test no longer
falsely fails on 4+ node clusters.
In `@tests/operator-harness/replay-runner.sh`:
- Around line 97-109: The documentation for the await modifier `skip_if_reached`
is inconsistent: it states that `skip_if_reached` performs a clean SKIP (exit 0)
but elsewhere the contract text still claims exit 0 means a full PASS; update
the contract text so exit code semantics are unified—describe that exit 0 may
indicate either PASS or a clean SKIP caused by `skip_if_reached`, and clearly
state how automation should distinguish SKIP vs PASS (e.g., a separate SKIP
marker/logging or an explicit exit code convention if you prefer a distinct
code); make the change in the same doc block referencing `skip_if_reached` and
the await-modifier contract so both places use the same wording about exit 0
meaning "PASS or clean SKIP" or switch to a distinct non-zero/zero convention
consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3e3ad797-b5b4-4416-b94a-653580728676
📒 Files selected for processing (9)
docs/cli-parity-known-deltas.mdtests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.shtests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.shtests/operator-harness/lib.shtests/operator-harness/replay-runner.shtests/operator-harness/replay/n-rst-tiebreaker-machinery-alive.yamltests/operator-harness/replay/r-c-with-tiebreaker-peer.yamltests/operator-harness/replay/r-d-last-uptodate-midsync-rejected.yamltests/operator-harness/replay/r-d-sub-quorum-teardown-converges.yaml
💤 Files with no reviewable changes (1)
- tests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.sh
| # Best-effort placement: at least one diskful, never more than the | ||
| # topology allows (3 nodes). Count diskful (non-DISKLESS / non-TIE_BREAKER) | ||
| # Resource CRDs for the RD, polling briefly so the placer has landed. | ||
| nodes=3 | ||
| diskful=0 | ||
| for _ in $(seq 1 15); do | ||
| diskful=$(kubectl get resources.blockstor.cozystack.io -o json 2>/dev/null \ | ||
| | RD="$RD" python3 -c "import json,sys,os | ||
| d=json.load(sys.stdin) | ||
| rd=os.environ['RD'] | ||
| n=0 | ||
| for it in d.get('items',[]): | ||
| if it.get('spec',{}).get('resourceDefinitionName')!=rd: continue | ||
| flags=it.get('spec',{}).get('flags',[]) or [] | ||
| if 'DISKLESS' in flags or 'TIE_BREAKER' in flags: continue | ||
| n+=1 | ||
| print(n)" 2>/dev/null || echo 0) | ||
| [[ "$diskful" -ge 1 ]] && break | ||
| sleep 2 | ||
| done | ||
|
|
||
| if [[ "$diskful" -lt 1 ]]; then | ||
| echo "FAIL (D1): deferred spawn placed ZERO diskful replicas on a cluster that can host $nodes" >&2 | ||
| exit 1 | ||
| fi | ||
| if [[ "$diskful" -gt "$nodes" ]]; then | ||
| echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful on a $nodes-node cluster" >&2 | ||
| exit 1 | ||
| fi | ||
| echo ">> deferred spawn placed $diskful diskful (best-effort, <= $nodes nodes) — OK" |
There was a problem hiding this comment.
Add an explicit Status convergence assertion for placed replicas.
This block validates count bounds, but it does not assert replica Status convergence (e.g., placed diskful replicas reaching an expected steady state), so the scenario can pass while state is still transient.
As per coding guidelines, tests/e2e/cli-matrix/**: “L6 operator-CLI e2e tests must test real linstor CLI → REST → satellite → DRBD kernel interactions and assert Status convergence under tests/e2e/cli-matrix/.”
Source: Coding guidelines
| nodes=3 | ||
| diskful=0 | ||
| for _ in $(seq 1 15); do | ||
| diskful=$(kubectl get resources.blockstor.cozystack.io -o json 2>/dev/null \ | ||
| | RD="$RD" python3 -c "import json,sys,os | ||
| d=json.load(sys.stdin) | ||
| rd=os.environ['RD'] | ||
| n=0 | ||
| for it in d.get('items',[]): | ||
| if it.get('spec',{}).get('resourceDefinitionName')!=rd: continue | ||
| flags=it.get('spec',{}).get('flags',[]) or [] | ||
| if 'DISKLESS' in flags or 'TIE_BREAKER' in flags: continue | ||
| n+=1 | ||
| print(n)" 2>/dev/null || echo 0) | ||
| [[ "$diskful" -ge 1 ]] && break | ||
| sleep 2 | ||
| done | ||
|
|
||
| if [[ "$diskful" -lt 1 ]]; then | ||
| echo "FAIL (D1): deferred spawn placed ZERO diskful replicas on a cluster that can host $nodes" >&2 | ||
| exit 1 | ||
| fi | ||
| if [[ "$diskful" -gt "$nodes" ]]; then | ||
| echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful on a $nodes-node cluster" >&2 | ||
| exit 1 |
There was a problem hiding this comment.
Hardcoded nodes=3 makes this test environment-fragile.
Line [132] assumes exactly 3 workers, but Line [69] only requires a minimum of 3. On stands with 4+ workers, a valid placement can exceed 3 and trigger a false failure at Line [154].
Suggested fix
-nodes=3
+nodes=$(kubectl get nodes -l '!node-role.kubernetes.io/control-plane' --no-headers 2>/dev/null \
+ | awk '$2 == "Ready"' | wc -l | tr -d ' ')
+max_placeable=$(( nodes < 7 ? nodes : 7 ))
@@
-if [[ "$diskful" -gt "$nodes" ]]; then
- echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful on a $nodes-node cluster" >&2
+if [[ "$diskful" -gt "$max_placeable" ]]; then
+ echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful (max placeable=$max_placeable, ready nodes=$nodes)" >&2
exit 1
fi
-echo ">> deferred spawn placed $diskful diskful (best-effort, <= $nodes nodes) — OK"
+echo ">> deferred spawn placed $diskful diskful (best-effort, <= $max_placeable) — OK"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh` around lines 132 - 156,
The script hardcodes nodes=3 which breaks on clusters with >3 workers; replace
the static assignment with a dynamic node count and use that for the upper-bound
check. Specifically, change the variable nodes (currently set as nodes=3) to
compute the actual number of cluster worker nodes (for example nodes="$(kubectl
get nodes --no-headers | wc -l)" or another cluster-appropriate kubectl query),
keeping the rest of the logic (diskful counting and the comparisons against
nodes) unchanged so the test no longer falsely fails on 4+ node clusters.
| # Optional await modifier (any disk_state-shaped await with a `node`): | ||
| # | ||
| # skip_if_reached: <disk_state> | ||
| # SKIP the WHOLE workflow cleanly (exit 0, neither PASS nor FAIL) | ||
| # if rd@node reaches <disk_state> before the awaited `expected` | ||
| # state. For scenarios that need a transient state which a fast | ||
| # stand can race past — e.g. the U130 mid-sync rejection needs the | ||
| # new replica observed SyncTarget, but a FILE_THIN skip-sync stand | ||
| # seeds it straight to UpToDate (no CRD-observable mid-sync). On | ||
| # such a stand the scenario is "not exercisable here", not a fault: | ||
| # `expected: SyncTarget` + `skip_if_reached: UpToDate` turns the | ||
| # would-be timeout FAIL into a clean SKIP. Teardown still runs. | ||
| # |
There was a problem hiding this comment.
Align exit-code documentation with SKIP semantics.
skip_if_reached now documents/implements clean SKIP with exit 0, but Line 127 still says exit 0 means full PASS only. Please unify the contract text to avoid ambiguous automation behavior.
Suggested doc patch
# Exit codes:
#
-# 0 — every step + assertion passed, all invariants hold
+# 0 — PASS, or clean SKIP when scenario is not exercisable on this stand
# 1 — at least one step failed; details on stderr
# 2 — usage / config error🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/operator-harness/replay-runner.sh` around lines 97 - 109, The
documentation for the await modifier `skip_if_reached` is inconsistent: it
states that `skip_if_reached` performs a clean SKIP (exit 0) but elsewhere the
contract text still claims exit 0 means a full PASS; update the contract text so
exit code semantics are unified—describe that exit 0 may indicate either PASS or
a clean SKIP caused by `skip_if_reached`, and clearly state how automation
should distinguish SKIP vs PASS (e.g., a separate SKIP marker/logging or an
explicit exit code convention if you prefer a distinct code); make the change in
the same doc block referencing `skip_if_reached` and the await-modifier contract
so both places use the same wording about exit 0 meaning "PASS or clean SKIP" or
switch to a distinct non-zero/zero convention consistently.
Harness/docs-only fixes for 7 of the 8 replay/cli-matrix failures surfaced by the release-gate stand sweep at the candidate SHA. No product code changed. (The 8th, an
inactive-return-backfills-redundancyfailure, is a genuine product race tracked separately — see below.)Verdicts
r dstamps an intentional 5-min suppression window; assert waited only 120s → guaranteed timeout--disklessnode is never flaggedTieBreaker, and at 1 diskful no auto-witness is managed →tiebreaker_presentcould never matchr d node1(3→2 diskful) the auto-tiebreaker legally relocates onto the freed node1 →resource_absent node1unsatisfiable/dev/loop9exists as a device node but is not a free PhysicalDevice; the free-device gate was too weak…-spawn-defers, known-delta row 83Tests 1/2/3 share #154's lesson (the auto-witness legally relocates onto a freed/sole spare node; the harness must not wait for a phantom witness or
resource_absentthere) but live in separate replay YAMLs #154 didn't touch.Changes
/v1/physical-storage; newskip_if_reachedmodifier for clean SKIP when a scenario is not reproducible on a fast stand.…-spawn-fails→…-spawn-deferspinning the real BS deferred-autoplace contract; known-delta row 83.Validated on the stand 3× each (PASS/SKIP as classified).
Product bug found, tracked separately (NOT in this PR)
inactive-return-backfills-redundancyis a genuine TOCTOU race: the witness reaperremoveWitnessesdoes a non-atomic Get→Delete and can clobber a replica the placer just promoted, losing the redundancy backfill. Medium-risk fix on the tiebreaker+autoplace surface; needs its own PR with L6/L7 validation.Summary by CodeRabbit
New Features
Tests
Documentation