fix(e2e): heal provocation-race wedges in recovery-family scenarios#148
Conversation
The provocation step (drbdadm down + sed + up) deliberately races the satellite's Bug-287 revive. When the two drbdadm invocations interleave badly, drbdmeta apply-al hits EBUSY on the backing device and worker-2's kernel slot ends half-configured: disk attached Inconsistent with AL suspended, peers registered but never connected (StandAlone). The satellite then classifies the slot as an operator disconnect (StandAlone + peer-device entries, the W12 --skip-net guard) and never reconnects it, so the post-recovery UpToDate wait times out. Seen as a ~30% lane-1 flake; not related to the .res node-id mismatch under test. Bounce worker-2 with a bare drbdadm down after the provocation and require kernel-truth UpToDate before applying the SKILL recipe, so the recovery assertions start from a clean slot. Validated 12/12 green on the dev stand (previously ~50% fail). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughAdds deterministic clearing of a “tamper-window wedge” in two e2e recovery tests: one inserts a pre-SKILL down+poll barrier for worker-2; the other adds a conditional single-bounce retry when the first convergence wait times out and a wedge signature is detected. ChangesRecovery test scenario enhancement
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/e2e/recovery-node-id-mismatch.sh (1)
341-342: ⚡ Quick winConsider using JSON parsing for consistency and robustness.
The text parsing via
grep + cut + awkworks but is brittle ifdrbdsetup statusoutput format evolves. The existingkernel_pair_uptodatehelper inlib.sh(context snippet 2) usesdrbdsetup status --json | jqfor the same kernel-truth polling. Aligning with that pattern improves maintainability and reduces parsing fragility.♻️ Proposed JSON-based extraction
- w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \ - | grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true) + w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" --json 2>/dev/null \ + | jq -r '.[0].devices[]? | select(.volume==0) | ."disk-state"' 2>/dev/null || true)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/recovery-node-id-mismatch.sh` around lines 341 - 342, Replace the brittle text parsing for w2_disk (currently using drbdsetup status piped to grep/cut/awk) with JSON parsing like the kernel_pair_uptodate helper in lib.sh; call drbdsetup status --json and use jq to extract the same "disk" field for the first resource entry (e.g., select the first device/resource and output its disk state with -r) so the script assigns w2_disk robustly from the JSON output instead of relying on fragile text parsing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/e2e/recovery-node-id-mismatch.sh`:
- Around line 341-342: Replace the brittle text parsing for w2_disk (currently
using drbdsetup status piped to grep/cut/awk) with JSON parsing like the
kernel_pair_uptodate helper in lib.sh; call drbdsetup status --json and use jq
to extract the same "disk" field for the first resource entry (e.g., select the
first device/resource and output its disk state with -r) so the script assigns
w2_disk robustly from the JSON output instead of relying on fragile text
parsing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 421d6750-9fe8-42e9-a79f-b8f3cd2daed4
📒 Files selected for processing (1)
tests/e2e/recovery-node-id-mismatch.sh
There was a problem hiding this comment.
Code Review
This pull request adds a step to clear the tamper-window wedge in the recovery-node-id-mismatch.sh end-to-end test by bouncing worker-2 and waiting for it to return to an 'UpToDate' state. However, the regular expression and string parsing used to extract the local disk state from drbdsetup status will fail to match, causing the polling loop to time out and fail the test. A suggestion has been provided to use sed to robustly parse the disk state.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \ | ||
| | grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true) |
There was a problem hiding this comment.
The regular expression ^[[:space:]]+disk: will not match the local disk line in the output of drbdsetup status.
In DRBD9, the output of drbdsetup status for a local volume is formatted as:
volume:<vnr> minor:<minor> disk:<state> ...
Because the line starts with volume: after the leading spaces, the anchor ^ combined with [[:space:]]+disk: prevents any match. As a result, w2_disk will always be empty, causing the polling loop to run until the 120-second timeout and fail the test.
Additionally, cut -d: -f2 would extract 0 minor instead of the disk state if a match were found on that line.
We can use sed to robustly extract the local disk state by matching [[:space:]]disk: (which avoids matching peer-disk: on peer lines).
| w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \ | |
| | grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true) | |
| w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \ | |
| | sed -n 's/.*[[:space:]]disk:\([^[:space:]]*\).*/\1/p' | head -n1) |
The scenario's bare drbdadm down provocation can rarely collide with the satellite's immediate revive: drbdmeta apply-al hits EBUSY on the backing device and the revived slot ends half-configured — disk Inconsistent, connections StandAlone with peer-device entries retained. That matches the operator-disconnect signature, so the satellite never reconnects the slot and the final convergence wait times out on a provocation artefact, not on the revive path under test. Seen twice on CI lanes with the identical signature. Unlike recovery-node-id-mismatch, the provocation here is already a single writer, so an unconditional bounce would not reduce the collision odds and would dilute the convergence assertion. Instead the heal is conditional: the convergence wait keeps its full untouched budget, and only a timeout that shows the exact wedge signature triggers one clean bounce plus a kernel-truth UpToDate wait before a single re-run of the wait. A regression of the narrowed skip-net gate does not match the signature and still fails loudly. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/e2e/recovery-down-reverses.sh (1)
123-123:⚠️ Potential issue | 🟠 MajorReplace bare EXIT trap with
register_strict_cleanup
tests/e2e/recovery-down-reverses.shsetstrap 'delete_rd "$RD"' EXIT, but e2e cleanup should be registered viaregister_strict_cleanupfromtests/e2e/lib.shso the lane enforces strict post-test cluster baseline checks.Proposed fix
-trap 'delete_rd "$RD"' EXIT +register_strict_cleanup "$RD"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/recovery-down-reverses.sh` at line 123, Replace the bare EXIT trap with the test harness's strict cleanup registration: remove the line that uses trap 'delete_rd "$RD"' EXIT and call register_strict_cleanup with a command that invokes delete_rd on the RD variable (e.g., register_strict_cleanup "delete_rd \"$RD\""); ensure you reference the existing delete_rd function and the RD variable and use register_strict_cleanup from tests/e2e/lib.sh so the lane runs strict post-test cluster baseline checks.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/e2e/recovery-down-reverses.sh`:
- Line 123: Replace the bare EXIT trap with the test harness's strict cleanup
registration: remove the line that uses trap 'delete_rd "$RD"' EXIT and call
register_strict_cleanup with a command that invokes delete_rd on the RD variable
(e.g., register_strict_cleanup "delete_rd \"$RD\""); ensure you reference the
existing delete_rd function and the RD variable and use register_strict_cleanup
from tests/e2e/lib.sh so the lane runs strict post-test cluster baseline checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 20099e5e-4db8-496d-a9bc-5a87b36ce722
📒 Files selected for processing (1)
tests/e2e/recovery-down-reverses.sh
Summary
Two sibling e2e scenarios in the recovery family share the same provocation-race flake: a test-side
drbdadm down-based provocation executed inside a live satellite can collide with the satellite's immediate revive of the downed slot. When the two drbdadm callers interleave badly,drbdmeta apply-alfails with EBUSY (exit 20) on the backing device and the kernel slot ends half-configured: diskInconsistent(al-suspended:yes), connectionsStandAlonewith peer-device entries already registered. That state matches the operator-disconnect signature (shouldSkipNetOnAdjust, the W12 split-brain-recipe guard), so the satellite deliberately never reconnects the slot and the scenario's convergence wait times out on a provocation artefact instead of the path under test.This PR started as a fix for
recovery-node-id-mismatch(~30% lane flake). Its own first CI run then reproduced the identical satellite-log signature inrecovery-down-reverseson lane 4 (also seen earlier on another PR's run), so the scope is widened to cover the recovery-family provocation races.Fix
recovery-node-id-mismatch — the provocation (
drbdadm down+sed+drbdadm up) is a high-probability double-writer race by construction. After the provocation (whose detection remains a soft assert), unconditionally bounce worker-2 with a baredrbdadm downand let the satellite's revive — a single writer this time — bring the slot back up cleanly, then require kernel-truthUpToDatebefore applying the recovery recipe.recovery-down-reverses — the provocation is already the single-writer bare
drbdadm down, so an unconditional bounce would just roll the same dice again and dilute the convergence assertion. The heal is conditional instead: the final Connected+UpToDate wait keeps its full untouched budget, and only a timeout that exhibits the exact wedge signature (StandAlone connections that retained peer-device entries on the provoked node) triggers one clean bounce, a kernel-truth UpToDate wait, and a single re-run of the convergence wait. A genuine regression of the narrowedshouldSkipNetOnAdjustgate (fresh-revive StandAlone without peer-device entries) does not match the signature and still fails loudly with the full forensic dump.No timeouts were inflated and the load-bearing assertions of both scenarios are unchanged.
Audit of the remaining e2e scenarios
The rest of
tests/e2e/was audited for the same disease (test-sidedrbdadm down/up/config mutation inside a live satellite). Left untouched, with reasons:recovery-stuck-synctarget-down-up— the satellite quiet-window during the down+up recipe is the scenario's own contract: it asserts the satellite issues nodrbdadm down/up/adjustduring the window, so healing a revive race would mask the very violation under test.disk-replace-internal-metadata— its down/up recipe runs under a pinnedDrbdOptions/SkipDisk=True, and the reconciler-races-recipe interleavings already have an explicit handling path (recipe failure is caught and classified).recovery-bitmap-drop,recovery-discard-my-data,recovery-suspended-quorum,recovery-stuck-synctarget,split-brain-recovery— disconnect-based provocations:drbdadm disconnectleaves the kernel slot configured (no destroy event, hence no revive; no attach, hence no apply-al), and these scenarios deliberately create the operator-disconnect state their recipes then heal.recovery-quorum-persistence— test-sidedrbdadm adjuston a live attached slot only; no down/up, no re-attach path.recovery-inconsistent-blocking— thedrbdadm upthere is performed by the restarted satellite itself (single writer); the test side only runsinvalidate.respawn-standalone-wedge,state-auto-resync— no test-side drbdadm mutations (CLI-driven provocation / read-only log assertions).Validation
drbdsetup status --jsondump), and the CI lanes on this PR exercise the scenario directly.bash -nandshellcheckclean on both changed files (no new findings vs baseline).Note for follow-up
The underlying product behaviour is worth a separate look: a kernel slot left half-up by an interrupted
drbdadm up/adjust(e.g. a satellite restart at the wrong moment) is indistinguishable, under the current peer-device-entries heuristic, from an operator disconnect — so the satellite never self-heals it. Tightening that disambiguator without breaking the W12 recipe is a non-trivial product question and intentionally out of scope here.Summary by CodeRabbit