Skip to content

fix(e2e): heal provocation-race wedges in recovery-family scenarios#148

Merged
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/recovery-node-id-mismatch-wedge
Jun 12, 2026
Merged

fix(e2e): heal provocation-race wedges in recovery-family scenarios#148
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/recovery-node-id-mismatch-wedge

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Two sibling e2e scenarios in the recovery family share the same provocation-race flake: a test-side drbdadm down-based provocation executed inside a live satellite can collide with the satellite's immediate revive of the downed slot. When the two drbdadm callers interleave badly, drbdmeta apply-al fails with EBUSY (exit 20) on the backing device and the kernel slot ends half-configured: disk Inconsistent (al-suspended:yes), connections StandAlone with peer-device entries already registered. That state matches the operator-disconnect signature (shouldSkipNetOnAdjust, the W12 split-brain-recipe guard), so the satellite deliberately never reconnects the slot and the scenario's convergence wait times out on a provocation artefact instead of the path under test.

This PR started as a fix for recovery-node-id-mismatch (~30% lane flake). Its own first CI run then reproduced the identical satellite-log signature in recovery-down-reverses on lane 4 (also seen earlier on another PR's run), so the scope is widened to cover the recovery-family provocation races.

Fix

recovery-node-id-mismatch — the provocation (drbdadm down + sed + drbdadm up) is a high-probability double-writer race by construction. After the provocation (whose detection remains a soft assert), unconditionally bounce worker-2 with a bare drbdadm down and let the satellite's revive — a single writer this time — bring the slot back up cleanly, then require kernel-truth UpToDate before applying the recovery recipe.

recovery-down-reverses — the provocation is already the single-writer bare drbdadm down, so an unconditional bounce would just roll the same dice again and dilute the convergence assertion. The heal is conditional instead: the final Connected+UpToDate wait keeps its full untouched budget, and only a timeout that exhibits the exact wedge signature (StandAlone connections that retained peer-device entries on the provoked node) triggers one clean bounce, a kernel-truth UpToDate wait, and a single re-run of the convergence wait. A genuine regression of the narrowed shouldSkipNetOnAdjust gate (fresh-revive StandAlone without peer-device entries) does not match the signature and still fails loudly with the full forensic dump.

No timeouts were inflated and the load-bearing assertions of both scenarios are unchanged.

Audit of the remaining e2e scenarios

The rest of tests/e2e/ was audited for the same disease (test-side drbdadm down/up/config mutation inside a live satellite). Left untouched, with reasons:

  • recovery-stuck-synctarget-down-up — the satellite quiet-window during the down+up recipe is the scenario's own contract: it asserts the satellite issues no drbdadm down/up/adjust during the window, so healing a revive race would mask the very violation under test.
  • disk-replace-internal-metadata — its down/up recipe runs under a pinned DrbdOptions/SkipDisk=True, and the reconciler-races-recipe interleavings already have an explicit handling path (recipe failure is caught and classified).
  • recovery-bitmap-drop, recovery-discard-my-data, recovery-suspended-quorum, recovery-stuck-synctarget, split-brain-recovery — disconnect-based provocations: drbdadm disconnect leaves the kernel slot configured (no destroy event, hence no revive; no attach, hence no apply-al), and these scenarios deliberately create the operator-disconnect state their recipes then heal.
  • recovery-quorum-persistence — test-side drbdadm adjust on a live attached slot only; no down/up, no re-attach path.
  • recovery-inconsistent-blocking — the drbdadm up there is performed by the restarted satellite itself (single writer); the test side only runs invalidate.
  • respawn-standalone-wedge, state-auto-resync — no test-side drbdadm mutations (CLI-driven provocation / read-only log assertions).

Validation

  • node-id-mismatch fix: reproduced the failure on the dev stand and captured the wedge state — identical to the CI failures. With the fix: 12/12 consecutive green runs; satellite logs confirm the apply-al EBUSY collision still occurred in about half of those runs and was healed by the bounce each time.
  • down-reverses fix: intentionally not re-validated on the stand. The heal reuses the exact bounce + kernel-truth-wait pattern already stand-proven 12/12 above, the conditional path only engages on the precise wedge signature captured from the lane-4 failure (the jq detector was verified against that captured drbdsetup status --json dump), and the CI lanes on this PR exercise the scenario directly.
  • bash -n and shellcheck clean on both changed files (no new findings vs baseline).

Note for follow-up

The underlying product behaviour is worth a separate look: a kernel slot left half-up by an interrupted drbdadm up/adjust (e.g. a satellite restart at the wrong moment) is indistinguishable, under the current peer-device-entries heuristic, from an operator disconnect — so the satellite never self-heals it. Tightening that disambiguator without breaking the W12 recipe is a non-trivial product question and intentionally out of scope here.

Summary by CodeRabbit

  • Tests
    • Improved end-to-end recovery test stability by adding a deterministic pre-recovery cleanup that clears a transient wedge and prevents recovery timeouts.
    • Expanded recovery scenario coverage to detect and handle the wedge case automatically; test now retries recovery and optionally annotates the PASS when the extra cleanup path was taken.

The provocation step (drbdadm down + sed + up) deliberately races
the satellite's Bug-287 revive. When the two drbdadm invocations
interleave badly, drbdmeta apply-al hits EBUSY on the backing
device and worker-2's kernel slot ends half-configured: disk
attached Inconsistent with AL suspended, peers registered but
never connected (StandAlone). The satellite then classifies the
slot as an operator disconnect (StandAlone + peer-device entries,
the W12 --skip-net guard) and never reconnects it, so the
post-recovery UpToDate wait times out. Seen as a ~30% lane-1
flake; not related to the .res node-id mismatch under test.

Bounce worker-2 with a bare drbdadm down after the provocation
and require kernel-truth UpToDate before applying the SKILL
recipe, so the recovery assertions start from a clean slot.

Validated 12/12 green on the dev stand (previously ~50% fail).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds deterministic clearing of a “tamper-window wedge” in two e2e recovery tests: one inserts a pre-SKILL down+poll barrier for worker-2; the other adds a conditional single-bounce retry when the first convergence wait times out and a wedge signature is detected.

Changes

Recovery test scenario enhancement

Layer / File(s) Summary
Pre-SKILL worker-2 recovery barrier
tests/e2e/recovery-node-id-mismatch.sh
Adds a step that runs drbdadm down on worker-2 and polls drbdsetup status until the local disk state is UpToDate before proceeding to the SKILL recipe.
Conditional tamper-window bounce and two-attempt convergence
tests/e2e/recovery-down-reverses.sh
Updates Steps 4–5 docs; replaces single convergence wait with a two-attempt loop that on first timeout detects a wedge via drbdsetup status --json + jq, issues one drbdadm down, polls until UpToDate, then rechecks Connected+UpToDate and annotates PASS if bounce occurred.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • cozystack/blockstor#50: Modifies the same e2e scenario script tests/e2e/recovery-down-reverses.sh and related convergence/wait handling after drbdadm down.

Poem

🐰 A wedge crept in, the race was mean,
So worker-2 took a polite downing scene.
We poll the disk until it's true,
UpToDate shows — the tests pass through.
Hop, patch, place — the logs look clean! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing E2E test flakes in recovery-family scenarios by healing provocation-race wedges. It is specific and directly related to the changeset modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/recovery-node-id-mismatch-wedge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/e2e/recovery-node-id-mismatch.sh (1)

341-342: ⚡ Quick win

Consider using JSON parsing for consistency and robustness.

The text parsing via grep + cut + awk works but is brittle if drbdsetup status output format evolves. The existing kernel_pair_uptodate helper in lib.sh (context snippet 2) uses drbdsetup status --json | jq for the same kernel-truth polling. Aligning with that pattern improves maintainability and reduces parsing fragility.

♻️ Proposed JSON-based extraction
-    w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \
-        | grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true)
+    w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" --json 2>/dev/null \
+        | jq -r '.[0].devices[]? | select(.volume==0) | ."disk-state"' 2>/dev/null || true)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/recovery-node-id-mismatch.sh` around lines 341 - 342, Replace the
brittle text parsing for w2_disk (currently using drbdsetup status piped to
grep/cut/awk) with JSON parsing like the kernel_pair_uptodate helper in lib.sh;
call drbdsetup status --json and use jq to extract the same "disk" field for the
first resource entry (e.g., select the first device/resource and output its disk
state with -r) so the script assigns w2_disk robustly from the JSON output
instead of relying on fragile text parsing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/e2e/recovery-node-id-mismatch.sh`:
- Around line 341-342: Replace the brittle text parsing for w2_disk (currently
using drbdsetup status piped to grep/cut/awk) with JSON parsing like the
kernel_pair_uptodate helper in lib.sh; call drbdsetup status --json and use jq
to extract the same "disk" field for the first resource entry (e.g., select the
first device/resource and output its disk state with -r) so the script assigns
w2_disk robustly from the JSON output instead of relying on fragile text
parsing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 421d6750-9fe8-42e9-a79f-b8f3cd2daed4

📥 Commits

Reviewing files that changed from the base of the PR and between c54be3b and cb04aa2.

📒 Files selected for processing (1)
  • tests/e2e/recovery-node-id-mismatch.sh

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a step to clear the tamper-window wedge in the recovery-node-id-mismatch.sh end-to-end test by bouncing worker-2 and waiting for it to return to an 'UpToDate' state. However, the regular expression and string parsing used to extract the local disk state from drbdsetup status will fail to match, causing the polling loop to time out and fail the test. A suggestion has been provided to use sed to robustly parse the disk state.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +341 to +342
w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \
| grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regular expression ^[[:space:]]+disk: will not match the local disk line in the output of drbdsetup status.

In DRBD9, the output of drbdsetup status for a local volume is formatted as:

  volume:<vnr> minor:<minor> disk:<state> ...

Because the line starts with volume: after the leading spaces, the anchor ^ combined with [[:space:]]+disk: prevents any match. As a result, w2_disk will always be empty, causing the polling loop to run until the 120-second timeout and fail the test.

Additionally, cut -d: -f2 would extract 0 minor instead of the disk state if a match were found on that line.

We can use sed to robustly extract the local disk state by matching [[:space:]]disk: (which avoids matching peer-disk: on peer lines).

Suggested change
w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \
| grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true)
w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \
| sed -n 's/.*[[:space:]]disk:\([^[:space:]]*\).*/\1/p' | head -n1)

The scenario's bare drbdadm down provocation can rarely collide
with the satellite's immediate revive: drbdmeta apply-al hits
EBUSY on the backing device and the revived slot ends
half-configured — disk Inconsistent, connections StandAlone with
peer-device entries retained. That matches the operator-disconnect
signature, so the satellite never reconnects the slot and the
final convergence wait times out on a provocation artefact, not on
the revive path under test. Seen twice on CI lanes with the
identical signature.

Unlike recovery-node-id-mismatch, the provocation here is already
a single writer, so an unconditional bounce would not reduce the
collision odds and would dilute the convergence assertion. Instead
the heal is conditional: the convergence wait keeps its full
untouched budget, and only a timeout that shows the exact wedge
signature triggers one clean bounce plus a kernel-truth UpToDate
wait before a single re-run of the wait. A regression of the
narrowed skip-net gate does not match the signature and still
fails loudly.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) changed the title fix(e2e): clear tamper-window wedge in recovery-node-id-mismatch fix(e2e): heal provocation-race wedges in recovery-family scenarios Jun 12, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/e2e/recovery-down-reverses.sh (1)

123-123: ⚠️ Potential issue | 🟠 Major

Replace bare EXIT trap with register_strict_cleanup
tests/e2e/recovery-down-reverses.sh sets trap 'delete_rd "$RD"' EXIT, but e2e cleanup should be registered via register_strict_cleanup from tests/e2e/lib.sh so the lane enforces strict post-test cluster baseline checks.

Proposed fix
-trap 'delete_rd "$RD"' EXIT
+register_strict_cleanup "$RD"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/recovery-down-reverses.sh` at line 123, Replace the bare EXIT trap
with the test harness's strict cleanup registration: remove the line that uses
trap 'delete_rd "$RD"' EXIT and call register_strict_cleanup with a command that
invokes delete_rd on the RD variable (e.g., register_strict_cleanup "delete_rd
\"$RD\""); ensure you reference the existing delete_rd function and the RD
variable and use register_strict_cleanup from tests/e2e/lib.sh so the lane runs
strict post-test cluster baseline checks.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/e2e/recovery-down-reverses.sh`:
- Line 123: Replace the bare EXIT trap with the test harness's strict cleanup
registration: remove the line that uses trap 'delete_rd "$RD"' EXIT and call
register_strict_cleanup with a command that invokes delete_rd on the RD variable
(e.g., register_strict_cleanup "delete_rd \"$RD\""); ensure you reference the
existing delete_rd function and the RD variable and use register_strict_cleanup
from tests/e2e/lib.sh so the lane runs strict post-test cluster baseline checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 20099e5e-4db8-496d-a9bc-5a87b36ce722

📥 Commits

Reviewing files that changed from the base of the PR and between cb04aa2 and 496abca.

📒 Files selected for processing (1)
  • tests/e2e/recovery-down-reverses.sh

@kvaps Andrei Kvapil (kvaps) merged commit 9c84104 into main Jun 12, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant