Skip to content

test(e2e): 2TiB quorum-loss + no-reboot recovery release gate (COV-011)#146

Merged
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
test/quorum-loss-2tb-recovery
Jun 12, 2026
Merged

test(e2e): 2TiB quorum-loss + no-reboot recovery release gate (COV-011)#146
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
test/quorum-loss-2tb-recovery

Conversation

@kvaps

Copy link
Copy Markdown
Member

What

Adds the missing release-gate coverage for blocker COV-011: an e2e scenario (tests/e2e/quorum-loss-2tb-recovery.sh) proving that on a large (2 TiB) zfs-thick-backed volume, losing DRBD quorum under active IO suspends IO per the suspend-io policy and — critically — IO resumes and the cluster heals without any node reboot once quorum returns.

Why

The production fear is a DRBD deadlock where quorum loss under IO wedges volumes such that only node reboots recover. No existing scenario covers the large-volume quorum-loss + recovery leg; this gate pins the full contract end to end.

Scenario contract

On the canonical 2-diskful + auto-tiebreaker shape (quorum=majority, on-no-quorum=suspend-io):

  1. Preflight gates on the stand-"big" substrate: a zfs-thick StoragePool (providerKind ZFS) on all 3 workers plus 2 TiB + 5% of free pool capacity per diskful node; SKIPs elsewhere with a clear message.
  2. 2 TiB RD/VD creation must reach UpToDate inside a bounded window — thick ZFS zvols are in the skip-initial-sync class, so a full 2 TiB initial sync is itself a gate failure (the skipInitialSync=true stamp is asserted explicitly).
  3. A seeded marker region (md5 captured) plus a continuous 1-tick/s direct-IO writer run against the Primary throughout.
  4. Secondary-only outage: quorum must HOLD via the witness (writer keeps ticking, no suspension), then heal and resync clean.
  5. Secondary + witness outage mid-IO: the Primary must report quorum lost and IO must SUSPEND — the writer freezes with zero write errors (suspend-io must block, never EIO), no crash.
  6. Restore: within bounded windows quorum returns, the frozen writer resumes and completes cleanly, both replicas return to UpToDate, the marker md5 is intact, DRBD status is clean on all workers, the single-Primary invariant holds — and boot_id is unchanged (with monotonically increased uptime) on every worker: the explicit no-reboot assert.
  7. Teardown with a no-orphans assert across all three layers (CRDs, zvols, kernel slots); full diagnostics dump on any failure.

Every wait is bounded by an explicit deadline on a concrete condition — no blind sleeps on the critical path.

Outage mechanism

Per-link iptables DROP of the resource's DRBD mesh port (all four src/dst x sport/dport combinations), the drop_pair recipe already proven by quorum-tiebreaker-no-return.sh and the partition scenarios. This is the strongest node-outage model reachable from a scenario on this stand: it breaks the kernel replication links the way a dead node does (no graceful drbdadm disconnect handshake; DRBD must detect the loss via ping-timeout), there is no VM-level kill helper reachable from a scenario, and stopping the satellite pod would not break kernel-level replication anyway. kubectl/API traffic is untouched, so kernel truth stays observable from every node throughout the outage.

Harness wiring

stand/run-scenarios-only.sh SKIP_ALLOWLIST gains the new scenario: it is stand-"big"-specific by design and must SKIP (not FAIL) on the regular CI lanes, which auto-discover the whole suite via make e2e-list.

Validation

  • bash -n and shellcheck -x clean (zero findings).
  • Every lib.sh helper referenced was grep-verified to exist; every wait has a bounded timeout.
  • Authoring-only PR: not yet executed on a stand. Stand "big" is pre-provisioned and waiting for the final candidate SHA; the scenario's worst case exceeds the generic 600 s lane timeout, so on stand "big" it should be invoked directly: ./tests/e2e/quorum-loss-2tb-recovery.sh .work/big

Andrei Kvapil (kvaps) and others added 2 commits June 12, 2026 11:43
Release-gate scenario for the production deadlock fear: quorum loss
under active IO wedging a large volume such that only node reboots
recover. On a 2 TiB zfs-thick volume (2 diskful + auto-tiebreaker,
on-no-quorum=suspend-io) the scenario proves:

  - bounded-time create (skip-initial-sync contract for thick zvols)
  - secondary-only outage keeps quorum via the witness, IO continues
  - losing both the secondary and the witness mid-IO suspends IO
    (writer blocks with zero errors, no crash)
  - restoring the links resumes the suspended IO and heals the
    cluster to all-UpToDate with the seeded marker intact, WITHOUT
    any node reboot (boot_id + uptime asserted on every worker)

Outage mechanism is the per-link iptables DROP of the resource's
DRBD mesh port (the drop_pair recipe already proven by
quorum-tiebreaker-no-return and the partition scenarios) — the
strongest node-outage model reachable from a scenario on this
stand, and the only one that breaks kernel replication links while
keeping observability up.

Stand-"big"-specific: SKIPs unless a zfs-thick (providerKind=ZFS)
StoragePool exists on all workers with 2 TiB of headroom.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The COV-011 gate needs the stand-"big" substrate (zfs-thick pool,
2.2T disks) and deliberately SKIPs on the regular CI lanes; without
the allowlist entry the runner reclassifies that SKIP as FAIL.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@kvaps, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 11 minutes and 19 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9ada32d-cce9-4d08-9b37-ea5a4988da05

📥 Commits

Reviewing files that changed from the base of the PR and between c50ede1 and e19c905.

📒 Files selected for processing (2)
  • stand/run-scenarios-only.sh
  • tests/e2e/quorum-loss-2tb-recovery.sh
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/quorum-loss-2tb-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new end-to-end test scenario, quorum-loss-2tb-recovery.sh, to verify that a 2 TiB quorum-loss recovery can occur without requiring a node reboot, and updates the SKIP_ALLOWLIST in stand/run-scenarios-only.sh to include this new scenario. The review feedback correctly identifies critical syntax errors in the bash script related to invalid array expansion default value fallbacks (${BLOCKED_PAIRS[@]:-} and ${kept[@]:-}), which would cause runtime failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +149 to +154
local kept=() p
for p in "${BLOCKED_PAIRS[@]:-}"; do
[[ "$p" == "$node|$peer_ip" ]] || kept+=("$p")
done
BLOCKED_PAIRS=("${kept[@]:-}")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using ${BLOCKED_PAIRS[@]:-} and ${kept[@]:-} results in a runtime syntax error in Bash (operand expected (error token is "-")). Since both arrays are explicitly declared (declare -a BLOCKED_PAIRS and local kept), they are guaranteed to be defined. You can safely reference them as ${BLOCKED_PAIRS[@]} and ${kept[@]} without triggering unbound variable errors under set -u.

Suggested change
local kept=() p
for p in "${BLOCKED_PAIRS[@]:-}"; do
[[ "$p" == "$node|$peer_ip" ]] || kept+=("$p")
done
BLOCKED_PAIRS=("${kept[@]:-}")
}
local kept=() p
for p in "${BLOCKED_PAIRS[@]}"; do
[[ "$p" == "$node|$peer_ip" ]] || kept+=("$p")
done
BLOCKED_PAIRS=("${kept[@]}")
}

Comment on lines +159 to +160
for p in "${BLOCKED_PAIRS[@]:-}"; do
[[ -z "$p" ]] && continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using ${BLOCKED_PAIRS[@]:-} results in a runtime syntax error in Bash. Since BLOCKED_PAIRS is explicitly declared as an array, you can safely reference it as ${BLOCKED_PAIRS[@]}.

Suggested change
for p in "${BLOCKED_PAIRS[@]:-}"; do
[[ -z "$p" ]] && continue
for p in "${BLOCKED_PAIRS[@]}"; do
[[ -z "$p" ]] && continue

@kvaps Andrei Kvapil (kvaps) merged commit c349d79 into main Jun 12, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant