fix(satellite): never lose the day0 first-activation mkfs (BUG-028) by kvaps · Pull Request #147 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-12T08:50:09Z

What

A fresh volume with FileSystem/Type set could intermittently (~2 in 5) end up with NO filesystem forever: the day0 skip-initial-sync race let an external promoter (drbd-reactor on the piraeus RWX path) reach the device before the first-activation mkfs, after which the volume was permanently wedged in a promote -> fsck-fail -> demote loop. This surfaced in CI as the recurring "rwx-ganesha flake".

Root cause chain

Day0 skip-initial-sync brings both diskful replicas Connected+UpToDate within ~2s; the mkfs-election winner reaches its first activation after that.
The force-promote kernel veto saw peer-disk:UpToDate and could not distinguish an empty day0 sibling from real data -> first-activation mkfs silently skipped.
Each external promote/demote of the empty volume bumped the DRBD current-UUID past the computed day0 GI without writing data.
The controller treated CurrentGI != day0 as proof of data -> false Initialized latch -> dispatcher granted no auto-primary -> the mkfs retry path was permanently dead.

Fix

Day0-aware veto bypass: force-promote for first-activation mkfs proceeds when every connected peer is provably a day0-empty sibling of the same never-initialized generation (belt) and the local volume carries no filesystem signature (braces). A peer with real data still vetoes mkfs unconditionally.
Latch-free mkfs retry: the retry no longer depends on the auto-primary election / Initialized latch — it fires when a volume provably lacks its filesystem while FileSystem/Type demands one, and is safe against a foreign Primary briefly holding the device (waits for Secondary).
Probe hardening: mkfs-safety probes unblocked when wedged by a stale drbd meta signature or CRD lag.

Validation (live stand, raw logs on the dev host)

Statistical: 12/12 consecutive rwx-ganesha PASS on the final code (pre-fix failure rate was 1-in-4); two earlier validation loops caught residual races and the fix was iterated until clean. Logs: /var/lib/blockstor/work/dev/bug028-fix-validation-20260612-060631/.
Deterministic: forced-race run (winner satellite killed mid-create) PASS — both fix legs observed firing in product logs (force-promote veto bypassed — every connected peer is a day0-empty sibling, latch-free mkfs retry). Logs: /var/lib/blockstor/work/dev/bug028-force-race-062248/.
No regressions: cli-matrix/r-full-lifecycle rc=0 and replay/file-thin-day0-skip-no-initial-sync PASS on the same build (day0 skip-initial-sync contract intact); go test ./..., targeted -race, and lint green.

Safety argument

mkfs is only ever performed when (a) all connected peers are day0-empty siblings by blockstor's own generation bookkeeping AND (b) no filesystem signature exists locally. Any peer with a non-day0 generation or any local fs signature refuses mkfs exactly as before. L1 regression tests cover the race chain, the false-latch retry, and the real-data refusal counter-case.

Authorises a promote+mkfs+demote retry without the dispatcher's auto-primary blessing, but only when the kernel proves it safe: every replica Secondary, local volumes UpToDate, and every connected peer-device either a lock-step UpToDate sibling or an intentional diskless witness. A disconnected or non-UpToDate diskful peer (a potential offline data holder) refuses, as does a foreign Primary (external promoters such as drbd-reactor simply defer the retry). Groundwork for the BUG-028 day0/mkfs race fix. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

A fresh RWX RD (FileSystem/Type on the RD) intermittently came up with no filesystem, forever. Two coupled failures: 1. Day0 race: skip-initial-sync brings both diskful replicas Connected+UpToDate at the shared day0 GI before the elected mkfs winner reaches finishDRBDApply. The Bug 342 force-promote kernel veto cannot tell an empty day0 sibling from a real data peer, so the one-and-only first-activation mkfs was silently skipped. 2. False-latch terminal state: an external promoter (drbd-reactor) then bumps the current-UUID past day0 without writing data, the controller latches RD.Spec.Initialized, the dispatcher drops the auto-primary election, and the Bug-311 mkfs retry (gated solely on autoPrimaryReplica) goes permanently dead. Fix, both sides, evidence-gated: - day0EmptyMkfsBypass: the veto may be bypassed only when the controller-persisted Spec.SkipInitialSync proves a never- initialized generation, no proven data peer exists, the local metadata current-UUID still equals the deterministic day0 GI (DRBD only lets replicas sit Connected+UpToDate in the same data generation, so this proves every connected peer is a day0-empty sibling), and no volume carries a filesystem signature. - shouldRetryAutoMkfs: the Bug-311 retry now also fires without the auto-primary election, via the deterministic lowest-diskful- node-id winner + SafeForMkfsRetryPromote kernel proof + blkid fs-absence probe, throttled, deferring while a foreign Primary holds the device. Any missing evidence refuses both paths, preserving the Bug 342/356 relocate and respawn-StandAlone protections and the v0.1.11 day0 skip-initial-sync contract. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…and CRD lag Stand forensics on the loop-backed ganesha scenario showed both new BUG-028 paths permanently refusing on real hardware: 1. blkid on the BACKING device of an internal-metadata DRBD volume always reports TYPE=drbd (libblkid recognises the meta superblock at the device tail), so the naive any-TYPE fs-absence probe read every never-formatted volume as populated. New backingHasUserFilesystem treats only a non-drbd signature as user data; the post-promote probe on the DRBD device (which hides the metadata) remains the authoritative double-mkfs guard. 2. The dispatcher's PeerHasData conservatively counts an UpToDate day0 sibling whose CurrentGI backfill has not landed yet as data-bearing; refusing the bypass on it permanently cost the one-shot first-activation mkfs. The bypass now uses kernel coverage (Day0SiblingSetConnected: no Primary anywhere, local UpToDate, every connected peer-device UpToDate/Diskless, an un-handshaken peer tolerated only when it is a configured diskless witness) which, combined with the local-GI==day0 proof, strictly supersedes the CRD signal: every state PeerHasData correctly protects is still refused. Also the latch-free retry now requires only ONE volume to be missing its filesystem (volumes that already carry one are adopted untouched by the per-volume blkid probe in the mkfs runner — the Bug 311 partial-mkfs shape). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…ambiguity The bypass cannot distinguish a respawned replica joining a data-bearing-but-still-day0 survivor (a volume whose entire write history happened fully connected never mints a new current-UUID) from a fresh day0 sibling. This is the same ambiguity the day0 seed path documents and the same shape the pre-existing Bug-311 retry already had; production auto-mkfs topologies (RWX with a diskless witness) are immune because the first consumer promote mints a UUID and latches the RD. Documented as accepted residual risk. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-06-12T08:50:18Z

Warning

Review limit reached

@kvaps, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 25 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 981feeb5-cc11-400b-8ff6-5edebfbbe8b9

📥 Commits

Reviewing files that changed from the base of the PR and between c54be3b and a5ed96e.

📒 Files selected for processing (4)

pkg/drbd/drbdadm.go
pkg/drbd/safe_for_mkfs_retry_promote_test.go
pkg/satellite/reconciler.go
pkg/satellite/reconciler_bug028_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/day0-mkfs-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses BUG-028 by introducing a day0-empty mkfs bypass and a latch-free mkfs retry mechanism. It adds SafeForMkfsRetryPromote and Day0SiblingSetConnected to pkg/drbd/drbdadm.go to safely probe the DRBD status, and integrates these checks into the satellite reconciler. Additionally, a backing-device filesystem probe is added to ignore the DRBD metadata signature. A review comment suggests using a less-than-or-equal comparison for peer node IDs in isLowestDiskfulNodeID to defensively prevent multiple nodes from simultaneously attempting to promote/mkfs in the event of duplicate node IDs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T08:51:55Z

+		peerID, peerErr := strconv.Atoi(opts["peer."+peer+".node-id"])
+		if peerErr != nil || peerID < selfID {
+			return false
+		}


There is an inconsistency in how node ID uniqueness/equality is handled between isLowestDiskfulNodeID and isLateAddWinner.

In isLateAddWinner, the comparison is *peer.NodeID <= localID (line 3379), which correctly disqualifies the local node if a peer shares the same node ID (defensive check against duplicate node IDs). However, isLowestDiskfulNodeID uses peerID < selfID (line 2981). If a duplicate node ID situation ever occurs, both nodes could evaluate isLowestDiskfulNodeID as true and attempt to promote/mkfs simultaneously, potentially leading to a split-brain or conflict.

Using <= here makes the check more robust and consistent with isLateAddWinner.

Suggested change

peerID, peerErr := strconv.Atoi(opts["peer."+peer+".node-id"])

if peerErr != nil || peerID < selfID {

return false

}

peerID, peerErr := strconv.Atoi(opts["peer."+peer+".node-id"])

if peerErr != nil || peerID <= selfID {

return false

}

Andrei Kvapil (kvaps) and others added 4 commits June 12, 2026 07:49

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Andrei Kvapil (kvaps) merged commit 960f414 into main Jun 12, 2026
27 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(satellite): never lose the day0 first-activation mkfs (BUG-028)#147

fix(satellite): never lose the day0 first-activation mkfs (BUG-028)#147
Andrei Kvapil (kvaps) merged 4 commits into
mainfrom
fix/day0-mkfs-race

Andrei Kvapil (kvaps) commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Andrei Kvapil (kvaps) commented Jun 12, 2026

What

Root cause chain

Fix

Validation (live stand, raw logs on the dev host)

Safety argument

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant