fix(controller): close witness-reap vs redundancy-backfill race (BUG-041)#157
Conversation
Add DeleteIfTieBreaker, which deletes a replica only if it still carries the TIE_BREAKER flag at the observed object version. The k8s store guards the delete with a ResourceVersion + UID precondition; the in-memory store performs the read-check-delete under its write lock. A concurrent promotion of the same (rd, node) row to a diskful backfill replica makes the delete a benign no-op instead of clobbering the promoted replica. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
removeWitnesses did a non-atomic Get-then-Delete: it re-read the witness flags, but the Delete carried no version precondition, so a concurrent backfill promotion landing in the gap between the read and the Delete was silently deleted by name. After an inactive-replica return or node event, the placer would promote the witness to a diskful backfill replica and the reaper would then clobber it — redundancy was never restored (fewer diskful replicas than place_count) and the loss was silent. Route the reap through the new ResourceStore.DeleteIfTieBreaker so the flag re-check and the delete are one optimistic-concurrency operation: a racing promotion aborts the delete instead of clobbering it. Legitimate orphan/ghost/relocate witness-reap paths are unaffected. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Add an L1 controller regression that injects the concurrent backfill promotion at the exact race point — between the reaper's observation of the witness and the delete commit — and asserts the promoted replica survives and redundancy holds. A companion test pins that a genuine orphan witness (no racing promotion) is still reaped, guarding against over-correction. Add a DeleteIfTieBreaker contract test to the shared store suite so both the in-memory and CRD-backed (envtest) backends are pinned in lock-step: witness reaped, promoted diskful preserved, missing row a no-op. Update the pkg/rest decorator fakes to satisfy the widened interface. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The inactive-return-backfills-redundancy hang has a decision-level root cause in addition to the reaper race: with 1 active diskful + 1 INACTIVE diskful + 1 leftover auto-tiebreaker witness, ensureTiebreaker filters the INACTIVE replica out of the voting set, sees "1 diskful + 1 witness", treats the witness as a genuine orphan, and reaps it. But that witness sits on the very spare the placer's `r c --auto-place` is promoting to restore active redundancy. Reaper and placer fight over the node and the reconciler oscillates 1<->2 active diskful forever, so the backfill never stabilises (active_diskful_count never reaches 2 — the 240s timeout). Hold an existing witness whenever an INACTIVE diskful replica is present: the cluster is degraded and a backfill is expected, so the witness is the backfill target, not an orphan. This only preserves an already-present witness (the create branch is unchanged, so no spurious witness grows), and it is the forward-correct shape — when the INACTIVE replica returns (`r activate`) the cluster is immediately at the 2-diskful + 1-witness three-voter quorum with no recreate churn. Add a decision-level regression that reproduces the oscillation shape and asserts the witness survives repeated reconciles; the Bug 387 no-spurious- witness and the ghost/orphan reap suites stay green. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughPR adds a version-guarded ChangesWitness Deletion & Backfill Race Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses Bug 393, where the auto-tiebreaker witness reaper raced with the placer's redundancy backfill, causing promoted replicas to be silently deleted and preventing redundancy from being restored. To resolve this, a new atomic/conditional delete method DeleteIfTieBreaker is introduced to ResourceStore and implemented for both in-memory and Kubernetes-backed stores. This ensures that a witness is only deleted if it still carries the TIE_BREAKER flag at the observed version, preventing racing promotions from being clobbered. Additionally, the reconciler is updated to preserve an existing witness when an inactive diskful replica is present. The review feedback suggests refining the containsInactiveReplica helper function to ensure it only matches inactive replicas that are also diskful, preventing inactive diskless replicas (witnesses) from incorrectly triggering the degraded redundancy logic.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func containsInactiveReplica(replicas []apiv1.Resource) bool { | ||
| for i := range replicas { | ||
| if slices.Contains(replicas[i].Flags, apiv1.ResourceFlagInactive) { | ||
| return true | ||
| } | ||
| } | ||
|
|
||
| return false | ||
| } |
There was a problem hiding this comment.
The helper function containsInactiveReplica is designed to detect if active redundancy is degraded by checking for deactivated replicas. However, checking only for ResourceFlagInactive can also match inactive diskless replicas (witnesses), which do not represent a degradation of active diskful redundancy.
To align precisely with the design intent (as documented in the comments), we should ensure that the inactive replica is also diskful (i.e., does not carry the ResourceFlagDiskless flag).
| func containsInactiveReplica(replicas []apiv1.Resource) bool { | |
| for i := range replicas { | |
| if slices.Contains(replicas[i].Flags, apiv1.ResourceFlagInactive) { | |
| return true | |
| } | |
| } | |
| return false | |
| } | |
| func containsInactiveReplica(replicas []apiv1.Resource) bool { | |
| for i := range replicas { | |
| if slices.Contains(replicas[i].Flags, apiv1.ResourceFlagInactive) && | |
| !slices.Contains(replicas[i].Flags, apiv1.ResourceFlagDiskless) { | |
| return true | |
| } | |
| } | |
| return false | |
| } |
What
Durability/availability race surfaced by the release-gate sweep (scenario
replay/inactive-return-backfills-redundancy): the auto-tiebreaker witness reaperremoveWitnessesdid a non-atomic Get→Delete with no resourceVersion precondition, racing the placer's redundancy backfill. When an inactive replica returns and the placer promotes a Resource into a diskful backfill, the reaper — having read the row a moment earlier as a witness — could Delete it by name, clobbering the freshly-promoted replica. Net effect: redundancy is silently NOT restored (fewer diskful replicas than place_count). Reproduces back-to-backdeactivate+r c --auto-place.Fix
feat(store): add a conditionalDeleteIfTieBreakertoResourceStore(both InMemory and k8s backends) — deletes only if the object still carries the TIE_BREAKER flag and is still diskless, under a resourceVersion precondition, so a concurrent promotion to diskful aborts the delete.fix(controller): routeremoveWitnessesthrough the conditional delete; a row that was promoted out of witness state between the reconcile read and the delete is left intact, and the backfill survives. Legitimate witness-reap paths (orphan witness, diskful≥3, BUG-040 AutoplaceTarget, ghost-on-lost-node fix(controller): ghost tiebreaker witness re-created on a just-deleted node #140) are unchanged.Tests
pin conditional witness delete against backfill race— injects the concurrent promotion between the reaper's read and delete (decorator store) and asserts the promoted replica survives + redundancy holds.big,cli-matrix/r-full-lifecyclegreen and L7replay/inactive-return-backfills-redundancy3/3 PASS. The fullensure_tiebreaker_*/ ghost / BUG-040 witness suites stay green.Summary by CodeRabbit
Bug Fixes
Tests