fix(controller): ghost tiebreaker witness re-created on a just-deleted node#140
Conversation
…d node After `n lost` cascaded a 3-diskful RD down to 2 diskful, the witness placement picked candidates from the manager's lagging informer cache and re-created the TIE_BREAKER on the just-deleted node, and nothing ever reaped the ghost (the node row is gone, so there is no DeletionTimestamp event and no finalizer pass). Close both legs, mirroring the REST layer's Bug 174 node-deleted race guard on the controller side: - Placement guard: re-validate the chosen witness node against the authoritative reader (direct read, not the cache) right before and right after the Create; defer to the next reconcile on a miss and roll back a witness whose node vanished mid-write. - Repair leg: replicas whose node row no longer exists are treated like EVICTED/LOST placements — excluded from the voting count and, for TIE_BREAKER rows, reaped so a fresh witness can land on a healthy spare. This also cleans up ghosts created before the fix. Node deletions now also re-enqueue the witness invariant via the existing node watch so the repair runs event-driven. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… node Pin both legs: the placement guard (stale cached node list offers a deleted node — the create is deferred and converges on the healthy spare once the list is fresh), the repair leg (a pre-existing ghost witness on a missing node is reaped and replaced), and the boundary (a stranded diskful on a missing node is excluded from the voting count but never deleted by the witness pass). All three fail without the fix. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Warning Review limit reached
More reviews will be available in 18 minutes and 7 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses Bug-024 by preventing the creation of witness replicas on deleted nodes and ensuring that ghost witnesses left on missing nodes are properly reaped. It introduces a placement guard that re-validates the chosen witness node against the authoritative reader before creation, implements a rollback mechanism if the node vanishes during creation, and updates the reconciliation logic to immediately trigger on node deletion events. The review feedback highlights a performance concern in probeNodeDirect, where using APIReader.List to find a single node by name can lead to an inefficient N+1 pattern, and suggests attempting a direct Get first to avoid overloading the API server.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func (r *ResourceDefinitionReconciler) probeNodeDirect(ctx context.Context, wireName string) (nodeProbe, error) { | ||
| if r.APIReader == nil { | ||
| node, err := r.Store.Nodes().Get(ctx, wireName) | ||
| if stderrors.Is(err, store.ErrNotFound) { | ||
| return nodeProbe{}, nil | ||
| } | ||
|
|
||
| if err != nil { | ||
| return nodeProbe{}, err | ||
| } | ||
|
|
||
| return nodeProbe{found: true, drained: isDisabledNode(&node)}, nil | ||
| } | ||
|
|
||
| var list blockstoriov1alpha1.NodeList | ||
| if err := r.APIReader.List(ctx, &list); err != nil { | ||
| return nodeProbe{}, err | ||
| } | ||
|
|
||
| for i := range list.Items { | ||
| if !nodeCRDMatchesWireName(&list.Items[i], wireName) { | ||
| continue | ||
| } | ||
|
|
||
| return nodeProbe{found: true, drained: nodeHasDrainFlag(&list.Items[i])}, nil | ||
| } | ||
|
|
||
| return nodeProbe{}, nil | ||
| } |
There was a problem hiding this comment.
Calling r.APIReader.List to find a single node by name is highly inefficient, especially in larger clusters. Since probeNodeDirect is called in a loop within confirmGhostReplicas, this can lead to an List pattern that bypasses the cache and could easily overload the Kubernetes API server under load.
To optimize this, we can attempt a direct r.APIReader.Get using the lowercase wireName first. If that succeeds, we avoid the expensive List call entirely. We should only fall back to List if the direct Get returns a NotFound error (to handle cases where the node name was slugified or modified).
func (r *ResourceDefinitionReconciler) probeNodeDirect(ctx context.Context, wireName string) (nodeProbe, error) {
if r.APIReader == nil {
node, err := r.Store.Nodes().Get(ctx, wireName)
if stderrors.Is(err, store.ErrNotFound) {
return nodeProbe{}, nil
}
if err != nil {
return nodeProbe{}, err
}
return nodeProbe{found: true, drained: isDisabledNode(&node)}, nil
}
// Try direct Get first to avoid expensive List
var node blockstoriov1alpha1.Node
err := r.APIReader.Get(ctx, client.ObjectKey{Name: strings.ToLower(wireName)}, &node)
if err == nil {
return nodeProbe{found: true, drained: nodeHasDrainFlag(&node)}, nil
}
if !apierrors.IsNotFound(err) {
return nodeProbe{}, err
}
var list blockstoriov1alpha1.NodeList
if err := r.APIReader.List(ctx, &list); err != nil {
return nodeProbe{}, err
}
for i := range list.Items {
if !nodeCRDMatchesWireName(&list.Items[i], wireName) {
continue
}
return nodeProbe{found: true, drained: nodeHasDrainFlag(&list.Items[i])}, nil
}
return nodeProbe{}, nil
}
Problem
After
node lostcascades a 3-diskful RD down to 2 diskful,ensureTiebreakerre-created the witness<rd>.<lost-node>[DISKLESS TIE_BREAKER]on the node that was just deleted: the witness placement picks candidates from the manager's lagging informer cache, which still serves the deleted node for a short window. Nothing ever reaped the ghost — the node object is gone, so there is no DeletionTimestamp event and no finalizer pass. Deterministic on a warm process; deadlock-class.The REST layer already guards the equivalent race on
r c(the Bug 174 node-deleted post-write re-check); the controller-side witness create lacked it.Fix
Both legs, mirroring the REST guard's shape on the controller side:
Node deletions now also re-enqueue the witness invariant through the existing node watch, so the repair runs event-driven instead of waiting for the periodic requeue.
Tests
L1 regression tests pin both legs and the boundary: a stale cached list offering a deleted node never produces a ghost (and the witness converges on the healthy spare once the list is fresh); a pre-existing ghost witness on a missing node is reaped and replaced; a stranded diskful on a missing node is excluded from the count but not deleted. All three fail without the fix. The full
ensure_tiebreakersuite (witness-iff-diskful==2 and friends) stays green.Verified locally:
go test ./...,go test -race ./internal/controller/...andgolangci-lint runare clean; the operator-level integration pinTestGroupKWFNodeLostCascade(PR #137 branch) goes red on main and green with this fix.