fix(controller): ghost tiebreaker witness re-created on a just-deleted node by kvaps · Pull Request #140 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-12T01:58:03Z

Problem

After node lost cascades a 3-diskful RD down to 2 diskful, ensureTiebreaker re-created the witness <rd>.<lost-node> [DISKLESS TIE_BREAKER] on the node that was just deleted: the witness placement picks candidates from the manager's lagging informer cache, which still serves the deleted node for a short window. Nothing ever reaped the ghost — the node object is gone, so there is no DeletionTimestamp event and no finalizer pass. Deterministic on a warm process; deadlock-class.

The REST layer already guards the equivalent race on r c (the Bug 174 node-deleted post-write re-check); the controller-side witness create lacked it.

Fix

Both legs, mirroring the REST guard's shape on the controller side:

Placement guard: the chosen witness node is re-validated against the authoritative reader (direct apiserver read, not the cache) right before the Create, and re-checked right after it. On a miss (or a freshly-stamped drain flag) the create is deferred to the next reconcile, which re-picks from a fresh list; a witness whose node vanished mid-write is rolled back.
Repair leg: replicas whose node row no longer exists are treated like EVICTED/LOST placements — excluded from the diskful/diskless voting count, and TIE_BREAKER rows are reaped so a fresh witness lands on a healthy spare. This also cleans up ghosts created before the fix. Absence is confirmed authoritatively before any reap, so a lagging cache can never trigger a wrongful one. Stranded diskful replicas are excluded from the count but never deleted here — their teardown belongs to the lost-node cascade.

Node deletions now also re-enqueue the witness invariant through the existing node watch, so the repair runs event-driven instead of waiting for the periodic requeue.

Tests

L1 regression tests pin both legs and the boundary: a stale cached list offering a deleted node never produces a ghost (and the witness converges on the healthy spare once the list is fresh); a pre-existing ghost witness on a missing node is reaped and replaced; a stranded diskful on a missing node is excluded from the count but not deleted. All three fail without the fix. The full ensure_tiebreaker suite (witness-iff-diskful==2 and friends) stays green.

Verified locally: go test ./..., go test -race ./internal/controller/... and golangci-lint run are clean; the operator-level integration pin TestGroupKWFNodeLostCascade (PR #137 branch) goes red on main and green with this fix.

…d node After `n lost` cascaded a 3-diskful RD down to 2 diskful, the witness placement picked candidates from the manager's lagging informer cache and re-created the TIE_BREAKER on the just-deleted node, and nothing ever reaped the ghost (the node row is gone, so there is no DeletionTimestamp event and no finalizer pass). Close both legs, mirroring the REST layer's Bug 174 node-deleted race guard on the controller side: - Placement guard: re-validate the chosen witness node against the authoritative reader (direct read, not the cache) right before and right after the Create; defer to the next reconcile on a miss and roll back a witness whose node vanished mid-write. - Repair leg: replicas whose node row no longer exists are treated like EVICTED/LOST placements — excluded from the voting count and, for TIE_BREAKER rows, reaped so a fresh witness can land on a healthy spare. This also cleans up ghosts created before the fix. Node deletions now also re-enqueue the witness invariant via the existing node watch so the repair runs event-driven. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

… node Pin both legs: the placement guard (stale cached node list offers a deleted node — the create is deferred and converges on the healthy spare once the list is fresh), the repair leg (a pre-existing ghost witness on a missing node is reaped and replaced), and the boundary (a stranded diskful on a missing node is excluded from the voting count but never deleted by the witness pass). All three fail without the fix. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-06-12T01:58:10Z

Warning

Review limit reached

@kvaps, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 18 minutes and 7 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c596214-e7ad-413d-9b5e-af79dd58e0d8

📥 Commits

Reviewing files that changed from the base of the PR and between 910ea3a and b1f9e61.

📒 Files selected for processing (2)

internal/controller/ensure_tiebreaker_ghost_bug024_test.go
internal/controller/resourcedefinition_controller.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/ghost-tiebreaker-lost-node

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses Bug-024 by preventing the creation of witness replicas on deleted nodes and ensuring that ghost witnesses left on missing nodes are properly reaped. It introduces a placement guard that re-validates the chosen witness node against the authoritative reader before creation, implements a rollback mechanism if the node vanishes during creation, and updates the reconciliation logic to immediately trigger on node deletion events. The review feedback highlights a performance concern in probeNodeDirect, where using APIReader.List to find a single node by name can lead to an inefficient N+1 pattern, and suggests attempting a direct Get first to avoid overloading the API server.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T01:59:20Z

+func (r *ResourceDefinitionReconciler) probeNodeDirect(ctx context.Context, wireName string) (nodeProbe, error) {
+	if r.APIReader == nil {
+		node, err := r.Store.Nodes().Get(ctx, wireName)
+		if stderrors.Is(err, store.ErrNotFound) {
+			return nodeProbe{}, nil
+		}
+
+		if err != nil {
+			return nodeProbe{}, err
+		}
+
+		return nodeProbe{found: true, drained: isDisabledNode(&node)}, nil
+	}
+
+	var list blockstoriov1alpha1.NodeList
+	if err := r.APIReader.List(ctx, &list); err != nil {
+		return nodeProbe{}, err
+	}
+
+	for i := range list.Items {
+		if !nodeCRDMatchesWireName(&list.Items[i], wireName) {
+			continue
+		}
+
+		return nodeProbe{found: true, drained: nodeHasDrainFlag(&list.Items[i])}, nil
+	}
+
+	return nodeProbe{}, nil
+}


Calling r.APIReader.List to find a single node by name is highly inefficient, especially in larger clusters. Since probeNodeDirect is called in a loop within confirmGhostReplicas, this can lead to an $N+1$ List pattern that bypasses the cache and could easily overload the Kubernetes API server under load.

To optimize this, we can attempt a direct r.APIReader.Get using the lowercase wireName first. If that succeeds, we avoid the expensive List call entirely. We should only fall back to List if the direct Get returns a NotFound error (to handle cases where the node name was slugified or modified).

func (r *ResourceDefinitionReconciler) probeNodeDirect(ctx context.Context, wireName string) (nodeProbe, error) { if r.APIReader == nil { node, err := r.Store.Nodes().Get(ctx, wireName) if stderrors.Is(err, store.ErrNotFound) { return nodeProbe{}, nil } if err != nil { return nodeProbe{}, err } return nodeProbe{found: true, drained: isDisabledNode(&node)}, nil } // Try direct Get first to avoid expensive List var node blockstoriov1alpha1.Node err := r.APIReader.Get(ctx, client.ObjectKey{Name: strings.ToLower(wireName)}, &node) if err == nil { return nodeProbe{found: true, drained: nodeHasDrainFlag(&node)}, nil } if !apierrors.IsNotFound(err) { return nodeProbe{}, err } var list blockstoriov1alpha1.NodeList if err := r.APIReader.List(ctx, &list); err != nil { return nodeProbe{}, err } for i := range list.Items { if !nodeCRDMatchesWireName(&list.Items[i], wireName) { continue } return nodeProbe{found: true, drained: nodeHasDrainFlag(&list.Items[i])}, nil } return nodeProbe{}, nil }

Andrei Kvapil (kvaps) and others added 2 commits June 12, 2026 04:55

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Andrei Kvapil (kvaps) merged commit 10b7835 into main Jun 12, 2026
15 checks passed

Andrei Kvapil (kvaps) mentioned this pull request Jun 13, 2026

fix(controller): close witness-reap vs redundancy-backfill race (BUG-041) #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): ghost tiebreaker witness re-created on a just-deleted node#140

fix(controller): ghost tiebreaker witness re-created on a just-deleted node#140
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/ghost-tiebreaker-lost-node

Andrei Kvapil (kvaps) commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Andrei Kvapil (kvaps) commented Jun 12, 2026

Problem

Fix

Tests

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant