Skip to content

fix(controller): ghost tiebreaker witness re-created on a just-deleted node#140

Merged
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/ghost-tiebreaker-lost-node
Jun 12, 2026
Merged

fix(controller): ghost tiebreaker witness re-created on a just-deleted node#140
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/ghost-tiebreaker-lost-node

Conversation

@kvaps

Copy link
Copy Markdown
Member

Problem

After node lost cascades a 3-diskful RD down to 2 diskful, ensureTiebreaker re-created the witness <rd>.<lost-node> [DISKLESS TIE_BREAKER] on the node that was just deleted: the witness placement picks candidates from the manager's lagging informer cache, which still serves the deleted node for a short window. Nothing ever reaped the ghost — the node object is gone, so there is no DeletionTimestamp event and no finalizer pass. Deterministic on a warm process; deadlock-class.

The REST layer already guards the equivalent race on r c (the Bug 174 node-deleted post-write re-check); the controller-side witness create lacked it.

Fix

Both legs, mirroring the REST guard's shape on the controller side:

  • Placement guard: the chosen witness node is re-validated against the authoritative reader (direct apiserver read, not the cache) right before the Create, and re-checked right after it. On a miss (or a freshly-stamped drain flag) the create is deferred to the next reconcile, which re-picks from a fresh list; a witness whose node vanished mid-write is rolled back.
  • Repair leg: replicas whose node row no longer exists are treated like EVICTED/LOST placements — excluded from the diskful/diskless voting count, and TIE_BREAKER rows are reaped so a fresh witness lands on a healthy spare. This also cleans up ghosts created before the fix. Absence is confirmed authoritatively before any reap, so a lagging cache can never trigger a wrongful one. Stranded diskful replicas are excluded from the count but never deleted here — their teardown belongs to the lost-node cascade.

Node deletions now also re-enqueue the witness invariant through the existing node watch, so the repair runs event-driven instead of waiting for the periodic requeue.

Tests

L1 regression tests pin both legs and the boundary: a stale cached list offering a deleted node never produces a ghost (and the witness converges on the healthy spare once the list is fresh); a pre-existing ghost witness on a missing node is reaped and replaced; a stranded diskful on a missing node is excluded from the count but not deleted. All three fail without the fix. The full ensure_tiebreaker suite (witness-iff-diskful==2 and friends) stays green.

Verified locally: go test ./..., go test -race ./internal/controller/... and golangci-lint run are clean; the operator-level integration pin TestGroupKWFNodeLostCascade (PR #137 branch) goes red on main and green with this fix.

Andrei Kvapil (kvaps) and others added 2 commits June 12, 2026 04:55
…d node

After `n lost` cascaded a 3-diskful RD down to 2 diskful, the
witness placement picked candidates from the manager's lagging
informer cache and re-created the TIE_BREAKER on the just-deleted
node, and nothing ever reaped the ghost (the node row is gone, so
there is no DeletionTimestamp event and no finalizer pass).

Close both legs, mirroring the REST layer's Bug 174 node-deleted
race guard on the controller side:

- Placement guard: re-validate the chosen witness node against the
  authoritative reader (direct read, not the cache) right before
  and right after the Create; defer to the next reconcile on a miss
  and roll back a witness whose node vanished mid-write.
- Repair leg: replicas whose node row no longer exists are treated
  like EVICTED/LOST placements — excluded from the voting count and,
  for TIE_BREAKER rows, reaped so a fresh witness can land on a
  healthy spare. This also cleans up ghosts created before the fix.

Node deletions now also re-enqueue the witness invariant via the
existing node watch so the repair runs event-driven.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… node

Pin both legs: the placement guard (stale cached node list offers a
deleted node — the create is deferred and converges on the healthy
spare once the list is fresh), the repair leg (a pre-existing ghost
witness on a missing node is reaped and replaced), and the boundary
(a stranded diskful on a missing node is excluded from the voting
count but never deleted by the witness pass). All three fail
without the fix.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@kvaps, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 18 minutes and 7 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2c596214-e7ad-413d-9b5e-af79dd58e0d8

📥 Commits

Reviewing files that changed from the base of the PR and between 910ea3a and b1f9e61.

📒 Files selected for processing (2)
  • internal/controller/ensure_tiebreaker_ghost_bug024_test.go
  • internal/controller/resourcedefinition_controller.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ghost-tiebreaker-lost-node

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses Bug-024 by preventing the creation of witness replicas on deleted nodes and ensuring that ghost witnesses left on missing nodes are properly reaped. It introduces a placement guard that re-validates the chosen witness node against the authoritative reader before creation, implements a rollback mechanism if the node vanishes during creation, and updates the reconciliation logic to immediately trigger on node deletion events. The review feedback highlights a performance concern in probeNodeDirect, where using APIReader.List to find a single node by name can lead to an inefficient N+1 pattern, and suggests attempting a direct Get first to avoid overloading the API server.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1421 to +1449
func (r *ResourceDefinitionReconciler) probeNodeDirect(ctx context.Context, wireName string) (nodeProbe, error) {
if r.APIReader == nil {
node, err := r.Store.Nodes().Get(ctx, wireName)
if stderrors.Is(err, store.ErrNotFound) {
return nodeProbe{}, nil
}

if err != nil {
return nodeProbe{}, err
}

return nodeProbe{found: true, drained: isDisabledNode(&node)}, nil
}

var list blockstoriov1alpha1.NodeList
if err := r.APIReader.List(ctx, &list); err != nil {
return nodeProbe{}, err
}

for i := range list.Items {
if !nodeCRDMatchesWireName(&list.Items[i], wireName) {
continue
}

return nodeProbe{found: true, drained: nodeHasDrainFlag(&list.Items[i])}, nil
}

return nodeProbe{}, nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling r.APIReader.List to find a single node by name is highly inefficient, especially in larger clusters. Since probeNodeDirect is called in a loop within confirmGhostReplicas, this can lead to an $N+1$ List pattern that bypasses the cache and could easily overload the Kubernetes API server under load.

To optimize this, we can attempt a direct r.APIReader.Get using the lowercase wireName first. If that succeeds, we avoid the expensive List call entirely. We should only fall back to List if the direct Get returns a NotFound error (to handle cases where the node name was slugified or modified).

func (r *ResourceDefinitionReconciler) probeNodeDirect(ctx context.Context, wireName string) (nodeProbe, error) {
	if r.APIReader == nil {
		node, err := r.Store.Nodes().Get(ctx, wireName)
		if stderrors.Is(err, store.ErrNotFound) {
			return nodeProbe{}, nil
		}

		if err != nil {
			return nodeProbe{}, err
		}

		return nodeProbe{found: true, drained: isDisabledNode(&node)}, nil
	}

	// Try direct Get first to avoid expensive List
	var node blockstoriov1alpha1.Node
	err := r.APIReader.Get(ctx, client.ObjectKey{Name: strings.ToLower(wireName)}, &node)
	if err == nil {
		return nodeProbe{found: true, drained: nodeHasDrainFlag(&node)}, nil
	}
	if !apierrors.IsNotFound(err) {
		return nodeProbe{}, err
	}

	var list blockstoriov1alpha1.NodeList
	if err := r.APIReader.List(ctx, &list); err != nil {
		return nodeProbe{}, err
	}

	for i := range list.Items {
		if !nodeCRDMatchesWireName(&list.Items[i], wireName) {
			continue
		}

		return nodeProbe{found: true, drained: nodeHasDrainFlag(&list.Items[i])}, nil
	}

	return nodeProbe{}, nil
}

@kvaps Andrei Kvapil (kvaps) merged commit 10b7835 into main Jun 12, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant