[fix][broker] Fix forced topic/namespace deletion still hanging when the compaction reader reconnect stalls#26026
Open
lhotari wants to merge 1 commit into
Open
Conversation
…the compaction reader reconnect stalls Follow-up to apache#26016 for issue apache#24148. apache#26016 fixed two causes of forced topic/namespace deletion hanging while compaction is in progress (the poisoned `currentCompaction` cursor deletion, and the `receiveRawAsync` enqueue-vs-close race), but a residual race remained and `CompactionTest#testForcedNamespaceDeleteWithInflightCompaction` still failed intermittently (~50%). ### Root cause The compaction `RawReader`'s consumer (`RawReaderImpl.RawConsumerImpl`, created with `retryOnRecoverableErrors=false`) has an asymmetry between its two reconnect-failure paths. When a forced delete fences the topic and disconnects the consumer, the client reconnects, and the failure surfaces at one of two stages: - SUBSCRIBE stage (`ConsumerImpl.connectionOpened().exceptionally`) consults `isUnrecoverableError` unconditionally, so a `ServiceNotReadyException` closes the reader promptly and fails the in-flight read. - LOOKUP / connection stage (`ConsumerImpl.connectionFailed`) only consults `isUnrecoverableError` inside `if (nonRetriableError || timeout)`. For a *retriable* `ServiceNotReadyException` within the lookup deadline it just reconnects, so the in-flight read never fails, `currentCompaction` never completes, and the forced deletion blocks past the test's 20s budget. Which stage wins is a metadata-cache propagation race, hence the flakiness. apache#26016's `receiveRawAsync` re-check only covered the already-fast subscribe path. ### Modifications - `RawReaderImpl.RawConsumerImpl` overrides `connectionFailed` to honor `isUnrecoverableError` before the retriable / lookup-deadline gate, so a fenced/deleted-topic reconnect terminates the reader promptly regardless of which stage surfaces the failure (mirroring the subscribe-stage behavior). This only affects the compaction reader (`retryOnRecoverableErrors=false`); ordinary RawReaders keep retrying recoverable errors unchanged. - `PersistentTopic.asyncDeleteCursorWithCleanCompactionLedger` skips waiting on `currentCompaction` when the topic is already closing/deleting. A forced delete aborts the in-flight compaction, so the cursor deletion must not block on a compaction future that may stay pending while its reader keeps reconnecting. This is the deterministic guarantee, independent of client reconnect timing or error classification. ### Verifying this change This change added tests and can be verified as follows: - `CompactionTest#testForcedDeleteCompletesWhileCompactionStuck`: blocks the compaction at the phase-two seek so `currentCompaction` never completes, then asserts `deleteForcefully()` still completes promptly (deterministic). - `RawReaderTest#testConnectionFailureTerminatesReadWhenNotRetryingRecoverableErrors`: builds a `retryOnRecoverableErrors=false` consumer with a pending read, invokes `connectionFailed(ServiceNotReadyException)`, and asserts it terminates the reader and fails the read (deterministic). - `testForcedNamespaceDeleteWithInflightCompaction` ran 10/10 under repetition with the fix; full `CompactionTest` and `RawReaderTest` suites pass. Assisted-by: Claude Code (Opus 4.8)
merlimat
approved these changes
Jun 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #26016 (which fixed #24148, "Compaction of
__change_eventstopic is blocking forcefulnamespace/topic deletion"). This addresses a residual race that #26016 left behind.
Motivation
#26016 fixed two causes of forced topic/namespace deletion hanging while compaction is in progress
(the poisoned
currentCompactioncursor deletion, and thereceiveRawAsyncenqueue-vs-close race),but
CompactionTest#testForcedNamespaceDeleteWithInflightCompactionstill failed intermittently(~50%).
The compaction
RawReader's consumer (RawReaderImpl.RawConsumerImpl, created withretryOnRecoverableErrors=false) has an asymmetry between its two reconnect-failure paths. When aforced delete fences the topic and disconnects the consumer, the client reconnects and the failure
surfaces at one of two stages:
ConsumerImpl.connectionOpened().exceptionally) consultsisUnrecoverableErrorunconditionally, so aServiceNotReadyExceptioncloses the reader promptlyand fails the in-flight read.
ConsumerImpl.connectionFailed) only consultsisUnrecoverableErrorinside
if (nonRetriableError || timeout). For a retriableServiceNotReadyExceptionwithin thelookup deadline it just reconnects — so the in-flight read never fails,
currentCompactionnevercompletes, and the forced deletion blocks past the test's 20s budget.
Which stage wins is a metadata-cache propagation race, hence the flakiness. (This is independent of
#26009, which only stops redirect-URL pinning across reconnects and does not change this
connectionFailedclassification.)Modifications
RawReaderImpl.RawConsumerImploverridesconnectionFailedto honorisUnrecoverableErrorbeforethe retriable / lookup-deadline gate, so a fenced/deleted-topic reconnect terminates the reader
promptly regardless of which stage surfaces the failure (mirroring the subscribe-stage behavior).
This only affects the compaction reader (
retryOnRecoverableErrors=false); ordinary RawReaders keepretrying recoverable errors unchanged.
PersistentTopic.asyncDeleteCursorWithCleanCompactionLedgerskips waiting oncurrentCompactionwhen the topic is already closing/deleting. A forced delete aborts the in-flight compaction, so the
cursor deletion must not block on a compaction future that may stay pending while its reader keeps
reconnecting. This is the deterministic guarantee, independent of client reconnect timing or error
classification.
Verifying this change
This change added tests and can be verified as follows:
CompactionTest#testForcedDeleteCompletesWhileCompactionStuck: blocks the compaction at thephase-two seek so
currentCompactionnever completes, then assertsdeleteForcefully()stillcompletes promptly (deterministic — exercises the broker-side gate).
RawReaderTest#testConnectionFailureTerminatesReadWhenNotRetryingRecoverableErrors: builds aretryOnRecoverableErrors=falseconsumer with a pending read, invokesconnectionFailed(ServiceNotReadyException), and asserts it terminates the reader and fails the read(deterministic — exercises the client-side override).
testForcedNamespaceDeleteWithInflightCompactionpassed 10/10 under repetition with the fix; thefull
CompactionTestandRawReaderTestsuites pass.Does this pull request potentially affect one of the following parts:
NONE of these are affected with this change.