fix: avoid infinite retry in blue-green migration test#1502
Open
nodece wants to merge 4 commits into
Open
Conversation
3c7c9e2 to
bc05b7a
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the blue/green migration integration test to avoid hanging indefinitely during broker migration/unload scenarios by adding bounded retry behavior, terminal-error exits, and stage-level timeouts/error propagation.
Changes:
- Add terminal error handling for producer send retries (stop retrying on
ErrTopicTerminated/ErrProducerClosed). - Bound producer and consumer retry loops to a 30s window per message.
- Add an error channel and a
waitWithErrorhelper to fail stages early and avoid waiting forever onWaitGroups.
Comments suppressed due to low confidence (2)
pulsar/blue_green_migration_test.go:193
- The pre-unload synchronization looks off by one: with
istarting at 0,if i == messageCountBeforeUnload { wgSendAndReceiveMessages.Done() }signals after processingmessageCountBeforeUnload+1messages. If the intent is to wait until exactlymessageCountBeforeUnloadmessages have been sent, this should trigger ati == messageCountBeforeUnload-1(or increment a counter and compare to the target).
if i == messageCountBeforeUnload {
wgSendAndReceiveMessages.Done()
}
}
pulsar/blue_green_migration_test.go:228
- Same off-by-one issue on the consumer side:
iis 0-based, soif i == messageCountBeforeUnload { ... }fires aftermessageCountBeforeUnload+1messages have been received/acked. If you want the unload to start after exactlymessageCountBeforeUnloadmessages, adjust the condition accordingly (e.g.,messageCountBeforeUnload-1).
if i == messageCountBeforeUnload {
wgSendAndReceiveMessages.Done()
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bc05b7a to
6c34646
Compare
BewareMyPower
requested changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The extensible-load-manager CI job timed out in TestBlueGreenMigrationTestSuite/TestTopicMigration/proxyConnection after 5 minutes.
From the stack and logs, the test was stuck waiting on WaitGroup while producer/consumer goroutines were still looping in retry paths. During migration, producer can enter terminal states (for example TopicTerminated or ProducerClosed), but the test retry loops had no terminal-exit logic, causing effectively unbounded retries and suite timeout.
Modifications
These changes make the test deterministic under migration failures and prevent hanging until global test timeout.