Skip to content

fix: avoid infinite retry in blue-green migration test#1502

Open
nodece wants to merge 4 commits into
apache:masterfrom
nodece:fix-blue-green-ci-timeout
Open

fix: avoid infinite retry in blue-green migration test#1502
nodece wants to merge 4 commits into
apache:masterfrom
nodece:fix-blue-green-ci-timeout

Conversation

@nodece
Copy link
Copy Markdown
Member

@nodece nodece commented May 20, 2026

Motivation

The extensible-load-manager CI job timed out in TestBlueGreenMigrationTestSuite/TestTopicMigration/proxyConnection after 5 minutes.

From the stack and logs, the test was stuck waiting on WaitGroup while producer/consumer goroutines were still looping in retry paths. During migration, producer can enter terminal states (for example TopicTerminated or ProducerClosed), but the test retry loops had no terminal-exit logic, causing effectively unbounded retries and suite timeout.

Modifications

  • Add terminal error handling in producer send retry loop:
    • if error is ErrTopicTerminated or ErrProducerClosed, fail fast instead of retrying forever.
  • Add bounded retry windows for both producer and consumer loops (30 seconds per message stage).
  • Add an error channel and stage-level wait helper around WaitGroup waits to fail early on goroutine errors.
  • Add stage timeout protection while waiting for:
    • pre-unload send/receive synchronization
    • producer/consumer goroutine completion
  • Keep per-iteration context cancellation immediate (cancel after each send/receive attempt).

These changes make the test deterministic under migration failures and prevent hanging until global test timeout.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the blue/green migration integration test to avoid hanging indefinitely during broker migration/unload scenarios by adding bounded retry behavior, terminal-error exits, and stage-level timeouts/error propagation.

Changes:

  • Add terminal error handling for producer send retries (stop retrying on ErrTopicTerminated / ErrProducerClosed).
  • Bound producer and consumer retry loops to a 30s window per message.
  • Add an error channel and a waitWithError helper to fail stages early and avoid waiting forever on WaitGroups.
Comments suppressed due to low confidence (2)

pulsar/blue_green_migration_test.go:193

  • The pre-unload synchronization looks off by one: with i starting at 0, if i == messageCountBeforeUnload { wgSendAndReceiveMessages.Done() } signals after processing messageCountBeforeUnload+1 messages. If the intent is to wait until exactly messageCountBeforeUnload messages have been sent, this should trigger at i == messageCountBeforeUnload-1 (or increment a counter and compare to the target).

			if i == messageCountBeforeUnload {
				wgSendAndReceiveMessages.Done()
			}
		}

pulsar/blue_green_migration_test.go:228

  • Same off-by-one issue on the consumer side: i is 0-based, so if i == messageCountBeforeUnload { ... } fires after messageCountBeforeUnload+1 messages have been received/acked. If you want the unload to start after exactly messageCountBeforeUnload messages, adjust the condition accordingly (e.g., messageCountBeforeUnload-1).

			if i == messageCountBeforeUnload {
				wgSendAndReceiveMessages.Done()
			}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@BewareMyPower BewareMyPower added this to the 0.20.0 milestone Jun 1, 2026
Comment thread pulsar/blue_green_migration_test.go
Comment thread pulsar/blue_green_migration_test.go
Comment thread pulsar/blue_green_migration_test.go
Comment thread pulsar/blue_green_migration_test.go
nodece and others added 3 commits June 1, 2026 15:57
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nodece nodece force-pushed the fix-blue-green-ci-timeout branch from bc05b7a to 6c34646 Compare June 1, 2026 08:01
Comment thread pulsar/blue_green_migration_test.go
Comment thread pulsar/blue_green_migration_test.go Outdated
@RobertIndie RobertIndie modified the milestones: 0.20.0, 0.21.0 Jun 1, 2026
@nodece nodece requested a review from BewareMyPower June 2, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants