Skip to content

PHOENIX-7859 Make ParallelPhoenixConnectionFallbackIT deterministic#2482

Open
mnpoonia wants to merge 2 commits into
apache:masterfrom
mnpoonia:PHOENIX-7859-deterministic-test
Open

PHOENIX-7859 Make ParallelPhoenixConnectionFallbackIT deterministic#2482
mnpoonia wants to merge 2 commits into
apache:masterfrom
mnpoonia:PHOENIX-7859-deterministic-test

Conversation

@mnpoonia

@mnpoonia mnpoonia commented May 19, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of this change?

Fix intermittent timeout in ParallelPhoenixConnectionFallbackIT.testParallelConnectionBackoff by increasing waitFor timeout ceilings to tolerate CI host CPU starvation.

Background

The test uses waitFor() with a 5s ceiling throughout. On a busy Yetus worker, newly created thread pools may not be scheduled for several seconds, causing the condition to never be observed within the deadline. The 5s limit was chosen by eye and offers very little margin under resource contention.

waitFor() exits as soon as the condition becomes true, so raising the ceiling costs nothing on fast machines — it only provides headroom when the CI host is under load.

What changes did I make?

Raised all waitFor timeout ceilings from 5000ms to 60000ms.

Related Issues

@mnpoonia mnpoonia force-pushed the PHOENIX-7859-deterministic-test branch 3 times, most recently from ce41869 to fb1bfd5 Compare May 20, 2026 03:02
…y checking queue state directly

The test ParallelPhoenixConnectionFallbackIT.testParallelConnectionBackoff
times out intermittently in CI when polling hasCapacity() to detect when
executor queues fill up.

Root cause: hasCapacity() performs a multi-step calculation (read queue
size, read capacity, divide, compare threshold) which creates a race
condition. Tasks can enter queues during calculation steps, causing the
check to miss state transitions.

Solution: Check queue.size() >= 1 directly (single atomic operation),
then verify hasCapacity() matches expected state as an assertion.

Benefits:
- Eliminates race condition (atomic read vs multi-step calculation)
- More deterministic (checks actual state, not derived value)
- Maintains 5s timeout (no increase needed)
- Validates both queue state and hasCapacity() logic
- Adds debug logging for troubleshooting

Testing: Passed locally with HBase 2.6.5. Queues filled in ~105ms
(2 checks), well under 5s timeout. State transition [0,0] → [1,1]
detected reliably. hasCapacity() correctly returned [false, false].

Related: PHOENIX-6840 (flaky ParallelPhoenix tests)

Co-authored-by: Claude Code <claude-code@anthropic.com>
@mnpoonia mnpoonia force-pushed the PHOENIX-7859-deterministic-test branch from fb1bfd5 to 2b1341c Compare May 20, 2026 03:04
@mnpoonia

Copy link
Copy Markdown
Contributor Author

@virajjasani

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes ParallelPhoenixConnectionFallbackIT.testParallelConnectionBackoff deterministic by removing a race-prone polling condition and instead waiting on the executor queues’ observed state before asserting hasCapacity() behavior.

Changes:

  • Replaces polling on PhoenixHAExecutorServiceProvider.hasCapacity() with polling on the underlying executor queue sizes.
  • Adds explicit assertions that hasCapacity() reports the expected “no capacity” state once queues are filled.
  • Adds debug logging to show queue sizes while waiting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +122 to +123
int queueSize1 = ((ThreadPoolExecutor) services.get(0).getExecutorService()).getQueue().size();
int queueSize2 = ((ThreadPoolExecutor) services.get(1).getExecutorService()).getQueue().size();
Comment on lines +133 to +134
assertFalse("Cluster 1 should have no capacity after queues filled", capacity.get(0).booleanValue());
assertFalse("Cluster 2 should have no capacity after queues filled", capacity.get(1).booleanValue());

@apurtell apurtell left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stated root cause is at least partly wrong?

Even if one hasCapacity() call returns a stale answer because of the non-atomic read pair, the next poll will see the update. The test polls 50 times over 5s. Maybe detection is delayed by one tick (~100ms). It cannot cause a 5s timeout. There must be some other reason?

Be sure to run mvn spotless:apply before posting an update or this will fail precommit or cause precommits to fail post merge.

LOG.debug("Waiting for queues to fill: cluster1 queue={}, cluster2 queue={}",
queueSize1, queueSize2);

return queueSize1 >= 1 && queueSize2 >= 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any future change adjusts HA_MAX_QUEUE_SIZE this test may pass when it should fail (the wait completes earlier than the fallback signal actually flips) or fail in confusing ways. Consider deriving the trigger from the configured values, e.g.:

int maxQueue = Integer.parseInt(
    PROPERTIES.getProperty(PhoenixHAExecutorServiceProvider.HA_MAX_QUEUE_SIZE));
double threshold = Double.parseDouble(
    PROPERTIES.getProperty(
        PhoenixHAExecutorServiceProvider.HA_THREADPOOL_QUEUE_BACKOFF_THRESHOLD));
int trigger = (int) Math.ceil(threshold * maxQueue);

... or just keep hasCapacity() as the wait predicate

@apurtell

apurtell commented May 21, 2026

Copy link
Copy Markdown
Contributor

FWIW @mnpoonia I put this to my robot and it suggests CI host CPU starvation is the cause because the timeouts are too tight.

On a busy Yetus worker, the brand-new single-thread pool created on line 100 of the test may not be scheduled for several seconds. The 5s timeout is just too tight. It was almost certainly chosen by eye, not by measurement, and offers very little margin.

…tion

The 5s timeouts were too tight for busy Yetus workers where newly created
thread pools may not be scheduled for several seconds. waitFor() exits as
soon as the condition is true, so raising the ceiling to 60s costs nothing
on fast machines while giving CI adequate headroom.

Also reverts the queue.size() polling approach — the hasCapacity() race
window is at most one poll tick (~100ms), not the cause of 5s timeouts.
@mnpoonia

mnpoonia commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @apurtell In that case i think we should increase timeout here. We are checking every 100 ms so it will not impact normal runs but will help in overloaded CI case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants