Skip to content

Bug: fix IPC named lock client-side robustness for Windows#1910

Merged
gnodet merged 3 commits into
apache:masterfrom
gnodet:gnodet/fix-ipc-windows
Jun 8, 2026
Merged

Bug: fix IPC named lock client-side robustness for Windows#1910
gnodet merged 3 commits into
apache:masterfrom
gnodet:gnodet/fix-ipc-windows

Conversation

@gnodet

@gnodet gnodet commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes Windows CI flakiness in IpcNamedLockFactoryIT.exclusiveAccess() (seen in PR #1909, CI run 27139141318).

Root cause: When the loser thread's client.lock() times out in doLockExclusively(), the cleanup call client.unlock(contextId) could either:

  • Hang forever — it used Long.MAX_VALUE timeout, so if the server/receiver had any issue, the thread blocked indefinitely
  • Throw RuntimeException — which propagated out of Access.run() (which only catches InterruptedException), causing the thread to terminate without calling loser.countDown(). The winner thread then waited on loser.await() forever, and the test hung until Surefire killed it at 25 seconds.

Changes

  1. IpcNamedLock: Wrap cleanup unlock in tryUnlock() — a try-catch that swallows exceptions so the calling thread always returns false instead of crashing. This is the primary fix.

  2. IpcClient.unlock(): Replace Long.MAX_VALUE timeout with 10 seconds. Normal unlock round-trips complete in microseconds; 10 seconds is generous enough while preventing infinite hangs.

  3. IpcClient.stopServer(): Replace Long.MAX_VALUE timeout with 30 seconds.

  4. IpcClient.receive(): On empty response, complete the future exceptionally instead of throwing IllegalStateException — the old behavior killed the receiver thread, breaking all subsequent operations on the shared client.

  5. IpcClient.close(): Call close() on EOF so pending futures are properly cleaned up when the server disconnects, and reset the initialized flag to prevent stale state.

Test plan

  • All 27 IPC tests pass locally (unit + integration: IpcNamedLockFactoryIT, IpcAdapterNoForkIT, IpcAdapterIT)
  • Windows CI should no longer hang on exclusiveAccess

When lock acquisition times out in IpcNamedLock, the cleanup call to
client.unlock() could either hang forever (Long.MAX_VALUE timeout) or
throw RuntimeException. On Windows, where IPC socket behavior can
differ, this causes test threads to either block indefinitely or crash
without signaling their CountDownLatch, making the test appear to hang.

Four changes:

1. IpcNamedLock: wrap cleanup unlock in try-catch (tryUnlock) so that a
   failure during timeout handling returns false instead of crashing the
   calling thread. This is the primary fix for the exclusiveAccess test
   hang on Windows.

2. IpcClient.unlock/stopServer: replace Long.MAX_VALUE timeouts with
   bounded values (10s/30s) so operations fail fast when the server is
   unresponsive.

3. IpcClient.receive: complete the future exceptionally on empty
   responses instead of throwing, which would kill the receiver thread
   and break all subsequent operations on the shared client.

4. IpcClient.close: properly call close() on EOF to clean up pending
   futures, and reset the initialized flag to prevent stale state.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@gnodet

gnodet commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

retest

The send() method read the `output` field multiple times without holding
the same lock as close(Throwable), which nulls it under synchronized(this).
Between ensureInitialized() returning and reaching synchronized(output),
the receiver thread could close the connection and null out the field,
causing a NullPointerException at output.flush().

Fix by capturing the output field in a local variable immediately after
ensureInitialized(), checking for null, and using the local throughout
the synchronized block.
@gnodet

gnodet commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a fix for the remaining NPE race condition in IpcClient.send().

Root cause: The send() method read the output field multiple times without holding the same lock as close(Throwable), which nulls the field under synchronized(this). Between ensureInitialized() returning and reaching synchronized(output), the receiver thread could call close(Throwable) — setting output = null — causing a NullPointerException at output.flush().

Fix: Capture output in a local variable immediately after ensureInitialized(), check it for null (throwing IOException("Connection closed") if the connection was already torn down), and use the local variable for both the synchronized block and all writes within it. This eliminates the TOCTOU race between the field read and its use.

@gnodet

gnodet commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

/rerun

@gnodet

gnodet commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Rerunning CI to gather additional Windows stability data.

@gnodet

gnodet commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

retest this please

@gnodet gnodet merged commit 9975d78 into apache:master Jun 8, 2026
42 checks passed
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

@gnodet Please assign appropriate label to PR according to the type of change.

@github-actions github-actions Bot added this to the 2.0.19 milestone Jun 8, 2026
@cstamas cstamas added the bug Something isn't working label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants