Bug: fix IPC named lock client-side robustness for Windows#1910
Conversation
When lock acquisition times out in IpcNamedLock, the cleanup call to client.unlock() could either hang forever (Long.MAX_VALUE timeout) or throw RuntimeException. On Windows, where IPC socket behavior can differ, this causes test threads to either block indefinitely or crash without signaling their CountDownLatch, making the test appear to hang. Four changes: 1. IpcNamedLock: wrap cleanup unlock in try-catch (tryUnlock) so that a failure during timeout handling returns false instead of crashing the calling thread. This is the primary fix for the exclusiveAccess test hang on Windows. 2. IpcClient.unlock/stopServer: replace Long.MAX_VALUE timeouts with bounded values (10s/30s) so operations fail fast when the server is unresponsive. 3. IpcClient.receive: complete the future exceptionally on empty responses instead of throwing, which would kill the receiver thread and break all subsequent operations on the shared client. 4. IpcClient.close: properly call close() on EOF to clean up pending futures, and reset the initialized flag to prevent stale state.
|
retest |
The send() method read the `output` field multiple times without holding the same lock as close(Throwable), which nulls it under synchronized(this). Between ensureInitialized() returning and reaching synchronized(output), the receiver thread could close the connection and null out the field, causing a NullPointerException at output.flush(). Fix by capturing the output field in a local variable immediately after ensureInitialized(), checking for null, and using the local throughout the synchronized block.
|
Pushed a fix for the remaining NPE race condition in Root cause: The Fix: Capture |
|
/rerun |
|
Rerunning CI to gather additional Windows stability data. |
|
retest this please |
|
@gnodet Please assign appropriate label to PR according to the type of change. |
Summary
Fixes Windows CI flakiness in
IpcNamedLockFactoryIT.exclusiveAccess()(seen in PR #1909, CI run 27139141318).Root cause: When the loser thread's
client.lock()times out indoLockExclusively(), the cleanup callclient.unlock(contextId)could either:Long.MAX_VALUEtimeout, so if the server/receiver had any issue, the thread blocked indefinitelyAccess.run()(which only catchesInterruptedException), causing the thread to terminate without callingloser.countDown(). The winner thread then waited onloser.await()forever, and the test hung until Surefire killed it at 25 seconds.Changes
IpcNamedLock: Wrap cleanup unlock intryUnlock()— a try-catch that swallows exceptions so the calling thread always returnsfalseinstead of crashing. This is the primary fix.IpcClient.unlock(): ReplaceLong.MAX_VALUEtimeout with 10 seconds. Normal unlock round-trips complete in microseconds; 10 seconds is generous enough while preventing infinite hangs.IpcClient.stopServer(): ReplaceLong.MAX_VALUEtimeout with 30 seconds.IpcClient.receive(): On empty response, complete the future exceptionally instead of throwingIllegalStateException— the old behavior killed the receiver thread, breaking all subsequent operations on the shared client.IpcClient.close(): Callclose()on EOF so pending futures are properly cleaned up when the server disconnects, and reset theinitializedflag to prevent stale state.Test plan
IpcNamedLockFactoryIT,IpcAdapterNoForkIT,IpcAdapterIT)exclusiveAccess