Skip to content

test(radius): keep test responder alive across its read timeout#1050

Open
vharseko wants to merge 3 commits into
OpenIdentityPlatform:masterfrom
vharseko:issues/fix-test
Open

test(radius): keep test responder alive across its read timeout#1050
vharseko wants to merge 3 commits into
OpenIdentityPlatform:masterfrom
vharseko:issues/fix-test

Conversation

@vharseko

@vharseko vharseko commented Jun 15, 2026

Copy link
Copy Markdown
Member

Root-cause fix for the remaining CI flake in
attackerForgedAccessAccept_withZeroAuthenticator_isRejected, which kept
failing with "No RADIUS server is online." despite the earlier static-
state isolation and the widened client timeout.

The background UDP responder runs on a server socket with a 5s SO_TIMEOUT
(@BeforeMethod). Its receive loop rethrew every IOException - including
SocketTimeoutException - which killed the responder thread whenever no
request arrived within 5s. The very first test method can be slow to
send its first packet (cold-JVM class loading, and authenticate()'s
InetAddress.getLocalHost() blocking on reverse DNS on a CI host). When
that pre-send delay exceeds 5s the responder dies first; the client's
request then goes unanswered, the client times out, marks its only
server OFFLINE and reports "No RADIUS server is online." instead of the
asserted authenticator rejection. Locally the client always sends within
5s, so the responder replies before its timeout - hence it only failed
on CI.

Treat SocketTimeoutException in the responder loop as "no request yet"
and keep waiting, so the responder survives an arbitrarily slow client.
Reproduced deterministically by injecting a 7s pre-send delay: fails
("No RADIUS server is online.", ~17s) without this change, passes with
it.

Also tighten the test client read timeout from 15s to 10s: with the
responder no longer dying, the large value is only defence-in-depth
against scheduling jitter and no longer needs to mask the real bug.

vharseko added 3 commits June 15, 2026 21:44
RadiusConn keeps server-availability state in two static fields that
outlive any single connection: the SERVER_STATUS map and the scheduled
RADIUSMonitor health-check timer. These leaked across the 13 test
methods, making the suite order-dependent and flaky under CI timing:

- failoverToSecondary permanently recorded its dead primary as OFFLINE
  and scheduled a background monitor that kept probing.
- attackerForgedAccessAccept_withZeroAuthenticator_isRejected then saw
  getOnlineServer() return null ("No RADIUS server is online.") when an
  ephemeral port number was reused, since RADIUSServer keys on host+port.
- healthCheckSurvivesClientSocketClose had its manually-driven
  monitor.run() churn through stale OFFLINE entries (multiple 2s socket
  timeouts) while racing the leaked background monitor, so the live
  server was never promoted back to ONLINE.

Add resetRadiusConnStatics(), invoked in @BeforeMethod and @AfterMethod,
which cancels and clears the static serverMonitor (unscheduling the
background timer) and clears SERVER_STATUS via reflection. Each method
now starts from a clean slate.

All 13 tests pass, green and stable across repeated runs (~4.8s vs the
20s timeout-laden CI run).
Two independent CI-only flakes in RadiusConnSecurityTest, both rooted in
test-harness timing rather than in RadiusConn's security logic.

1. Leaked static state across test methods.
   RadiusConn keeps server-availability in two static fields that outlive
   any single connection: the SERVER_STATUS map and the scheduled
   RADIUSMonitor health-check timer. failoverToSecondary marks its dead
   primary OFFLINE and schedules a background monitor; left running it
   bled into later methods, so healthCheckSurvivesClientSocketClose had
   its manually-driven monitor.run() churn through stale OFFLINE entries
   (multiple socket timeouts) while racing the leaked monitor, and the
   live server was never promoted back to ONLINE.
   Fix: resetRadiusConnStatics(), invoked in @BeforeMethod and
   @AfterMethod, cancels and nulls the static serverMonitor (unscheduling
   the background timer) and clears SERVER_STATUS via reflection, so each
   method starts from a clean slate.

2. Too-tight client read timeout on the cold first method.
   attackerForgedAccessAccept_withZeroAuthenticator_isRejected is the
   first @test in source order and pays JVM warm-up / class-loading cost.
   On a CPU-constrained CI runner the responder thread is briefly starved
   and the 2s socket timeout fired before the forged reply arrived; the
   client then marked its only server OFFLINE and surfaced "No RADIUS
   server is online." instead of the asserted authenticator rejection.
   Fix: widen the test client read timeout from 2s to 15s. Every test
   using this client receives a response and returns as soon as it
   arrives, so the larger timeout never slows the happy path; it only
   absorbs CI scheduling jitter.

Suite is green and stable across repeated local runs (~4.8s).
Root-cause fix for the remaining CI flake in
attackerForgedAccessAccept_withZeroAuthenticator_isRejected, which kept
failing with "No RADIUS server is online." despite the earlier static-
state isolation and the widened client timeout.

The background UDP responder runs on a server socket with a 5s SO_TIMEOUT
(@BeforeMethod). Its receive loop rethrew every IOException - including
SocketTimeoutException - which killed the responder thread whenever no
request arrived within 5s. The very first test method can be slow to
*send* its first packet (cold-JVM class loading, and authenticate()'s
InetAddress.getLocalHost() blocking on reverse DNS on a CI host). When
that pre-send delay exceeds 5s the responder dies first; the client's
request then goes unanswered, the client times out, marks its only
server OFFLINE and reports "No RADIUS server is online." instead of the
asserted authenticator rejection. Locally the client always sends within
5s, so the responder replies before its timeout - hence it only failed
on CI.

Treat SocketTimeoutException in the responder loop as "no request yet"
and keep waiting, so the responder survives an arbitrarily slow client.
Reproduced deterministically by injecting a 7s pre-send delay: fails
("No RADIUS server is online.", ~17s) without this change, passes with
it.

Also tighten the test client read timeout from 15s to 10s: with the
responder no longer dying, the large value is only defence-in-depth
against scheduling jitter and no longer needs to mask the real bug.
@vharseko vharseko changed the title test(radius): isolate RadiusConnSecurityTest from leaked static state test(radius): keep test responder alive across its read timeout Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant