test(server): harden setup integration tests against flaky CI#1051
Open
vharseko wants to merge 4 commits into
Open
test(server): harden setup integration tests against flaky CI#1051vharseko wants to merge 4 commits into
vharseko wants to merge 4 commits into
Conversation
RadiusConn keeps server-availability state in two static fields that
outlive any single connection: the SERVER_STATUS map and the scheduled
RADIUSMonitor health-check timer. These leaked across the 13 test
methods, making the suite order-dependent and flaky under CI timing:
- failoverToSecondary permanently recorded its dead primary as OFFLINE
and scheduled a background monitor that kept probing.
- attackerForgedAccessAccept_withZeroAuthenticator_isRejected then saw
getOnlineServer() return null ("No RADIUS server is online.") when an
ephemeral port number was reused, since RADIUSServer keys on host+port.
- healthCheckSurvivesClientSocketClose had its manually-driven
monitor.run() churn through stale OFFLINE entries (multiple 2s socket
timeouts) while racing the leaked background monitor, so the live
server was never promoted back to ONLINE.
Add resetRadiusConnStatics(), invoked in @BeforeMethod and @AfterMethod,
which cancels and clears the static serverMonitor (unscheduling the
background timer) and clears SERVER_STATUS via reflection. Each method
now starts from a clean slate.
All 13 tests pass, green and stable across repeated runs (~4.8s vs the
20s timeout-laden CI run).
Two independent CI-only flakes in RadiusConnSecurityTest, both rooted in test-harness timing rather than in RadiusConn's security logic. 1. Leaked static state across test methods. RadiusConn keeps server-availability in two static fields that outlive any single connection: the SERVER_STATUS map and the scheduled RADIUSMonitor health-check timer. failoverToSecondary marks its dead primary OFFLINE and schedules a background monitor; left running it bled into later methods, so healthCheckSurvivesClientSocketClose had its manually-driven monitor.run() churn through stale OFFLINE entries (multiple socket timeouts) while racing the leaked monitor, and the live server was never promoted back to ONLINE. Fix: resetRadiusConnStatics(), invoked in @BeforeMethod and @AfterMethod, cancels and nulls the static serverMonitor (unscheduling the background timer) and clears SERVER_STATUS via reflection, so each method starts from a clean slate. 2. Too-tight client read timeout on the cold first method. attackerForgedAccessAccept_withZeroAuthenticator_isRejected is the first @test in source order and pays JVM warm-up / class-loading cost. On a CPU-constrained CI runner the responder thread is briefly starved and the 2s socket timeout fired before the forged reply arrived; the client then marked its only server OFFLINE and surfaced "No RADIUS server is online." instead of the asserted authenticator rejection. Fix: widen the test client read timeout from 2s to 15s. Every test using this client receives a response and returns as soon as it arrives, so the larger timeout never slows the happy path; it only absorbs CI scheduling jitter. Suite is green and stable across repeated local runs (~4.8s).
Root-cause fix for the remaining CI flake in
attackerForgedAccessAccept_withZeroAuthenticator_isRejected, which kept
failing with "No RADIUS server is online." despite the earlier static-
state isolation and the widened client timeout.
The background UDP responder runs on a server socket with a 5s SO_TIMEOUT
(@BeforeMethod). Its receive loop rethrew every IOException - including
SocketTimeoutException - which killed the responder thread whenever no
request arrived within 5s. The very first test method can be slow to
*send* its first packet (cold-JVM class loading, and authenticate()'s
InetAddress.getLocalHost() blocking on reverse DNS on a CI host). When
that pre-send delay exceeds 5s the responder dies first; the client's
request then goes unanswered, the client times out, marks its only
server OFFLINE and reports "No RADIUS server is online." instead of the
asserted authenticator rejection. Locally the client always sends within
5s, so the responder replies before its timeout - hence it only failed
on CI.
Treat SocketTimeoutException in the responder loop as "no request yet"
and keep waiting, so the responder survives an arbitrarily slow client.
Reproduced deterministically by injecting a 7s pre-send delay: fails
("No RADIUS server is online.", ~17s) without this change, passes with
it.
Also tighten the test client read timeout from 15s to 10s: with the
responder no longer dying, the large value is only defence-in-depth
against scheduling jitter and no longer needs to mask the real bug.
IT_SetupWithOpenDJ.testSetupWithOpendj intermittently fails on CI with a TimeoutException while waiting for the OpenAM installation to complete. The Selenium + Docker + Cargo setup tests drive a full OpenAM install through a real browser against a Tomcat and an OpenDJ container, so they are sensitive to resource contention on a shared CI runner (a container slow to become healthy, a Tomcat slow to deploy the heavy OpenAM web app, or a ChromeDriver command stalling under load). Treat these as infra flakes rather than a product bug: - Add RetryAnalyzer + RetryListener: every integration-test method is retried once (configurable via -Dit.retry.count) before failing the build. Each retry starts from a clean config and a freshly started OpenAM instance (CargoBaseTest), so a transient infra hiccup gets a second chance without masking a reproducible failure. Wired globally via @listeners on BaseTest so both IT classes inherit it. - Raise the testSetupWithOpendj install-completion wait from 600s to 900s: this variant installs against an external OpenDJ and makes OpenAM create the base entry itself, which under contention can exceed the old 10-minute ceiling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
IT_SetupWithOpenDJ.testSetupWithOpendj intermittently fails on CI with a
TimeoutException while waiting for the OpenAM installation to complete.
The Selenium + Docker + Cargo setup tests drive a full OpenAM install
through a real browser against a Tomcat and an OpenDJ container, so they
are sensitive to resource contention on a shared CI runner (a container
slow to become healthy, a Tomcat slow to deploy the heavy OpenAM web app,
or a ChromeDriver command stalling under load). Treat these as infra
flakes rather than a product bug:
retried once (configurable via -Dit.retry.count) before failing the
build. Each retry starts from a clean config and a freshly started
OpenAM instance (CargoBaseTest), so a transient infra hiccup gets a
second chance without masking a reproducible failure. Wired globally
via @listeners on BaseTest so both IT classes inherit it.
900s: this variant installs against an external OpenDJ and makes OpenAM
create the base entry itself, which under contention can exceed the old
10-minute ceiling.