test(server): harden setup integration tests against flaky CI by vharseko · Pull Request #1051 · OpenIdentityPlatform/OpenAM

vharseko · 2026-06-15T20:38:36Z

IT_SetupWithOpenDJ.testSetupWithOpendj intermittently fails on CI with a
TimeoutException while waiting for the OpenAM installation to complete.
The Selenium + Docker + Cargo setup tests drive a full OpenAM install
through a real browser against a Tomcat and an OpenDJ container, so they
are sensitive to resource contention on a shared CI runner (a container
slow to become healthy, a Tomcat slow to deploy the heavy OpenAM web app,
or a ChromeDriver command stalling under load). Treat these as infra
flakes rather than a product bug:

Add RetryAnalyzer + RetryListener: every integration-test method is
retried once (configurable via -Dit.retry.count) before failing the
build. Each retry starts from a clean config and a freshly started
OpenAM instance (CargoBaseTest), so a transient infra hiccup gets a
second chance without masking a reproducible failure. Wired globally
via @listeners on BaseTest so both IT classes inherit it.
Raise the testSetupWithOpendj install-completion wait from 600s to
900s: this variant installs against an external OpenDJ and makes OpenAM
create the base entry itself, which under contention can exceed the old
10-minute ceiling.

RadiusConn keeps server-availability state in two static fields that outlive any single connection: the SERVER_STATUS map and the scheduled RADIUSMonitor health-check timer. These leaked across the 13 test methods, making the suite order-dependent and flaky under CI timing: - failoverToSecondary permanently recorded its dead primary as OFFLINE and scheduled a background monitor that kept probing. - attackerForgedAccessAccept_withZeroAuthenticator_isRejected then saw getOnlineServer() return null ("No RADIUS server is online.") when an ephemeral port number was reused, since RADIUSServer keys on host+port. - healthCheckSurvivesClientSocketClose had its manually-driven monitor.run() churn through stale OFFLINE entries (multiple 2s socket timeouts) while racing the leaked background monitor, so the live server was never promoted back to ONLINE. Add resetRadiusConnStatics(), invoked in @BeforeMethod and @AfterMethod, which cancels and clears the static serverMonitor (unscheduling the background timer) and clears SERVER_STATUS via reflection. Each method now starts from a clean slate. All 13 tests pass, green and stable across repeated runs (~4.8s vs the 20s timeout-laden CI run).

@test

Two independent CI-only flakes in RadiusConnSecurityTest, both rooted in test-harness timing rather than in RadiusConn's security logic. 1. Leaked static state across test methods. RadiusConn keeps server-availability in two static fields that outlive any single connection: the SERVER_STATUS map and the scheduled RADIUSMonitor health-check timer. failoverToSecondary marks its dead primary OFFLINE and schedules a background monitor; left running it bled into later methods, so healthCheckSurvivesClientSocketClose had its manually-driven monitor.run() churn through stale OFFLINE entries (multiple socket timeouts) while racing the leaked monitor, and the live server was never promoted back to ONLINE. Fix: resetRadiusConnStatics(), invoked in @BeforeMethod and @AfterMethod, cancels and nulls the static serverMonitor (unscheduling the background timer) and clears SERVER_STATUS via reflection, so each method starts from a clean slate. 2. Too-tight client read timeout on the cold first method. attackerForgedAccessAccept_withZeroAuthenticator_isRejected is the first @test in source order and pays JVM warm-up / class-loading cost. On a CPU-constrained CI runner the responder thread is briefly starved and the 2s socket timeout fired before the forged reply arrived; the client then marked its only server OFFLINE and surfaced "No RADIUS server is online." instead of the asserted authenticator rejection. Fix: widen the test client read timeout from 2s to 15s. Every test using this client receives a response and returns as soon as it arrives, so the larger timeout never slows the happy path; it only absorbs CI scheduling jitter. Suite is green and stable across repeated local runs (~4.8s).

Root-cause fix for the remaining CI flake in attackerForgedAccessAccept_withZeroAuthenticator_isRejected, which kept failing with "No RADIUS server is online." despite the earlier static- state isolation and the widened client timeout. The background UDP responder runs on a server socket with a 5s SO_TIMEOUT (@BeforeMethod). Its receive loop rethrew every IOException - including SocketTimeoutException - which killed the responder thread whenever no request arrived within 5s. The very first test method can be slow to *send* its first packet (cold-JVM class loading, and authenticate()'s InetAddress.getLocalHost() blocking on reverse DNS on a CI host). When that pre-send delay exceeds 5s the responder dies first; the client's request then goes unanswered, the client times out, marks its only server OFFLINE and reports "No RADIUS server is online." instead of the asserted authenticator rejection. Locally the client always sends within 5s, so the responder replies before its timeout - hence it only failed on CI. Treat SocketTimeoutException in the responder loop as "no request yet" and keep waiting, so the responder survives an arbitrarily slow client. Reproduced deterministically by injecting a 7s pre-send delay: fails ("No RADIUS server is online.", ~17s) without this change, passes with it. Also tighten the test client read timeout from 15s to 10s: with the responder no longer dying, the large value is only defence-in-depth against scheduling jitter and no longer needs to mask the real bug.

@listeners

IT_SetupWithOpenDJ.testSetupWithOpendj intermittently fails on CI with a TimeoutException while waiting for the OpenAM installation to complete. The Selenium + Docker + Cargo setup tests drive a full OpenAM install through a real browser against a Tomcat and an OpenDJ container, so they are sensitive to resource contention on a shared CI runner (a container slow to become healthy, a Tomcat slow to deploy the heavy OpenAM web app, or a ChromeDriver command stalling under load). Treat these as infra flakes rather than a product bug: - Add RetryAnalyzer + RetryListener: every integration-test method is retried once (configurable via -Dit.retry.count) before failing the build. Each retry starts from a clean config and a freshly started OpenAM instance (CargoBaseTest), so a transient infra hiccup gets a second chance without masking a reproducible failure. Wired globally via @listeners on BaseTest so both IT classes inherit it. - Raise the testSetupWithOpendj install-completion wait from 600s to 900s: this variant installs against an external OpenDJ and makes OpenAM create the base entry itself, which under contention can exceed the old 10-minute ceiling.

vharseko added 4 commits June 15, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(server): harden setup integration tests against flaky CI#1051

test(server): harden setup integration tests against flaky CI#1051
vharseko wants to merge 4 commits into
OpenIdentityPlatform:masterfrom
vharseko:issues/fix-test-2

vharseko commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vharseko commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant