Skip to content

test(server): harden setup integration tests against flaky CI#1051

Open
vharseko wants to merge 4 commits into
OpenIdentityPlatform:masterfrom
vharseko:issues/fix-test-2
Open

test(server): harden setup integration tests against flaky CI#1051
vharseko wants to merge 4 commits into
OpenIdentityPlatform:masterfrom
vharseko:issues/fix-test-2

Conversation

@vharseko

Copy link
Copy Markdown
Member

IT_SetupWithOpenDJ.testSetupWithOpendj intermittently fails on CI with a
TimeoutException while waiting for the OpenAM installation to complete.
The Selenium + Docker + Cargo setup tests drive a full OpenAM install
through a real browser against a Tomcat and an OpenDJ container, so they
are sensitive to resource contention on a shared CI runner (a container
slow to become healthy, a Tomcat slow to deploy the heavy OpenAM web app,
or a ChromeDriver command stalling under load). Treat these as infra
flakes rather than a product bug:

  • Add RetryAnalyzer + RetryListener: every integration-test method is
    retried once (configurable via -Dit.retry.count) before failing the
    build. Each retry starts from a clean config and a freshly started
    OpenAM instance (CargoBaseTest), so a transient infra hiccup gets a
    second chance without masking a reproducible failure. Wired globally
    via @listeners on BaseTest so both IT classes inherit it.
  • Raise the testSetupWithOpendj install-completion wait from 600s to
    900s: this variant installs against an external OpenDJ and makes OpenAM
    create the base entry itself, which under contention can exceed the old
    10-minute ceiling.

vharseko added 4 commits June 15, 2026 21:44
RadiusConn keeps server-availability state in two static fields that
outlive any single connection: the SERVER_STATUS map and the scheduled
RADIUSMonitor health-check timer. These leaked across the 13 test
methods, making the suite order-dependent and flaky under CI timing:

- failoverToSecondary permanently recorded its dead primary as OFFLINE
  and scheduled a background monitor that kept probing.
- attackerForgedAccessAccept_withZeroAuthenticator_isRejected then saw
  getOnlineServer() return null ("No RADIUS server is online.") when an
  ephemeral port number was reused, since RADIUSServer keys on host+port.
- healthCheckSurvivesClientSocketClose had its manually-driven
  monitor.run() churn through stale OFFLINE entries (multiple 2s socket
  timeouts) while racing the leaked background monitor, so the live
  server was never promoted back to ONLINE.

Add resetRadiusConnStatics(), invoked in @BeforeMethod and @AfterMethod,
which cancels and clears the static serverMonitor (unscheduling the
background timer) and clears SERVER_STATUS via reflection. Each method
now starts from a clean slate.

All 13 tests pass, green and stable across repeated runs (~4.8s vs the
20s timeout-laden CI run).
Two independent CI-only flakes in RadiusConnSecurityTest, both rooted in
test-harness timing rather than in RadiusConn's security logic.

1. Leaked static state across test methods.
   RadiusConn keeps server-availability in two static fields that outlive
   any single connection: the SERVER_STATUS map and the scheduled
   RADIUSMonitor health-check timer. failoverToSecondary marks its dead
   primary OFFLINE and schedules a background monitor; left running it
   bled into later methods, so healthCheckSurvivesClientSocketClose had
   its manually-driven monitor.run() churn through stale OFFLINE entries
   (multiple socket timeouts) while racing the leaked monitor, and the
   live server was never promoted back to ONLINE.
   Fix: resetRadiusConnStatics(), invoked in @BeforeMethod and
   @AfterMethod, cancels and nulls the static serverMonitor (unscheduling
   the background timer) and clears SERVER_STATUS via reflection, so each
   method starts from a clean slate.

2. Too-tight client read timeout on the cold first method.
   attackerForgedAccessAccept_withZeroAuthenticator_isRejected is the
   first @test in source order and pays JVM warm-up / class-loading cost.
   On a CPU-constrained CI runner the responder thread is briefly starved
   and the 2s socket timeout fired before the forged reply arrived; the
   client then marked its only server OFFLINE and surfaced "No RADIUS
   server is online." instead of the asserted authenticator rejection.
   Fix: widen the test client read timeout from 2s to 15s. Every test
   using this client receives a response and returns as soon as it
   arrives, so the larger timeout never slows the happy path; it only
   absorbs CI scheduling jitter.

Suite is green and stable across repeated local runs (~4.8s).
Root-cause fix for the remaining CI flake in
attackerForgedAccessAccept_withZeroAuthenticator_isRejected, which kept
failing with "No RADIUS server is online." despite the earlier static-
state isolation and the widened client timeout.

The background UDP responder runs on a server socket with a 5s SO_TIMEOUT
(@BeforeMethod). Its receive loop rethrew every IOException - including
SocketTimeoutException - which killed the responder thread whenever no
request arrived within 5s. The very first test method can be slow to
*send* its first packet (cold-JVM class loading, and authenticate()'s
InetAddress.getLocalHost() blocking on reverse DNS on a CI host). When
that pre-send delay exceeds 5s the responder dies first; the client's
request then goes unanswered, the client times out, marks its only
server OFFLINE and reports "No RADIUS server is online." instead of the
asserted authenticator rejection. Locally the client always sends within
5s, so the responder replies before its timeout - hence it only failed
on CI.

Treat SocketTimeoutException in the responder loop as "no request yet"
and keep waiting, so the responder survives an arbitrarily slow client.
Reproduced deterministically by injecting a 7s pre-send delay: fails
("No RADIUS server is online.", ~17s) without this change, passes with
it.

Also tighten the test client read timeout from 15s to 10s: with the
responder no longer dying, the large value is only defence-in-depth
against scheduling jitter and no longer needs to mask the real bug.
IT_SetupWithOpenDJ.testSetupWithOpendj intermittently fails on CI with a
TimeoutException while waiting for the OpenAM installation to complete.
The Selenium + Docker + Cargo setup tests drive a full OpenAM install
through a real browser against a Tomcat and an OpenDJ container, so they
are sensitive to resource contention on a shared CI runner (a container
slow to become healthy, a Tomcat slow to deploy the heavy OpenAM web app,
or a ChromeDriver command stalling under load). Treat these as infra
flakes rather than a product bug:

- Add RetryAnalyzer + RetryListener: every integration-test method is
  retried once (configurable via -Dit.retry.count) before failing the
  build. Each retry starts from a clean config and a freshly started
  OpenAM instance (CargoBaseTest), so a transient infra hiccup gets a
  second chance without masking a reproducible failure. Wired globally
  via @listeners on BaseTest so both IT classes inherit it.
- Raise the testSetupWithOpendj install-completion wait from 600s to
  900s: this variant installs against an external OpenDJ and makes OpenAM
  create the base entry itself, which under contention can exceed the old
  10-minute ceiling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant