fix: deflake HA full e2e suite by switching to in-proc interval-mining anvil by spalladino · Pull Request #23979 · AztecProtocol/aztec-packages

spalladino · 2026-06-10T03:25:12Z

Fixes the flaky HA full suite (e2e_ha_full) seen in http://ci.aztec-labs.com/8e1e980c4886df0d, where "should distribute work across multiple HA nodes" timed out awaiting a trigger tx. Also re-enables the suite, which #23976 had skipped.

Root cause

The HA compose suite was the only block-building suite running against an L1 with no self-advancing clock. Its anvil container ran in automine with no --block-time, and being external, it was excluded from the TestDateProvider sync that locally-spawned anvils get. L1 chain time only moved when something mined, while the shared sequencer clock free-ran. #23821 removed the AnvilTestWatcher that used to couple the two clocks in this mode and replaced it with per-iteration nudges in the test (clock warp + blind mine(8)).

Two consequences, both visible in the failed run's logs:

The mine(8) overshoot put L1 ~1.5 slots ahead of the test clock, so each iteration's first propose raced its slot boundary and was silently dropped, followed by a prune that destroyed the pipelined builders' forks (Fork not found on all surviving nodes). This race was lost in passing runs too.
Recovery then required the proposers' archiver-sync gate to clear, but the gate's deadline runs on the free-running test clock while nothing mines L1 during the test's waitForTx — Archiver did not sync L1 past slot 109 before slot 110 expired, discarding pipelined work, repeated until the jest timeout. Whether a run passed or failed came down to seconds of margin on this gate.

Fix

Stop emulating L1 time in the test and run the suite in the same regime as every other block-building e2e (e.g. e2e_epochs):

Drop the anvil container and ETHEREUM_HOSTS from the HA compose file. With no external L1 configured, setup() spawns anvil in-proc with interval mining (--block-time = ethereumSlotDuration) and keeps the TestDateProvider snapped to L1 block timestamps via the existing stdout listener. The sibling web3signer compose suite already works this way.
Add automineL1Setup: true so L1 contract deployment runs under temporary automine before interval mining starts.
Delete all time scaffolding from the test (clock warps, cheat-mining heartbeats, archiver sync nudges). Tests submit a tx and wait, in real time. No assertions change.

No production code changes: with a self-advancing L1, the sequencer and publisher behave exactly as on a real network.

Parallelization

The suite file is renamed to e2e_ha_full.parallel.test.ts, so CI runs each of its 8 tests as an isolated job in its own compose stack instead of one 15+ minute serial job:

bootstrap.sh expands the HA suite per test name (same mechanism as the existing .parallel simple tests).
run_test.sh forwards the test name into the compose stack and namespaces the docker compose project per test so concurrent jobs on one host don't collide.
sendTriggerTx now starts the HA sequencers idempotently, since under per-test isolation the governance/reload/distribute tests run without the first test (previously the only caller of startHASequencers).
Three clock-skew test titles contained parentheses, which jest's --testNamePattern interprets as regex groups (the filter would silently match nothing); they are retitled.

Teardown fix (follow-up to the first CI round)

The first CI round passed every test body but three jobs (produce-blocks, governance, reload) hung in afterAll until the job timeout. Two compounding causes, both fixed here:

afterAll reset the shared TestDateProvider before stopping nodes. The reset rewinds the clock from chain time to wall time — minutes apart after the automine deploy burst — so vote submissions armed against the rewound clock pushed sequencer stops out by that gap. The old 30s abandon-race then gave up, and the abandoned nodes outlived the jest environment, keeping the worker alive until the CI timeout (jest runs without forceExit). afterAll now stops sequencers first, awaits every node stop fully, and resets the clock last. These three jobs are the ones whose tests end with sequencers still running; the distribute test (which stops nodes in-test, before any reset) passed for the same reason.
Ports fix: interrupt publisher send-at-slot sleep on sequencer stop #23990 from merge-train/spartan (not previously on the v5 line): CheckpointProposalJob.interrupt() now propagates to the publisher, cancelling the sendRequestsAt slot-deadline sleep on sequencer stop, so a pending vote submission can never block shutdown. The original PR's e2e_ha_full teardown changes are superseded by the rework above and were not ported.

Verification

Three full local runs of the suite via run_test.sh ha (all 8 tests each): green in 255s / 254s / 268s of jest time (the old warp-based suite ran 10+ minutes), with zero occurrences of the old failure signatures (Fork not found, Archiver did not sync, discarding pipelined work) — passing runs of the old code showed 12+ Fork not found errors even when green.
One per-test CI-style run (run_test.sh ha <file> "should distribute work across multiple HA nodes"): the originally flaky test passes standalone in its own compose stack (7 skipped, 1 passed), exercising the full TEST_NAME plumbing.
yarn build, yarn format, yarn lint clean; sequencer-client unit tests pass (back to the pre-change suite after the revert).

…-work test The HA compose suite runs anvil without interval mining, so L1 chain time advances only when blocks are mined on demand, while the sequencers' test clock free-runs. The one-shot per-iteration mine(8) overshot the test clock by ~1.5 slots, making every iteration's first propose race its slot boundary on L1, and once a race was lost the suite deadlocked: the archiver-sync gate deadline runs on the test clock while nothing advances L1 during waitForTx. Replace the blind mine with a gap-closing one that never overshoots the test clock, make the per-iteration clock alignment forward-only so it cannot rewind below L1 time, and nudge L1 chain time plus archiver sync once per L1 slot while awaiting trigger txs, emulating a self-advancing L1.

A governance signal lands on L1 only when its block's timestamp falls within the slot it was signed for. When the HA duty race flushes the signal standalone instead of bundled with the timestamp-aligned propose tx, the per-slot retries need L1 chain time to keep tracking the test clock; with L1 frozen during the poll, every retry signs a slot the chain never reaches and the 120s poll times out.

Governance and slashing signal signatures bind the L2 slot the tx mines in, so the one-L1-slot mempool warm-up in sendRequestsAt — meant to get proposes into the first L1 block of their slot — makes a vote-only bundle eligible for inclusion in the L1 block right before its slot starts, where signature verification fails silently inside Multicall3. Environments that mine txs on arrival (anvil automine) turn that risk into a certainty. Keep the warm-up only when the bundle contains a propose, and submit vote-only bundles at the slot boundary, matching the documented intent at the call sites.

The suite was skipped in #23976 while the HA block-building interaction with the always-enforced timetable (#23821) was diagnosed; the preceding commits fix that interaction.

This reverts commit c93af0b.

The HA compose suite was the only block-building suite running against an L1 with no self-advancing clock: an external anvil container in automine mode, excluded from the TestDateProvider sync that locally-spawned anvils get. After #23821 removed the AnvilTestWatcher that coupled the clocks, the test emulated L1 time with clock warps and cheat-mining nudges, which raced proposals against slot boundaries and starved the proposers' archiver-sync gate whenever the test thread blocked on a tx wait. Drop the anvil container and ETHEREUM_HOSTS so setup() spawns anvil in-proc with interval mining and the stdout dateProvider sync, exactly like e2e_epochs and the sibling web3signer compose suite. Delete all time scaffolding from the test; tests now submit txs and wait in real time. Deploy L1 contracts under temporary automine (automineL1Setup).

Rename the suite to .parallel.test.ts and expand it in bootstrap.sh into one job per test name, each in its own compose stack. run_test.sh forwards the test name into the container (jest --testNamePattern) and namespaces the docker compose project per test so concurrent jobs on one host don't collide. sendTriggerTx starts the HA sequencers idempotently since tests no longer share state, and three clock-skew titles lose their parentheses, which testNamePattern would parse as regex groups and match nothing.

Docker compose project names must be lowercase; test titles are not.

The afterAll hook abandoned node stops after 30s and reset the shared TestDateProvider before stopping anything. Resetting rewinds the clock from chain time to wall time (minutes apart after the automine deploy burst), so vote submissions armed against the rewound clock blocked sequencer shutdown until wall time caught up, and the abandoned nodes outlived the jest environment, keeping the worker alive until the CI job timeout. Stop sequencers first, await every stop, reset the clock last.

Port of #23990 from merge-train/spartan. Propagates CheckpointProposalJob.interrupt() to its SequencerPublisher so the publisher's sendRequestsAt slot-deadline sleep is cancelled on sequencer stop, and checks interrupted before sleeping since InterruptibleSleep.interrupt() only resolves sleeps already in flight. The e2e_ha_full teardown changes from the original PR are superseded by the afterAll rework in this branch and are not ported.

spalladino changed the title ~~fix: deflake HA full e2e suite under the always-enforced timetable~~ fix: deflake HA full e2e suite by switching to in-proc interval-mining anvil Jun 10, 2026

spalladino added 10 commits June 10, 2026 18:26

test(e2e): re-enable HA full suite

cb4b056

The suite was skipped in #23976 while the HA block-building interaction with the always-enforced timetable (#23821) was diagnosed; the preceding commits fix that interaction.

Revert "fix(sequencer): send vote-only bundles at the target slot start"

a91d246

This reverts commit c93af0b.

fix(e2e): lowercase HA compose project name postfix

532dcd5

Docker compose project names must be lowercase; test titles are not.

spalladino force-pushed the spl/deflake-ha branch from 64a6fb4 to e171113 Compare June 10, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deflake HA full e2e suite by switching to in-proc interval-mining anvil#23979

fix: deflake HA full e2e suite by switching to in-proc interval-mining anvil#23979
spalladino wants to merge 10 commits into
merge-train/spartan-v5from
spl/deflake-ha

spalladino commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spalladino commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

Parallelization

Teardown fix (follow-up to the first CI round)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spalladino commented Jun 10, 2026 •

edited

Loading