feat: merge-train/spartan-v5#23975
Open
AztecBot wants to merge 8 commits into
Open
Conversation
…ted block (#23967) ## Motivation On a staging HA validator, an archiver orphan prune triggered a storm of thousands of duplicate `chain-checkpointed` events out of p2p's `L2BlockStream`. The local tips store keeps a block number per cursor and derives the checkpoint number from a `block -> checkpoint` map that is only populated for the last block of each confirmed checkpoint. `handleChainPruned` moved the `checkpointed` and `proposedCheckpoint` cursors to the prune target unconditionally. That target is the new tip of the *proposed* chain and can be an uncheckpointed block with no mapping (in the incident, a block belonging to a not-yet-confirmed checkpoint, sitting ahead of the checkpointed tip). `getCheckpointId` then resolved that cursor to checkpoint zero, the stream computed `nextCheckpointToEmit = 0 + 1 = 1`, and it replayed every checkpoint from 1 up to the source tip. ## Approach A prune is a rollback, so checkpoint-bearing cursors may only move *backward*. `handleChainPruned` now sets `proposed` unconditionally and clamps `checkpointed`/`proposedCheckpoint`/`proven` to the prune target only when they are ahead of it (generalizing the guard `proven` already had). In the incident the checkpointed cursor is left untouched and keeps resolving to its real checkpoint, so there is nothing to replay. Surfacing a missing mapping *loudly* (rather than silently reporting checkpoint zero) is intentionally deferred to a stacked follow-up: doing it safely requires per-tip checkpoint ids so the store can fail loudly on genuine corruption without bricking legitimate skipped-history prunes (which would otherwise throw on the next `getL2Tips`). This PR is the minimal, behavior-preserving fix for the storm. ## Changes - **stdlib**: `handleChainPruned` clamps checkpoint-bearing cursors backward only instead of forcing them onto the (possibly uncheckpointed) prune target. - **stdlib (tests)**: store-level regression that a prune to an uncheckpointed block ahead of the checkpointed tip leaves the cursor intact and resolving to its real checkpoint; stream-level regression asserting no `chain-checkpointed` replay after such a prune. Fixes A-1167
…ration (#23821) ## Motivation The production sequencer kept two legacy escape hatches: `enforceTimeTable=false` (unbounded block building with no deadlines) and `blockDurationMs=undefined` (single-block-per-slot mode). Both existed only to satisfy tests and the sandbox, complicated the timetable with dead branches, and let most e2e tests run under timing that production never uses. ## Approach The timetable now always enforces sub-slot deadlines with a concrete `blockDurationMs` (required config, default 3000 ms). The only non-enforced path left is the `AutomineSequencer`: the local sandbox switches to it, which makes `AnvilTestWatcher` deletable — it was already inert across the e2e suite since every e2e path runs anvil in interval mining. The e2e PIPELINING preset flips to enforced real timing at exactly 2 blocks per slot. ## Fee prediction changes `sequencer-client/src/global_variable_builder/fee_provider.ts` now treats the current L1 fee snapshot as part of the predicted-fee set exposed by the node. `getPredictedMinFees()` returns the current minimum fees first, followed by the future-slot predictions from the fee predictor. This matters for local automine because a freshly mined checkpoint can make the node current minimum fees higher than the predictor's future samples; including the current value prevents clients from quoting below what tx validation will accept. `getCurrentMinFees()` also bypasses viem's cached block number by calling `getBlockNumber({ cacheTime: 0 })`, so automine fee snapshots observe newly mined L1 blocks immediately instead of reusing a stale L1 block number. ## Public simulation global variables `aztec-node/src/aztec-node/server.ts` no longer calls `buildGlobalVariables()` for `simulatePublicCalls`. Instead it computes a simulation target slot from local chain state and calls `buildCheckpointGlobalVariables()` with that slot, then combines those checkpoint globals with the requested simulation block number. The target slot is the max of: - the slot corresponding to the next L1 timestamp from the epoch cache, - the slot after the proposed checkpoint loaded from the block source, - the latest proposed block slot when it is ahead of the proposed checkpoint. This keeps public simulation aligned with the same checkpoint-global construction used for block building, while avoiding a rollup-contract lookup on the simulation path. ## Automine sequencer: proving and recovery The `AutomineSequencer` now drives epoch proving for the sandbox without a prover. `aztecNode.prove(upToCheckpoint?)` synthetically settles each checkpointed epoch — computing its out-hash, writing the outbox root and proven checkpoint to L1 via cheat codes, then calling `markAsProven` — with partial-epoch support (it can settle a prefix up to a requested checkpoint). Because that settlement mines no L1 block, it then mines one empty block so the archiver (which short-circuits its L1 sync while the block hash is unchanged) observes the new proven tip immediately, mirroring a real epoch proof landing an L1 verify tx. An optional auto-settle loop (`AUTOMINE_ENABLE_PROVE_EPOCH`, on by default for the local network) proves epochs as they close, replacing the standalone `EpochTestSettler` that used to race the build loop. On a wrong-slot or failed publish the sequencer returns the failed block's txs to the pool and retries the build rather than reorging L1, using a new `archiver.discardProposedCheckpointsAfter` to drop proposed-but-uncheckpointed blocks during recovery. ## Changes - **stdlib**: `blockDuration` required in the checkpoint timing model, single-block branches removed; `DEFAULT_BLOCK_DURATION_MS = 3000` as single source of truth. - **sequencer-client**: `SequencerTimetable` loses the `enforce` field; `canStartNextBlock` always returns a concrete deadline; config drops `enforceTimeTable`. - **automine sequencer**: `prove(upToCheckpoint?)` synthetically settles epochs (partial-epoch capable) and an optional auto-settle loop (`AUTOMINE_ENABLE_PROVE_EPOCH`) advances the proven tip, mining an empty L1 block so the archiver observes it; failed/wrong-slot publishes return txs to the pool and retry instead of reorging L1 (new `archiver.discardProposedCheckpointsAfter`). - **fees**: node fee predictions now include the current minimum fees as the first entry before future-slot predictions, and current fee snapshots bypass cached L1 block numbers so local automine fee quotes see freshly mined checkpoints. - **p2p**: `blockDurationMs` required in proposal/attestation validators, the pipelining window, and gossipsub topic scoring. - **foundation / aztec-node / validator-client**: `SEQ_ENFORCE_TIME_TABLE` env var removed; dead `blockDurationMs === undefined` branches simplified. - **aztec (sandbox)**: local network runs the `AutomineSequencer` by default, including p2p-enabled local runs; local-network is not a mode for connecting to an existing Aztec network. `AnvilTestWatcher` deleted, and the standalone `EpochTestSettler` is replaced by the AutomineSequencer auto-settle loop. - **end-to-end (tests)**: PIPELINING preset sets `blockDurationMs: 3000` (2 blocks/slot); ~30 `enforceTimeTable` call sites removed; watcher manual-proving call sites replaced with `cheatCodes.rollup.markAsProven()`; bench given explicit slot headroom; block-building regression test fixed for a min-txs remainder livelock that enforced deadlines exposed. - **docs**: sequencer-client and gossipsub READMEs updated to the always-enforced model; sandbox/local-network docs updated to describe automine block production and the removal of `SEQ_ENFORCE_TIME_TABLE` for v5. Breaking: `SEQ_ENFORCE_TIME_TABLE` is removed and `SEQ_BLOCK_DURATION_MS` now defaults to 3000 ms (previously unset, meaning single block per slot). The `SEQ_ENFORCE_TIME_TABLE` wiring in `spartan/` (deploy script, terraform, env files) is removed as well. Fixes A-1148
PhilWindle
approved these changes
Jun 9, 2026
…g test (#23976) ## Problem CI on `merge-train/spartan-v5` (commit 609014a, [log](http://ci.aztec-labs.com/1781038207169152)) failed in the `yarn-project` build at the `yarn tsgo -b --emitDeclarationOnly` step: ``` end-to-end/src/e2e_epochs/epochs_optimistic_proving.parallel.test.ts(222,9): error TS2353: Object literal may only specify known properties, and 'enforceTimeTable' does not exist in type 'EpochsTestOpts'. ``` (also at lines 366, 473, 558, 646, 780) ## Root cause PR #23821 (*always enforce timetable with concrete block duration*) made timetable enforcement unconditional and removed the `enforceTimeTable` option from `EpochsTestOpts`/`SetupOptions`, deleting ~30 `enforceTimeTable: true` call sites. `epochs_optimistic_proving.parallel.test.ts` landed on the v5 line separately and still passed `enforceTimeTable: true` at six sites, so it no longer type-checks. ## Fix - Remove the six now-invalid `enforceTimeTable: true` properties. Each call site already sets a concrete `blockDurationMs: 8000`, so the change is behavior-preserving — the same deletion the PR applied to every other e2e test. Verified in CI: `yarn-project` now type-checks and `epochs_optimistic_proving.parallel.test.ts` passes. - Temporarily `it.skip` the HA test `should distribute work across multiple HA nodes` in `composed/ha/e2e_ha_full.test.ts`, which fails under the always-enforced timetable (sequencer misses slots: `BlockOrCheckpointSlotExpiredError` / `no_blocks_built` / `Fork not found`). Skipped at Santiago's request, to be re-enabled after the HA block-building interaction with #23821 is fixed.
Fixes A-1157. Addresses security advisory GHSA-h4vv-85x5-6hmh.
## Problem
Peer scores decay toward zero (~0.9/minute). A peer whose score crossed
the ban threshold (`MIN_SCORE_BEFORE_BAN = -100`) recovered to a healthy
score within approximately 1 hour.
## Fix
Record a ban when a peer's score drops below the ban threshold and hold
it for a configurable duration (default 24h). Bans are kept **in memory
only** and are cleared on restart — a restarted node re-learns bad peers
from their behaviour rather than carrying stale bans across runs.
- **`PeerScoring`** records `{ score, expiry }` in an in-memory
`bannedPeers` map, so `getScore`/`getScoreState` stay **synchronous**
(required by the peer-manager hot paths, including a `.sort()`
comparator).
- While banned, `getScore` returns the **ban score** regardless of
decay, so a peer cannot recover its way out of the ban early — even
after `decayAllScores` cleans up the decayed live-score entry. Once the
ban expires it is lifted and the live (decayed) score takes over,
letting the peer recover.
- Expired bans are lifted lazily on the next score query
(`getActiveBanScore`) and swept proactively each heartbeat via
`pruneExpiredBans()`, so a banned peer that disconnects and is never
queried again does not linger in the map.
## Configuration
New `P2P_PEER_BAN_DURATION_SECONDS` (config field
`peerBanDurationSeconds`), default `86400` (24h). Registered in
`foundation` env vars and the P2P config mappings.
## Tests
`peer_scoring.test.ts` covers the full lifecycle, asserting both score
**values** and states:
- ban floor held through banned → recovered-live-score → expiry
transitions;
- `peerBanDurationSeconds` drives the window (60s case);
- the advisory regression: after decay cleans up the live-score entry,
`getScore` still returns the `-150` ban score (not `0`), keeping the
peer Banned;
- a peer whose previous ban has expired can be re-banned;
- `pruneExpiredBans` removes expired bans but keeps active ones.
Existing `peer_manager` and `peer_scoring` suites pass; the previously
existing "returns to Healthy after improving score" assertion was
updated to reflect the new intended behaviour (a banned peer stays
banned for the full window).
## Summary - Rename JSON-RPC namespaces from `node_*` / `nodeAdmin_*` / `nodeDebug_*` to `aztec_*`, `aztecAdmin_*`, and `aztecDebug_*` on both client schemas and `aztec start` server registration. - Stop mounting the standalone `p2p_*` namespace; add `getPeers` and `getCheckpointAttestationsForSlot` to `AztecNode` and delegate from the node server. - Update node API reference generation, regenerated operator docs, e2e forward-compatibility config, and a migration note under TBD. Fixes [A-1010](https://linear.app/aztec-labs/issue/A-1010/archiver-flag-silently-ignored-when-combined-with-node-archiver-rpc)
…all clock (#23978) The prepare-for-slot loop in p2p client was **not** synced with the `L2BlockStream` events, meanining the `unprotect` call could trigger before the blocks-added flagged the txs as mined. One solution could've been to add a new event to the blockstream on `slot-synced`, but it's easier to just remove the polling, and unprotect slots when a block proposal that protected the txs fails. As a safeguard, we still call unprotect based on slot numbers on mined blocks. ## Problem The tx pool **protects** txs referenced by an in-flight block proposal: on gossip receipt, the proposal's txs are keyed to its slot and removed from the pending indices, so the local builder cannot re-select them and eviction cannot drop them while the proposal may still land. `prepareForSlot(S)` releases protections from slots before `S`, revalidates the txs, and returns them to pending. Release was driven by a wall-clock slot monitor polling the epoch cache every tick. Three problems: - **Race against mined-marking.** The monitor can fire after a proposal's checkpoint lands on L1 but before the block stream delivers `blocks-added`. The just-landed txs are unprotected into pending, where eviction or nullifier-conflict resolution can delete them; when the block then syncs there is nothing left to mark mined, and after a later reorg `handlePrunedBlocks` has nothing to restore — the tx is lost to the pool. - **Clock dependency.** The epoch cache is wall-clock derived and explicitly depends on system clock sync; unprotection correctness should depend on observed chain state instead. - **Pipelining blind spot.** Gossiped proposals carry future target slots during proposer pipelining, so wall-clock release frees them late. (The old target-slot branch that tried to address this read `proposedCheckpoint` from the local tips store, where it can never lead the checkpointed tip — removed in #23968 as dead code.) ## Change The protection lifecycle becomes fully event-driven and the slot monitor is deleted: - **Protect** on gossip receipt of a block proposal (unchanged). - **Release on local validation failure**: a proposal that fails validation immediately releases the protections it created — only entries still keyed to that proposal's slot, so a tx also referenced by a live proposal at another slot stays protected. - **Resolve via chain events** (unchanged): `blocks-added` marks txs mined, superseding protection; `chain-pruned` un-mines them back to pending. Proposals that landed as proposed-but-unconfirmed checkpoints and are later orphaned are fully handled by this existing lifecycle. - **Collect silent deaths via synced block slots**: `prepareForSlot` now runs inside the `blocks-added` handler with the slot of the last synced block, after mined-marking. Any block landing at slot S releases protections from all earlier slots — covering proposals that never reached L1 at all (no quorum, proposer crash, dropped L1 tx), for which no chain event can ever fire. Because it is ordered after mined-marking in the same handler, the unprotect-before-mined race is impossible by construction. - **Proposers are unaffected**: the sequencer already calls `prepareForSlot(targetSlot)` directly before building, which remains the one legitimate ahead-of-chain preparation. ## Trade-off accepted During a multi-slot stall with no blocks landing anywhere, non-proposer pools retain protections until the first block lands (the wall clock used to release them mid-stall). There is no user-visible cost — there is no chain to include txs in during a stall — and the memory held is bounded and self-healing on the first `blocks-added`. Proposers self-serve via the direct sequencer call. Protections are in-memory only, so a restart clears them. Fixes A-1173
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(p2p): stop checkpoint-replay storm when pruning to an uncheckpointed block (#23967)
refactor(sequencer)!: always enforce timetable with concrete block duration (#23821)
fix(e2e): drop removed enforceTimeTable option from optimistic proving test (#23976)
feat: persist peer bans for a configurable duration (A-1157) (#23922)
refactor!: rename node JSON-RPC to aztec_* prefixes (#23909)
fix(p2p): drive tx protection release from synced blocks instead of wall clock (#23978)
END_COMMIT_OVERRIDE