refactor(stdlib)!: thin chain-checkpointed event, collapse sync by spalladino · Pull Request #24007 · AztecProtocol/aztec-packages

spalladino · 2026-06-11T02:14:55Z

Motivation

L2BlockStream.work() reconciled the four chain tips through independent code paths — a reorg walk, two
checkpoint loops with a shared cursor, a block loop, and a final proven/finalized diff — with optimizations
(startingBlock, skipFinalized) that muted some paths but not others. Nothing enforced that what one poll
emitted was tier-consistent, which produced a family of bugs (detailed below). The root enabler was the fat
chain-checkpointed event: because it carried a full PublishedCheckpoint, the stream had to fetch, order,
and replay every checkpoint individually, and every special case had to reason about when that replay could be
skipped.

This PR makes chain-checkpointed a thin tip event like chain-proven/chain-finalized and collapses
work() into three symmetric steps: reorg detection, one block-download loop, and unconditional end-of-pass
tier reconciliation from a single source snapshot. The two consumers that relied on per-checkpoint payload
delivery are first decoupled: the sentinel stops needing a block stream at all, and the prover-node drives its
own checkpoint catch-up from a cursor.

Bugs and races fixed

Checkpointed cursor stuck at genesis under startingBlock (A-1061). The startingBlock fast-forward
arm suppressed all checkpoint emission while the end-of-pass chain-proven still fired, leaving the local
store with proven > checkpointed and the checkpointed cursor at genesis until the source's checkpointed
tip passed startingBlock. In p2p this degraded isEpochPrune (epoch prunes misclassified as ordinary
prunes, so txPoolDeleteTxsAfterReorg was not honored during the window); the same pattern affected the
sentinel and prover-node streams. Tier reconciliation is now unconditional against the source snapshot, so
the first pass converges. Regression: emits the source checkpointed tip on the FIRST pass even when
startingBlock is past it.
Duplicate / out-of-order checkpoint events after a reorg (stale-snapshot guard). Loop 1's "blocks
already local" guard compared against the localTips snapshot taken before the prune handler rewrote the
store, so a node lagging a checkpoint that detected a reorg in the same pass could emit chain-checkpointed
for a new-fork checkpoint before its blocks, then re-emit checkpoints from the prune target upward. Loop 1
is deleted; reconciliation re-reads the local tips after a prune so it compares against the just-clamped
cursors. Regression: does not re-emit the checkpointed tip after pruning to a block ahead of it.
Same-height stale cursors never refreshed (number-only comparisons). The proven/finalized gates compared
block numbers only, so a same-number/different-hash tip after a reorg kept a stale (hash, checkpoint-id)
pair until the tip next advanced. Gates now compare (number, hash) — skipping the hash comparison when the
local hash is undefined, which world-state legitimately reports for tips ahead of its synced range
(comparing against it would re-emit the event on every poll). Regressions: re-emits the proven tip when
numbers match but the known local hash differs and does not re-emit when the local hash is undefined.
Sentinel credited attestors on reorged-out checkpoints. Its slotNumberToCheckpoint map had no
chain-pruned handling, so a reorged-out checkpoint's attestation entry lingered and getSlotActivity
could classify validators against a non-canonical checkpoint. The map (and the stream feeding it) is gone;
the sentinel fetches the checkpoint for a slot on demand, so the answer is always canonical.
Read-skew race between getL2Tips and getCheckpoints. The old checkpoint loops fetched payloads in
separate reads from the tips snapshot; a source-side reorg between the two reads could plant inconsistent
state in the local store (the reason catch-up code needed validation-and-abort machinery). Eliminated
structurally: every tier event is now built from the same getL2Tips snapshot, with no second fetch to
skew against.
Prover-node restart could have skipped unproven checkpoints (prevented by design here). With catch-up
driven by a cursor, seeding it from a checkpointed tip would silently skip the unproven checkpoints of a
partially-proven epoch on restart. The cursor seeds from the last checkpoint of the last fully-proven epoch
(or 0), and advances only after both checkpointStore.addOrUpdate and sessionManager.onCheckpointAdded
succeed, preserving the existing at-least-once retry semantics (A-1041). Epoch expiry additionally gets a
periodic tick, since it previously piggybacked on per-checkpoint event volume.

Approach

Sentinel (decoupled first): deletes its L2BlockStream, L2TipsMemoryStore, slot→checkpoint map, and
the manual stream sync in its work loop. getSlotActivity fetches archiver.getCheckpoint({ slot }) on
demand (the by-slot query already existed) and derives the same attestation data; the p2p-sync gate reads
the archiver tips directly instead of a stale local mirror.
Prover-node (decoupled second): on a thin chain-checkpointed tip event, walks every checkpoint between
its cursor and the tip: light getCheckpointsData metadata first, whole-epoch relevance filtering (an epoch
is skipped only if fully proven or past its proof-submission window — never individual checkpoints inside a
provable epoch, which the SessionManager's full-coverage contract requires), then a heavy getCheckpoint
fetch only for checkpoints it will actually register. This is strictly cheaper than the old stream replay,
which transferred every full checkpoint payload before the prover-node could decide to skip it.
Stream rewrite: chain-checkpointed becomes { block: L2BlockId, checkpoint: CheckpointId }, emitted at
most once per pass — symmetric with chain-proven/chain-finalized. work() is now: reorg walk-back +
chain-pruned; one getBlocks download loop (start incorporates startingBlock/skipFinalized);
end-of-pass reconciliation checkpointed → proven → finalized from one snapshot. PXE, the one remaining
payload consumer, fetches its anchor header by hash (reorg-safe) and skips the update if the block vanished.

API changes

Internal API only (no RPC schema changes):

L2BlockStreamEvent: chain-checkpointed carries { block, checkpoint } ids instead of a
PublishedCheckpoint, and fires at most once per sync pass instead of once per checkpoint. Consumers
needing payloads fetch them from the block source.
L2BlockStream source narrows to Pick<L2BlockSource, 'getBlocks' | 'getBlockData' | 'getL2Tips'>; the
checkpointPrefetchLimit option and CHECKPOINT_PREFETCH_LIMIT export are gone.
LocalChainTips.checkpointed widens to { block, checkpoint } so the checkpointed tier can hash-gate like
proven/finalized (still structurally assignable from LocalL2Tips).

Simplifications

L2BlockStream.work(): ~220 → ~115 lines. Deleted: Loop 1 (already-local checkpoint backfill), Loop 2 +
prefetch buffer (checkpoint-transport block download), nextCheckpointToEmit, both startingBlock
checkpoint fast-forward arms, and the checkpoint payload fetching altogether.
Catch-up emits at most one checkpointed event per pass regardless of lag — no per-checkpoint replay, no
multi-emission warning path, no anti-spam special cases.
Sentinel: net ~50 lines removed plus a whole subsystem dependency (stream + tips store) — replaced by one
~25-line on-demand fetch.
PXE node adapter: the ~25-line getCheckpoints implementation is deleted; the telemetry stream wrapper
narrows accordingly.
Tips stores: handleChainCheckpointed reads ids straight off the event instead of recomputing the
checkpoint hash from the payload.
Stream test suite: rewritten from 1,828 to ~640 lines while adding four regression tests for the bugs above.
Net across the branch: −1,228 lines over 15 files.

Changes

stdlib: thin chain-checkpointed event; collapsed work(); hash-aware tier gates; narrowed stream
source type; tips-store handler reads event ids directly.
stdlib (tests): stream suite rewritten around the new event semantics + regression tests for A-1061,
post-prune reconciliation, and both hash-gate behaviors.
prover-node: cursor-driven checkpoint catch-up (processCheckpointJump/registerCheckpoint/
computeStartingCheckpoint), whole-epoch relevance filtering, prune clamping, periodic epoch-expiry tick.
aztec-node (sentinel): block stream, tips store, and slot→checkpoint map deleted; on-demand canonical
checkpoint fetch; direct archiver read for the p2p-sync gate.
pxe: anchor header fetched by hash on checkpointed-tip events; node adapter loses getCheckpoints.
kv-store / telemetry-client: test-suite and wrapper-type adjustments to the new event shape.

Fixes A-1061

…Stream work() Replace the per-checkpoint fat `chain-checkpointed` event (carrying a full PublishedCheckpoint) with a thin once-per-pass tip event `{ block, checkpoint }`, symmetric with chain-proven/chain-finalized. Collapse L2BlockStream.work() to: reorg walk-back + chain-pruned; one getBlocks download loop; end-of-pass tier reconciliation (checkpointed -> proven -> finalized) from a single sourceTips snapshot. This removes the prefetch buffer, both startingBlock checkpoint fast-forward arms, nextCheckpointToEmit, Loop 1 and Loop 2, and the CHECKPOINT_PREFETCH_LIMIT option. The stream no longer fetches checkpoint payloads, so its source type narrows to Pick<L2BlockSource, 'getBlocks'|'getBlockData'|'getL2Tips'>; PXE's node adapter and the telemetry wrapper drop getCheckpoints accordingly. Fixes the A-1061 stuck-checkpointed-cursor livelock structurally: the checkpointed reconciliation runs unconditionally against the source snapshot, so a startingBlock past the source checkpointed tip still advances the local checkpointed cursor on the first pass. The tier reconciliation re-reads local tips after a prune so it compares against the cursors the prune handler just clamped (kills the Loop 1 stale-snapshot duplicate-emit bug), and the hash-aware diff compares (number, hash) while skipping the hash when the local hash is undefined (world-state) — re-emitting same-number/different-hash tips without looping on world-state-shaped providers. PXE advances its checkpointed anchor by fetching the tip header by hash, which is safe against same-height reorgs and skips cleanly when the block was reorged out. The tips store reads the checkpoint id straight off the thin event. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The L2BlockStream now delivers a single thin chain-checkpointed tip event per pass instead of one fat event per checkpoint, so the prover-node drives the catch-up itself: on a tip event it walks every checkpoint between its cursor and the reported tip, fetching light getCheckpointsData metadata first to decide relevance, then heavy getCheckpoint only for checkpoints in epochs that can still be proven. Whole unprovable epochs (fully proven / past the proof-submission window) are skipped; individual checkpoints inside a provable epoch are never skipped, honoring the SessionManager full-coverage contract. The cursor seeds at start() from the last checkpoint of the last fully-proven epoch (or 0), so a restart reprocesses the partially-proven epoch rather than trusting a checkpointed tip that may lead the last proven checkpoint. It advances one checkpoint at a time and only after both checkpointStore.addOrUpdate and sessionManager.onCheckpointAdded succeed, preserving the A-1041 at-least-once semantics (a mid-jump failure leaves the cursor behind to retry). chain-pruned clamps the cursor down to the post-prune checkpointed tip. checkEpochExpiry() now also runs from a periodic ticker, since the thin once-per-pass event no longer drives it per checkpoint and idle periods would otherwise stall expiry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The sentinel's only use of its L2BlockStream was to feed a private slotNumberToCheckpoint map read solely by getSlotActivity, plus a local tips mirror for the p2p-synced gate. Delete the stream, its L2TipsMemoryStore, the map and its pruning, and the manual sync() in work(). getSlotActivity now fetches archiver.getCheckpoint({ slot }) on demand (computing checkpointNumber/archive/proposalPayloadHash/attestors as handleCheckpoint did), and the p2p-synced gate reads archiver.getL2Tips() directly. On-demand reads are always against the canonical chain, which also fixes a latent reorg bug: the map had no chain-pruned handling, so a reorged-out checkpoint's attestation entry lingered and could credit attestors on a non-canonical checkpoint. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

AztecBot · 2026-06-11T02:36:36Z

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/3f6420a4813011d5�3f6420a4813011d58;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_optimistic_proving.parallel.test.ts "removes a checkpoint mid-epoch via reorg and proves with survivors" (227s) (code: 0) group:e2e-p2p-epoch-flakes

spalladino and others added 3 commits June 10, 2026 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(stdlib)!: thin chain-checkpointed event, collapse sync#24007

refactor(stdlib)!: thin chain-checkpointed event, collapse sync#24007
spalladino wants to merge 3 commits into
spl/fix-tips-checkpoint-ids-fail-loudfrom
spl/a1061-thin-checkpoint-events

spalladino commented Jun 11, 2026

Uh oh!

AztecBot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spalladino commented Jun 11, 2026

Motivation

Bugs and races fixed

Approach

API changes

Simplifications

Changes

Uh oh!

AztecBot commented Jun 11, 2026

Flakey Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants