Skip to content

refactor(stdlib)!: thin chain-checkpointed event, collapse sync#24007

Open
spalladino wants to merge 3 commits into
spl/fix-tips-checkpoint-ids-fail-loudfrom
spl/a1061-thin-checkpoint-events
Open

refactor(stdlib)!: thin chain-checkpointed event, collapse sync#24007
spalladino wants to merge 3 commits into
spl/fix-tips-checkpoint-ids-fail-loudfrom
spl/a1061-thin-checkpoint-events

Conversation

@spalladino

Copy link
Copy Markdown
Contributor

Motivation

L2BlockStream.work() reconciled the four chain tips through independent code paths — a reorg walk, two
checkpoint loops with a shared cursor, a block loop, and a final proven/finalized diff — with optimizations
(startingBlock, skipFinalized) that muted some paths but not others. Nothing enforced that what one poll
emitted was tier-consistent, which produced a family of bugs (detailed below). The root enabler was the fat
chain-checkpointed event: because it carried a full PublishedCheckpoint, the stream had to fetch, order,
and replay every checkpoint individually, and every special case had to reason about when that replay could be
skipped.

This PR makes chain-checkpointed a thin tip event like chain-proven/chain-finalized and collapses
work() into three symmetric steps: reorg detection, one block-download loop, and unconditional end-of-pass
tier reconciliation from a single source snapshot. The two consumers that relied on per-checkpoint payload
delivery are first decoupled: the sentinel stops needing a block stream at all, and the prover-node drives its
own checkpoint catch-up from a cursor.

Bugs and races fixed

  • Checkpointed cursor stuck at genesis under startingBlock (A-1061). The startingBlock fast-forward
    arm suppressed all checkpoint emission while the end-of-pass chain-proven still fired, leaving the local
    store with proven > checkpointed and the checkpointed cursor at genesis until the source's checkpointed
    tip passed startingBlock. In p2p this degraded isEpochPrune (epoch prunes misclassified as ordinary
    prunes, so txPoolDeleteTxsAfterReorg was not honored during the window); the same pattern affected the
    sentinel and prover-node streams. Tier reconciliation is now unconditional against the source snapshot, so
    the first pass converges. Regression: emits the source checkpointed tip on the FIRST pass even when
    startingBlock is past it
    .
  • Duplicate / out-of-order checkpoint events after a reorg (stale-snapshot guard). Loop 1's "blocks
    already local" guard compared against the localTips snapshot taken before the prune handler rewrote the
    store, so a node lagging a checkpoint that detected a reorg in the same pass could emit chain-checkpointed
    for a new-fork checkpoint before its blocks, then re-emit checkpoints from the prune target upward. Loop 1
    is deleted; reconciliation re-reads the local tips after a prune so it compares against the just-clamped
    cursors. Regression: does not re-emit the checkpointed tip after pruning to a block ahead of it.
  • Same-height stale cursors never refreshed (number-only comparisons). The proven/finalized gates compared
    block numbers only, so a same-number/different-hash tip after a reorg kept a stale (hash, checkpoint-id)
    pair until the tip next advanced. Gates now compare (number, hash) — skipping the hash comparison when the
    local hash is undefined, which world-state legitimately reports for tips ahead of its synced range
    (comparing against it would re-emit the event on every poll). Regressions: re-emits the proven tip when
    numbers match but the known local hash differs
    and does not re-emit when the local hash is undefined.
  • Sentinel credited attestors on reorged-out checkpoints. Its slotNumberToCheckpoint map had no
    chain-pruned handling, so a reorged-out checkpoint's attestation entry lingered and getSlotActivity
    could classify validators against a non-canonical checkpoint. The map (and the stream feeding it) is gone;
    the sentinel fetches the checkpoint for a slot on demand, so the answer is always canonical.
  • Read-skew race between getL2Tips and getCheckpoints. The old checkpoint loops fetched payloads in
    separate reads from the tips snapshot; a source-side reorg between the two reads could plant inconsistent
    state in the local store (the reason catch-up code needed validation-and-abort machinery). Eliminated
    structurally: every tier event is now built from the same getL2Tips snapshot, with no second fetch to
    skew against.
  • Prover-node restart could have skipped unproven checkpoints (prevented by design here). With catch-up
    driven by a cursor, seeding it from a checkpointed tip would silently skip the unproven checkpoints of a
    partially-proven epoch on restart. The cursor seeds from the last checkpoint of the last fully-proven epoch
    (or 0), and advances only after both checkpointStore.addOrUpdate and sessionManager.onCheckpointAdded
    succeed, preserving the existing at-least-once retry semantics (A-1041). Epoch expiry additionally gets a
    periodic tick, since it previously piggybacked on per-checkpoint event volume.

Approach

  • Sentinel (decoupled first): deletes its L2BlockStream, L2TipsMemoryStore, slot→checkpoint map, and
    the manual stream sync in its work loop. getSlotActivity fetches archiver.getCheckpoint({ slot }) on
    demand (the by-slot query already existed) and derives the same attestation data; the p2p-sync gate reads
    the archiver tips directly instead of a stale local mirror.
  • Prover-node (decoupled second): on a thin chain-checkpointed tip event, walks every checkpoint between
    its cursor and the tip: light getCheckpointsData metadata first, whole-epoch relevance filtering (an epoch
    is skipped only if fully proven or past its proof-submission window — never individual checkpoints inside a
    provable epoch, which the SessionManager's full-coverage contract requires), then a heavy getCheckpoint
    fetch only for checkpoints it will actually register. This is strictly cheaper than the old stream replay,
    which transferred every full checkpoint payload before the prover-node could decide to skip it.
  • Stream rewrite: chain-checkpointed becomes { block: L2BlockId, checkpoint: CheckpointId }, emitted at
    most once per pass — symmetric with chain-proven/chain-finalized. work() is now: reorg walk-back +
    chain-pruned; one getBlocks download loop (start incorporates startingBlock/skipFinalized);
    end-of-pass reconciliation checkpointed → proven → finalized from one snapshot. PXE, the one remaining
    payload consumer, fetches its anchor header by hash (reorg-safe) and skips the update if the block vanished.

API changes

Internal API only (no RPC schema changes):

  • L2BlockStreamEvent: chain-checkpointed carries { block, checkpoint } ids instead of a
    PublishedCheckpoint, and fires at most once per sync pass instead of once per checkpoint. Consumers
    needing payloads fetch them from the block source.
  • L2BlockStream source narrows to Pick<L2BlockSource, 'getBlocks' | 'getBlockData' | 'getL2Tips'>; the
    checkpointPrefetchLimit option and CHECKPOINT_PREFETCH_LIMIT export are gone.
  • LocalChainTips.checkpointed widens to { block, checkpoint } so the checkpointed tier can hash-gate like
    proven/finalized (still structurally assignable from LocalL2Tips).

Simplifications

  • L2BlockStream.work(): ~220 → ~115 lines. Deleted: Loop 1 (already-local checkpoint backfill), Loop 2 +
    prefetch buffer (checkpoint-transport block download), nextCheckpointToEmit, both startingBlock
    checkpoint fast-forward arms, and the checkpoint payload fetching altogether.
  • Catch-up emits at most one checkpointed event per pass regardless of lag — no per-checkpoint replay, no
    multi-emission warning path, no anti-spam special cases.
  • Sentinel: net ~50 lines removed plus a whole subsystem dependency (stream + tips store) — replaced by one
    ~25-line on-demand fetch.
  • PXE node adapter: the ~25-line getCheckpoints implementation is deleted; the telemetry stream wrapper
    narrows accordingly.
  • Tips stores: handleChainCheckpointed reads ids straight off the event instead of recomputing the
    checkpoint hash from the payload.
  • Stream test suite: rewritten from 1,828 to ~640 lines while adding four regression tests for the bugs above.
  • Net across the branch: −1,228 lines over 15 files.

Changes

  • stdlib: thin chain-checkpointed event; collapsed work(); hash-aware tier gates; narrowed stream
    source type; tips-store handler reads event ids directly.
  • stdlib (tests): stream suite rewritten around the new event semantics + regression tests for A-1061,
    post-prune reconciliation, and both hash-gate behaviors.
  • prover-node: cursor-driven checkpoint catch-up (processCheckpointJump/registerCheckpoint/
    computeStartingCheckpoint), whole-epoch relevance filtering, prune clamping, periodic epoch-expiry tick.
  • aztec-node (sentinel): block stream, tips store, and slot→checkpoint map deleted; on-demand canonical
    checkpoint fetch; direct archiver read for the p2p-sync gate.
  • pxe: anchor header fetched by hash on checkpointed-tip events; node adapter loses getCheckpoints.
  • kv-store / telemetry-client: test-suite and wrapper-type adjustments to the new event shape.

Fixes A-1061

spalladino and others added 3 commits June 10, 2026 22:24
…Stream work()

Replace the per-checkpoint fat `chain-checkpointed` event (carrying a full
PublishedCheckpoint) with a thin once-per-pass tip event `{ block, checkpoint }`,
symmetric with chain-proven/chain-finalized. Collapse L2BlockStream.work() to:
reorg walk-back + chain-pruned; one getBlocks download loop; end-of-pass tier
reconciliation (checkpointed -> proven -> finalized) from a single sourceTips
snapshot.

This removes the prefetch buffer, both startingBlock checkpoint fast-forward
arms, nextCheckpointToEmit, Loop 1 and Loop 2, and the CHECKPOINT_PREFETCH_LIMIT
option. The stream no longer fetches checkpoint payloads, so its source type
narrows to Pick<L2BlockSource, 'getBlocks'|'getBlockData'|'getL2Tips'>; PXE's
node adapter and the telemetry wrapper drop getCheckpoints accordingly.

Fixes the A-1061 stuck-checkpointed-cursor livelock structurally: the
checkpointed reconciliation runs unconditionally against the source snapshot, so
a startingBlock past the source checkpointed tip still advances the local
checkpointed cursor on the first pass. The tier reconciliation re-reads local
tips after a prune so it compares against the cursors the prune handler just
clamped (kills the Loop 1 stale-snapshot duplicate-emit bug), and the hash-aware
diff compares (number, hash) while skipping the hash when the local hash is
undefined (world-state) — re-emitting same-number/different-hash tips without
looping on world-state-shaped providers.

PXE advances its checkpointed anchor by fetching the tip header by hash, which
is safe against same-height reorgs and skips cleanly when the block was reorged
out. The tips store reads the checkpoint id straight off the thin event.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The L2BlockStream now delivers a single thin chain-checkpointed tip event per
pass instead of one fat event per checkpoint, so the prover-node drives the
catch-up itself: on a tip event it walks every checkpoint between its cursor and
the reported tip, fetching light getCheckpointsData metadata first to decide
relevance, then heavy getCheckpoint only for checkpoints in epochs that can
still be proven. Whole unprovable epochs (fully proven / past the
proof-submission window) are skipped; individual checkpoints inside a provable
epoch are never skipped, honoring the SessionManager full-coverage contract.

The cursor seeds at start() from the last checkpoint of the last fully-proven
epoch (or 0), so a restart reprocesses the partially-proven epoch rather than
trusting a checkpointed tip that may lead the last proven checkpoint. It
advances one checkpoint at a time and only after both checkpointStore.addOrUpdate
and sessionManager.onCheckpointAdded succeed, preserving the A-1041
at-least-once semantics (a mid-jump failure leaves the cursor behind to retry).
chain-pruned clamps the cursor down to the post-prune checkpointed tip.

checkEpochExpiry() now also runs from a periodic ticker, since the thin
once-per-pass event no longer drives it per checkpoint and idle periods would
otherwise stall expiry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The sentinel's only use of its L2BlockStream was to feed a private
slotNumberToCheckpoint map read solely by getSlotActivity, plus a local tips
mirror for the p2p-synced gate. Delete the stream, its L2TipsMemoryStore, the
map and its pruning, and the manual sync() in work(). getSlotActivity now
fetches archiver.getCheckpoint({ slot }) on demand (computing
checkpointNumber/archive/proposalPayloadHash/attestors as handleCheckpoint did),
and the p2p-synced gate reads archiver.getL2Tips() directly.

On-demand reads are always against the canonical chain, which also fixes a
latent reorg bug: the map had no chain-pruned handling, so a reorged-out
checkpoint's attestation entry lingered and could credit attestors on a
non-canonical checkpoint.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@AztecBot

Copy link
Copy Markdown
Collaborator

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/3f6420a4813011d5�3f6420a4813011d58;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_optimistic_proving.parallel.test.ts "removes a checkpoint mid-epoch via reorg and proves with survivors" (227s) (code: 0) group:e2e-p2p-epoch-flakes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants