refactor(stdlib)!: thin chain-checkpointed event, collapse sync#24007
Open
spalladino wants to merge 3 commits into
Open
refactor(stdlib)!: thin chain-checkpointed event, collapse sync#24007spalladino wants to merge 3 commits into
spalladino wants to merge 3 commits into
Conversation
…Stream work()
Replace the per-checkpoint fat `chain-checkpointed` event (carrying a full
PublishedCheckpoint) with a thin once-per-pass tip event `{ block, checkpoint }`,
symmetric with chain-proven/chain-finalized. Collapse L2BlockStream.work() to:
reorg walk-back + chain-pruned; one getBlocks download loop; end-of-pass tier
reconciliation (checkpointed -> proven -> finalized) from a single sourceTips
snapshot.
This removes the prefetch buffer, both startingBlock checkpoint fast-forward
arms, nextCheckpointToEmit, Loop 1 and Loop 2, and the CHECKPOINT_PREFETCH_LIMIT
option. The stream no longer fetches checkpoint payloads, so its source type
narrows to Pick<L2BlockSource, 'getBlocks'|'getBlockData'|'getL2Tips'>; PXE's
node adapter and the telemetry wrapper drop getCheckpoints accordingly.
Fixes the A-1061 stuck-checkpointed-cursor livelock structurally: the
checkpointed reconciliation runs unconditionally against the source snapshot, so
a startingBlock past the source checkpointed tip still advances the local
checkpointed cursor on the first pass. The tier reconciliation re-reads local
tips after a prune so it compares against the cursors the prune handler just
clamped (kills the Loop 1 stale-snapshot duplicate-emit bug), and the hash-aware
diff compares (number, hash) while skipping the hash when the local hash is
undefined (world-state) — re-emitting same-number/different-hash tips without
looping on world-state-shaped providers.
PXE advances its checkpointed anchor by fetching the tip header by hash, which
is safe against same-height reorgs and skips cleanly when the block was reorged
out. The tips store reads the checkpoint id straight off the thin event.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The L2BlockStream now delivers a single thin chain-checkpointed tip event per pass instead of one fat event per checkpoint, so the prover-node drives the catch-up itself: on a tip event it walks every checkpoint between its cursor and the reported tip, fetching light getCheckpointsData metadata first to decide relevance, then heavy getCheckpoint only for checkpoints in epochs that can still be proven. Whole unprovable epochs (fully proven / past the proof-submission window) are skipped; individual checkpoints inside a provable epoch are never skipped, honoring the SessionManager full-coverage contract. The cursor seeds at start() from the last checkpoint of the last fully-proven epoch (or 0), so a restart reprocesses the partially-proven epoch rather than trusting a checkpointed tip that may lead the last proven checkpoint. It advances one checkpoint at a time and only after both checkpointStore.addOrUpdate and sessionManager.onCheckpointAdded succeed, preserving the A-1041 at-least-once semantics (a mid-jump failure leaves the cursor behind to retry). chain-pruned clamps the cursor down to the post-prune checkpointed tip. checkEpochExpiry() now also runs from a periodic ticker, since the thin once-per-pass event no longer drives it per checkpoint and idle periods would otherwise stall expiry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The sentinel's only use of its L2BlockStream was to feed a private
slotNumberToCheckpoint map read solely by getSlotActivity, plus a local tips
mirror for the p2p-synced gate. Delete the stream, its L2TipsMemoryStore, the
map and its pruning, and the manual sync() in work(). getSlotActivity now
fetches archiver.getCheckpoint({ slot }) on demand (computing
checkpointNumber/archive/proposalPayloadHash/attestors as handleCheckpoint did),
and the p2p-synced gate reads archiver.getL2Tips() directly.
On-demand reads are always against the canonical chain, which also fixes a
latent reorg bug: the map had no chain-pruned handling, so a reorged-out
checkpoint's attestation entry lingered and could credit attestors on a
non-canonical checkpoint.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Collaborator
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
L2BlockStream.work()reconciled the four chain tips through independent code paths — a reorg walk, twocheckpoint loops with a shared cursor, a block loop, and a final proven/finalized diff — with optimizations
(
startingBlock,skipFinalized) that muted some paths but not others. Nothing enforced that what one pollemitted was tier-consistent, which produced a family of bugs (detailed below). The root enabler was the fat
chain-checkpointedevent: because it carried a fullPublishedCheckpoint, the stream had to fetch, order,and replay every checkpoint individually, and every special case had to reason about when that replay could be
skipped.
This PR makes
chain-checkpointeda thin tip event likechain-proven/chain-finalizedand collapseswork()into three symmetric steps: reorg detection, one block-download loop, and unconditional end-of-passtier reconciliation from a single source snapshot. The two consumers that relied on per-checkpoint payload
delivery are first decoupled: the sentinel stops needing a block stream at all, and the prover-node drives its
own checkpoint catch-up from a cursor.
Bugs and races fixed
startingBlock(A-1061). ThestartingBlockfast-forwardarm suppressed all checkpoint emission while the end-of-pass
chain-provenstill fired, leaving the localstore with
proven > checkpointedand the checkpointed cursor at genesis until the source's checkpointedtip passed
startingBlock. In p2p this degradedisEpochPrune(epoch prunes misclassified as ordinaryprunes, so
txPoolDeleteTxsAfterReorgwas not honored during the window); the same pattern affected thesentinel and prover-node streams. Tier reconciliation is now unconditional against the source snapshot, so
the first pass converges. Regression: emits the source checkpointed tip on the FIRST pass even when
startingBlock is past it.
already local" guard compared against the
localTipssnapshot taken before the prune handler rewrote thestore, so a node lagging a checkpoint that detected a reorg in the same pass could emit
chain-checkpointedfor a new-fork checkpoint before its blocks, then re-emit checkpoints from the prune target upward. Loop 1
is deleted; reconciliation re-reads the local tips after a prune so it compares against the just-clamped
cursors. Regression: does not re-emit the checkpointed tip after pruning to a block ahead of it.
block numbers only, so a same-number/different-hash tip after a reorg kept a stale (hash, checkpoint-id)
pair until the tip next advanced. Gates now compare (number, hash) — skipping the hash comparison when the
local hash is
undefined, which world-state legitimately reports for tips ahead of its synced range(comparing against it would re-emit the event on every poll). Regressions: re-emits the proven tip when
numbers match but the known local hash differs and does not re-emit when the local hash is undefined.
slotNumberToCheckpointmap had nochain-prunedhandling, so a reorged-out checkpoint's attestation entry lingered andgetSlotActivitycould classify validators against a non-canonical checkpoint. The map (and the stream feeding it) is gone;
the sentinel fetches the checkpoint for a slot on demand, so the answer is always canonical.
getL2TipsandgetCheckpoints. The old checkpoint loops fetched payloads inseparate reads from the tips snapshot; a source-side reorg between the two reads could plant inconsistent
state in the local store (the reason catch-up code needed validation-and-abort machinery). Eliminated
structurally: every tier event is now built from the same
getL2Tipssnapshot, with no second fetch toskew against.
driven by a cursor, seeding it from a checkpointed tip would silently skip the unproven checkpoints of a
partially-proven epoch on restart. The cursor seeds from the last checkpoint of the last fully-proven epoch
(or 0), and advances only after both
checkpointStore.addOrUpdateandsessionManager.onCheckpointAddedsucceed, preserving the existing at-least-once retry semantics (A-1041). Epoch expiry additionally gets a
periodic tick, since it previously piggybacked on per-checkpoint event volume.
Approach
L2BlockStream,L2TipsMemoryStore, slot→checkpoint map, andthe manual stream sync in its work loop.
getSlotActivityfetchesarchiver.getCheckpoint({ slot })ondemand (the by-slot query already existed) and derives the same attestation data; the p2p-sync gate reads
the archiver tips directly instead of a stale local mirror.
chain-checkpointedtip event, walks every checkpoint betweenits cursor and the tip: light
getCheckpointsDatametadata first, whole-epoch relevance filtering (an epochis skipped only if fully proven or past its proof-submission window — never individual checkpoints inside a
provable epoch, which the SessionManager's full-coverage contract requires), then a heavy
getCheckpointfetch only for checkpoints it will actually register. This is strictly cheaper than the old stream replay,
which transferred every full checkpoint payload before the prover-node could decide to skip it.
chain-checkpointedbecomes{ block: L2BlockId, checkpoint: CheckpointId }, emitted atmost once per pass — symmetric with
chain-proven/chain-finalized.work()is now: reorg walk-back +chain-pruned; onegetBlocksdownload loop (start incorporatesstartingBlock/skipFinalized);end-of-pass reconciliation checkpointed → proven → finalized from one snapshot. PXE, the one remaining
payload consumer, fetches its anchor header by hash (reorg-safe) and skips the update if the block vanished.
API changes
Internal API only (no RPC schema changes):
L2BlockStreamEvent:chain-checkpointedcarries{ block, checkpoint }ids instead of aPublishedCheckpoint, and fires at most once per sync pass instead of once per checkpoint. Consumersneeding payloads fetch them from the block source.
L2BlockStreamsource narrows toPick<L2BlockSource, 'getBlocks' | 'getBlockData' | 'getL2Tips'>; thecheckpointPrefetchLimitoption andCHECKPOINT_PREFETCH_LIMITexport are gone.LocalChainTips.checkpointedwidens to{ block, checkpoint }so the checkpointed tier can hash-gate likeproven/finalized (still structurally assignable from
LocalL2Tips).Simplifications
L2BlockStream.work(): ~220 → ~115 lines. Deleted: Loop 1 (already-local checkpoint backfill), Loop 2 +prefetch buffer (checkpoint-transport block download),
nextCheckpointToEmit, bothstartingBlockcheckpoint fast-forward arms, and the checkpoint payload fetching altogether.
multi-emission warning path, no anti-spam special cases.
~25-line on-demand fetch.
getCheckpointsimplementation is deleted; the telemetry stream wrappernarrows accordingly.
handleChainCheckpointedreads ids straight off the event instead of recomputing thecheckpoint hash from the payload.
Changes
chain-checkpointedevent; collapsedwork(); hash-aware tier gates; narrowed streamsource type; tips-store handler reads event ids directly.
post-prune reconciliation, and both hash-gate behaviors.
processCheckpointJump/registerCheckpoint/computeStartingCheckpoint), whole-epoch relevance filtering, prune clamping, periodic epoch-expiry tick.checkpoint fetch; direct archiver read for the p2p-sync gate.
getCheckpoints.Fixes A-1061