Skip to content

fix(p2p): drive tx protection release from synced blocks instead of wall clock#23978

Merged
PhilWindle merged 1 commit into
merge-train/spartan-v5from
spl/event-driven-tx-protection
Jun 10, 2026
Merged

fix(p2p): drive tx protection release from synced blocks instead of wall clock#23978
PhilWindle merged 1 commit into
merge-train/spartan-v5from
spl/event-driven-tx-protection

Conversation

@spalladino

@spalladino spalladino commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

The prepare-for-slot loop in p2p client was not synced with the L2BlockStream events, meanining the unprotect call could trigger before the blocks-added flagged the txs as mined.

One solution could've been to add a new event to the blockstream on slot-synced, but it's easier to just remove the polling, and unprotect slots when a block proposal that protected the txs fails. As a safeguard, we still call unprotect based on slot numbers on mined blocks.

Problem

The tx pool protects txs referenced by an in-flight block proposal: on gossip receipt, the proposal's txs are keyed to its slot and removed from the pending indices, so the local builder cannot re-select them and eviction cannot drop them while the proposal may still land. prepareForSlot(S) releases protections from slots before S, revalidates the txs, and returns them to pending.

Release was driven by a wall-clock slot monitor polling the epoch cache every tick. Three problems:

  • Race against mined-marking. The monitor can fire after a proposal's checkpoint lands on L1 but before the block stream delivers blocks-added. The just-landed txs are unprotected into pending, where eviction or nullifier-conflict resolution can delete them; when the block then syncs there is nothing left to mark mined, and after a later reorg handlePrunedBlocks has nothing to restore — the tx is lost to the pool.
  • Clock dependency. The epoch cache is wall-clock derived and explicitly depends on system clock sync; unprotection correctness should depend on observed chain state instead.
  • Pipelining blind spot. Gossiped proposals carry future target slots during proposer pipelining, so wall-clock release frees them late. (The old target-slot branch that tried to address this read proposedCheckpoint from the local tips store, where it can never lead the checkpointed tip — removed in fix(p2p)!: resolve checkpoint tips from stored ids #23968 as dead code.)

Change

The protection lifecycle becomes fully event-driven and the slot monitor is deleted:

  • Protect on gossip receipt of a block proposal (unchanged).
  • Release on local validation failure: a proposal that fails validation immediately releases the protections it created — only entries still keyed to that proposal's slot, so a tx also referenced by a live proposal at another slot stays protected.
  • Resolve via chain events (unchanged): blocks-added marks txs mined, superseding protection; chain-pruned un-mines them back to pending. Proposals that landed as proposed-but-unconfirmed checkpoints and are later orphaned are fully handled by this existing lifecycle.
  • Collect silent deaths via synced block slots: prepareForSlot now runs inside the blocks-added handler with the slot of the last synced block, after mined-marking. Any block landing at slot S releases protections from all earlier slots — covering proposals that never reached L1 at all (no quorum, proposer crash, dropped L1 tx), for which no chain event can ever fire. Because it is ordered after mined-marking in the same handler, the unprotect-before-mined race is impossible by construction.
  • Proposers are unaffected: the sequencer already calls prepareForSlot(targetSlot) directly before building, which remains the one legitimate ahead-of-chain preparation.

Trade-off accepted

During a multi-slot stall with no blocks landing anywhere, non-proposer pools retain protections until the first block lands (the wall clock used to release them mid-stall). There is no user-visible cost — there is no chain to include txs in during a stall — and the memory held is bounded and self-healing on the first blocks-added. Proposers self-serve via the direct sequencer call. Protections are in-memory only, so a restart clears them.

Fixes A-1173

@spalladino spalladino force-pushed the spl/event-driven-tx-protection branch 2 times, most recently from 22b27d3 to d061358 Compare June 10, 2026 01:05
@spalladino spalladino changed the base branch from spl/fix-tips-checkpoint-ids-fail-loud to merge-train/spartan-v5 June 10, 2026 01:05
…all clock

The tx pool protects txs referenced by an in-flight block proposal by keying
them to the proposal's slot and removing them from the pending indices, then
releases the protection once the slot has passed. Release used to be driven by
a wall-clock slot monitor polling the epoch cache.

That monitor could fire after a proposal's checkpoint landed on L1 but before
the block stream delivered blocks-added: the just-landed txs were unprotected
into pending, where eviction or nullifier-conflict resolution could delete them
before mined-marking ran, and a later reorg had nothing to restore.

Make the protection lifecycle fully event-driven and delete the slot monitor:

- Release on local validation failure: a proposal that fails validation
  releases only the protection entries still keyed to its slot, so a tx also
  referenced by a live proposal at another slot stays protected.
- Collect via synced block slots: prepareForSlot now runs inside the
  blocks-added handler with the last synced block's slot, after mined-marking,
  so the unprotect-before-mined race is impossible by construction.
- Mining now clears a tx's protection entry, since a mined tx supersedes any
  protection.

Protections are in-memory only and are not rehydrated on restart. Proposers are
unaffected: the sequencer still calls prepareForSlot(targetSlot) directly.
@PhilWindle PhilWindle merged commit 3867206 into merge-train/spartan-v5 Jun 10, 2026
18 checks passed
@PhilWindle PhilWindle deleted the spl/event-driven-tx-protection branch June 10, 2026 16:05
PhilWindle pushed a commit that referenced this pull request Jun 11, 2026
## Motivation

#23967 stopped the checkpoint-replay storm at its source (a prune must
not advance checkpoint-bearing cursors onto an uncheckpointed block).
This PR makes the local tips store never *silently* report checkpoint
zero for a real block, removes the machinery and degenerate fields that
made the silent path possible, and closes the gaps found in review: a
store-upgrade path that would brick p2p nodes, a transient proven-tip
overshoot on prune, and world-state fabricating checkpoint ids it never
had.

Fixes A-1174

## Commits

**1. `fix(p2p): resolve checkpoint tips from stored ids and fail loudly
on corruption`**
- `chain-proven`/`chain-finalized` carry their `CheckpointId`; the store
records a checkpoint id per cursor. `getCheckpointId` resolves genesis →
stored id → (back-compat) `block→checkpoint` mapping → **throw**, so a
real-block cursor with no resolvable checkpoint fails loudly instead of
degrading to a replay.
- A `proven`/`finalized` cursor can legitimately lead the
locally-checkpointed frontier (batch lag / `startingBlock`); carrying
the id lets it resolve without a local mapping, so the throw only fires
on genuine corruption (never on a legitimate skipped-history prune).

**2. `refactor(p2p): drop the block→checkpoint mapping and
checkpoint-object store`**
- With per-cursor stored ids, the `block→checkpoint` mapping and the
checkpoint-object store are dead weight (they existed only to feed
`getCheckpointId`). Both backing maps are removed from the KV and memory
stores.
- `chain-pruned` now carries `checkpointed: L2TipId` (the source's
confirmed checkpointed tip) instead of a bare `CheckpointId`. The prune
handler clamps any checkpoint-bearing cursor that leads that tip down to
it, always landing on a block with a recorded id — no genesis-clamp, no
mapping lookup. `handleChainFinalized` collapses to pruning block hashes
below the lowest live tip.

**3. `refactor(stdlib): drop proposedCheckpoint from the local L2 tips
provider`**
- `proposedCheckpoint` is degenerate in the local stores (always equal
to `checkpointed`) and no consumer reads it — the one reader,
`p2p_client`'s `maybeCallPrepareForSlot`, was a dead branch (always
false). `L2TipsProvider.getL2Tips` now returns `LocalL2Tips =
Omit<L2Tips, 'proposedCheckpoint'>`; the local stores stop maintaining
the cursor; the dead p2p branch is removed.
- `L2BlockSource.getL2Tips` is a separate interface and keeps the full
`L2Tips` with `proposedCheckpoint`, which the sequencer and node still
read from the archiver.

**4. `fix(p2p): bump p2p store schema version for the per-tip checkpoint
id layout`**
- An upgraded p2p store would keep its old tips with an empty per-cursor
id map, making `getL2Tips` throw on every read with no way to self-heal
— failing `P2PClient.start()` outright. Bumping the store schema version
(3 → 4) resets the store on upgrade instead.

**5. `fix(p2p): clamp the proven tip to the source proven tip on
prune`**
- `chain-pruned` carried only the checkpointed tip, so a prune that
rolled back the proven chain clamped the local proven cursor onto the
(higher) checkpointed tip, transiently reporting unproven blocks as
proven until the corrective `chain-proven` event landed at the end of
the same sync iteration. The event now also carries `proven: L2TipId`
and each cursor clamps to its own source tip.

**6. `refactor(world-state): stop fabricating checkpoint ids in the
world-state tips provider`**
- The stream's local-data-provider contract demanded full
checkpoint-bearing tips, forcing world-state to hardcode genesis
checkpoint ids and a `checkpointed` tip at block zero (violating
`finalized ≤ checkpointed`) that nothing ever read. The provider
contract is narrowed to `LocalChainTips` — the tips the stream actually
consumes, with `checkpointed` required only when emitting checkpoint
events — and the stream fails loudly if checkpoint emission is enabled
without one. World-state now reports only the proposed/proven/finalized
blocks it genuinely tracks; `L2TipsProvider` keeps the full
`LocalL2Tips` shape for the p2p/pxe tips stores.

**7. `fix(prover-node): consume the reshaped chain-pruned and
checkpoint-bearing events`**
- The CheckpointStore redesign landed a prover-node consumer of
`chain-pruned` (written against the old event shape) on the merge train
while this branch was in flight. The prune handler now reads
`event.checkpointed.checkpoint` (same semantics as the old field) and
the test event constructors are updated to the new shapes.

**8. `refactor(world-state): report unresolvable tip hashes as undefined
instead of fabricating them`**
- World-state's `getL2Tips` fabricated values for hashes it could not
resolve: an empty string for the proven/finalized tips and a non-null
assertion for the proposed tip. The honest missing case is real (a
proven tip ahead of the synced range has no archive leaf to resolve
from), so `LocalChainTips` now carries block ids with an optional hash
and world-state returns resolved hashes as-is. Local tips stores are
unaffected (`LocalL2Tips` with required hashes remains assignable), and
the stream reads only block numbers from local tips.

## Breaking / operational notes

- `PXE_DATA_SCHEMA_VERSION` bumped: existing PXE DBs are wiped and
resync on first open.
- p2p store schema version bumped: existing p2p data dirs (including the
tx pool) are wiped and resync on upgrade.
- `L2BlockStreamEvent` shape changes (internal API):
`chain-proven`/`chain-finalized` gain `checkpoint`; `chain-pruned`
carries `checkpointed`/`proven` tips instead of a bare `checkpoint`.

## Deferred

`maybeCallPrepareForSlot`'s target-slot preparation (prepare the target
slot when a proposed checkpoint exists) never worked, because the local
store cannot represent a proposed checkpoint ahead of the checkpointed
tip. This is handled in
#23978.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants