fix: prevent second batch store reset wiping postage snapshot on --resync#5499
fix: prevent second batch store reset wiping postage snapshot on --resync#5499martinconic wants to merge 1 commit into
Conversation
| // postage snapshot has already reset and rebuilt the batch store, so the live | ||
| // sync continues from the snapshot instead of wiping it (see issue #5495). | ||
| func (svc *batchService) SkipResync() { | ||
| svc.resync = false |
There was a problem hiding this comment.
to me the fact that there's two places where the batch service could get reset already smells a bit. with this change we have still two places where a reset happens + third path to disable one of them potentially. so the question is asked - why have two places that do this in the first place?
There was a problem hiding this comment.
The two reset sites aren't pure duplication. They serve different roles:
- snapshot service → resets + rebuilds from the snapshot
- live service → resets + rebuilds from chain genesis — this is the fallback path, needed when there's no snapshot
So you can't simply delete one site without merging the two services (they use different listeners — snapshot filterer vs live chain backend).
I changed the implementation to not modify at all batchservice.go
c26bb91 to
a398078
Compare
…sync When a node is started with --resync together with the postage snapshot, the batch store was reset twice. The snapshot batch service resets and rebuilds the store from the snapshot, but the live-chain batch service was constructed with the same resync flag and reset the store a second time in its Start(), discarding the freshly loaded snapshot. Live sync then began at the snapshot block height against an empty store, so the first top-up or dilute event for any batch created before the snapshot height failed with "get batch <id>: storage: not found" and the node shut down. This is why disabling the snapshot (skip-postage-snapshot) worked around it: the store was reset only once and sync replayed from contract genesis, where every create precedes its top-up. Construct the live batch service after the snapshot path has run and pass resync only when the snapshot did not rebuild the store (o.Resync && !snapshotLoaded). The store is still reset exactly once in every other case (snapshot skipped, failed, not applicable, or a non-resync start). Closes #5495
a398078 to
69df1aa
Compare
| } | ||
| ) | ||
|
|
||
| snapshotLoaded := false |
There was a problem hiding this comment.
Suggestion: Instead of having this guard, we could expose batchservice.Reset(stateStore, batchStore) and just call it before start in case if --resync is set to true
There was a problem hiding this comment.
Additional solution could be to modify the Start interface:
Start(ctx context.Context, listener Listener, startBlock uint64) errorwhere we would pass snapshotEventListener and eventListener respectfully, and then when resync is done, we could set it to false.
But this requires interface changes.
There was a problem hiding this comment.
I am not sure exposing Reset is needed. From my perspective consider this:
- if you want to reset (the flag was set) - you start the batch service the snapshot. If you don't want to reset - snapshot it nil.
- inside the constructor, it detects - if the snapshot is not nil - reset, replay events, then continue startup normally, returning the instance.
If it is really needed to be able to reset without a snapshot, i.e. force a reset that reads directly from the blockchain, pass both a flag and the snapshot (nullable). Not sure I see the need for external callers to be able to reset the state of that component in that way.
Checklist
Description
Fixes #5495: a node started with
--resyncwhile the postage snapshot is active shuts down during sync with:Root cause
The batch store is reset twice on startup in this configuration:
snapshotBatchSvc.Start()resets the store (resync) and rebuilds it from the snapshot, then setspostageSyncStart = maxBlockHeight.batchSvcwas constructed with the sameo.Resyncflag, so itsStart()reset the store again, discarding the snapshot that was just loaded.Live sync then started from the snapshot block height against an empty store, so the first top-up/dilute event referencing a batch created before the snapshot height failed with
storage: not found, and the node shut down.This matches the report exactly: two
"resync requested, resetting batch store"log lines, and the fact thatskip-postage-snapshotworks around it (only one reset, and sync replays from contract genesis where every create precedes its top-up).Note: this is independent of #5343/#5482. The flow originates in #5094 and is still present on master; the reporter's build (
2.8.0-41d6efc6) predates the #5343 revert, but reverting #5343 does not address this.Fix
The two reset sites are not redundant — one rebuilds from the snapshot, the other is the fallback rebuild for the no-snapshot / snapshot-failed / non-mainnet cases — they were just both armed at once.
Construct the live batch service after the snapshot path has run and pass resync only when the snapshot did not rebuild the store:
The reset logic stays in its single existing place (
batchservice.Start()), exactly one service is armed to reset, and there is no runtime toggle. The store is still reset exactly once in every other case (snapshot skipped, failed, not applicable, or a non-resync start).Testing
TestResyncControlsResetinbatchserviceasserts both directions of the contractnode.gorelies on:resync=true⇒ store reset onceresync=false⇒ store not reset (the post-snapshot live-service case)go test -race ./pkg/postage/batchservice/...passesgo build ./pkg/node/...passes;go vet,gofumpt, andgolangci-lintcleanOpen API Spec Version Changes (if applicable)
None.
Related Issue
Closes #5495
AI Disclosure