fix(core): preserve event-log order in hook-vs-sleep replay races#2171
fix(core): preserve event-log order in hook-vs-sleep replay races#2171TooTallNate wants to merge 1 commit into
Conversation
A buffered hook payload (a `hook_received` consumed before the workflow awaited the hook) was delivered to its consumer only at claim time (`iterator.next()`/`await`), and a concurrent `wait_completed` resolved synchronously with fewer microtask hops. When the workflow did `Promise.race([hook, sleep])`, sleep could win a race the committed event log says the hook won, surfacing as `CorruptedEventLogError` on replay (observed in production: a `step_created` for one step consumed by a different step's consumer). Fix, in three timing-independent parts: 1. Resolve a buffered hook payload through a `promiseQueue` slot chained at its log position (not at the later claim site), so resolution order stays anchored to the event log regardless of hydration/decryption time. 2. Register a per-delivery ordering barrier keyed by the source `hook_received` eventId (`ctx.pendingHookDeliveries`). A later-in-log entity (sleep's `wait_completed`) defers behind any earlier-in-log in-flight hook delivery via `awaitEarlierHookDeliveries`. 3. Release the barrier on a MACROTASK (`setTimeout(0)`) after the payload is claimed, so the consumer's branch decision — however many await hops deep through the async hook iterator — always commits before the deferring entity proceeds. This reuses the macrotask-boundary technique `scheduleWhenIdle` already relies on and is fully hop-count- and decryption-time independent (no microtask-hop heuristic). `awaitEarlierHookDeliveries` bounds its wait with a one-macrotask fallback so an unclaimed payload cannot deadlock a deferring entity. Adds characterization tests covering hook-vs-sleep (the repro), hook-vs-step, and a two-hook slow-decrypt ordering case. Pre-existing runtime bug on `stable`, independent of the OCC/fence work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a deterministic-replay divergence in @workflow/core where a buffered hook_received payload could lose a Promise.race against a concurrently-resolving sleep (wait_completed) during replay, even when the committed event log indicates the hook branch won. The change anchors buffered hook payload delivery to its event-log position and introduces a cross-entity ordering barrier so later-in-log sleeps defer until earlier hook deliveries are observed.
Changes:
- Anchor buffered
hook_receivedpayload hydration to the payload’s log position viactx.promiseQueue(instead of scheduling at claim/iterator.next()time). - Add
ctx.pendingHookDeliveriesandawaitEarlierHookDeliveries()to deferwait_completedbehind earlier-in-log buffered hook deliveries. - Add characterization/regression tests covering hook-vs-sleep, hook-vs-step, and two-hook ordering under slow async hydration.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/workflow/sleep.ts | Defers wait_completed resolution behind earlier buffered hook deliveries. |
| packages/core/src/workflow/hook.ts | Reworks buffered hook payload handling to resolve at log position and register ordering barriers. |
| packages/core/src/private.ts | Adds pendingHookDeliveries to context and implements awaitEarlierHookDeliveries(). |
| packages/core/src/workflow.ts | Initializes pendingHookDeliveries in the runtime context. |
| packages/core/src/hook-sleep-interaction.test.ts | Adds regression/characterization tests for ordering across hook/sleep/step races. |
| packages/core/src/workflow/sleep.test.ts | Updates test harness context initialization with pendingHookDeliveries. |
| packages/core/src/step.test.ts | Updates test harness context initialization with pendingHookDeliveries. |
| packages/core/src/async-deserialization-ordering.test.ts | Updates test harness context initialization with pendingHookDeliveries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await Promise.race([ | ||
| Promise.all(earlier), | ||
| new Promise<void>((resolve) => { | ||
| setTimeout(resolve, 0); | ||
| }), | ||
| ]); |
| * different branch or is suspending), `awaitEarlierHookDeliveries` | ||
| * bounds its wait with a one-macrotask fallback. | ||
| */ | ||
| pendingHookDeliveries: Map<string, Promise<void>>; |
| // Ordered durable history where the hook branch already won the race | ||
| // against a *step* (not a sleep): the hook payload (evnt_2) precedes | ||
| // the racing step's completion (evnt_9), and the committed branch is | ||
| // the hook (drainStep created at evnt_10). |
Summary
Fixes a replay divergence where a buffered hook payload races a concurrent
sleep, andsleepcan win aPromise.racethat the committed event log says the hook won — surfacing asCorruptedEventLogErroron replay (seen in production as astep_createdfor one step being consumed by a different step's consumer).Stacked on top of #2169 (the runtime-only repro). Base is the repro branch so this PR is repro + fix; it will reduce to the fix-only diff once #2169 merges.
Root cause
hook_receivedconsumed before the workflow awaited the hook) was delivered to its consumer only at claim time (iterator.next()/await).wait_completedresolved synchronously in itspromiseQueueslot — no hydration, fewer microtask hops — while the hook payload reaches the consumer through the async hook iterator (yield await this), which adds hops.Promise.race([hook, sleep]), sleep could therefore preempt an earlier-in-log hook payload, diverging from the committed log.This is specific to hook-vs-sleep: hook-vs-step and two-hook ordering already resolve correctly (verified by the added characterization tests), because the existing serial
promiseQueuediscipline is decryption-time independent for those.Fix (three timing-independent parts)
promiseQueueslot chained at its log position (not at the later claim site), so ordering follows the event log regardless of hydration/decryption time.hook_receivedeventId (ctx.pendingHookDeliveries). A later-in-log entity (sleep) defers behind any earlier-in-log in-flight hook delivery viaawaitEarlierHookDeliveries.setTimeout(0)) after the payload is claimed, so the consumer's branch decision — however many await hops deep — always commits before the deferring entity proceeds. This reuses the same macrotask-boundary techniquescheduleWhenIdlealready relies on; no microtask-hop heuristic.awaitEarlierHookDeliveriesbounds its wait with a one-macrotask fallback so an unclaimed payload can't deadlock a deferring entity.Why timing-independent
Decryption time, hydration time, and consumer
await-chain depth are all irrelevant: a macrotask runs only after the entire pending microtask queue drains. Empirically validated by modeling the race (the iterator path costs 3 microtask hops; a fixed-K microtask release was fragile — the macrotask boundary is not).Tests
hook-sleep-interaction.test.ts: hook-vs-sleep (the repro, both sync + async-deser modes), hook-vs-step, two-hook slow-decrypt ordering.@workflow/coresuite: 630/630, typecheck clean, stable across repeated runs, no hangs.Scope
Pre-existing runtime bug on
stable; independent of the OCC / fenced-write work, so it can ship on its own.Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com