[core] Optimistic concurrency control for event writes against stale logs#2113
[core] Optimistic concurrency control for event writes against stale logs#2113VaguelySerious wants to merge 6 commits into
Conversation
The elapsed-wait scan now snapshots the loaded events' tail eventId and passes it as `lastKnownEventId` on each `wait_completed` write, so a concurrent `resumeHook` that has already advanced the canonical log is detected — the server's CAS rejects the write, we surface it as the existing `EntityConflictError`, and the next iteration re-replays against the fresh event list (mirroring the duplicate-wait fall-through that was already there). `resumeHook` sends `asOfTimestamp` (Date.now() at call time) so the server resolves the fence to the highest eventId strictly before resume time — no client-side event pre-read needed. Plumbed through `CreateEventParams` on `@workflow/world` so future worlds can forward as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 1e69c82 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)fastify (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
The lazy-refs branch of createWorkflowRunEventInner forgot to thread `lastKnownEventId` and `asOfTimestamp` into the request body, so the fence was silently dropped for any event whose type went through the lazy path (i.e., not in `eventsNeedingResolve`). The resolve branch already had the forwarding. Caught by Vercel Agent Review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Vercel Agent review acknowledged + addressed in 1e69c82 — the lazy branch of |
|
Status after
I don't think this is from anything in this PR:
Reads like a long-standing abort-stream-propagation timing flake that just happened to fire on fastify Vercel Prod this run. Will rerun the job once the workflow run is no longer in-progress; flagging here in case a reviewer hits it before I get back to it. |
Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Peter Wielander <mittgfu@gmail.com>
|
Good catch — the answer is "yes, conceptually they can race the same way." Here's the breakdown of where we currently fence vs. don't, and what's at risk: Where the SDK currently sends a fenceOnly one site: the elapsed-wait scan in Where it doesn't, but could race in the same shapeAny write that the workflow runtime makes based on a branch decision driven by the loaded events array can race in the same way the production hook/sleep bug did. Concretely:
What doesn't need a fence (and why)
Why the wait_completed fence alone catches the production bug todayThe hook/sleep race needs: (1) tick takes sleep branch with stale events, (2) tick writes wait_completed, and (3) tick writes any sleep-branch follow-up events (more sleeps, the next iteration's allocations). My current fence trips at step (2): the wait_completed CAS fails, the retry loop reloads events, sees But that's specific to the hook-vs-sleep shape because RecommendationFor full coverage of the general "stale-snapshot branch decision" race, fence all 6 sites in the table above. Implementation cost is modest: the suspension-handler already has the events array available (the runtime invokes it with the loaded events); we'd thread the tail eventId in the same way I did for the elapsed-wait scan, and add the same retry-on-fence-conflict pattern to each catch block. Want me to do it in this PR or keep it as a follow-up? My slight lean is follow-up: the current PR cleanly resolves the production bug and it's already a non-trivial behavioral change; the broader fencing is mostly defensive against shapes we haven't seen in the wild. But the risk of leaving them unfenced is real if we're betting that no other workflow patterns trigger the same race. |
|
You're right, I was sloppy. Let me walk through it. Your example:
|
| if (!EntityConflictError.is(err)) { | ||
| throw err; | ||
| } | ||
| // Fence conflicts surface a specific error |
There was a problem hiding this comment.
nit: would be good to add a log message here so we can capture these in datadog
| // too, but to guarantee correctness, will be re-tried here directly. | ||
| // TODO: We can remove the retry here after extensive validation. | ||
| // The cost is low in the meantime. | ||
| const isFenceConflict = /fence conflict/i.test( |
There was a problem hiding this comment.
AI Review: brittle coupling to server error wording. /fence conflict/i.test(err.message) against the free-form 409 message means any rewording on the workflow-server side silently routes fence conflicts into the Wait already completed, skipping branch below — workflows keep moving but stale-snapshot protection is gone, and you'd only notice via the CORRUPTED_EVENT_LOG you're trying to eliminate. Prefer surfacing a typed code from the server (e.g. errorData.code === 'FENCE_CONFLICT' carried on EntityConflictError) and matching on that. Worth doing in the paired server PR so this regex never has to ship.
| continue; | ||
| if (result.event) { | ||
| fenceEventId = result.event.eventId; | ||
| } |
There was a problem hiding this comment.
AI Review: EventResultResolveWireSchema.event is .optional() (packages/world-vercel/src/events.ts:63), so on any response without event you don't refresh fenceEventId and the next iteration's write uses the now-stale tail — that triggers a spurious fence conflict and forces a full reload/backoff cycle for every subsequent wait in waitsToComplete. In practice the server probably always returns the created event, but the type allows the foot-gun. Either tighten the response schema to require event on create, or fall back to a deterministic value when missing.
| !events.some((x) => x.eventId === e.eventId) | ||
| ) { | ||
| events.push(e); | ||
| } |
There was a problem hiding this comment.
AI Review: events.some(x => x.eventId === e.eventId) inside the for makes this O(n²) over the existing log on every fence-retry reload. Event logs aren't huge today, but the retry path is exactly where they'll be longest. A Set of existing ids built once before the loop avoids it for free.
| * `hook_received` after anything the caller could have observed without paying | ||
| * for a separate read. Ignored when `lastKnownEventId` is also set. | ||
| */ | ||
| asOfTimestamp?: number; |
There was a problem hiding this comment.
AI Review: asOfTimestamp is added to the public CreateEventParams and threaded through world-vercel, but no caller in this PR uses it — the docstring points at resumeHook, which the PR explicitly keeps unfenced. Public API surface with no exerciser tends to rot (or drift from the server's interpretation) before its first real caller arrives. Consider dropping it from this PR until resumeHook (or another caller) actually needs it, or wiring up that single caller now so the contract is tested end-to-end.
Summary
Adds optimistic-concurrency fencing to the event writes that go through workflow-server, closing the hook/sleep race that produces
CORRUPTED_EVENT_LOGon production runs.eventIdand passes it aslastKnownEventIdon eachwait_completedwrite. If a concurrentresumeHookhas already advanced the canonical log, the server's CAS rejects the write.EntityConflictError, the runtime now retries in-place rather than throwing the whole tick away: it reloads events from the cursor, refreshes the fence, and tries again (up to 5x with backoff). Falling back to queue redelivery turned out to thunder-herd — every redelivery spawns another concurrent tick, which fences-conflicts again, and workflows stall inrunning. If the wait was completed by a concurrent writer between attempts, we observe it in the reloaded log and skip the write entirely.resumeHookappendshook_receivedunconditionally. ULID ordering already places this write after anything committed before us, and applying CAS would only ever reject the hook in favor of an unrelated concurrent write (which would lose the user's signal). Stale-snapshot protection lives on the tick writes that consume hooks, not on the write that delivers them.CreateEventParamson@workflow/worldgrowslastKnownEventIdandasOfTimestamp(both optional). Worlds that don't implement OCC can pass them through or ignore them.Pairs with the workflow-server PR which materializes
run.lastKnownEventIdand gates event writes on it. The server's CAS is explicit opt-in — unfenced writers (most paths) still atomically advance the materialized value so fenced writers can chain off it, but they don't reject on contention.Test plan
Stress reproduction details
The original CORRUPTED_EVENT_LOG bug was reproduced on stable at the rate of ~0.1–0.4% of runs under the following shape (
Promise.race([hook, sleep])withsleepBranchWaitCountparallel sleeps when sleep wins, fired 10 hook payloads per token atfireAfterMs=3000).Re-ran the same shape against the fix on
2026-05-27— two back-to-back cycles, 200 workflows each, identical params to the original repro:{ "count": 200, "iterations": 8, "sleepMs": 500, "sleepBranchWaitCount": 2, "sleepBranchWaitMs": 100, "drainDelayMs": 50, "fireAfterMs": 3000, "fireCount": 10, "fireBurstSpacingMs": 0 }Results across 400 workflows:
completedrunningat final check (low-priority queue tail)failedwitherrorCode: CORRUPTED_EVENT_LOGfailedwitherrorCode: USER_ERRORfailedwitherrorCode: WORLD_CONTRACT_ERRORThe target bug —
CORRUPTED_EVENT_LOGfrom the hook/sleep race — does not reproduce.The remaining failures are a different pattern: in workflows whose
Promise.all([sleep, sleep, …])(sleep-branch waits) commits twowait_createdevents microseconds apart, sometimes only one of them gets await_completedand the workflow hangs on the unresolved promise until it eventually surfaces asUSER_ERROR/WORLD_CONTRACT_ERROR. Root cause looks like the runtime's broadened "treat any non-fence 409 as already-completed" branch (eats a genuine conflict that should be retried). Tracking as a follow-up — the fix here closes the original CORRUPTED_EVENT_LOG bug, which is the production-visible defect.🤖 Generated with Claude Code