Skip to content

[core] Optimistic concurrency control for event writes against stale logs#2113

Open
VaguelySerious wants to merge 6 commits into
mainfrom
peter/sdk-event-write-cas
Open

[core] Optimistic concurrency control for event writes against stale logs#2113
VaguelySerious wants to merge 6 commits into
mainfrom
peter/sdk-event-write-cas

Conversation

@VaguelySerious
Copy link
Copy Markdown
Member

@VaguelySerious VaguelySerious commented May 26, 2026

Summary

Adds optimistic-concurrency fencing to the event writes that go through workflow-server, closing the hook/sleep race that produces CORRUPTED_EVENT_LOG on production runs.

  • The elapsed-wait scan snapshots the loaded events' tail eventId and passes it as lastKnownEventId on each wait_completed write. If a concurrent resumeHook has already advanced the canonical log, the server's CAS rejects the write.
  • On a fence-conflict EntityConflictError, the runtime now retries in-place rather than throwing the whole tick away: it reloads events from the cursor, refreshes the fence, and tries again (up to 5x with backoff). Falling back to queue redelivery turned out to thunder-herd — every redelivery spawns another concurrent tick, which fences-conflicts again, and workflows stall in running. If the wait was completed by a concurrent writer between attempts, we observe it in the reloaded log and skip the write entirely.
  • resumeHook appends hook_received unconditionally. ULID ordering already places this write after anything committed before us, and applying CAS would only ever reject the hook in favor of an unrelated concurrent write (which would lose the user's signal). Stale-snapshot protection lives on the tick writes that consume hooks, not on the write that delivers them.
  • CreateEventParams on @workflow/world grows lastKnownEventId and asOfTimestamp (both optional). Worlds that don't implement OCC can pass them through or ignore them.

Pairs with the workflow-server PR which materializes run.lastKnownEventId and gates event writes on it. The server's CAS is explicit opt-in — unfenced writers (most paths) still atomically advance the materialized value so fenced writers can chain off it, but they don't reject on contention.

Test plan

  • All 1013 core unit tests passing
  • Typecheck clean
  • Changeset included
  • End-to-end repro against a Vercel preview deployment of this branch + the matching workflow-server preview:

Stress reproduction details

The original CORRUPTED_EVENT_LOG bug was reproduced on stable at the rate of ~0.1–0.4% of runs under the following shape (Promise.race([hook, sleep]) with sleepBranchWaitCount parallel sleeps when sleep wins, fired 10 hook payloads per token at fireAfterMs=3000).

Re-ran the same shape against the fix on 2026-05-27 — two back-to-back cycles, 200 workflows each, identical params to the original repro:

{
  "count": 200, "iterations": 8, "sleepMs": 500,
  "sleepBranchWaitCount": 2, "sleepBranchWaitMs": 100,
  "drainDelayMs": 50, "fireAfterMs": 3000,
  "fireCount": 10, "fireBurstSpacingMs": 0
}

Results across 400 workflows:

Outcome Count
completed 241
Still running at final check (low-priority queue tail) 132
failed with errorCode: CORRUPTED_EVENT_LOG 0
failed with errorCode: USER_ERROR 23
failed with errorCode: WORLD_CONTRACT_ERROR 4

The target bug — CORRUPTED_EVENT_LOG from the hook/sleep race — does not reproduce.

The remaining failures are a different pattern: in workflows whose Promise.all([sleep, sleep, …]) (sleep-branch waits) commits two wait_created events microseconds apart, sometimes only one of them gets a wait_completed and the workflow hangs on the unresolved promise until it eventually surfaces as USER_ERROR / WORLD_CONTRACT_ERROR. Root cause looks like the runtime's broadened "treat any non-fence 409 as already-completed" branch (eats a genuine conflict that should be retried). Tracking as a follow-up — the fix here closes the original CORRUPTED_EVENT_LOG bug, which is the production-visible defect.

🤖 Generated with Claude Code

The elapsed-wait scan now snapshots the loaded events' tail eventId and
passes it as `lastKnownEventId` on each `wait_completed` write, so a
concurrent `resumeHook` that has already advanced the canonical log is
detected — the server's CAS rejects the write, we surface it as the
existing `EntityConflictError`, and the next iteration re-replays
against the fresh event list (mirroring the duplicate-wait fall-through
that was already there).

`resumeHook` sends `asOfTimestamp` (Date.now() at call time) so the
server resolves the fence to the highest eventId strictly before
resume time — no client-side event pre-read needed.

Plumbed through `CreateEventParams` on `@workflow/world` so future
worlds can forward as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 26, 2026

🦋 Changeset detected

Latest commit: 1e69c82

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 20 packages
Name Type
@workflow/core Patch
@workflow/world Patch
@workflow/world-vercel Patch
@workflow/builders Patch
@workflow/cli Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/vitest Patch
@workflow/web-shared Patch
@workflow/web Patch
workflow Patch
@workflow/world-testing Patch
@workflow/world-local Patch
@workflow/world-postgres Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment May 27, 2026 9:58am
example-nextjs-workflow-webpack Ready Ready Preview, Comment May 27, 2026 9:58am
example-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-astro-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-express-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-fastify-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-hono-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-nitro-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-nuxt-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-sveltekit-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-tanstack-start-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workbench-vite-workflow Ready Ready Preview, Comment May 27, 2026 9:58am
workflow-docs Ready Ready Preview, Comment, Open in v0 May 27, 2026 9:58am
workflow-swc-playground Ready Ready Preview, Comment May 27, 2026 9:58am
workflow-tarballs Ready Ready Preview, Comment May 27, 2026 9:58am
workflow-web Ready Ready Preview, Comment May 27, 2026 9:58am

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 1221 1 219 1441
✅ 💻 Local Development 1615 0 219 1834
✅ 📦 Local Production 1615 0 219 1834
✅ 🐘 Local Postgres 1615 0 219 1834
✅ 🪟 Windows 131 0 0 131
✅ 📋 Other 741 0 176 917
Total 6938 1 1052 7991

❌ Failed Tests

▲ Vercel Production (1 failed)

fastify (1 failed):

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 105 0 26
✅ example 105 0 26
✅ express 105 0 26
❌ fastify 104 1 26
✅ hono 105 0 26
✅ nextjs-turbopack 129 0 2
✅ nextjs-webpack 129 0 2
✅ nitro 105 0 26
✅ nuxt 105 0 26
✅ sveltekit 124 0 7
✅ vite 105 0 26
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 106 0 25
✅ express-stable 106 0 25
✅ fastify-stable 106 0 25
✅ hono-stable 106 0 25
✅ nextjs-turbopack-canary 112 0 19
✅ nextjs-turbopack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-turbopack-stable-lazy-discovery-enabled 131 0 0
✅ nextjs-webpack-canary 112 0 19
✅ nextjs-webpack-stable-lazy-discovery-disabled 131 0 0
✅ nextjs-webpack-stable-lazy-discovery-enabled 131 0 0
✅ nitro-stable 106 0 25
✅ nuxt-stable 106 0 25
✅ sveltekit-stable 125 0 6
✅ vite-stable 106 0 25
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 131 0 0
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 106 0 25
✅ e2e-local-dev-tanstack-start- 106 0 25
✅ e2e-local-postgres-nest-stable 106 0 25
✅ e2e-local-postgres-tanstack-start- 106 0 25
✅ e2e-local-prod-nest-stable 106 0 25
✅ e2e-local-prod-tanstack-start- 106 0 25
✅ e2e-vercel-prod-tanstack-start 105 0 26

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: success
  • Local Prod: success
  • Local Postgres: success
  • Windows: success

Check the workflow run for details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

📊 Benchmark Results

📈 Comparing against baseline from main branch. Green 🟢 = faster, Red 🔺 = slower.

workflow with no steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Nitro 0.030s (-29.5% 🟢) 1.006s (~) 0.975s 10 1.00x
💻 Local Express 0.031s (-30.5% 🟢) 1.006s (~) 0.975s 10 1.01x
🐘 Postgres Nitro 0.046s (-51.4% 🟢) 1.011s (-3.1%) 0.965s 10 1.52x
💻 Local Next.js (Turbopack) 0.047s 1.005s 0.958s 10 1.56x
🐘 Postgres Express 0.049s (-14.8% 🟢) 1.012s (~) 0.962s 10 1.63x
🐘 Postgres Next.js (Turbopack) 0.057s 1.013s 0.955s 10 1.88x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 0.270s (+7.3% 🔺) 2.339s (~) 2.069s 10 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 1 step

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Nitro 1.069s (-5.5% 🟢) 2.007s (~) 0.938s 10 1.00x
💻 Local Express 1.075s (-4.5%) 2.006s (~) 0.931s 10 1.01x
🐘 Postgres Nitro 1.081s (-5.2% 🟢) 2.010s (~) 0.929s 10 1.01x
🐘 Postgres Express 1.083s (-5.6% 🟢) 2.010s (~) 0.928s 10 1.01x
💻 Local Next.js (Turbopack) 1.116s 2.005s 0.889s 10 1.04x
🐘 Postgres Next.js (Turbopack) 1.124s 2.009s 0.885s 10 1.05x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 1.510s (-25.8% 🟢) 3.632s (-5.2% 🟢) 2.122s 10 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 10 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 10.395s (-5.2% 🟢) 11.015s (~) 0.620s 3 1.00x
🐘 Postgres Nitro 10.412s (-4.2%) 11.016s (~) 0.604s 3 1.00x
💻 Local Nitro 10.414s (-4.9%) 11.021s (~) 0.607s 3 1.00x
💻 Local Express 10.419s (-4.6%) 11.021s (~) 0.602s 3 1.00x
💻 Local Next.js (Turbopack) 10.648s 11.020s 0.372s 3 1.02x
🐘 Postgres Next.js (Turbopack) 10.721s 11.020s 0.299s 3 1.03x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 13.431s (-22.5% 🟢) 15.343s (-20.9% 🟢) 1.912s 2 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 25 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Nitro 13.459s (-10.6% 🟢) 14.026s (-12.5% 🟢) 0.568s 5 1.00x
🐘 Postgres Express 13.474s (-7.6% 🟢) 14.019s (-6.7% 🟢) 0.545s 5 1.00x
🐘 Postgres Nitro 13.477s (-7.7% 🟢) 14.021s (-6.7% 🟢) 0.544s 5 1.00x
💻 Local Express 13.490s (-9.9% 🟢) 14.027s (-6.7% 🟢) 0.537s 5 1.00x
💻 Local Next.js (Turbopack) 14.019s 14.628s 0.608s 5 1.04x
🐘 Postgres Next.js (Turbopack) 14.123s 15.019s 0.896s 4 1.05x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 21.775s (-58.6% 🟢) 23.776s (-56.5% 🟢) 2.002s 3 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 50 sequential steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Nitro 11.882s (-29.2% 🟢) 12.022s (-29.4% 🟢) 0.140s 8 1.00x
🐘 Postgres Express 11.892s (-15.1% 🟢) 12.141s (-16.8% 🟢) 0.249s 8 1.00x
💻 Local Express 11.905s (-28.3% 🟢) 12.147s (-28.7% 🟢) 0.242s 8 1.00x
🐘 Postgres Nitro 12.053s (-13.7% 🟢) 12.644s (-11.6% 🟢) 0.590s 8 1.01x
💻 Local Next.js (Turbopack) 12.941s 13.166s 0.225s 7 1.09x
🐘 Postgres Next.js (Turbopack) 13.139s 14.016s 0.876s 7 1.11x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 30.903s (-92.1% 🟢) 33.322s (-91.6% 🟢) 2.420s 3 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.all with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.145s (-9.2% 🟢) 2.006s (~) 0.862s 15 1.00x
🐘 Postgres Nitro 1.149s (-9.8% 🟢) 2.008s (~) 0.858s 15 1.00x
💻 Local Express 1.175s (-21.1% 🟢) 2.007s (~) 0.832s 15 1.03x
💻 Local Nitro 1.178s (-27.8% 🟢) 2.006s (-3.3%) 0.828s 15 1.03x
🐘 Postgres Next.js (Turbopack) 1.210s 2.007s 0.797s 15 1.06x
💻 Local Next.js (Turbopack) 1.284s 2.005s 0.722s 15 1.12x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.458s (-27.7% 🟢) 3.840s (-22.1% 🟢) 1.383s 8 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.all with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.206s (-48.9% 🟢) 2.007s (-33.3% 🟢) 0.801s 15 1.00x
🐘 Postgres Nitro 1.221s (-48.1% 🟢) 2.008s (-33.2% 🟢) 0.787s 15 1.01x
🐘 Postgres Next.js (Turbopack) 1.353s 2.007s 0.653s 15 1.12x
💻 Local Express 1.725s (-41.6% 🟢) 2.006s (-41.9% 🟢) 0.281s 15 1.43x
💻 Local Next.js (Turbopack) 1.749s 2.073s 0.324s 15 1.45x
💻 Local Nitro 1.750s (-44.3% 🟢) 2.006s (-48.4% 🟢) 0.255s 15 1.45x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 3.493s (-50.8% 🟢) 4.977s (-44.1% 🟢) 1.484s 7 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.all with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.305s (-62.6% 🟢) 2.008s (-49.9% 🟢) 0.703s 15 1.00x
🐘 Postgres Nitro 1.360s (-60.9% 🟢) 2.008s (-49.9% 🟢) 0.648s 15 1.04x
🐘 Postgres Next.js (Turbopack) 1.612s 2.007s 0.396s 15 1.24x
💻 Local Next.js (Turbopack) 4.832s 5.345s 0.513s 6 3.70x
💻 Local Nitro 5.161s (-38.2% 🟢) 5.679s (-37.0% 🟢) 0.519s 6 3.96x
💻 Local Express 5.508s (-33.9% 🟢) 5.846s (-35.2% 🟢) 0.338s 6 4.22x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 6.890s (-22.7% 🟢) 8.580s (-21.7% 🟢) 1.690s 4 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.race with 10 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.145s (-9.0% 🟢) 2.010s (~) 0.866s 15 1.00x
🐘 Postgres Nitro 1.148s (-8.7% 🟢) 2.008s (~) 0.860s 15 1.00x
🐘 Postgres Next.js (Turbopack) 1.202s 2.008s 0.806s 15 1.05x
💻 Local Next.js (Turbopack) 1.305s 2.006s 0.701s 15 1.14x
💻 Local Express 1.410s (-25.5% 🟢) 2.006s (-15.1% 🟢) 0.596s 15 1.23x
💻 Local Nitro 1.415s (-24.2% 🟢) 2.007s (-14.3% 🟢) 0.592s 15 1.24x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.454s (-16.3% 🟢) 3.868s (-16.7% 🟢) 1.414s 8 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.race with 25 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.208s (-48.4% 🟢) 2.008s (-33.3% 🟢) 0.800s 15 1.00x
🐘 Postgres Nitro 1.228s (-47.5% 🟢) 2.009s (-33.3% 🟢) 0.782s 15 1.02x
🐘 Postgres Next.js (Turbopack) 1.339s 2.007s 0.668s 15 1.11x
💻 Local Express 2.030s (-35.2% 🟢) 2.469s (-34.4% 🟢) 0.438s 13 1.68x
💻 Local Next.js (Turbopack) 2.048s 2.736s 0.688s 11 1.70x
💻 Local Nitro 2.101s (-31.5% 🟢) 2.509s (-35.4% 🟢) 0.408s 12 1.74x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 3.652s (+16.2% 🔺) 5.229s (+15.6% 🔺) 1.577s 6 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Promise.race with 50 concurrent steps

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.317s (-62.4% 🟢) 2.008s (-49.9% 🟢) 0.691s 15 1.00x
🐘 Postgres Nitro 1.375s (-60.5% 🟢) 2.008s (-49.9% 🟢) 0.633s 15 1.04x
🐘 Postgres Next.js (Turbopack) 1.614s 2.008s 0.394s 15 1.23x
💻 Local Next.js (Turbopack) 5.531s 6.211s 0.680s 5 4.20x
💻 Local Express 5.862s (-33.4% 🟢) 6.215s (-33.0% 🟢) 0.354s 5 4.45x
💻 Local Nitro 6.248s (-31.7% 🟢) 6.613s (-34.0% 🟢) 0.365s 5 4.74x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 6.242s (-7.6% 🟢) 7.983s (-6.6% 🟢) 1.742s 4 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 10 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 0.444s (-45.9% 🟢) 1.007s (~) 0.563s 60 1.00x
🐘 Postgres Express 0.449s (-46.5% 🟢) 1.007s (-1.6%) 0.558s 60 1.01x
💻 Local Express 0.464s (-52.9% 🟢) 1.004s (-6.7% 🟢) 0.540s 60 1.04x
💻 Local Nitro 0.474s (-51.6% 🟢) 1.004s (-8.2% 🟢) 0.530s 60 1.07x
🐘 Postgres Next.js (Turbopack) 0.667s 1.006s 0.339s 60 1.50x
💻 Local Next.js (Turbopack) 0.704s 1.004s 0.300s 60 1.59x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 5.314s (-63.4% 🟢) 6.832s (-57.5% 🟢) 1.518s 9 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 25 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.041s (-47.3% 🟢) 1.614s (-28.5% 🟢) 0.573s 56 1.00x
🐘 Postgres Nitro 1.076s (-44.2% 🟢) 1.693s (-19.4% 🟢) 0.616s 54 1.03x
💻 Local Express 1.186s (-60.7% 🟢) 2.006s (-44.1% 🟢) 0.820s 45 1.14x
💻 Local Nitro 1.186s (-60.9% 🟢) 2.006s (-46.6% 🟢) 0.820s 45 1.14x
🐘 Postgres Next.js (Turbopack) 1.629s 2.008s 0.378s 45 1.57x
💻 Local Next.js (Turbopack) 1.806s 2.028s 0.222s 45 1.73x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 13.860s (-72.2% 🟢) 15.887s (-69.3% 🟢) 2.027s 6 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 50 sequential data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 2.078s (-49.4% 🟢) 2.508s (-45.5% 🟢) 0.430s 48 1.00x
🐘 Postgres Express 2.110s (-47.1% 🟢) 2.675s (-38.8% 🟢) 0.565s 45 1.02x
💻 Local Express 2.669s (-71.0% 🟢) 3.008s (-70.0% 🟢) 0.339s 40 1.28x
💻 Local Nitro 2.684s (-71.1% 🟢) 3.008s (-70.0% 🟢) 0.324s 40 1.29x
🐘 Postgres Next.js (Turbopack) 3.262s 4.043s 0.780s 30 1.57x
💻 Local Next.js (Turbopack) 3.719s 4.008s 0.289s 30 1.79x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 26.999s (-74.8% 🟢) 29.226s (-73.2% 🟢) 2.227s 5 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 10 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 0.175s (-38.4% 🟢) 1.006s (~) 0.831s 60 1.00x
🐘 Postgres Express 0.176s (-37.8% 🟢) 1.006s (~) 0.831s 60 1.01x
🐘 Postgres Next.js (Turbopack) 0.231s 1.006s 0.775s 60 1.33x
💻 Local Express 0.387s (-30.9% 🟢) 1.004s (~) 0.617s 60 2.22x
💻 Local Nitro 0.411s (-32.0% 🟢) 1.004s (-1.7%) 0.593s 60 2.35x
💻 Local Next.js (Turbopack) 0.483s 1.004s 0.521s 60 2.77x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.656s (+31.3% 🔺) 4.319s (+13.8% 🔺) 1.663s 14 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 25 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Nitro 0.301s (-39.3% 🟢) 1.006s (~) 0.705s 90 1.00x
🐘 Postgres Express 0.330s (-35.3% 🟢) 1.028s (+2.2%) 0.698s 88 1.09x
🐘 Postgres Next.js (Turbopack) 0.436s 1.006s 0.570s 90 1.45x
💻 Local Next.js (Turbopack) 2.146s 2.884s 0.739s 32 7.12x
💻 Local Nitro 2.190s (-13.7% 🟢) 2.853s (-5.2% 🟢) 0.663s 32 7.26x
💻 Local Express 2.199s (-12.5% 🟢) 2.766s (-8.1% 🟢) 0.567s 33 7.29x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 5.974s (+69.0% 🔺) 7.826s (+50.7% 🔺) 1.852s 12 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

workflow with 50 concurrent data payload steps (10KB)

💻 Local Development

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 0.605s (-26.1% 🟢) 1.006s (-1.1%) 0.401s 120 1.00x
🐘 Postgres Nitro 0.628s (-20.6% 🟢) 1.006s (~) 0.378s 120 1.04x
🐘 Postgres Next.js (Turbopack) 0.901s 1.118s 0.217s 108 1.49x
💻 Local Express 10.018s (-10.5% 🟢) 10.778s (-9.7% 🟢) 0.760s 12 16.55x
💻 Local Nitro 10.268s (-8.2% 🟢) 10.861s (-6.9% 🟢) 0.593s 12 16.96x
💻 Local Next.js (Turbopack) 10.462s 11.117s 0.655s 11 17.28x

▲ Production (Vercel)

World Framework Workflow Time Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 16.620s (+60.9% 🔺) 18.683s (+52.1% 🔺) 2.063s 7 1.00x
▲ Vercel Express ⚠️ missing - - - -
▲ Vercel Nitro ⚠️ missing - - - -

🔍 Observability: Next.js (Turbopack)

Stream Benchmarks (includes TTFB metrics)
workflow with stream

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
💻 Local 🥇 Nitro 1.131s (+429.2% 🔺) 2.005s (+99.6% 🔺) 0.012s (-2.4%) 2.020s (+98.3% 🔺) 0.889s 10 1.00x
🐘 Postgres Express 1.137s (+454.3% 🔺) 1.998s (+100.1% 🔺) 0.001s (-25.0% 🟢) 2.011s (+98.8% 🔺) 0.874s 10 1.01x
🐘 Postgres Nitro 1.141s (+456.8% 🔺) 1.999s (+100.0% 🔺) 0.001s (-26.7% 🟢) 2.011s (+98.8% 🔺) 0.869s 10 1.01x
💻 Local Express 1.143s (+474.1% 🔺) 2.006s (+99.7% 🔺) 0.012s (+2.5%) 2.020s (+98.4% 🔺) 0.877s 10 1.01x
💻 Local Next.js (Turbopack) 1.167s 2.003s 0.012s 2.019s 0.851s 10 1.03x
🐘 Postgres Next.js (Turbopack) 1.210s 2.001s 0.001s 2.010s 0.800s 10 1.07x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 2.147s (-68.7% 🟢) 3.299s (-61.9% 🟢) 2.045s (+223.6% 🔺) 5.833s (-40.4% 🟢) 3.685s 10 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

stream pipeline with 5 transform steps (1MB)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.496s (+137.4% 🔺) 2.004s (+99.1% 🔺) 0.004s (+4.4%) 2.024s (+97.9% 🔺) 0.529s 30 1.00x
💻 Local Nitro 1.524s (+81.7% 🔺) 2.011s (+98.7% 🔺) 0.010s (+5.3% 🔺) 2.022s (+81.2% 🔺) 0.499s 30 1.02x
🐘 Postgres Nitro 1.524s (+144.2% 🔺) 2.002s (+98.8% 🔺) 0.004s (-4.1%) 2.026s (+98.1% 🔺) 0.502s 30 1.02x
💻 Local Express 1.530s (+102.1% 🔺) 2.012s (+95.6% 🔺) 0.011s (+15.1% 🔺) 2.025s (+94.7% 🔺) 0.495s 30 1.02x
🐘 Postgres Next.js (Turbopack) 1.674s 2.010s 0.004s 2.025s 0.352s 30 1.12x
💻 Local Next.js (Turbopack) 1.845s 2.012s 0.009s 2.202s 0.357s 28 1.23x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 6.069s (-64.1% 🟢) 7.539s (-58.7% 🟢) 0.289s (+36.9% 🔺) 8.337s (-56.0% 🟢) 2.267s 8 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

10 parallel streams (1MB each)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 0.641s (-33.3% 🟢) 1.031s (-19.3% 🟢) 0.000s (+19.0% 🔺) 1.051s (-19.6% 🟢) 0.410s 58 1.00x
🐘 Postgres Nitro 0.666s (-31.2% 🟢) 1.015s (-18.7% 🟢) 0.000s (-17.2% 🟢) 1.037s (-17.6% 🟢) 0.371s 58 1.04x
🐘 Postgres Next.js (Turbopack) 0.761s 1.054s 0.000s 1.060s 0.300s 57 1.19x
💻 Local Nitro 1.336s (+9.3% 🔺) 2.015s (~) 0.000s (+200.0% 🔺) 2.017s (~) 0.681s 30 2.08x
💻 Local Express 1.349s (+10.2% 🔺) 2.015s (~) 0.000s (-30.0% 🟢) 2.017s (~) 0.667s 30 2.10x
💻 Local Next.js (Turbopack) 1.374s 2.014s 0.000s 2.017s 0.643s 30 2.14x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 3.459s (-66.0% 🟢) 4.699s (-59.2% 🟢) 0.000s (+Infinity% 🔺) 5.149s (-57.3% 🟢) 1.689s 12 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

fan-out fan-in 10 streams (1MB each)

💻 Local Development

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
🐘 Postgres 🥇 Express 1.335s (-24.7% 🟢) 1.995s (-8.4% 🟢) 0.000s (+Infinity% 🔺) 2.010s (-8.6% 🟢) 0.675s 30 1.00x
🐘 Postgres Nitro 1.375s (-23.3% 🟢) 2.067s (-3.5%) 0.000s (-3.4%) 2.092s (-3.8%) 0.717s 29 1.03x
🐘 Postgres Next.js (Turbopack) 1.775s 2.262s 0.000s 2.269s 0.495s 27 1.33x
💻 Local Next.js (Turbopack) 2.654s 3.291s 0.000s 3.294s 0.641s 19 1.99x
💻 Local Nitro 3.093s (-8.7% 🟢) 3.776s (-6.3% 🟢) 0.000s (-18.0% 🟢) 3.782s (-6.3% 🟢) 0.689s 16 2.32x
💻 Local Express 3.192s (-8.0% 🟢) 4.029s (~) 0.000s (-41.7% 🟢) 4.031s (~) 0.839s 15 2.39x

▲ Production (Vercel)

World Framework Workflow Time TTFB Slurp Wall Time Overhead Samples vs Fastest
▲ Vercel 🥇 Next.js (Turbopack) 5.088s (-9.4% 🟢) 6.519s (-6.6% 🟢) 0.000s (-11.1% 🟢) 6.961s (-7.7% 🟢) 1.873s 9 1.00x
▲ Vercel Express ⚠️ missing - - - - -
▲ Vercel Nitro ⚠️ missing - - - - -

🔍 Observability: Next.js (Turbopack)

Summary

Fastest Framework by World

Winner determined by most benchmark wins

World 🥇 Fastest Framework Wins
💻 Local Nitro 8/21
🐘 Postgres Express 15/21
▲ Vercel Next.js (Turbopack) 21/21
Fastest World by Framework

Winner determined by most benchmark wins

Framework 🥇 Fastest World Wins
Express 🐘 Postgres 19/21
Next.js (Turbopack) 🐘 Postgres 15/21
Nitro 🐘 Postgres 15/21
Column Definitions
  • Workflow Time: Runtime reported by workflow (completedAt - createdAt) - primary metric
  • TTFB: Time to First Byte - time from workflow start until first stream byte received (stream benchmarks only)
  • Slurp: Time from first byte to complete stream consumption (stream benchmarks only)
  • Wall Time: Total testbench time (trigger workflow + poll for result)
  • Overhead: Testbench overhead (Wall Time - Workflow Time)
  • Samples: Number of benchmark iterations run
  • vs Fastest: How much slower compared to the fastest configuration for this benchmark

Worlds:

  • 💻 Local: In-memory filesystem world (local development)
  • 🐘 Postgres: PostgreSQL database world (local development)
  • ▲ Vercel: Vercel production/preview deployment
  • 🌐 Turso: Community world (local development)
  • 🌐 MongoDB: Community world (local development)
  • 🌐 Redis: Community world (local development)
  • 🌐 Jazz: Community world (local development)
  • 🌐 Redis: Community world (local development)
  • 🌐 Redis + BullMQ: Community world (local development)
  • 🌐 Cloudflare: Community world (local development)
  • 🌐 MySQL: Community world (local development)
  • 🌐 Azure: Community world (local development)
  • 🌐 NATS JetStream: Community world (local development)
  • 🌐 Upstash: Community world (local development)

📋 View full workflow run


Some benchmark jobs failed:

  • Local: success
  • Postgres: success
  • Vercel: failure

Check the workflow run for details.

Copy link
Copy Markdown
Contributor

@vercel vercel Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Suggestion:

OCC fence parameters (lastKnownEventId, asOfTimestamp) are silently dropped for wait_completed and hook_received events because the lazy branch of createWorkflowRunEventInner doesn't forward them.

Fix on Vercel

The lazy-refs branch of createWorkflowRunEventInner forgot to thread
`lastKnownEventId` and `asOfTimestamp` into the request body, so the
fence was silently dropped for any event whose type went through
the lazy path (i.e., not in `eventsNeedingResolve`). The resolve
branch already had the forwarding. Caught by Vercel Agent Review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@VaguelySerious
Copy link
Copy Markdown
Member Author

Vercel Agent review acknowledged + addressed in 1e69c82 — the lazy branch of createWorkflowRunEventInner now forwards lastKnownEventId and asOfTimestamp alongside the resolve branch. Good catch — without this the fence was silently dropped for any event whose type didn't appear in eventsNeedingResolve (including wait_completed and hook_received).

@VaguelySerious
Copy link
Copy Markdown
Member Author

Status after 1e69c82b:

  • All Local E2E (Dev / Prod / Postgres / Windows) green.
  • Vercel Prod E2E: 11/12 apps green (astro, example, express, fastify build, hono, nextjs-turbopack, nextjs-webpack, nitro, nuxt, sveltekit, tanstack-start, vite). Including hono and vite that were red on the previous push.
  • Vercel Prod fastify — single failure: abortAnyInStepWorkflow: AbortSignal.any inside a step composes deserialized signals (130/131 passed). The assertion is expect(returnValue.c2Aborted).toBe(true) — the workflow-side c2 controller didn't observe the abort by the time the workflow returned.

I don't think this is from anything in this PR:

  • The failing workflow has no sleep()/wait calls, so the elapsed-wait-scan fence path I touched never executes for it.
  • The abort signal here propagates from step → workflow via the controller's backing stream, not via hook_received. resumeHook is not in the path either.
  • Both stepResult.saw === true and stepResult.via === 'listener' passed on the same run, so the step-side composition worked correctly. Only the workflow-side controller hadn't caught up by the time the workflow returned.
  • The same test passes on Local fastify (and on every other Vercel Prod app on this PR).

Reads like a long-standing abort-stream-propagation timing flake that just happened to fire on fastify Vercel Prod this run. Will rerun the job once the workflow run is no longer in-progress; flagging here in case a reviewer hits it before I get back to it.

Comment thread .changeset/event-write-occ-fence.md Outdated
Comment thread packages/core/src/runtime/resume-hook.ts Outdated
Comment thread packages/world/src/events.ts Outdated
Comment thread packages/world/src/events.ts Outdated
Comment thread packages/core/src/runtime.ts Outdated
Comment thread packages/core/src/runtime.ts Outdated
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
@VaguelySerious
Copy link
Copy Markdown
Member Author

Good catch — the answer is "yes, conceptually they can race the same way." Here's the breakdown of where we currently fence vs. don't, and what's at risk:

Where the SDK currently sends a fence

Only one site: the elapsed-wait scan in runtime.ts, when it writes wait_completed. The fence is the loaded events' tail eventId; on a fence conflict the tick retries in-place with a fresh fence (the loop I added).

Where it doesn't, but could race in the same shape

Any write that the workflow runtime makes based on a branch decision driven by the loaded events array can race in the same way the production hook/sleep bug did. Concretely:

suspension-handler.ts — these are exactly the writes the workflow VM emits when its replay decides to allocate a new entity. If that decision was made on a stale snapshot, the write is "stale-branch":

Write Currently fenced? Can race?
hook_created no yes — workflow decided to allocate a hook based on its branch
hook_disposed no yes — workflow decided to dispose
step_created no yes — workflow decided to invoke a step
wait_created no yes — workflow decided to sleep()

runtime.ts — terminal-state writes after a successful replay:

Write Currently fenced? Can race?
run_completed (line 974) no yes — workflow returned based on its branch
run_failed (catch path) no yes — workflow threw based on its branch

What doesn't need a fence (and why)

  • step_started, step_completed, step_failed, step_retrying (step-executor.ts, step-handler.ts): these record facts about a step that's already in the log via step_created. They're not making a new branch decision; if the step was allocated, finishing it is just bookkeeping.
  • run_created (start.ts): no prior events for the run; nothing to be stale against.
  • run_started (initial tick): same — first thing the runtime writes.
  • run_failed from MAX_DELIVERIES_EXCEEDED / replay-budget exhaustion (replay-budget.ts): terminal escape hatch; needs to land regardless.
  • hook_received from resumeHook: deliberately unfenced — fencing would lose the user's signal.

Why the wait_completed fence alone catches the production bug today

The hook/sleep race needs: (1) tick takes sleep branch with stale events, (2) tick writes wait_completed, and (3) tick writes any sleep-branch follow-up events (more sleeps, the next iteration's allocations). My current fence trips at step (2): the wait_completed CAS fails, the retry loop reloads events, sees hook_received, and the next replay picks the hook branch — so step (3)'s sleep-branch writes never happen. That's why the repro went from corrupted to clean.

But that's specific to the hook-vs-sleep shape because wait_completed is on the critical path of the sleep branch. A race that doesn't go through wait_completed (e.g., a branch decision around Promise.race([hook, someStep]) where the workflow allocates a different step depending on which side wins) wouldn't be caught — the stale step_created/hook_created would land without a fence check.

Recommendation

For full coverage of the general "stale-snapshot branch decision" race, fence all 6 sites in the table above. Implementation cost is modest: the suspension-handler already has the events array available (the runtime invokes it with the loaded events); we'd thread the tail eventId in the same way I did for the elapsed-wait scan, and add the same retry-on-fence-conflict pattern to each catch block. run_completed/run_failed need the same plus distinguishing fence-conflict from "run already finished" (mirror of the workflow-server-side fix).

Want me to do it in this PR or keep it as a follow-up? My slight lean is follow-up: the current PR cleanly resolves the production bug and it's already a non-trivial behavioral change; the broader fencing is mostly defensive against shapes we haven't seen in the wild. But the risk of leaving them unfenced is real if we're betting that no other workflow patterns trigger the same race.

@VaguelySerious
Copy link
Copy Markdown
Member Author

You're right, I was sloppy. Let me walk through it.

Your example: Promise.race([someStep, someHook])

Timeline that breaks:

  1. T+0: workflow starts, runtime writes step_created for someStep. Suspends.
  2. T+1: step finishes elsewhere. Step executor writes step_completed (eventId timestamp ≈ T+1).
  3. T+3: resumeHook writes hook_received (eventId timestamp ≈ T+3).
  4. T+3.5: tick A is re-invoked, loads events.

If at T+3.5 the eventually-consistent read returns [..., step_created, hook_received] but misses step_completed, then tick A's Promise.race resolves with the hook (because the step's resolution event isn't in its events array). Tick A writes hook-branch follow-up events (say step_created for the post-hook step).

But canonically: step_completed (T+1) < hook_received (T+3). A future replay walks events in eventId order:

  • consumes step_completed → step subscriber resolves first via promiseQueue
  • then hook_received → hook subscriber resolves second

step wins the race, not hook. Tick A's hook-branch writes are orphan. Future replay tries to take the step branch, hits unconsumed events, fires CORRUPTED_EVENT_LOG.

So yes, the race exists and step_completed is on the critical path.

And my fence doesn't catch this one

This is the embarrassing part. The fence compares run.lastKnownEventId (latest written) against the tick's claim (its loaded tail). At T+3.5:

  • server run.lastKnownEventId = E_hook (because hook_received was the most-recently-written event, and my server .set() is unconditional, not max())
  • tick A's fence = E_hook (its loaded tail)
  • CAS: E_hook == E_hookpasses

The tick wrote based on a stale view of "everything ≤ E_hook" but the server's fence only tracks "latest write," which the tick did see. The fence catches "missed events that came after my tail" (the hook/sleep production shape) but not "missed events that came before my tail."

Why the production fix still held

The hook/sleep production race has the opposite shape: the tick's tail is from T+0 (a wait_created written at workflow start), and the missed event (hook_received) lands at T+5. That missed event has a higher eventId than the tick's tail, so the server's lastKnownEventId advances past it and CAS fails. That's the case the current fence covers.

The shape you're describing — missed event has a lower eventId than what the tick saw — slips past a single-attribute fence.

What would actually catch it

Three options I can think of, none free:

  1. Strongly-consistent reads on every event load instead of the split-read with the 1.5 s freshness window. Catches it definitively. Reads get more expensive; existing perf decisions go out the window.
  2. Server-side gap detection: server checks that the events table has no eventId in the range (tick.fence_value, tick.now()) other than the events the tick has acknowledged. Requires either sending the full set of seen eventIds or a per-run counter the tick increments.
  3. Reject any event whose eventId is < run.lastKnownEventId: makes eventId monotonic per run (no "stragglers"). Eliminates the failure mode but rejects legitimate concurrent writers whose clocks lag — turns into spurious 409s under any clock skew between SDK processes.

(2) is the cleanest but requires server-side work that isn't in this PR's scope. (3) is cheap but trades correctness for skew-tolerance. (1) is correct but a perf regression.

So how should we think about this PR

Honestly: the current fence is a partial fix. It closes the hook/sleep production shape but the general "stale-snapshot branch decision" hazard still exists for races where the missed event is canonically earlier than what the tick saw. The Promise.race([someStep, someHook]) shape you've described is a real instance of that.

I think we should:

  • Land this PR for the production bug as-is, but be honest in the description that it's the asymmetric coverage above.
  • File a follow-up to design proper gap-detection (probably option 2) — that requires server-side schema/logic that we shouldn't tack onto this PR.

Want me to update the PR description with this caveat, or take a swing at extending the fence in this PR? If the latter, I'd want to whiteboard the design with you first — none of the three options above are obviously right.

if (!EntityConflictError.is(err)) {
throw err;
}
// Fence conflicts surface a specific error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to add a log message here so we can capture these in datadog

// too, but to guarantee correctness, will be re-tried here directly.
// TODO: We can remove the retry here after extensive validation.
// The cost is low in the meantime.
const isFenceConflict = /fence conflict/i.test(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review: brittle coupling to server error wording. /fence conflict/i.test(err.message) against the free-form 409 message means any rewording on the workflow-server side silently routes fence conflicts into the Wait already completed, skipping branch below — workflows keep moving but stale-snapshot protection is gone, and you'd only notice via the CORRUPTED_EVENT_LOG you're trying to eliminate. Prefer surfacing a typed code from the server (e.g. errorData.code === 'FENCE_CONFLICT' carried on EntityConflictError) and matching on that. Worth doing in the paired server PR so this regex never has to ship.

continue;
if (result.event) {
fenceEventId = result.event.eventId;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review: EventResultResolveWireSchema.event is .optional() (packages/world-vercel/src/events.ts:63), so on any response without event you don't refresh fenceEventId and the next iteration's write uses the now-stale tail — that triggers a spurious fence conflict and forces a full reload/backoff cycle for every subsequent wait in waitsToComplete. In practice the server probably always returns the created event, but the type allows the foot-gun. Either tighten the response schema to require event on create, or fall back to a deterministic value when missing.

!events.some((x) => x.eventId === e.eventId)
) {
events.push(e);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review: events.some(x => x.eventId === e.eventId) inside the for makes this O(n²) over the existing log on every fence-retry reload. Event logs aren't huge today, but the retry path is exactly where they'll be longest. A Set of existing ids built once before the loop avoids it for free.

* `hook_received` after anything the caller could have observed without paying
* for a separate read. Ignored when `lastKnownEventId` is also set.
*/
asOfTimestamp?: number;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review: asOfTimestamp is added to the public CreateEventParams and threaded through world-vercel, but no caller in this PR uses it — the docstring points at resumeHook, which the PR explicitly keeps unfenced. Public API surface with no exerciser tends to rot (or drift from the server's interpretation) before its first real caller arrives. Consider dropping it from this PR until resumeHook (or another caller) actually needs it, or wiring up that single caller now so the contract is tested end-to-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants