feat: enforce max queue deliveries in handlers with graceful failure#1344
feat: enforce max queue deliveries in handlers with graceful failure#1344
Conversation
🦋 Changeset detectedLatest commit: d5b4132 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)nextjs-turbopack (1 failed):
🌍 Community Worlds (58 failed)mongodb (3 failed):
redis (2 failed):
turso (53 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
3ff6de8 to
9929b57
Compare
c2ed3e7 to
085a05a
Compare
There was a problem hiding this comment.
Pull request overview
This PR moves enforcement of the queue delivery cap from the Vercel Queue trigger configuration into the workflow/step runtime handlers, and updates local-queue behavior/logging to support the new approach.
Changes:
- Add a shared
MAX_QUEUE_DELIVERIESconstant and enforce it in both workflow and step handlers with graceful failure (run_failed/step_failed+ requeue workflow for step). - Remove
maxDeliveriesfrom queue trigger definitions in@workflow/builders. - Improve
world-localqueue logging withrunId/stepIdcontext and add a local retry safety limit.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/world-local/src/queue.ts | Adds structured identifiers to logs and replaces the old retry counter with a fixed safety-loop cap. |
| packages/errors/src/error-codes.ts | Introduces MAX_DELIVERIES_EXCEEDED run error code. |
| packages/core/src/runtime/step-handler.ts | Enforces max deliveries for steps and adjusts event creation/logging behavior. |
| packages/core/src/runtime/step-handler.test.ts | Adds test coverage for step max-deliveries behavior. |
| packages/core/src/runtime/constants.ts | Defines MAX_QUEUE_DELIVERIES. |
| packages/core/src/runtime.ts | Enforces max deliveries for workflow handler and records run_failed with a specific error code. |
| packages/builders/src/constants.ts | Removes VQS maxDeliveries from trigger constants. |
| .changeset/handler-max-deliveries.md | Changeset describing the behavior shift from VQS config to handler enforcement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| const startResult = await world.events.create(workflowRunId, { | ||
| eventType: 'step_started', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| correlationId: stepId, | ||
| }); |
packages/core/src/runtime.ts
Outdated
| try { | ||
| const world = getWorld(); | ||
| await world.events.create(runId, { | ||
| eventType: 'run_failed', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| eventData: { | ||
| error: { | ||
| message: `Workflow exceeded maximum queue deliveries (${metadata.attempt}/${MAX_QUEUE_DELIVERIES})`, | ||
| }, | ||
| errorCode: RUN_ERROR_CODES.MAX_DELIVERIES_EXCEEDED, | ||
| }, | ||
| }); |
| // Safety limit to prevent infinite loops in the local queue. | ||
| // The actual max delivery enforcement happens in the workflow/step handlers. | ||
| const MAX_LOCAL_SAFETY_LIMIT = 1000; | ||
| try { | ||
| let defaultRetriesLeft = 3; | ||
| for (let attempt = 0; defaultRetriesLeft > 0; attempt++) { | ||
| defaultRetriesLeft--; | ||
|
|
||
| for (let attempt = 0; attempt < MAX_LOCAL_SAFETY_LIMIT; attempt++) { |
There was a problem hiding this comment.
yeah there's no reason this has to be 1000. just needs to be higher than the max queue attempts. let's go with 256
| it('should post step_failed and re-queue workflow when delivery count exceeds max', async () => { | ||
| const result = await capturedHandler( | ||
| createMessage(), | ||
| { ...createMetadata('myStep'), attempt: 65 } | ||
| ); | ||
|
|
||
| expect(result).toBeUndefined(); | ||
| expect(mockEventsCreate).toHaveBeenCalledWith( | ||
| 'wrun_test123', | ||
| expect.objectContaining({ | ||
| eventType: 'step_failed', | ||
| correlationId: 'step_abc', | ||
| }) | ||
| ); | ||
| expect(mockQueueMessage).toHaveBeenCalled(); | ||
| expect(mockRuntimeLogger.error).toHaveBeenCalledWith( | ||
| expect.stringContaining('exceeded max deliveries'), | ||
| expect.objectContaining({ workflowRunId: 'wrun_test123' }) | ||
| ); | ||
| }); | ||
|
|
||
| it('should consume message silently when step_failed fails with EntityConflictError', async () => { | ||
| mockEventsCreate.mockRejectedValue( | ||
| new EntityConflictError('Step already completed') | ||
| ); | ||
|
|
||
| const result = await capturedHandler( | ||
| createMessage(), | ||
| { ...createMetadata('myStep'), attempt: 65 } | ||
| ); | ||
|
|
||
| expect(result).toBeUndefined(); | ||
| expect(mockStepFn).not.toHaveBeenCalled(); | ||
| }); | ||
|
|
||
| it('should not trigger max deliveries check when under limit', async () => { | ||
| const result = await capturedHandler( | ||
| createMessage(), | ||
| { ...createMetadata('myStep'), attempt: 64 } | ||
| ); |
Replace VQS maxDeliveries cap with handler-level enforcement. Handlers now gracefully fail runs/steps after excessive queue redeliveries, preventing "phantom stuck" runs. - Add MAX_QUEUE_DELIVERIES constant (64) and enforce in both workflow and step handlers with run_failed/step_failed events - Remove maxDeliveries from VQS trigger configs (builders + sveltekit) - Improve world-local queue: safety limit loop, structured logging with runId/stepId, backoff delay on failures - Add MAX_DELIVERIES_EXCEEDED error code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
|
|
||
| // --- Max delivery check --- | ||
| // Enforce max delivery limit before any infrastructure calls. | ||
| // This prevents runaway steps from consuming infinite queue deliveries. |
There was a problem hiding this comment.
| // This prevents runaway steps from consuming infinite queue deliveries. | |
| // This prevents runaway steps from consuming infinite queue deliveries. | |
| // At this point, we want to do the minimal amount of work (no fetching | |
| // of the step details, etc. We simply attempt to mark the step as failed | |
| // and enqueue the workflow once, and if either of those fails, the message | |
| // is still consumed but with adequate logging that an error occurred. |
| 'Failed to post run_failed for max deliveries exceeded, consuming message anyway', | ||
| { | ||
| workflowRunId: runId, |
There was a problem hiding this comment.
Like the step error message, this should also be more verbose and explain that a persistent outage is preventing us from failing the run normally etc. etc.
| EntityConflictError.is(err) || | ||
| RunExpiredError.is(err) | ||
| ) { | ||
| // Run already finished, consume the message |
There was a problem hiding this comment.
| // Run already finished, consume the message | |
| // Run already finished, consume the message silently |
| return; | ||
| } | ||
| runtimeLogger.error( | ||
| 'Failed to post run_failed for max deliveries exceeded, consuming message anyway', |
There was a problem hiding this comment.
Like the step error message, this should also be more verbose and explain that a persistent outage is preventing us from failing the run normally etc. etc.
|
|
||
| // --- Max delivery check --- | ||
| // Enforce max delivery limit before any infrastructure calls. | ||
| // This prevents runaway workflows from consuming infinite queue deliveries. |
There was a problem hiding this comment.
| // This prevents runaway workflows from consuming infinite queue deliveries. | |
| // This prevents runaway workflows from consuming infinite queue deliveries. | |
| // At this point, we want to do the minimal amount of work (no fetching | |
| // of the workflow events, etc. We simply attempt to mark the run as failed | |
| // and if that fails, the message is still consumed but with adequate logging | |
| // that an error occurred preventing us from failing the run. |

Summary
Replaces VQS
maxDeliveries: 64cap with handler-level enforcement. Handlers now gracefully fail runs/steps after excessive queue redeliveries, preventing "phantom stuck" runs.Stacked on #1342 → #1340
Problem
When infrastructure is down (OOMs, network outages), VQS retries messages up to
maxDeliveries: 64times at 5s intervals. After exhausting retries, VQS drops the message — the run stays inrunningstatus forever with no error, no failure event.Solution
maxDeliveriesfrom VQS config — allow infinite retries at queue levelretryAfterSeconds: 5— VQS owns retry timing (works even after SIGKILL/OOM)metadata.attempt— when >MAX_QUEUE_DELIVERIES(64), fail gracefully withMAX_DELIVERIES_EXCEEDEDerror codeQueue error log examples (before → after)
Before (dumped full body, no run context):
After (structured, includes run/step IDs, separates HTTP status from handler error):
Local world queue
Test plan
failedwithMAX_DELIVERIES_EXCEEDED🤖 Generated with Claude Code