Skip to content

fix(execution): fix isolated-vm memory leak and add worker recycling#4108

Merged
waleedlatif1 merged 5 commits intostagingfrom
fix/isolated-vm-memory-leak
Apr 11, 2026
Merged

fix(execution): fix isolated-vm memory leak and add worker recycling#4108
waleedlatif1 merged 5 commits intostagingfrom
fix/isolated-vm-memory-leak

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: ECS app tasks were crashing with exit code 139 (SIGSEGV) after ~87 minutes of steady memory growth (~200MB/min). Three bugs:

    1. Host-side isolated-vm handles (Context, Script, Reference, ExternalCopy) were never explicitly released — only isolate.dispose() was called in the finally block. Per isolated-vm issue #198, disposing an isolate while child handles still exist causes stuck-GC states and native memory accumulation outside the V8 heap. Per issue #42, unreleased References hold their parent context alive indefinitely.
    2. Workers had no execution limit — under sustained load the 60s idle timeout never fires, so workers run indefinitely and accumulate V8 heap fragmentation that persists even with correct per-execution cleanup.
    3. --no-node-snapshot was missing from worker spawns. isolated-vm v6 README explicitly requires this for Node.js 20+ due to shared V8 snapshot heap incompatibility (issue #377).
  • Fix 1 (isolated-vm-worker.cjs): Hoist all ivm handle declarations before the try block so finally can reach them. Release children (scripts, references, external copies, context) before calling isolate.dispose() — correct order per the library's issue tracker. All releases wrapped in try/catch so a bad isolate state can't crash the worker.

  • Fix 2 (isolated-vm.ts): Add graceful worker recycling after N lifetime executions (default 500, tunable via IVM_MAX_EXECUTIONS_PER_WORKER). Workers are marked retiring — no new dispatches, in-flight executions drain normally, then the worker is cleaned up and a replacement spawns automatically. Zero user-visible impact.

  • Fix 3 (isolated-vm.ts): Add --no-node-snapshot flag to worker spawn — required by isolated-vm v6 for Node.js 20+.

References

Type of Change

  • Bug fix

Testing

Tested manually — build passes clean

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 11, 2026 5:54pm

Request Review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 11, 2026

PR Summary

Medium Risk
Touches the isolated execution worker pool and cleanup logic; errors could cause stalled executions or reduced capacity, but changes are scoped to resource management and process spawning.

Overview
Fixes native memory growth in isolated-vm execution by explicitly tracking and release()ing all host-side handles (Context, Script, Reference, ExternalCopy, callbacks) before isolate.dispose() in the worker.

Adds worker recycling in the pool: workers track lifetimeExecutions, get marked retiring after IVM_MAX_EXECUTIONS_PER_WORKER (new env var, default 500), stop receiving new work, and are cleaned up once idle; pool sizing/selection now ignores retiring workers. Also spawns workers with node --no-node-snapshot for Node 20+ compatibility.

Reviewed by Cursor Bugbot for commit e05a36c. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 11, 2026

Greptile Summary

This PR fixes three root causes of SIGSEGV crashes (exit code 139) in ECS app tasks: (1) isolated-vm child handles (Script, Reference, Callback, ExternalCopy, Context) were not explicitly released before isolate.dispose(), causing stuck-GC states and native memory accumulation; (2) workers had no lifetime execution cap, accumulating V8 heap fragmentation indefinitely; (3) --no-node-snapshot was missing, causing heap incompatibility on Node.js 20+. The previous reviewer concern about retiring workers blocking the pool has been resolved in commit 33bf94a — retiring workers are now excluded from pool-size calculations and a replacement is spawned proactively.

Confidence Score: 5/5

Safe to merge — all three root causes of the SIGSEGV crash are correctly addressed and the previous P1 pool-sizing concern is resolved.

No P0 or P1 findings remain. The handle-release ordering (children before isolate.dispose), the pool-size accounting fix, and the --no-node-snapshot flag are all implemented correctly. The lifetimeExecutions counter is correctly incremented in the normal result and timeout paths with no double-counting risk. The retiring-worker drain-then-cleanup flow is sound.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/lib/execution/isolated-vm-worker.cjs Hoists all ivm handle declarations before try block; releases Script, Callback, Reference, ExternalCopy, and Context in child-before-parent order in finally; wraps each release in try/catch to survive bad-isolate states.
apps/sim/lib/execution/isolated-vm.ts Adds lifetimeExecutions counter and retiring flag to WorkerInfo; marks workers for retirement after MAX_EXECUTIONS_PER_WORKER executions; excludes retiring workers from pool-size accounting in acquireWorker and drainQueue; adds --no-node-snapshot to worker spawn args.
apps/sim/lib/core/config/env.ts Adds IVM_MAX_EXECUTIONS_PER_WORKER env var (default 500) used to tune the new worker recycling threshold.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([Execution request]) --> B[acquireWorker]
    B --> C{Non-retiring worker\nwith capacity?}
    C -- Yes --> D[dispatchToWorker]
    C -- No, pool not full\nexcluding retiring --> E[spawnWorker]
    E --> D
    C -- Pool at capacity --> F[enqueueExecution]

    D --> G[[Worker isolate executes]]
    G --> H{Complete / Timeout / SendFail}

    H -- Normal result --> I[lifetimeExecutions++]
    H -- Timeout +1s --> J[lifetimeExecutions++]
    H -- SendFail --> K[resolve error]

    I --> L{>= MAX_EXECUTIONS_PER_WORKER?}
    J --> L
    L -- Yes --> M[retiring = true]
    L -- No --> N{retiring &&\nactiveExecutions == 0?}
    M --> N
    N -- Yes --> O[cleanupWorker]
    N -- No --> P[resetWorkerIdleTimeout]
    K --> P

    O --> Q[resolve result / error]
    P --> Q
    Q --> R[drainQueue]
    R --> S{Queue items?\npool below POOL_SIZE?}
    S -- Yes --> E
    S -- No --> T([Done])
Loading

Reviews (4): Last reviewed commit: "fix(execution): increment lifetimeExecut..." | Re-trigger Greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit e05a36c. Configure here.

@waleedlatif1 waleedlatif1 merged commit 20cc018 into staging Apr 11, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/isolated-vm-memory-leak branch April 11, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant