Skip to content

fix(sandbox): bound quick file ops during link partitions#4066

Merged
tlgimenes merged 1 commit into
mainfrom
zeta-delphini
Jun 22, 2026
Merged

fix(sandbox): bound quick file ops during link partitions#4066
tlgimenes merged 1 commit into
mainfrom
zeta-delphini

Conversation

@tlgimenes

@tlgimenes tlgimenes commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a 10s fast-fail timeout for quick sandbox file operations when the daemon link is partitioned.
  • Update the WS partition resilience scenario timing expectations.
  • Align the Playwright config unit test with MCP_CACHE_ENABLED=true.

Test Plan

  • bunx biome format --write apps/mesh/src/api/routes/sandbox-proxy.ts tests/resilience/scenarios/link-dispatch-ws-partition.test.ts apps/mesh/playwright.config.test.ts
  • bun run check
  • bun test apps/mesh/playwright.config.test.ts

Note: full bun test was not completed in the sandbox because Docker-backed resilience setup did not progress.


Summary by cubic

Bound quick sandbox file ops to a 10s fast-fail so reads/writes don’t stall ~30s during link partitions and instead return a prompt 502. Updated resilience test timing and aligned Playwright config test with MCP_CACHE_ENABLED=true.

  • Bug Fixes
    • Added a 10s QUICK_FILE_OP_TIMEOUT_MS and quickFileOpSignal; applied to read/write/mkdir/unlink/rename/glob routes to fail fast on partitions.
    • Left streaming ops (exec, events, git/*) unbounded to preserve behavior.
    • Resilience test: assert quick-file-op errors within ~10s (15s ceiling) and extend reconnect waits to 90s for CI backoff.
    • Playwright config test: expect MCP_CACHE_ENABLED=true in the dev server command.

Written for commit 3d7c9cf. Summary will update on new commits.

Review in cubic

@tlgimenes tlgimenes enabled auto-merge (squash) June 22, 2026 22:30
@tlgimenes tlgimenes merged commit ab68c91 into main Jun 22, 2026
15 checks passed
@tlgimenes tlgimenes deleted the zeta-delphini branch June 22, 2026 22:34
decocms Bot pushed a commit that referenced this pull request Jun 22, 2026
PR: #4066 fix(sandbox): bound quick file ops during link partitions
Bump type: patch

- decocms (apps/mesh/package.json): 3.43.2 -> 3.43.3

Deploy-Scope: server
tlgimenes added a commit that referenced this pull request Jun 22, 2026
…nnect assertion) (#4069)

The Resilience Tests workflow has been red on main since the NATS tunnel
transport landed (#3854) and worsened with #4066. Two independent root causes
in the sandbox↔studio WS-partition suite:

1. Baseline write/read 502s on a healthy link. #4066 bounded quick file ops
   at a 10s QUICK_FILE_OP_TIMEOUT_MS (sandbox-proxy.ts). The daemon gates the
   first quick file op on waitForFirstMounts() — up to 10s FIRST_MOUNT_WAIT_MS
   (entry.ts) — waiting for org-fs mounts. These containers have no FUSE, so
   rclone mount always fails after the full grace, so the first write times
   out at exactly 10s. org-fs can't work here and this suite doesn't test it,
   so disable it via DISABLE_ORGFS_MOUNTS on the studio service. The daemon
   then never expects org-fs and skips the wait.

2. "WS restored → reconnects" can never pass. Since #3854 presence is an
   optimistic live /api/links/status probe; /api/links/me no longer returns a
   stored claim's connectedAt. The test asserted `claim.connectedAt > baseline`
   (undefined > undefined → always false), so it timed out every run. Detect
   reconnect by presence reading online again, matching the new model.

Also folds in the NATS healthcheck hardening (image/healthcheck no longer
depend on shell utilities, with a docker-compose.test.ts guard) and skips the
log-replay suite outright so its heavy beforeAll never runs while its only
tests are skipped.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant