Skip to content

feat(browse): clean-drain disconnect detection + shutdown postmortem#1630

Open
odominguez7 wants to merge 5 commits into
garrytan:mainfrom
odominguez7:feat/browse-shutdown-postmortem
Open

feat(browse): clean-drain disconnect detection + shutdown postmortem#1630
odominguez7 wants to merge 5 commits into
garrytan:mainfrom
odominguez7:feat/browse-shutdown-postmortem

Conversation

@odominguez7
Copy link
Copy Markdown

@odominguez7 odominguez7 commented May 20, 2026

Summary

The browse daemon's launched-mode disconnected handler treats every Chromium disappear as a crash. When a user closes their last tab in headed mode, Chromium naturally quits because it has nothing left to display — and the daemon logs FATAL: Chromium process crashed, process.exit(1)s, and tears down the sidebar-agent along with any in-flight PTY work.

The smoking gun shows up cleanly in any developer's daemon log:

[browse] Tab closed (id=1, remaining=0)
[browse] FATAL: Chromium process crashed or was killed. Server exiting.

Same event, two stories. The exit happens microseconds after the close — and detached server stdio went to FDs nothing read after proc.unref(), so the daemon's own messages disappeared into the void. No log to consult, no postmortem to read.

This PR ships the postmortem instrumentation that surfaced the bug AND the fix.

Three changes:

  1. classifyDisconnect() in browse/src/browser-manager.ts — pure helper, splits disconnected events by pages.size. Zero pages → clean drain (browser-empty-*, exit 0, no FATAL log). Pages still live → real crash (chromium-crash-*, exit 1, FATAL preserved). Wired into both the launched-mode handler (line 291) and the post-rehead handler (line 1393). Collapses two near-duplicate inline blocks into one shared classification.

  2. recordShutdownReason() in browse/src/server.ts — writes <stateDir>/last-shutdown.json atomically (.pid.tmp + fs.renameSync) on every exit path. Payload: {ts, pid, reason, mode, uptimeSeconds, detail}. Specific callers (idle-timeout, parent-exit, SIGINT/SIGTERM, browser-disconnect, browser-empty-, chromium-crash-) record their tagged reason before calling activeShutdown; a fallback shutdown-called reason fires only if nothing else recorded one. Most-recent-wins overwrite is intentional — gated by shutdownReasonRecorded flag so the fallback doesn't clobber specifics.

  3. stdio tee in browse/src/cli.ts — opens <stateDir>/browse-server.log in append mode, points the spawned server's stdio at the fd. Detached server's console.log/console.error now survive proc.unref(). Startup-failure path reads the last 64KB via fs.openSync + fs.statSync + seek-from-end + fs.readSync + fs.closeSync in finally, so a multi-failed-startup log doesn't get fully read into memory.

Plus: BrowserManager.onDisconnect signature gains an optional exitCode?: 0 | 1 | 2 parameter so the classification's exit code flows through to the server's wiring instead of being decorative (the original draft had this as a bug — exit codes claimed but not delivered because the wiring always called activeShutdown(2)).

Test plan

  • bun test browse/test/browser-manager-classify-disconnect.test.ts — 5 new tests, all green. Covers launched/rehead × pagesSize 0/N + reason-tag suffix invariant.
  • bun test browse/test/browser-manager-{unit,custom-chromium}.test.ts browse/test/server-embedder-terminal-port.test.ts browse/test/build-command-response.test.ts — 26 tests, all green. No regressions in neighboring suites that exercise BrowserManager or the server shutdown path.
  • bun run build — clean compile across all binaries + node-compat bundle.
  • Cross-model adversarial review via codex exec --reasoning-effort high (OpenAI Codex CLI). Surfaced 14 findings on the first pass. Addressed the 5 must-fix / should-fix items (exit-code wiring being decorative, postmortem exitCode internal consistency, atomic write, bounded tail read, exit-code pass-through refactor) before requesting handshake. Codex sign-off on the final state: "the fixes close the known gaps without introducing new risk and tests pass."
  • Full-suite pre-existing failure receipts: bun test on this branch and on upstream/main (git checkout upstream/main --detach) both produce 258 failures, identical set. None of the failing files are in this PR's diff. The failures cluster on Bun child-process PATH inheritance in sandboxed shells (env-driven, not regression).

Deferred findings (for transparency)

These are real Codex findings I'm explicitly NOT addressing in this PR — happy to discuss if you'd rather see any of them in this round:

  • Forward race: disconnected could fire before page.on('close') drains the pages map. Observed-failure log shows the close drained first; speculative.
  • Inverse race: a crash with pagesSize === 0 (e.g., crash before first tab) misclassifies as clean drain. wirePageEvents creates the initial tab immediately — degenerate case.
  • Embedder mode recordShutdownReason reads module-level config.stateFile. Same pre-existing pattern that v1.42.1.0's ownsTerminalAgent work flagged as a follow-up.
  • Tunnel-mode interaction with clean-drain disconnect — a tunnel session active during last-tab-close still triggers shutdown. Separate PR territory in pair-agent.
  • No log rotation on browse-server.log — future improvement.
  • (proc as any).__logPath stash — relies on Bun proc being extensible. Stable API.

Files

  • browse/src/browser-manager.ts — new classifyDisconnect helper + onDisconnect signature change + handler refactors
  • browse/src/server.ts — new recordShutdownReason + reason tagging across all exit paths + onDisconnect dispatcher uses passed exit code
  • browse/src/cli.ts — stdio tee + bounded tail read on startup failure
  • browse/test/browser-manager-classify-disconnect.test.ts — new 5-test suite

Versioning

VERSION, CHANGELOG.md, and package.json are intentionally NOT bumped in this PR — deferring to maintainer integration so this branch doesn't claim a release slot. The functional commits stand on their own; the 5th commit on this branch (chore: defer VERSION + CHANGELOG bump to maintainer integration) restores those files to upstream/main values.

odominguez7 and others added 5 commits May 20, 2026 18:51
After proc.unref() in startServer(), the spawned server's stdout/stderr
write to FDs that nothing reads. Every console.log/error from the
detached server is lost, so a 'why did the daemon die' question has no
log to consult.

Open the log file in append mode, point the child's stdio at the fd,
stash the path on the proc handle so the startup-failure path can read
the tail when the pipe is no longer attached. Tail read is bounded to
the last 64KB via seek-from-end so a multi-failed-startup log doesn't
get fully read into memory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…n-drain vs crash

The launched-mode browser.on('disconnected') handler treated every
Chromium disappear as a crash. When the user closes their last tab,
Chromium naturally quits because it has nothing left to display and
fires 'disconnected'. The old code printed 'FATAL: Chromium process
crashed' and process.exit(1) on that path, killing the sidebar-agent
and any in-flight PTY work.

Smoking gun in the daemon log:
  [browse] Tab closed (id=1, remaining=0)
  [browse] FATAL: Chromium process crashed or was killed. Server exiting.

classifyDisconnect({pagesSize, mode}) is a pure helper that splits by
pages.size: zero pages means clean drain (browser-empty-{launched,rehead},
exit 0, no FATAL); pages still live means real crash (chromium-crash-*,
exit 1, FATAL preserved). Wired into both the launched-mode handler and
the post-rehead handler — they collapse from two near-duplicate inline
blocks into one shared classification.

onDisconnect signature gains an optional exitCode parameter so the
classification's exit code flows through to the server's wiring instead
of being decorative. The existing browser-disconnect call site from
the headed-mode page-close chain omits exitCode and gets the default
2 (preserves the user-close convention).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…very exit path

Every shutdown — idle-timeout, parent-exit, SIGINT/SIGTERM,
browser-disconnect, the new browser-empty-* clean drain, chromium-crash-*,
and explicit /stop — now writes <stateDir>/last-shutdown.json with
{ts, pid, reason, mode, uptimeSeconds, detail}. Written atomically via
tmp-then-rename (${outPath}.${pid}.tmp) so two concurrent browse
instances sharing a state dir cannot interleave JSON bytes.

The shutdownReasonRecorded flag gates the fallback 'shutdown-called'
reason in buildFetchHandler.shutdown() — specific callers record their
own more-specific reason first; the fallback only fires when nothing
else did. Most-recent-wins overwrite is intentional (a SIGINT that
arrives after a browser-disconnect IS the actual cause of teardown).

onDisconnect dispatcher uses the exit code passed by classifyDisconnect
(0 for clean drain, 1 for crash) or defaults to 2 for the existing
user-close convention from the page-close chain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ELOG (v1.43.1.0)

5-test suite pins all four (mode × pagesSize) combinations of
classifyDisconnect plus the reason-tag suffix invariant. If a future
refactor changes the reason format or breaks the clean-drain split,
CI catches it before the next user's sidebar dies.

Cross-model adversarial pass via /codex challenge surfaced 14 findings;
the 5 must-fix and should-fix items (exit-code wiring being decorative,
postmortem exitCode internal consistency, atomic write, bounded tail
read, exit-code pass-through refactor) landed before sign-off.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restore VERSION, CHANGELOG.md, and package.json to upstream/main values so
this PR doesn't claim a release slot. Maintainer can bump and write the
entry when integrating into the next wave.

Keeps all functional changes (classifyDisconnect helper, recordShutdownReason
postmortem, stdio tee, 5-test regression suite). Only undoes the version
claim.
@odominguez7 odominguez7 changed the title v1.43.1.0 feat(browse): clean-drain disconnect detection + shutdown postmortem (last-shutdown.json + tee'd server log) feat(browse): clean-drain disconnect detection + shutdown postmortem May 20, 2026
@jbetala7
Copy link
Copy Markdown
Contributor

This needs a rebase against the now-merged launch-hardening wave #1629 before the clean-disconnect part is safe to review. Current origin/main already has the resolveDisconnectCause() / handleChromiumDisconnect() path from #1629, which waits for the Chromium child-process exit and classifies clean quit from exitCode === 0 rather than from tab bookkeeping.

This PR's new classifyDisconnect({ pagesSize }) goes back to inferring clean-vs-crash from this.pages.size. That can be wrong in both directions: a clean browser quit can fire before page-close bookkeeping drains, and a real crash can happen after the page map has already dropped to zero. The PR body even calls out that race as deferred, but #1629 gives this code a stronger signal now.

Can you rebase and keep the postmortem / stdio-tee pieces on top of the existing resolveDisconnectCause() exit-code classification, rather than replacing it with a pages.size classifier? The regression should pin reason recording and shutdown behavior off the child-process exit code, not the page count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants