Diagnostic: capture a hard-pegged prerender render (CDP pause + gated V8 --prof → S3)#5243
Diagnostic: capture a hard-pegged prerender render (CDP pause + gated V8 --prof → S3)#5243habdelra wants to merge 19 commits into
Conversation
…d --prof) When a prerender render pegs the renderer's main thread, the existing timeout-path diagnostics can't name the offending function: the CDP `Profiler` only yields samples at `stop`, which the pegged thread can't serialize, and the out-of-band trace stream samples continuously, so its own overhead can perturb a timing-sensitive wedge enough to dissolve it. Add a capture that reads the peg from outside without instrumenting it: - `Debugger.pause` (primary, always on the timeout path): V8 honors the pause at the next interrupt check (a loop back-edge / call), so it lands inside a synchronous loop without the loop yielding — the mechanism behind the DevTools "pause" button on a hung page. Zero overhead until the single pause, so it can't mask the wedge. Reports the call frames, the live stack depth (a runaway recursion shows a huge total), and heap used (flat → tight compute/recursion; climbing → combinatorial breadth). Runs out-of-process over CDP, so it survives a pegged JS thread, and only on the timeout path, so it's free on a healthy render. - `--prof` (gated behind `PRERENDER_V8_PROF`, off by default): V8's kernel-SIGPROF sampler, armed at Chrome launch. The signal preempts the thread regardless of what it's running, so it also samples a peg stuck in a non-yielding NATIVE call — where `Debugger.pause` can't get a back-edge and returns `pause-timeout`. It samples every render, so it can perturb a timing-sensitive wedge; enable it only as the native-peg fallback. Both surface in the prerender-server timeout log line (`pausedStack:`, `v8ProfTopFrames:`). Also lower the render timeout 90s → 60s: a healthy render finishes well under 30s, so 60s cleanly separates a genuine wedge from the slowest legitimate cards, and — with the request-timeout overhead — fires the render-level timeout (and this capture) before the request-level abort gives up on the render. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6154e9feba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The timeout-path processor scanned the shared OS temp dir and picked the largest matching `--prof` log, so a stale log from an earlier run in a reused container (or another renderer that ran longer without wedging) could outrank the timed-out page and point `v8ProfTopFrames` at the wrong stack. Clear pre-existing logs at browser launch and stamp the launch time, then on the timeout path consider only logs written since launch and pick the most-recently-written one — the renderer still spinning into the timeout, not an older completed render's larger-but-stale log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Host Test Results 1 files ±0 1 suites ±0 1h 37m 57s ⏱️ - 2m 24s Results for commit 87ad20d. ± Comparison against earlier commit fa10cf1. Realm Server Test Results 1 files ±0 1 suites ±0 11m 7s ⏱️ -41s Results for commit 87ad20d. ± Comparison against earlier commit fa10cf1. |
Adds Mode I — reading a render wedged in a synchronous CPU peg from outside, when the thread is too pegged to service the CDP message that arms a debugger/profiler (`Debugger.enable` / `Profiler.enable` time out) and Mode H's captures are starved: - `pausedStack` — the always-on one-shot `Debugger.pause` on the timeout path (zero overhead until the pause, so it can't mask), with its `<pause-timeout>` / `<debugger-enable-timeout>` states. - `v8ProfTopFrames` — the `--prof` kernel-`SIGPROF` sampler (knob `PRERENDER_V8_PROF`) that preempts the pegged thread without cooperation, the fallback that reads a hard peg the CDP captures can't. Also adds the `PRERENDER_V8_PROF` row to the Mode H knobs table and a cross-reference from Mode H's limits, and notes the 60s render timeout that fires the capture before the request-level abort. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The renderer's V8 `--prof` log was being written (PRERENDER_V8_PROF=true,
browser launched with the flag) but the timeout-path summarizer reported
`<no v8 --prof log found>`: V8's per-isolate logging prepends
`isolate-<addr>-` to the --logfile name, and the reader matched with
`startsWith('prerender-v8-prof-')`, which never matches the prefixed file.
Match the prefix with `includes` (in both the launch cleanup and the
reader), pick the LARGEST this-run log (the pegged isolate accumulates the
most samples over the peg), and on a miss list the `.log` files actually
present so a still-wrong pattern is self-diagnosing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The renderer's --prof logs live only in the container's ephemeral /tmp (no EFS mount on the prerenderer), so they're reclaimed when the ECS task replaces — but a long-lived task that had PRERENDER_V8_PROF on, then off, would keep the "on" period's logs around because the launch-time cleanup was gated on the flag being enabled. Run the stale-log sweep on every browser launch regardless of the flag, and only stamp the launch time / add the --prof launch flag when enabled. So flipping the flag off + restarting now actually clears the logs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prerender renderer only ever renders the realm's own trusted cards (never untrusted web content) and always runs in a container where the sandbox must be off for Chrome to launch — and for the renderer to write the V8 `--prof` diagnostic log to /tmp. Drop the CI / PUPPETEER_DISABLE_SANDBOX gate and pass --no-sandbox unconditionally, so it never silently depends on an env var being set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
v8ProfTopFrames came back absent on the bxl wedge with no reason, because the reader's catch swallowed the error and returned null. That left the live-vs-exit question (is the on-disk --prof log usable while the renderer is still running, or only after it closes?) unanswerable. Surface it: report the picked log basename + size in every path; on a prof-process failure return a descriptive string (incl. a "killed: timed out" marker) instead of null; on any reader error return the message rather than null; raise the stdout cap to 128MB and the time-box to 40s. Now the timeout line always shows either the frames + log size, or the exact reason + size — which decides whether the log is written live or is empty/incomplete mid-run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…container A hard synchronous CPU peg starves CDP (Debugger.enable / Profiler.enable time out), so V8 `--prof` — kernel-SIGPROF sampled, written by a separate thread to a file — is the only capture that survives it. But that file accumulates every render on the isolate since browser launch (tens of MB), and `node --prof-process` on it blows the render-timeout budget — which is why the in-container summarizer kept returning nothing. Stop parsing in-container. Add a `v8log` artifact kind and, on the render timeout (when armed), stream the pegged isolate's raw `--prof` log to the prerender S3 artifacts bucket via the existing sink — keyed by realm/card/ job so the wedging task's log is self-identifying across the fleet (no shelling into 4 servers to find it). Symbolize it offline with `node --prof-process`, where there's no deadline and the peg dominates the top self-time frames. The timeout log line reports the upload + key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ization Mode I and the Mode H knob table now describe the real flow: on a render timeout the raw V8 --prof log is streamed to the prerender artifacts bucket as a `v8log` (the timeout line reports the upload + key), and is symbolized OFFLINE with `node --prof-process` (the log self-contains its code-creation map, so JS frames resolve without binaries/sourcemaps). Replaces the old "v8ProfTopFrames on the log line" in-container-parse model that blew the timeout budget on a large log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
symbolize-prerender-wedge.sh resolves the newest `.v8log` artifact for a realm in the prerender artifacts bucket (or an exact --key), downloads it, and runs `node --prof-process` to print the [Summary]/[JavaScript]/[Bottom up] sections — the push-button version of the three manual aws+node steps. The log self-contains its code-creation map, so JS frames resolve offline without binaries or source maps; native/Chrome frames stay opaque. Mode I now points at the helper as the primary path and keeps the raw commands as the under-the-hood reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The raw --prof log runs tens of MB and `--logfile` accumulates them in the container's temp dir across the browser's long life. Now that the wedged page is evicted on a render timeout (its isolate tears down and the next visit writes a fresh log), the uploaded log strands nothing — so reclaim the disk by deleting it once the upload is durable. uploadArtifact now returns whether the object actually landed, so the delete is gated on a confirmed upload: a disabled/declined/failed sink keeps the local copy rather than destroying a capture that was never persisted. The timeout log line reports which path ran. Mode I of the indexing-diagnostics skill is updated to match the status wording. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… selection Extend the artifact-sink unit test for the `v8log` kind suffix and the new boolean return (disabled sink → false, the contract the --prof delete gates on). Add a v8-prof unit test covering flag gating, the launch flags, the stale-log sweep, and how uploadV8ProfLog selects the pegged isolate's log: largest this-run wins, isolate-prefixed names match, logs from before the run are excluded, and a capture that can't be persisted is KEPT on disk rather than deleted. All run standalone (no PG / Chrome / S3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Coerce the nullable status to a string before asserting on it rather than guarding with '&&' inside the assertion argument. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds new prerender “hard wedge” diagnostics so synchronous main-thread pegs can still be inspected even when CDP is starved, by capturing a one-shot paused stack (CDP Debugger.pause) and optionally arming V8’s kernel --prof sampler and uploading the raw tick log to the artifacts bucket for offline symbolization.
Changes:
- Introduces
pause-capture.ts(timeout-path debugger pause + heap usage) andv8-prof.ts(opt-in V8--profcapture + artifact upload + local cleanup on confirmed persistence). - Extends the prerender artifact sink with a new
v8logkind and changesuploadArtifact()to returnbooleanindicating persistence success. - Adds offline tooling/docs (
symbolize-prerender-wedge.sh, indexing-diagnostics skill updates) plus realm-server tests for log selection and sink contract.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/realm-server/prerender/pause-capture.ts | New CDP Debugger.pause-based timeout-path stack/heap capture utility. |
| packages/realm-server/prerender/v8-prof.ts | New opt-in V8 --prof log selection + streaming upload as v8log artifact. |
| packages/realm-server/prerender/artifact-sink.ts | Adds v8log artifact kind; uploadArtifact now returns true/false for persistence. |
| packages/realm-server/prerender/utils.ts | Hooks new timeout-path diagnostics into render timeout logging. |
| packages/realm-server/prerender/browser-manager.ts | Arms --prof at launch when enabled; adjusts sandbox flags behavior. |
| packages/realm-server/prerender/prerender-constants.ts | Reduces default render timeout from 90s to 60s. |
| packages/realm-server/scripts/symbolize-prerender-wedge.sh | New helper to fetch latest .v8log from S3 and run node --prof-process. |
| packages/realm-server/tests/prerender-v8-prof-test.ts | New unit tests for v8-prof gating, sweeping, and log-selection/retention behavior. |
| packages/realm-server/tests/prerender-artifact-sink-test.ts | Adds coverage for v8log suffix and uploadArtifact false-return contract when disabled. |
| packages/realm-server/tests/index.ts | Registers the new prerender-v8-prof-test in the realm-server test index. |
| .claude/skills/indexing-diagnostics/SKILL.md | Documents the new Mode I flow and PRERENDER_V8_PROF knob. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…anups - browser-manager: the renderer executes user-authored card code, so keep the Chromium sandbox on where it works and disable it only where it must be off — in the container (PUPPETEER_DISABLE_SANDBOX), in CI, and when the V8 --prof diagnostic is armed (a sandboxed renderer can't write the log). - pause-capture: if createCDPSession outlives its timeout, detach the late-resolving session so it isn't orphaned on the page. - utils: compute formatPausedStack once for the timeout log line. - symbolize-prerender-wedge.sh: match the realm with grep -F so a realm containing regex metacharacters can't match unintended keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 003f1e5afc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…us pick Selecting the largest this-run isolate log is a heuristic: a 60s CPU peg samples continuously so its log dwarfs IO-bound renderers', but under several concurrent renderers a long-lived busy isolate could rival it. pickV8ProfLog now reports the candidate count + runner-up size, and the timeout line surfaces them so a mis-pick is visible (re-symbolize another with the helper's --key). The post-upload delete now runs only when there's a single this-run log — then it's unmistakably the wedged, evicted isolate's and the unlink frees real disk. With multiple candidates the largest may be a still-live renderer's log, where unlinking frees nothing (V8 holds the fd) and loses its samples; keep them all for the next browser-launch sweep instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prerender task runs only in a container where the sandbox can't initialize, so Chrome won't start with it on — it has always run sandbox-off and is required to. Force it unconditionally rather than gating on CI / PUPPETEER_DISABLE_SANDBOX, so a missing env var can't silently break launch. The security boundary is the task itself: the prerenderer is a separate, segregated ECS task isolated from the realm-server. The comment no longer claims the renderer only runs trusted cards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a card pegs the prerender renderer's main thread synchronously, the render-timeout path goes blind: CDP itself is starved, so
Debugger.enable/Profiler.enabletime out and we get no stack. This adds two out-of-process captures on the render-timeout path so a hard peg can still be read, plus the offline tooling and docs to use them.Captures (both opt-in, render-timeout path only — free on a healthy render)
Debugger.pause+ heap (pause-capture.ts) → apausedStackon the timeout log line, withjsHeapUsedMB. Works while the thread still services CDP; when it can't, it self-reports the reason (<pause-timeout>/<debugger-enable-timeout>) rather than hanging — the signal to reach for--prof.--profkernel sampler (v8-prof.ts, knobPRERENDER_V8_PROF) → armed at Chrome launch. The kernel SIGPROF timer preempts the pegged thread and a separate thread writes the samples to disk, so it records the spinning frame even when CDP is dead. The raw log (tens of MB) is too large to symbolize in-container under the timeout budget, so on the timeout it's streamed to the prerender S3 artifacts bucket as av8log, keyed by realm/card/job. Once the upload is durable the local log is deleted so these don't accumulate on the container disk; a sink that's disabled/declined/failed keeps the local copy rather than destroying an unrecoverable capture. The timeout line reports which path ran.Supporting changes
prerender-constants.ts) so the render-level timeout — and its hang diagnostics — fire before the request-level abort gives up on the render. A healthy render completes well under 30s.--no-sandboxfor the renderer (browser-manager.ts) — the sandbox blocks the renderer from writing the--proflog.symbolize-prerender-wedge.sh— fetches the newest matching.v8logfrom S3 and runsnode --prof-process. The log self-contains itscode-creationmap, so JS frames resolve offline with no binaries or source maps; native/Chrome frames stay opaque.PRERENDER_V8_PROFknob.Tests
Standalone unit tests (no PG / Chrome / S3): the
v8logartifact kind + key suffix,uploadArtifact'strue/falseupload contract, andv8-proflog selection — the largest this-run isolate log wins, isolate-prefixed names match, pre-run logs are excluded — plus the keep-on-non-persisted-upload safety property.Validated
Captured a real staging wedge end-to-end: a
render.metapeg on a densely cross-referenced card,mainThreadResponsive=false,pausedStack: <debugger-enable-timeout>(CDP starved), and a 42MB--proflog uploaded then symbolized offline — hot frames ingetOrCreateRealmResource/ relationship-dependency tracking and per-getFieldsautosave subscribe/unsubscribe churn. This PR is observability only; the underlying wedge is addressed in CS-11580.