Skip to content

Diagnostic: capture a hard-pegged prerender render (CDP pause + gated V8 --prof → S3)#5243

Open
habdelra wants to merge 19 commits into
mainfrom
worktree-prerender-wedge-observe
Open

Diagnostic: capture a hard-pegged prerender render (CDP pause + gated V8 --prof → S3)#5243
habdelra wants to merge 19 commits into
mainfrom
worktree-prerender-wedge-observe

Conversation

@habdelra

@habdelra habdelra commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

When a card pegs the prerender renderer's main thread synchronously, the render-timeout path goes blind: CDP itself is starved, so Debugger.enable / Profiler.enable time out and we get no stack. This adds two out-of-process captures on the render-timeout path so a hard peg can still be read, plus the offline tooling and docs to use them.

Captures (both opt-in, render-timeout path only — free on a healthy render)

  • CDP Debugger.pause + heap (pause-capture.ts) → a pausedStack on the timeout log line, with jsHeapUsedMB. Works while the thread still services CDP; when it can't, it self-reports the reason (<pause-timeout> / <debugger-enable-timeout>) rather than hanging — the signal to reach for --prof.
  • V8 --prof kernel sampler (v8-prof.ts, knob PRERENDER_V8_PROF) → armed at Chrome launch. The kernel SIGPROF timer preempts the pegged thread and a separate thread writes the samples to disk, so it records the spinning frame even when CDP is dead. The raw log (tens of MB) is too large to symbolize in-container under the timeout budget, so on the timeout it's streamed to the prerender S3 artifacts bucket as a v8log, keyed by realm/card/job. Once the upload is durable the local log is deleted so these don't accumulate on the container disk; a sink that's disabled/declined/failed keeps the local copy rather than destroying an unrecoverable capture. The timeout line reports which path ran.

Supporting changes

  • Render timeout 90s → 60s (prerender-constants.ts) so the render-level timeout — and its hang diagnostics — fire before the request-level abort gives up on the render. A healthy render completes well under 30s.
  • Unconditional --no-sandbox for the renderer (browser-manager.ts) — the sandbox blocks the renderer from writing the --prof log.
  • symbolize-prerender-wedge.sh — fetches the newest matching .v8log from S3 and runs node --prof-process. The log self-contains its code-creation map, so JS frames resolve offline with no binaries or source maps; native/Chrome frames stay opaque.
  • indexing-diagnostics skill — Mode I (the capture + offline-symbolization workflow) and the Mode H PRERENDER_V8_PROF knob.

Tests

Standalone unit tests (no PG / Chrome / S3): the v8log artifact kind + key suffix, uploadArtifact's true/false upload contract, and v8-prof log selection — the largest this-run isolate log wins, isolate-prefixed names match, pre-run logs are excluded — plus the keep-on-non-persisted-upload safety property.

Validated

Captured a real staging wedge end-to-end: a render.meta peg on a densely cross-referenced card, mainThreadResponsive=false, pausedStack: <debugger-enable-timeout> (CDP starved), and a 42MB --prof log uploaded then symbolized offline — hot frames in getOrCreateRealmResource / relationship-dependency tracking and per-getFields autosave subscribe/unsubscribe churn. This PR is observability only; the underlying wedge is addressed in CS-11580.

…d --prof)

When a prerender render pegs the renderer's main thread, the existing
timeout-path diagnostics can't name the offending function: the CDP
`Profiler` only yields samples at `stop`, which the pegged thread can't
serialize, and the out-of-band trace stream samples continuously, so its
own overhead can perturb a timing-sensitive wedge enough to dissolve it.

Add a capture that reads the peg from outside without instrumenting it:

- `Debugger.pause` (primary, always on the timeout path): V8 honors the
  pause at the next interrupt check (a loop back-edge / call), so it lands
  inside a synchronous loop without the loop yielding — the mechanism
  behind the DevTools "pause" button on a hung page. Zero overhead until
  the single pause, so it can't mask the wedge. Reports the call frames,
  the live stack depth (a runaway recursion shows a huge total), and heap
  used (flat → tight compute/recursion; climbing → combinatorial breadth).
  Runs out-of-process over CDP, so it survives a pegged JS thread, and only
  on the timeout path, so it's free on a healthy render.

- `--prof` (gated behind `PRERENDER_V8_PROF`, off by default): V8's
  kernel-SIGPROF sampler, armed at Chrome launch. The signal preempts the
  thread regardless of what it's running, so it also samples a peg stuck in
  a non-yielding NATIVE call — where `Debugger.pause` can't get a back-edge
  and returns `pause-timeout`. It samples every render, so it can perturb a
  timing-sensitive wedge; enable it only as the native-peg fallback.

Both surface in the prerender-server timeout log line (`pausedStack:`,
`v8ProfTopFrames:`).

Also lower the render timeout 90s → 60s: a healthy render finishes well
under 30s, so 60s cleanly separates a genuine wedge from the slowest
legitimate cards, and — with the request-timeout overhead — fires the
render-level timeout (and this capture) before the request-level abort
gives up on the render.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6154e9feba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/prerender/v8-prof.ts Outdated
habdelra and others added 2 commits June 15, 2026 19:20
The timeout-path processor scanned the shared OS temp dir and picked the
largest matching `--prof` log, so a stale log from an earlier run in a
reused container (or another renderer that ran longer without wedging)
could outrank the timed-out page and point `v8ProfTopFrames` at the wrong
stack. Clear pre-existing logs at browser launch and stamp the launch
time, then on the timeout path consider only logs written since launch and
pick the most-recently-written one — the renderer still spinning into the
timeout, not an older completed render's larger-but-stale log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Host Test Results

    1 files  ±0      1 suites  ±0   1h 37m 57s ⏱️ - 2m 24s
3 080 tests +2  3 065 ✅ +2  15 💤 ±0  0 ❌ ±0 
3 099 runs  +2  3 084 ✅ +2  15 💤 ±0  0 ❌ ±0 

Results for commit 87ad20d. ± Comparison against earlier commit fa10cf1.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   11m 7s ⏱️ -41s
1 716 tests ±0  1 716 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 809 runs  ±0  1 809 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 87ad20d. ± Comparison against earlier commit fa10cf1.

@habdelra habdelra marked this pull request as draft June 16, 2026 00:16
habdelra and others added 3 commits June 15, 2026 20:39
Adds Mode I — reading a render wedged in a synchronous CPU peg from
outside, when the thread is too pegged to service the CDP message that
arms a debugger/profiler (`Debugger.enable` / `Profiler.enable` time out)
and Mode H's captures are starved:

- `pausedStack` — the always-on one-shot `Debugger.pause` on the timeout
  path (zero overhead until the pause, so it can't mask), with its
  `<pause-timeout>` / `<debugger-enable-timeout>` states.
- `v8ProfTopFrames` — the `--prof` kernel-`SIGPROF` sampler (knob
  `PRERENDER_V8_PROF`) that preempts the pegged thread without cooperation,
  the fallback that reads a hard peg the CDP captures can't.

Also adds the `PRERENDER_V8_PROF` row to the Mode H knobs table and a
cross-reference from Mode H's limits, and notes the 60s render timeout
that fires the capture before the request-level abort.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The renderer's V8 `--prof` log was being written (PRERENDER_V8_PROF=true,
browser launched with the flag) but the timeout-path summarizer reported
`<no v8 --prof log found>`: V8's per-isolate logging prepends
`isolate-<addr>-` to the --logfile name, and the reader matched with
`startsWith('prerender-v8-prof-')`, which never matches the prefixed file.

Match the prefix with `includes` (in both the launch cleanup and the
reader), pick the LARGEST this-run log (the pegged isolate accumulates the
most samples over the peg), and on a miss list the `.log` files actually
present so a still-wrong pattern is self-diagnosing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The renderer's --prof logs live only in the container's ephemeral /tmp
(no EFS mount on the prerenderer), so they're reclaimed when the ECS task
replaces — but a long-lived task that had PRERENDER_V8_PROF on, then off,
would keep the "on" period's logs around because the launch-time cleanup
was gated on the flag being enabled.

Run the stale-log sweep on every browser launch regardless of the flag,
and only stamp the launch time / add the --prof launch flag when enabled.
So flipping the flag off + restarting now actually clears the logs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prerender renderer only ever renders the realm's own trusted cards
(never untrusted web content) and always runs in a container where the
sandbox must be off for Chrome to launch — and for the renderer to write
the V8 `--prof` diagnostic log to /tmp. Drop the CI /
PUPPETEER_DISABLE_SANDBOX gate and pass --no-sandbox unconditionally, so it
never silently depends on an env var being set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
v8ProfTopFrames came back absent on the bxl wedge with no reason, because
the reader's catch swallowed the error and returned null. That left the
live-vs-exit question (is the on-disk --prof log usable while the renderer
is still running, or only after it closes?) unanswerable.

Surface it: report the picked log basename + size in every path; on a
prof-process failure return a descriptive string (incl. a "killed: timed
out" marker) instead of null; on any reader error return the message rather
than null; raise the stdout cap to 128MB and the time-box to 40s. Now the
timeout line always shows either the frames + log size, or the exact
reason + size — which decides whether the log is written live or is
empty/incomplete mid-run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
habdelra and others added 3 commits June 16, 2026 10:09
…container

A hard synchronous CPU peg starves CDP (Debugger.enable / Profiler.enable
time out), so V8 `--prof` — kernel-SIGPROF sampled, written by a separate
thread to a file — is the only capture that survives it. But that file
accumulates every render on the isolate since browser launch (tens of MB),
and `node --prof-process` on it blows the render-timeout budget — which is
why the in-container summarizer kept returning nothing.

Stop parsing in-container. Add a `v8log` artifact kind and, on the render
timeout (when armed), stream the pegged isolate's raw `--prof` log to the
prerender S3 artifacts bucket via the existing sink — keyed by realm/card/
job so the wedging task's log is self-identifying across the fleet (no
shelling into 4 servers to find it). Symbolize it offline with
`node --prof-process`, where there's no deadline and the peg dominates the
top self-time frames. The timeout log line reports the upload + key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ization

Mode I and the Mode H knob table now describe the real flow: on a render
timeout the raw V8 --prof log is streamed to the prerender artifacts bucket
as a `v8log` (the timeout line reports the upload + key), and is symbolized
OFFLINE with `node --prof-process` (the log self-contains its code-creation
map, so JS frames resolve without binaries/sourcemaps). Replaces the old
"v8ProfTopFrames on the log line" in-container-parse model that blew the
timeout budget on a large log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
symbolize-prerender-wedge.sh resolves the newest `.v8log` artifact for a
realm in the prerender artifacts bucket (or an exact --key), downloads it,
and runs `node --prof-process` to print the [Summary]/[JavaScript]/[Bottom
up] sections — the push-button version of the three manual aws+node steps.
The log self-contains its code-creation map, so JS frames resolve offline
without binaries or source maps; native/Chrome frames stay opaque.

Mode I now points at the helper as the primary path and keeps the raw
commands as the under-the-hood reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@habdelra habdelra marked this pull request as ready for review June 16, 2026 15:14
habdelra and others added 3 commits June 16, 2026 11:24
The raw --prof log runs tens of MB and `--logfile` accumulates them in the
container's temp dir across the browser's long life. Now that the wedged
page is evicted on a render timeout (its isolate tears down and the next
visit writes a fresh log), the uploaded log strands nothing — so reclaim
the disk by deleting it once the upload is durable.

uploadArtifact now returns whether the object actually landed, so the
delete is gated on a confirmed upload: a disabled/declined/failed sink
keeps the local copy rather than destroying a capture that was never
persisted. The timeout log line reports which path ran. Mode I of the
indexing-diagnostics skill is updated to match the status wording.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… selection

Extend the artifact-sink unit test for the `v8log` kind suffix and the new
boolean return (disabled sink → false, the contract the --prof delete gates
on). Add a v8-prof unit test covering flag gating, the launch flags, the
stale-log sweep, and how uploadV8ProfLog selects the pegged isolate's log:
largest this-run wins, isolate-prefixed names match, logs from before the
run are excluded, and a capture that can't be persisted is KEPT on disk
rather than deleted. All run standalone (no PG / Chrome / S3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Coerce the nullable status to a string before asserting on it rather than
guarding with '&&' inside the assertion argument.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@habdelra habdelra changed the title Diagnostic: capture a wedged prerender's stack (Debugger.pause + gated --prof) Diagnostic: capture a hard-pegged prerender render (CDP pause + gated V8 --prof → S3) Jun 16, 2026
@habdelra habdelra requested a review from Copilot June 16, 2026 15:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new prerender “hard wedge” diagnostics so synchronous main-thread pegs can still be inspected even when CDP is starved, by capturing a one-shot paused stack (CDP Debugger.pause) and optionally arming V8’s kernel --prof sampler and uploading the raw tick log to the artifacts bucket for offline symbolization.

Changes:

  • Introduces pause-capture.ts (timeout-path debugger pause + heap usage) and v8-prof.ts (opt-in V8 --prof capture + artifact upload + local cleanup on confirmed persistence).
  • Extends the prerender artifact sink with a new v8log kind and changes uploadArtifact() to return boolean indicating persistence success.
  • Adds offline tooling/docs (symbolize-prerender-wedge.sh, indexing-diagnostics skill updates) plus realm-server tests for log selection and sink contract.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/realm-server/prerender/pause-capture.ts New CDP Debugger.pause-based timeout-path stack/heap capture utility.
packages/realm-server/prerender/v8-prof.ts New opt-in V8 --prof log selection + streaming upload as v8log artifact.
packages/realm-server/prerender/artifact-sink.ts Adds v8log artifact kind; uploadArtifact now returns true/false for persistence.
packages/realm-server/prerender/utils.ts Hooks new timeout-path diagnostics into render timeout logging.
packages/realm-server/prerender/browser-manager.ts Arms --prof at launch when enabled; adjusts sandbox flags behavior.
packages/realm-server/prerender/prerender-constants.ts Reduces default render timeout from 90s to 60s.
packages/realm-server/scripts/symbolize-prerender-wedge.sh New helper to fetch latest .v8log from S3 and run node --prof-process.
packages/realm-server/tests/prerender-v8-prof-test.ts New unit tests for v8-prof gating, sweeping, and log-selection/retention behavior.
packages/realm-server/tests/prerender-artifact-sink-test.ts Adds coverage for v8log suffix and uploadArtifact false-return contract when disabled.
packages/realm-server/tests/index.ts Registers the new prerender-v8-prof-test in the realm-server test index.
.claude/skills/indexing-diagnostics/SKILL.md Documents the new Mode I flow and PRERENDER_V8_PROF knob.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-server/prerender/pause-capture.ts Outdated
Comment thread packages/realm-server/prerender/browser-manager.ts
Comment thread packages/realm-server/prerender/utils.ts Outdated
Comment thread packages/realm-server/scripts/symbolize-prerender-wedge.sh
…anups

- browser-manager: the renderer executes user-authored card code, so keep
  the Chromium sandbox on where it works and disable it only where it must
  be off — in the container (PUPPETEER_DISABLE_SANDBOX), in CI, and when the
  V8 --prof diagnostic is armed (a sandboxed renderer can't write the log).
- pause-capture: if createCDPSession outlives its timeout, detach the
  late-resolving session so it isn't orphaned on the page.
- utils: compute formatPausedStack once for the timeout log line.
- symbolize-prerender-wedge.sh: match the realm with grep -F so a realm
  containing regex metacharacters can't match unintended keys.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 003f1e5afc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/prerender/browser-manager.ts
Comment thread packages/realm-server/prerender/v8-prof.ts
…us pick

Selecting the largest this-run isolate log is a heuristic: a 60s CPU peg
samples continuously so its log dwarfs IO-bound renderers', but under
several concurrent renderers a long-lived busy isolate could rival it.
pickV8ProfLog now reports the candidate count + runner-up size, and the
timeout line surfaces them so a mis-pick is visible (re-symbolize another
with the helper's --key).

The post-upload delete now runs only when there's a single this-run log —
then it's unmistakably the wedged, evicted isolate's and the unlink frees
real disk. With multiple candidates the largest may be a still-live
renderer's log, where unlinking frees nothing (V8 holds the fd) and loses
its samples; keep them all for the next browser-launch sweep instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team June 16, 2026 16:22
The prerender task runs only in a container where the sandbox can't
initialize, so Chrome won't start with it on — it has always run sandbox-off
and is required to. Force it unconditionally rather than gating on CI /
PUPPETEER_DISABLE_SANDBOX, so a missing env var can't silently break launch.
The security boundary is the task itself: the prerenderer is a separate,
segregated ECS task isolated from the realm-server. The comment no longer
claims the renderer only runs trusted cards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants