CLI: capture eval-runner failures and timeouts in result by chubes4 · Pull Request #3330 · Automattic/studio

chubes4 · 2026-05-04T12:09:06Z

Summary

Extends the eval-runner result JSON with two new fields that any consumer of EVAL_RUNNER_RESULT_FILE can read:

error: string | null — set when an exception is thrown inside the message loop (auth blip, MCP crash, network error, SDK throw, etc.).
timedOut: boolean — flipped by the timeout callback before it calls query.interrupt().

Together these address the finalError ask in #3262 section 1 ("First actionable failure") and complete the failure-visibility story started by #3273 (firstToolError, toolEvents, phaseTimingsMs).

Why

Three failure modes today, three different visibility outcomes:

Failure shape	Today	After this PR
Model returns `success: false` cleanly	✅ Captured fully	Same
Timeout fires (`query.interrupt()` after `timeoutMs`)	⚠️ Captured as `success: false`, indistinguishable from a clean model failure	✅ `timedOut: true` distinguishes it
Mid-loop exception (auth, MCP, network, SDK throw)	⚠️ Caught at `main()`'s top-level → emits stripped-down `{ success: false, error }` JSON, losing all timings/tools/turns	✅ Caught inside `runEval()` → full structured result with `error` set alongside everything else

The third row is the most consequential: today, a run that completes meaningful work and then hits a late exception loses all the diagnostic state that was already captured. The new try { … } catch keeps the structured result intact.

What this enables

A consumer can now answer:

Did the eval time out vs. the model gave up cleanly? (Today: indistinguishable. After: check result.timedOut.)
Was there a runtime exception during the run, even if success was true? (Today: not representable — success: true and the exception path are mutually exclusive. After: both fields can coexist.)
When an exception fires, what tools ran before it, how long did each phase take? (Today: lost in the main()-level fallback JSON. After: preserved in the structured result.)

This is the same shape as #3273's contribution — producer-side observability improvements that any downstream consumer benefits from. Studio's own npm run eval, future internal CI, promptfoo configurations, and external benchmark harnesses all read the same JSON contract.

Diff

8 lines:

let error: string | null = null and let timedOut = false declarations
Timeout callback rewritten to set timedOut = true before calling query.interrupt()
try { … } catch ( caught ) { error = … } wrapper around the message loop
Two new fields in the return shape

The catch only changes behavior on exception paths; the existing successful and clean-failure paths are unchanged.

Origin

Originally drafted on the experimental Static Site Importer branch (#3309) where it was being used by an out-of-tree benchmark harness. Split out for upstream review because:

The change is generic — useful to any consumer of the eval-runner's structured output, not just the SSI experiment.
It shouldn't ship gated on the SSI experiment merging.
Reviewing it on its own keeps the surface area small (8 lines vs. ~200).

The SSI draft PR will rebase on top of this once it lands.

Tests

npm run typecheck (all workspaces) — clean
npx eslint apps/cli/ai/eval-runner.ts — clean

The change is small enough that it's covered by the existing eval-runner exercise paths (npm run eval, etc.). If reviewers want explicit unit tests around the new fields I'm happy to add them.

Refs

Issue Eval runner should expose structured diagnostics and benchmark metadata #3262 — Eval runner should expose structured diagnostics and benchmark metadata
PR CLI: add eval-runner diagnostics #3273 — CLI: add eval-runner diagnostics (the previous installment of this story)
PR Experiment: use Static Site Importer for generated local sites #3309 — Experiment: use Static Site Importer for generated local sites (where this code originally lived)

AI assistance

AI assistance: Yes
Tool(s): Claude Code (Sonnet 4.5)
Used for: Drafted the catch wrapper, the timedOut flag, and the result-shape extension under Chris's direction. Chris reviewed the diff, the issue framing, and the split rationale.

Extends the eval-runner result JSON with two new fields that any consumer of `EVAL_RUNNER_RESULT_FILE` can read: - `error: string | null` — set when an exception is thrown inside the message loop (auth blip, MCP crash, network error, SDK throw, etc.). Today those exceptions propagate out of `runEval()` to `main()`'s top-level catch, which emits a stripped-down `{ success: false, error }` JSON without the timings, tools, turns, or partial state that were captured before the throw. With this change, the exception is caught inside `runEval()`, so consumers get the full structured result alongside the error message. - `timedOut: boolean` — flipped by the timeout callback before it calls `query.interrupt()`. Today timeouts are indistinguishable from clean model failures: both surface as `success: false` with no other differentiator. Consumers can now tell them apart and attribute time-budget exhaustion correctly. Together these address the `finalError` ask in #3262 section 1 ("First actionable failure") and complete the failure- visibility story started by #3273 (which added `firstToolError`, `toolEvents`, `phaseTimingsMs`). The change is generic — it benefits any consumer of the eval-runner's structured output (Studio's own scripts, `npm run eval`, future internal CI, third-party benchmark harnesses). Originally drafted on the experimental Static Site Importer branch (#3309); split out for upstream review because the eval-runner changes are independently valuable and shouldn't ship gated on that experiment. Refs: #3262, #3273, #3309 ## AI assistance - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Drafted the catch wrapper, the `timedOut` flag, and the result-shape extension under Chris's direction. Chris reviewed the diff, the issue framing, and the split rationale.

Pairs with Automattic/studio#3330, which adds two new fields to the eval-runner result JSON: - result.error: string | null — set when an exception is caught inside the message loop (auth blip, MCP crash, network error, SDK throw, etc.) - result.timedOut: boolean — set when the timeout callback fires before query.interrupt() Bench changes: - studio-agent-runtime + studio-agent-site-info: failure-detail message now distinguishes 'timed out after Nms', 'exception: ...', and 'exit=N'. Lifts the timeout literal into a named constant so the message stays in sync with the actual budget. - studio-agent-site-build: agentSucceeded gate now also requires !result.timedOut. Adds two new metrics: - timed_out (1 when the run hit the time budget) - agent_runner_error (1 when an exception landed in result.error) These categorize regressions that today look like generic agent failures. Backwards-compatible: on Studio versions older than #3330, both fields are nullish and the bench behaves exactly as it did before. Refs: Automattic/studio#3330, Automattic/studio#3262, Automattic/studio#3273 ## AI assistance - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Drafted the gate-tightening, the constants, and the failure-detail messages under Chris's direction.

wpmobilebot · 2026-05-04T12:32:32Z

📊 Performance Test Results

Comparing f4ab2be vs trunk

app-size

Metric	trunk	`f4ab2be`	Diff	Change
App Size (Mac)	1557.92 MB	1557.92 MB	+0.00 MB	⚪ 0.0%

site-editor

Metric	trunk	`f4ab2be`	Diff	Change
load	1522 ms	1491 ms	31 ms	⚪ 0.0%

site-startup

Metric	trunk	`f4ab2be`	Diff	Change
siteCreation	8092 ms	8088 ms	4 ms	⚪ 0.0%
siteStartup	4949 ms	4948 ms	1 ms	⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

github-actions Bot assigned chubes4 May 4, 2026

chubes4 mentioned this pull request May 4, 2026

Experiment: use Static Site Importer for generated local sites #3309

Draft

chubes4 requested a review from youknowriad May 4, 2026 12:10

chubes4 mentioned this pull request May 4, 2026

feat(studio): consume eval-runner failure-capture fields chubes4/homeboy-rigs#57

Merged

youknowriad approved these changes May 4, 2026

View reviewed changes

chubes4 merged commit b42be68 into trunk May 4, 2026
11 checks passed

chubes4 deleted the eval-runner-failure-capture branch May 4, 2026 13:16

chubes4 mentioned this pull request May 5, 2026

CLI: add eval-runner result diagnostics #3349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: capture eval-runner failures and timeouts in result#3330

CLI: capture eval-runner failures and timeouts in result#3330
chubes4 merged 1 commit intotrunkfrom
eval-runner-failure-capture

chubes4 commented May 4, 2026

Uh oh!

wpmobilebot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chubes4 commented May 4, 2026

Summary

Why

What this enables

Diff

Origin

Tests

Refs

AI assistance

Uh oh!

wpmobilebot commented May 4, 2026

📊 Performance Test Results

app-size

site-editor

site-startup

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants